PySpark Native Functions

In this tutorial we will explore some common PySpark native functions

PySpark provides a variety of built-in functions that can be used to perform operations on columns in a DataFrame. These functions are part of the pyspark.sql.functions module and can be imported as follows:

from pyspark.sql.functions import *

Some examples of commonly used functions include:

sum() function: It is used to calculate the sum of a column.

df.agg(sum("column1"))

avg() function: It is used to calculate the average of a column.

df.agg(avg("column1"))

min() function: It is used to calculate the minimum value of a column.

df.agg(min("column1"))

max() function: It is used to calculate the maximum value of a column.

df.agg(max("column1"))

concat() function: It is used to concatenate two or more columns

df.select(concat(col("column1"), col("column2")))

These functions can be used with the select() and agg() methods to perform operations on DataFrame columns.

df.select(sum("column1").alias("sum_column1"))

You can also use these functions in the filter() method to filter the dataframe based on a certain condition

df.filter(col("column1") > 10)

These functions can also be used with the withColumn() method to add a new column to a DataFrame.

df.withColumn("new_column", col("column1") + col("column2"))

You can also use the when() and otherwise() functions to create a new column based on a certain condition.

from pyspark.sql.functions import when
df.withColumn("new_column", when(col("column1") > 10, "high").otherwise("low"))

You can also use the ifnull() and nullif() functions to handle missing values.

from pyspark.sql.functions import ifnull, nullif
df.select(ifnull("column1", 0))
df.select(nullif("column1", 0))

These are just some examples of the built-in functions provided by PySpark. There are many more functions available and it's always good to check the documentation for the latest updates and options.

It's always good to check the documentation for the latest updates and options. Also, when you are working with Databricks, always make sure that you have the required libraries installed.

PreviousPySpark Action Methods NextPartitioning

Last updated 2 years ago

Was this helpful?