PySpark Transformation Methods

In this tutorial, we will try to explore some of the common data transformation methods in PySpark

  • Data transformations are operations that create a new DataFrame from an existing one. PySpark provides a variety of data transformation methods that can be used to manipulate data in a DataFrame.

  • One of the most commonly used transformation methods is select() which allows you to select specific columns from a DataFrame. You can use the select() method to select one or more columns by name or index.

df.select("column1","column2")
  • Another commonly used transformation method is filter() which allows you to filter rows based on a certain condition. You can use the filter() method to filter rows by passing a column name and a condition.

df.filter(df.column_name > 10)
  • You can also use groupBy() to group the dataframe by one or more columns and perform various aggregate functions such as count, sum, avg, etc. on the grouped dataframe

df.groupBy("column1","column2").agg(avg("column3"),max("column4"))
  • join() method allows you to join two DataFrames based on a column. It supports various types of joins such as inner join, outer join, left join, and right join.

df1.join(df2, "column_name", "inner")
  • drop() method allows you to drop one or more columns from a DataFrame.

df.drop("column1","column2")
  • withColumn() method allows you to add or replace a column in a DataFrame.

df.withColumn("new_column", df.column1 + df.column2)
  • sort() method allows you to sort the dataframe by one or more columns in ascending or descending order.

df.sort(df.column1.desc())
  • distinct()method removes duplicate rows from a DataFrame.

df.distinct()
  • limit() method allows you to limit the number of rows in a DataFrame.

df.limit(10)
  • repartition() method allows you to change the number of partitions in a DataFrame. This can be useful when you want to increase or decrease the parallelism of a DataFrame.

df.repartition(10)
  • coalesce() method allows you to decrease the number of partitions in a DataFrame.

df.coalesce(5)
  • cast() method allows you to change the data type of a column.

from pyspark.sql.functions import col
df = df.withColumn("column1", col("column1").cast("int"))
  • replace() method allows you to replace specific values in a DataFrame.

df = df.replace(["value1", "value2"], ["new_value1", "new_value2"], "column_name")
  • dropna() method allows you to drop rows with null values from a DataFrame.

df.dropna()
  • fillna() method allows you to fill null values with a specific value.

df.fillna(0, "column_name")

These are some of the commonly used data transformation methods in PySpark. There are many more methods and options available depending on the specific use case. It's always good to check the documentation for the latest updates and options.

Last updated