ConsoleFlare
  • Python
    • Python Installation
    • Pandas and SQL
  • Projects
    • Data Analytics Project
      • Courier Analytics Challenge
      • Solution
    • Skytrax Airline Review Analysis Pipeline
      • Setting up Azure SQL Database
      • SkyTrax Web Scraping
  • Reporting
    • Power BI
      • Installation
      • Data Sources
      • Important Links
  • PySpark & Databricks
    • Spark vs Hadoop
    • Cluster Computing
    • PySpark
    • Databricks Introduction
    • PySpark in Databricks
    • Reading Data with PySpark
    • PySpark Transformation Methods
    • Handling Duplicate Data
    • PySpark Action Methods
    • PySpark Native Functions
    • Partitioning
    • Bucketing
    • Partitioning vs Bucketing
  • Live Data Streaming
    • Spark Streaming
      • Installation Issues
      • Jupyter Notebook Setup
  • Data Pipeline
    • Azure Data Factory
  • Blockchain
    • Smart Contract Guide
      • Setting up a Node project
      • Developing smart contracts
  • Interview Questions
    • SQL Interview Questions
    • Power BI Interview Questions
  • T-SQL Exercises
    • Exercise 0
    • Exercise 1
    • Exercise 2
    • Exercise 3
  • CHEAT SHEET
    • Ultimate SQL Server Cheat Sheet
Powered by GitBook
On this page

Was this helpful?

  1. PySpark & Databricks

PySpark Transformation Methods

In this tutorial, we will try to explore some of the common data transformation methods in PySpark

  • Data transformations are operations that create a new DataFrame from an existing one. PySpark provides a variety of data transformation methods that can be used to manipulate data in a DataFrame.

  • One of the most commonly used transformation methods is select() which allows you to select specific columns from a DataFrame. You can use the select() method to select one or more columns by name or index.

df.select("column1","column2")
  • Another commonly used transformation method is filter() which allows you to filter rows based on a certain condition. You can use the filter() method to filter rows by passing a column name and a condition.

df.filter(df.column_name > 10)
  • You can also use groupBy() to group the dataframe by one or more columns and perform various aggregate functions such as count, sum, avg, etc. on the grouped dataframe

df.groupBy("column1","column2").agg(avg("column3"),max("column4"))
  • join() method allows you to join two DataFrames based on a column. It supports various types of joins such as inner join, outer join, left join, and right join.

df1.join(df2, "column_name", "inner")
  • drop() method allows you to drop one or more columns from a DataFrame.

df.drop("column1","column2")
  • withColumn() method allows you to add or replace a column in a DataFrame.

df.withColumn("new_column", df.column1 + df.column2)
  • sort() method allows you to sort the dataframe by one or more columns in ascending or descending order.

df.sort(df.column1.desc())
  • distinct()method removes duplicate rows from a DataFrame.

df.distinct()
  • limit() method allows you to limit the number of rows in a DataFrame.

df.limit(10)
  • repartition() method allows you to change the number of partitions in a DataFrame. This can be useful when you want to increase or decrease the parallelism of a DataFrame.

df.repartition(10)
  • coalesce() method allows you to decrease the number of partitions in a DataFrame.

df.coalesce(5)
  • cast() method allows you to change the data type of a column.

from pyspark.sql.functions import col
df = df.withColumn("column1", col("column1").cast("int"))
  • replace() method allows you to replace specific values in a DataFrame.

df = df.replace(["value1", "value2"], ["new_value1", "new_value2"], "column_name")
  • dropna() method allows you to drop rows with null values from a DataFrame.

df.dropna()
  • fillna() method allows you to fill null values with a specific value.

df.fillna(0, "column_name")
PreviousReading Data with PySparkNextHandling Duplicate Data

Last updated 2 years ago

Was this helpful?

These are some of the commonly used data transformation methods in PySpark. There are many more methods and options available depending on the specific use case. It's always good to check the for the latest updates and options.

documentation