ConsoleFlare
  • Python
    • Python Installation
    • Pandas and SQL
  • Projects
    • Data Analytics Project
      • Courier Analytics Challenge
      • Solution
    • Skytrax Airline Review Analysis Pipeline
      • Setting up Azure SQL Database
      • SkyTrax Web Scraping
  • Reporting
    • Power BI
      • Installation
      • Data Sources
      • Important Links
  • PySpark & Databricks
    • Spark vs Hadoop
    • Cluster Computing
    • PySpark
    • Databricks Introduction
    • PySpark in Databricks
    • Reading Data with PySpark
    • PySpark Transformation Methods
    • Handling Duplicate Data
    • PySpark Action Methods
    • PySpark Native Functions
    • Partitioning
    • Bucketing
    • Partitioning vs Bucketing
  • Live Data Streaming
    • Spark Streaming
      • Installation Issues
      • Jupyter Notebook Setup
  • Data Pipeline
    • Azure Data Factory
  • Blockchain
    • Smart Contract Guide
      • Setting up a Node project
      • Developing smart contracts
  • Interview Questions
    • SQL Interview Questions
    • Power BI Interview Questions
  • T-SQL Exercises
    • Exercise 0
    • Exercise 1
    • Exercise 2
    • Exercise 3
  • CHEAT SHEET
    • Ultimate SQL Server Cheat Sheet
Powered by GitBook
On this page

Was this helpful?

  1. PySpark & Databricks

PySpark Native Functions

In this tutorial we will explore some common PySpark native functions

PySpark provides a variety of built-in functions that can be used to perform operations on columns in a DataFrame. These functions are part of the pyspark.sql.functions module and can be imported as follows:

from pyspark.sql.functions import *

Some examples of commonly used functions include:

  • sum() function: It is used to calculate the sum of a column.

df.agg(sum("column1"))
  • avg() function: It is used to calculate the average of a column.

df.agg(avg("column1"))
  • min() function: It is used to calculate the minimum value of a column.

df.agg(min("column1"))
  • max() function: It is used to calculate the maximum value of a column.

df.agg(max("column1"))
  • concat() function: It is used to concatenate two or more columns

df.select(concat(col("column1"), col("column2")))

These functions can be used with the select() and agg() methods to perform operations on DataFrame columns.

df.select(sum("column1").alias("sum_column1"))

You can also use these functions in the filter() method to filter the dataframe based on a certain condition

df.filter(col("column1") > 10)

These functions can also be used with the withColumn() method to add a new column to a DataFrame.

df.withColumn("new_column", col("column1") + col("column2"))

You can also use the when() and otherwise() functions to create a new column based on a certain condition.

from pyspark.sql.functions import when
df.withColumn("new_column", when(col("column1") > 10, "high").otherwise("low"))

You can also use the ifnull() and nullif() functions to handle missing values.

from pyspark.sql.functions import ifnull, nullif
df.select(ifnull("column1", 0))
df.select(nullif("column1", 0))

These are just some examples of the built-in functions provided by PySpark. There are many more functions available and it's always good to check the documentation for the latest updates and options.

PreviousPySpark Action MethodsNextPartitioning

Last updated 2 years ago

Was this helpful?

It's always good to check the for the latest updates and options. Also, when you are working with Databricks, always make sure that you have the required libraries installed.

documentation