ConsoleFlare
  • Python
    • Python Installation
    • Pandas and SQL
  • Projects
    • Data Analytics Project
      • Courier Analytics Challenge
      • Solution
    • Skytrax Airline Review Analysis Pipeline
      • Setting up Azure SQL Database
      • SkyTrax Web Scraping
  • Reporting
    • Power BI
      • Installation
      • Data Sources
      • Important Links
  • PySpark & Databricks
    • Spark vs Hadoop
    • Cluster Computing
    • PySpark
    • Databricks Introduction
    • PySpark in Databricks
    • Reading Data with PySpark
    • PySpark Transformation Methods
    • Handling Duplicate Data
    • PySpark Action Methods
    • PySpark Native Functions
    • Partitioning
    • Bucketing
    • Partitioning vs Bucketing
  • Live Data Streaming
    • Spark Streaming
      • Installation Issues
      • Jupyter Notebook Setup
  • Data Pipeline
    • Azure Data Factory
  • Blockchain
    • Smart Contract Guide
      • Setting up a Node project
      • Developing smart contracts
  • Interview Questions
    • SQL Interview Questions
    • Power BI Interview Questions
  • T-SQL Exercises
    • Exercise 0
    • Exercise 1
    • Exercise 2
    • Exercise 3
  • CHEAT SHEET
    • Ultimate SQL Server Cheat Sheet
Powered by GitBook
On this page

Was this helpful?

  1. PySpark & Databricks

PySpark Action Methods

In this tutorial we will try to look at some of the common action methods in PySpark

  • Action methods are operations that return a value or produce a side effect. They are used to retrieve or collect data from a DataFrame.

  • One of the most commonly used action methods is count() which returns the number of rows in a DataFrame.

df.count()
  • Another commonly used action method is show() which displays the first n rows of a DataFrame. By default, it shows 20 rows, but you can specify a different number of rows.

df.show(n=10)
  • collect() method is used to retrieve all the rows in a DataFrame as an array of Row objects. It should be used with caution as it can cause the driver to run out of memory if the DataFrame is too large.

df.collect()
  • first() method is used to retrieve the first row in a DataFrame.

df.first()
  • take() method is used to retrieve the first n rows of a DataFrame.

df.take(n=5)
  • foreach() method is used to apply a function to each element of a DataFrame.

def my_function(row):
    print(row)

df.foreach(my_function)
  • foreachPartition() method is used to apply a function to each partition of a DataFrame.

def my_function(iterator):
    for row in iterator:
        print(row)

df.foreachPartition(my_function)
  • toPandas() method is used to convert a DataFrame to a pandas DataFrame. It should be used with caution as it can cause the driver to run out of memory if the DataFrame is too large.

df.toPandas()
PreviousHandling Duplicate DataNextPySpark Native Functions

Last updated 2 years ago

Was this helpful?

These are some of the commonly used action methods in PySpark. There are many more methods and options available depending on the specific use case. It's always good to check the for the latest updates and options.

documentation