PySpark Action Methods

In this tutorial we will try to look at some of the common action methods in PySpark

  • Action methods are operations that return a value or produce a side effect. They are used to retrieve or collect data from a DataFrame.

  • One of the most commonly used action methods is count() which returns the number of rows in a DataFrame.

df.count()
  • Another commonly used action method is show() which displays the first n rows of a DataFrame. By default, it shows 20 rows, but you can specify a different number of rows.

df.show(n=10)
  • collect() method is used to retrieve all the rows in a DataFrame as an array of Row objects. It should be used with caution as it can cause the driver to run out of memory if the DataFrame is too large.

df.collect()
  • first() method is used to retrieve the first row in a DataFrame.

df.first()
  • take() method is used to retrieve the first n rows of a DataFrame.

df.take(n=5)
  • foreach() method is used to apply a function to each element of a DataFrame.

def my_function(row):
    print(row)

df.foreach(my_function)
  • foreachPartition() method is used to apply a function to each partition of a DataFrame.

def my_function(iterator):
    for row in iterator:
        print(row)

df.foreachPartition(my_function)
  • toPandas() method is used to convert a DataFrame to a pandas DataFrame. It should be used with caution as it can cause the driver to run out of memory if the DataFrame is too large.

df.toPandas()

These are some of the commonly used action methods in PySpark. There are many more methods and options available depending on the specific use case. It's always good to check the documentation for the latest updates and options.

Last updated