PySpark

Introduction

  • PySpark is the Python library for Spark programming. It allows you to harness the power of Apache Spark, a fast and general-purpose cluster computing system, using Python.

  • To use PySpark, you will first need to install Spark on your machine. You can download Spark from the official website (https://spark.apache.org/downloads.html) and follow the instructions for your operating system.

  • Once you have Spark installed, you can start using PySpark by importing the library in your Python script:

from pyspark import SparkContext, SparkConf
  • The next step is to create a SparkConf and SparkContext. The SparkConf allows you to configure various settings for your Spark application, while the SparkContext is the entry point to the Spark cluster and the main object you will use to interact with it.

conf = SparkConf().setAppName("MyApp").setMaster("local")
sc = SparkContext(conf=conf)
  • With the SparkContext, you can now create RDDs (Resilient Distributed Datasets), which are the basic data structure in Spark. You can create RDDs from a variety of data sources, such as local files, HDFS, or even other RDDs.

rdd = sc.textFile("path/to/file.txt")
  • RDDs support two types of operations: transformations and actions. Transformations are operations that create a new RDD from an existing one, such as map, filter, and groupByKey. Actions are operations that return a value or write data to an external storage, such as count, collect, and saveAsTextFile.

# Transformation
rdd2 = rdd.filter(lambda x: "error" in x)
# Action
print(rdd2.count())
  • PySpark also supports DataFrame and SQL API which is similar to pandas and SQL.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MyApp").getOrCreate()
df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)
df.show()
  • Finally, remember to stop the SparkContext when you are done:

sc.stop()

This is a brief summary of PySpark basics, but it is just the tip of the iceberg. There are many more features and capabilities of PySpark, such as machine learning libraries, streaming, and graph processing.

But if you are writing PySpark code in Databricks then you can ignore the above code.

In the next chapter we will learn about using PySpark in Databricks.

Last updated