Reading Data with PySpark

In this tutorial we will see how we can read data in PySpark

  • To read data in PySpark in Databricks, you will first need to create a SparkSession. You can create a SparkSession by running the following command:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MyApp").getOrCreate()

The above step can be skipped in Databricks. Because Databricks notebooks have the Spark Session already preloaded.

  • Once you have a SparkSession, you can use the method to read data from various sources. Databricks supports various file formats such as CSV, JSON, Parquet, and many more. You can read a CSV file from a DBFS (Databricks File System) by running the following command:

df ="/path/to/file.csv", header=True, inferSchema=True)
  • Databricks also supports reading data from external sources such as Amazon S3, Azure Blob Storage, and more. You just need to provide the appropriate path for the file.

df ="s3a://path/to/file.csv", header=True, inferSchema=True)
  • Once you have read the data, you can perform various operations on the dataframe such as filtering, aggregation and join. Databricks also provides a built-in visualization tool called display which you can use to plot the dataframe.

display(df.filter(df.column_name == 'value').groupby('column_name').agg(avg('column_name')))
  • You can also save the dataframe back to DBFS or external storage like S3, Azure Blob etc.

  • To read data from other sources such as Hive or JDBC, you can use the method with the appropriate options.

df ="jdbc").options(url="jdbc:postgresql:dbserver", dbtable="schema.tablename", user="username", password="password").load()
  • Always make sure that you have the required credentials to access the data and the required libraries are installed.

These are some basic notes for reading data using PySpark in Databricks. There are many more options and configurations available for reading data, depending on the specific data source and use case.

Last updated