ConsoleFlare
  • Python
    • Python Installation
    • Pandas and SQL
  • Projects
    • Data Analytics Project
      • Courier Analytics Challenge
      • Solution
    • Skytrax Airline Review Analysis Pipeline
      • Setting up Azure SQL Database
      • SkyTrax Web Scraping
  • Reporting
    • Power BI
      • Installation
      • Data Sources
      • Important Links
  • PySpark & Databricks
    • Spark vs Hadoop
    • Cluster Computing
    • PySpark
    • Databricks Introduction
    • PySpark in Databricks
    • Reading Data with PySpark
    • PySpark Transformation Methods
    • Handling Duplicate Data
    • PySpark Action Methods
    • PySpark Native Functions
    • Partitioning
    • Bucketing
    • Partitioning vs Bucketing
  • Live Data Streaming
    • Spark Streaming
      • Installation Issues
      • Jupyter Notebook Setup
  • Data Pipeline
    • Azure Data Factory
  • Blockchain
    • Smart Contract Guide
      • Setting up a Node project
      • Developing smart contracts
  • Interview Questions
    • SQL Interview Questions
    • Power BI Interview Questions
  • T-SQL Exercises
    • Exercise 0
    • Exercise 1
    • Exercise 2
    • Exercise 3
  • CHEAT SHEET
    • Ultimate SQL Server Cheat Sheet
Powered by GitBook
On this page

Was this helpful?

  1. PySpark & Databricks

Reading Data with PySpark

In this tutorial we will see how we can read data in PySpark

  • To read data in PySpark in Databricks, you will first need to create a SparkSession. You can create a SparkSession by running the following command:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MyApp").getOrCreate()

The above step can be skipped in Databricks. Because Databricks notebooks have the Spark Session already preloaded.

  • Once you have a SparkSession, you can use the spark.read method to read data from various sources. Databricks supports various file formats such as CSV, JSON, Parquet, and many more. You can read a CSV file from a DBFS (Databricks File System) by running the following command:

df = spark.read.csv("/path/to/file.csv", header=True, inferSchema=True)
  • Databricks also supports reading data from external sources such as Amazon S3, Azure Blob Storage, and more. You just need to provide the appropriate path for the file.

df = spark.read.csv("s3a://path/to/file.csv", header=True, inferSchema=True)
  • Once you have read the data, you can perform various operations on the dataframe such as filtering, aggregation and join. Databricks also provides a built-in visualization tool called display which you can use to plot the dataframe.

display(df.filter(df.column_name == 'value').groupby('column_name').agg(avg('column_name')))
  • You can also save the dataframe back to DBFS or external storage like S3, Azure Blob etc.

df.write.parquet("/path/to/save/data.parquet")
  • To read data from other sources such as Hive or JDBC, you can use the spark.read method with the appropriate options.

df = spark.read.format("jdbc").options(url="jdbc:postgresql:dbserver", dbtable="schema.tablename", user="username", password="password").load()
  • Always make sure that you have the required credentials to access the data and the required libraries are installed.

These are some basic notes for reading data using PySpark in Databricks. There are many more options and configurations available for reading data, depending on the specific data source and use case.

PreviousPySpark in DatabricksNextPySpark Transformation Methods

Last updated 2 years ago

Was this helpful?