Reading Data with PySpark
In this tutorial we will see how we can read data in PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MyApp").getOrCreate()df = spark.read.csv("/path/to/file.csv", header=True, inferSchema=True)df = spark.read.csv("s3a://path/to/file.csv", header=True, inferSchema=True)display(df.filter(df.column_name == 'value').groupby('column_name').agg(avg('column_name')))df.write.parquet("/path/to/save/data.parquet")df = spark.read.format("jdbc").options(url="jdbc:postgresql:dbserver", dbtable="schema.tablename", user="username", password="password").load()Last updated