Handling Duplicate Data

In this tutorial, we will see some common methods for how we can handle duplicate data.

Removing duplicate rows based on all columns:

df = df.distinct()

Removing duplicate rows based on specific columns:

df = df.dropDuplicates(["column1", "column2"])

Removing duplicate rows based on specific columns and considering only the first occurrence:

df = df.dropDuplicates(["column1", "column2"], keep='first')

Removing duplicate rows based on specific columns and considering only the last occurrence:

df = df.dropDuplicates(["column1", "column2"], keep='last')

Removing duplicate rows based on specific columns and considering only the first occurrence for each group of duplicates:

df = df.dropDuplicates(["column1", "column2"], keep=False)

It's worth noting that dropDuplicates() is an alias for distinct() so you can use either of these function depending on your preference and it's always good to check the documentation for the latest updates and options.

PreviousPySpark Transformation Methods NextPySpark Action Methods

Last updated 2 years ago

Was this helpful?