Handling Duplicate Data

In this tutorial, we will see some common methods for how we can handle duplicate data.

  • Removing duplicate rows based on all columns:

df = df.distinct()
  • Removing duplicate rows based on specific columns:

df = df.dropDuplicates(["column1", "column2"])
  • Removing duplicate rows based on specific columns and considering only the first occurrence:

df = df.dropDuplicates(["column1", "column2"], keep='first')
  • Removing duplicate rows based on specific columns and considering only the last occurrence:

df = df.dropDuplicates(["column1", "column2"], keep='last')
  • Removing duplicate rows based on specific columns and considering only the first occurrence for each group of duplicates:

df = df.dropDuplicates(["column1", "column2"], keep=False)

It's worth noting that dropDuplicates() is an alias for distinct() so you can use either of these function depending on your preference and it's always good to check the documentationarrow-up-right for the latest updates and options.

Last updated