Handling Duplicate Data
In this tutorial, we will see some common methods for how we can handle duplicate data.
df = df.distinct()df = df.dropDuplicates(["column1", "column2"])df = df.dropDuplicates(["column1", "column2"], keep='first')df = df.dropDuplicates(["column1", "column2"], keep='last')df = df.dropDuplicates(["column1", "column2"], keep=False)Last updated