ConsoleFlare
  • Python
    • Python Installation
    • Pandas and SQL
  • Projects
    • Data Analytics Project
      • Courier Analytics Challenge
      • Solution
    • Skytrax Airline Review Analysis Pipeline
      • Setting up Azure SQL Database
      • SkyTrax Web Scraping
  • Reporting
    • Power BI
      • Installation
      • Data Sources
      • Important Links
  • PySpark & Databricks
    • Spark vs Hadoop
    • Cluster Computing
    • PySpark
    • Databricks Introduction
    • PySpark in Databricks
    • Reading Data with PySpark
    • PySpark Transformation Methods
    • Handling Duplicate Data
    • PySpark Action Methods
    • PySpark Native Functions
    • Partitioning
    • Bucketing
    • Partitioning vs Bucketing
  • Live Data Streaming
    • Spark Streaming
      • Installation Issues
      • Jupyter Notebook Setup
  • Data Pipeline
    • Azure Data Factory
  • Blockchain
    • Smart Contract Guide
      • Setting up a Node project
      • Developing smart contracts
  • Interview Questions
    • SQL Interview Questions
    • Power BI Interview Questions
  • T-SQL Exercises
    • Exercise 0
    • Exercise 1
    • Exercise 2
    • Exercise 3
  • CHEAT SHEET
    • Ultimate SQL Server Cheat Sheet
Powered by GitBook
On this page

Was this helpful?

  1. PySpark & Databricks

Handling Duplicate Data

In this tutorial, we will see some common methods for how we can handle duplicate data.

PreviousPySpark Transformation MethodsNextPySpark Action Methods

Last updated 2 years ago

Was this helpful?

  • Removing duplicate rows based on all columns:

df = df.distinct()
  • Removing duplicate rows based on specific columns:

df = df.dropDuplicates(["column1", "column2"])
  • Removing duplicate rows based on specific columns and considering only the first occurrence:

df = df.dropDuplicates(["column1", "column2"], keep='first')
  • Removing duplicate rows based on specific columns and considering only the last occurrence:

df = df.dropDuplicates(["column1", "column2"], keep='last')
  • Removing duplicate rows based on specific columns and considering only the first occurrence for each group of duplicates:

df = df.dropDuplicates(["column1", "column2"], keep=False)

It's worth noting that dropDuplicates() is an alias for distinct() so you can use either of these function depending on your preference and it's always good to check the for the latest updates and options.

documentation