# Handling Duplicate Data

* Removing duplicate rows based on all columns:

```python
df = df.distinct()
```

* Removing duplicate rows based on specific columns:

```python
df = df.dropDuplicates(["column1", "column2"])
```

* Removing duplicate rows based on specific columns and considering only the first occurrence:

```python
df = df.dropDuplicates(["column1", "column2"], keep='first')
```

* Removing duplicate rows based on specific columns and considering only the last occurrence:

```python
df = df.dropDuplicates(["column1", "column2"], keep='last')
```

* Removing duplicate rows based on specific columns and considering only the first occurrence for each group of duplicates:

```python
df = df.dropDuplicates(["column1", "column2"], keep=False)
```

It's worth noting that `dropDuplicates()` is an alias for `distinct()` so you can use either of these function depending on your preference and it's always good to check the [documentation](https://docs.databricks.com/) for the latest updates and options.
