# PySpark Transformation Methods

* Data transformations are operations that create a new DataFrame from an existing one. PySpark provides a variety of data transformation methods that can be used to manipulate data in a DataFrame.
* One of the most commonly used transformation methods is `select()` which allows you to select specific columns from a DataFrame. You can use the `select()` method to select one or more columns by name or index.

```python
df.select("column1","column2")
```

* Another commonly used transformation method is `filter()` which allows you to filter rows based on a certain condition. You can use the `filter()` method to filter rows by passing a column name and a condition.

```python
df.filter(df.column_name > 10)
```

* You can also use `groupBy()` to group the dataframe by one or more columns and perform various aggregate functions such as count, sum, avg, etc. on the grouped dataframe

```python
df.groupBy("column1","column2").agg(avg("column3"),max("column4"))
```

* `join()` method allows you to join two DataFrames based on a column. It supports various types of joins such as inner join, outer join, left join, and right join.

```python
df1.join(df2, "column_name", "inner")
```

* `drop()` method allows you to drop one or more columns from a DataFrame.

```python
df.drop("column1","column2")
```

* `withColumn()` method allows you to add or replace a column in a DataFrame.

```python
df.withColumn("new_column", df.column1 + df.column2)
```

* `sort()` method allows you to sort the dataframe by one or more columns in ascending or descending order.

```python
df.sort(df.column1.desc())
```

* `distinct()`method removes duplicate rows from a DataFrame.

```python
df.distinct()
```

* `limit()` method allows you to limit the number of rows in a DataFrame.

```python
df.limit(10)
```

* `repartition()` method allows you to change the number of partitions in a DataFrame. This can be useful when you want to increase or decrease the parallelism of a DataFrame.

```python
df.repartition(10)
```

* `coalesce()` method allows you to decrease the number of partitions in a DataFrame.

```python
df.coalesce(5)
```

* `cast()` method allows you to change the data type of a column.

```python
from pyspark.sql.functions import col
df = df.withColumn("column1", col("column1").cast("int"))
```

* `replace()` method allows you to replace specific values in a DataFrame.

```python
df = df.replace(["value1", "value2"], ["new_value1", "new_value2"], "column_name")
```

* `dropna()` method allows you to drop rows with null values from a DataFrame.

```python
df.dropna()
```

* `fillna()` method allows you to fill null values with a specific value.

```python
df.fillna(0, "column_name")
```

These are some of the commonly used data transformation methods in PySpark. There are many more methods and options available depending on the specific use case. It's always good to check the [documentation](https://docs.databricks.com/) for the latest updates and options.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.consoleflare.com/pyspark-and-databricks/pyspark-transformation-methods.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
