# Bucketing

Bucketing is a feature in PySpark that enables you to group similar data into separate "buckets" to improve query performance. This is achieved by organizing the data into fixed-size hash-based buckets based on one or more columns in your DataFrame. Each bucket is stored as a separate file in the underlying file system. When you query the data, Spark can access the data in parallel from the individual files instead of having to scan the entire data set, which can improve query performance significantly.

Fixed-size hash-based buckets refer to a bucketing technique in which data is divided into a fixed number of buckets based on a hash of the values in a specific column or columns. In this approach, each data value is hashed and the hash value is used to determine which bucket the data should belong to. The number of buckets is fixed, so each bucket has roughly the same number of data points.

This technique is useful when we want to divide data into a small, fixed number of buckets, while still keeping related data together. For example, in a data analysis use case, we may want to divide a large dataset into a small number of buckets based on specific values of a certain column, such as user IDs or timestamps. With fixed-size hash-based bucketing, we can quickly and efficiently retrieve data for specific buckets and process the data within those buckets.

Here's an example of how you might create buckets in a PySpark DataFrame:

```python
from pyspark.sql.functions import bucket, expr

df = df.write.bucketBy(10, "column1").sortBy("column2").saveAsTable("table_name")
```

In this example, we're grouping the data into 10 buckets based on the values in the "column1" column. The data within each bucket is then sorted by the values in the "column2" column.

It's important to note that bucketing is most effective when the data within each bucket is roughly the same size. To ensure this, you should choose the number of buckets carefully and consider the distribution of the data in the columns that you're using for bucketing.

{% hint style="info" %}
Additionally, bucketing is only supported for tables that are stored as Parquet files and can only be used in combination with sorting.
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.consoleflare.com/pyspark-and-databricks/bucketing.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
