PySpark in Databricks

In this tutorial, we will see how we can get started with Databricks

Databricks offers 2 platforms

In this tutorial, we will be focusing on the Databrick Community Edition.

In order to get started you will need to follow the following steps:

Step 1: Go to the Databricks Community Platform website (https://community.cloud.databricks.com/) and click on the "Sign Up" button to create a new account.

Step 2: Fill in the required information to create an account, such as your name, email address, and password.

Step 3: Verify your email address by clicking on the link sent to your email.

Step 4: Once your account is created, you will be redirected to the Databricks Community Platform dashboard.

Step 5: To start working with Databricks, you will need to create a new workspace. Click on the "Workspaces" button on the left sidebar, and then click on the "Create Workspace" button.

Step 6: Give your workspace a name and select the "Community Edition" plan.

Step 7: Once your workspace is created, you will be taken to the workspace dashboard.

Step 8: Now you can create a new cluster by clicking on the "Clusters" button on the left sidebar and then clicking on the "Create Cluster" button.

Step 9: Give your cluster a name, select the appropriate settings and click on the "Create Cluster" button.

Step 10: Once the cluster is created, you can create a new notebook by clicking on the "Workspaces" button on the left sidebar, then the "Create" button and then selecting "Notebook".

Step 11: Give your notebook a name and select the language (Python or Scala) and attach the created cluster to the notebook.

Step 12: You can now start writing and running code in your notebook using PySpark or Scala.

Please note that the community edition of Databricks has some limitations such as the number of hours the cluster can run and certain features may not be available.

That's it! You are now ready to start working with the Databricks Community Platform. Remember that you can always refer to the Databricks documentation for more information and help on specific features and functionality. Additionally, you can always reach out to Databricks community support for any questions or issues you may encounter.

It's important to note that, as a community edition user, you may not have access to all the features that are available on the paid version, but it's still a great way to get started and learn about the platform. The community edition should be enough for small-scale data processing and experimentation.

A basic example

  • To begin working with PySpark in Databricks, you'll first need to create a new notebook and attach it to a Spark cluster. This can be done by clicking on the "New Notebook" button in the Databricks workspace and selecting "Python" as the language.

  • Once you have a new notebook open, you can start working with PySpark by importing the necessary libraries and modules. The most commonly used library is the pyspark library, which provides the main interface for working with PySpark.

  • Databricks notebooks has PySpark session preloaded with the name "spark"

  • To read data from a file (e.g. CSV), you can use the spark.read method to read the file, and then convert it to a DataFrame:

df = spark.read.csv("path_to_file.csv", header = True, inferSchema = True)
  • Once you have a DataFrame, you can perform various transformations and operations on it, such as selecting columns, filtering rows, and grouping data. For example:

df.select("column1", "column2")
df.filter(df["column1"] > 10)
df.groupBy("column1").mean()
  • To write the dataframe to a file you can use the following command:

df.write.parquet("path_to_file.parquet")
  • Finally, you can use the show() method to display the results of your analysis, for example:

df.show()

These are the basic commands that you need to get started with PySpark in Databricks. There are many more functions and options available, and it's always good to check the documentation for the latest updates and options.

Last updated