# Spark Streaming

## What to expect

Here we will be explaining everything, including setting up your spark development environment, the theoretical & practical concepts related to spark, and big data.

We will also do some fun projects along the way.

It is one-stop documentation to learn and practice the Spark framework. So, just sit back and follow along.

### Installation

{% tabs %}
{% tab title="Windows user" %}

#### &#x20;<a href="#ftoc-heading-2" id="ftoc-heading-2"></a>

#### 1. Install Java 8 <a href="#ftoc-heading-2" id="ftoc-heading-2"></a>

Check if Java 8 is already installed on your system or not

```bash
java -version
```

If Java is installed, it will respond with the following output:

![](/files/-MhKBozGubkqXdF_Ind8)

If not then you need to install Java 8

To install Java 8, visit the following link and click on the download button

{% embed url="<https://java.com/en/download/>" %}

{% hint style="warning" %}
Spark needs Java 8 to work. So installing any other version won't work well. Please use the above link only.
{% endhint %}

Once downloaded, double-click on the file and complete the installation.

After installation is complete, open a new command prompt and check for the java version as follows:

```bash
java -version
```

**2. Install Python 3**

First, check if python 3 is already installed.

```bash
python --version
```

The above command should display a result like this:

![](/files/-MhKDdtRe778EPs1oW7q)

If the python version is not showing up and you get an error, please if double-check if python is properly installed and the python path is added to environment variables.

To install the latest python version visit the following link and download the installer.

{% embed url="<https://www.python.org/>" %}

![](/files/-MhKK2EuYWTk0C3K8Nv9)

Once downloaded, run the installer.

During the installation make sure you check the following option to add the python path to the environment variables, as shown below

![](/files/uYAOqe3a8z5yBBd01EE3)

![](/files/cozxSeur0mpTTaRA4wmW)

> *Make sure that python is installed with "**all users**" option*

![](/files/KOa5eQP9kjcYfBnJKmc4)

Now open a new command prompt and run the following command to check the python version.

```bash
python --version
```

The output should print the python version

**3. Install Spark**

Use the following command to install spark.

```bash
pip install pyspark
```

If the above command does not work then you can use the manual installation steps as mentioned below.

### Spark Manual Installation (only if pip step 3 doesn't work)

**1: Download Setup**

Open the following link

{% embed url="<https://spark.apache.org/downloads.html>" %}

Under the Download Apache Spark heading, there are two drop-down menus. Use the current non-preview version.

In **Choose a Spark release drop-down** menu **select 3.0.3 (Jun 23, 2021)**. In the second drop-down **Choose a package type**, leave the selection **Pre-built for Apache Hadoop 2.7**.

&#x20;Click the spark-3.0.3-bin-hadoop2.7.tgz link.

![](/files/-MjTcWNJXZbzffGyJkWy)

A page with a list of mirror links loads where you can see different servers to download from. Pick any from the list and save the file.

#### 2: Install Apache Spark <a href="#ftoc-heading-6" id="ftoc-heading-6"></a>

Create a new folder named Spark in the root of your C: drive. From a command line, enter the following:

```bash
cd \

mkdir Spark
```

In Explorer, locate the Spark file you downloaded.

Right-click the file and extract it to C:\Spark using the tool you have on your system (e.g., 7-Zip).

#### 3: Add winutils.exe File <a href="#ftoc-heading-7" id="ftoc-heading-7"></a>

Navigate to this URL <https://github.com/cdarlint/winutils>

Select the folder that matches the hadoop version with your spark download

Then, Inside the subsequent bin folder, locate winutils.exe, and click it.

![](/files/-MjTfhECfVZPA-_rdlW8)

Find the **Download** button on the right side to download the file.

Create new folders **Hadoop** and **bin** on C: using Windows Explorer or the Command Prompt.

Copy the winutils.exe file from the Downloads folder to **C:\hadoop\bin.**

#### 4: Configure Environment Variables <a href="#ftoc-heading-8" id="ftoc-heading-8"></a>

Click **Start** and type *environment*.

Select the result labeled ***Edit the system environment variables***.

A System Properties dialog box appears. In the lower-right corner, click **Environment Variables** and then click **New** in the next window.

![](/files/-MjTgdyu88HHjadk7l8R)

For *Variable Name* type ***SPARK\_HOME***.

For *Variable Value* type **C:\Spark\spark-3.0.3-bin-hadoop2.7** and click OK. If you changed the folder path, use that one instead.

![](/files/-MjTh8ZPpw1rmDu5xqQ-)

&#x20;In the top box, click the **Path** entry, then click **Edit**. Be careful with editing the system path. Avoid deleting any entries already on the list.

![](/files/-MjThHKFs5inIg64NO_Z)

You should see a box with entries on the left. On the right, click **New**.

The system highlights a new line. Enter the path to the Spark folder **%SPARK\_HOME%\bin**.

![](/files/-MjThRBFCjqV2t-slruI)

Repeat this process for Hadoop and Java.

* For Hadoop, the variable name is **HADOOP\_HOME** and for the value use the path of the folder you created earlier: **C:\hadoop.** Add **C:\hadoop\bin** to the **Path variable** field, but we recommend using **%HADOOP\_HOME%\bin**.
* For Java, the variable name is **JAVA\_HOME** and for the value use the path to your Java JDK directory (in our case it’s **C:\Program Files\Java\jdk1.8.0\_251**).

Click **OK** to close all open windows.

#### 5: Launch Spark <a href="#ftoc-heading-9" id="ftoc-heading-9"></a>

Open a new command prompt window using the right-click and **Run as administrator**:

To start Spark, enter:

```bash
spark-shell
```

If you set the **environment path** correctly, you can type **`spark-shell`** to launch Spark.

The system should display several lines indicating the status of the application. You may get a Java pop-up. Select **Allow access** to continue.

Finally, the Spark logo appears, and the prompt displays the **Scala shell**.

![](/files/-MjTiRpMsOEufX0LoCWk)

Open a web browser and navigate to **<http://localhost:4040/>**.

You can replace **localhost** with the name of your system.

You should see an Apache Spark shell Web UI. The example below shows the *Executors* page.

![](/files/-MjTicX9_8OMMCYx6S8B)

&#x20;To exit Spark and close the Scala shell, press **`ctrl-d`** in the command prompt window.
{% endtab %}

{% tab title="Linux user" %}

{% endtab %}
{% endtabs %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.consoleflare.com/live-data-streaming/master.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
