Spark Streaming

A to-the-point Instruction/Guide for any Spark enthusiast

What to expect

Here we will be explaining everything, including setting up your spark development environment, the theoretical & practical concepts related to spark, and big data.

We will also do some fun projects along the way.

It is one-stop documentation to learn and practice the Spark framework. So, just sit back and follow along.

Installation

1. Install Java 8

Check if Java 8 is already installed on your system or not

java -version

If Java is installed, it will respond with the following output:

If not then you need to install Java 8

To install Java 8, visit the following link and click on the download button

Spark needs Java 8 to work. So installing any other version won't work well. Please use the above link only.

Once downloaded, double-click on the file and complete the installation.

After installation is complete, open a new command prompt and check for the java version as follows:

java -version

2. Install Python 3

First, check if python 3 is already installed.

python --version

The above command should display a result like this:

If the python version is not showing up and you get an error, please if double-check if python is properly installed and the python path is added to environment variables.

To install the latest python version visit the following link and download the installer.

Once downloaded, run the installer.

During the installation make sure you check the following option to add the python path to the environment variables, as shown below

Make sure that python is installed with "all users" option

Now open a new command prompt and run the following command to check the python version.

python --version

The output should print the python version

3. Install Spark

Use the following command to install spark.

pip install pyspark

If the above command does not work then you can use the manual installation steps as mentioned below.

Spark Manual Installation (only if pip step 3 doesn't work)

1: Download Setup

Open the following link

Under the Download Apache Spark heading, there are two drop-down menus. Use the current non-preview version.

In Choose a Spark release drop-down menu select 3.0.3 (Jun 23, 2021). In the second drop-down Choose a package type, leave the selection Pre-built for Apache Hadoop 2.7.

Click the spark-3.0.3-bin-hadoop2.7.tgz link.

A page with a list of mirror links loads where you can see different servers to download from. Pick any from the list and save the file.

2: Install Apache Spark

Create a new folder named Spark in the root of your C: drive. From a command line, enter the following:

cd \

mkdir Spark

In Explorer, locate the Spark file you downloaded.

Right-click the file and extract it to C:\Spark using the tool you have on your system (e.g., 7-Zip).

3: Add winutils.exe File

Navigate to this URL https://github.com/cdarlint/winutils

Select the folder that matches the hadoop version with your spark download

Then, Inside the subsequent bin folder, locate winutils.exe, and click it.

Find the Download button on the right side to download the file.

Create new folders Hadoop and bin on C: using Windows Explorer or the Command Prompt.

Copy the winutils.exe file from the Downloads folder to C:\hadoop\bin.

4: Configure Environment Variables

Click Start and type environment.

Select the result labeled Edit the system environment variables.

A System Properties dialog box appears. In the lower-right corner, click Environment Variables and then click New in the next window.

For Variable Name type SPARK_HOME.

For Variable Value type C:\Spark\spark-3.0.3-bin-hadoop2.7 and click OK. If you changed the folder path, use that one instead.

In the top box, click the Path entry, then click Edit. Be careful with editing the system path. Avoid deleting any entries already on the list.

You should see a box with entries on the left. On the right, click New.

The system highlights a new line. Enter the path to the Spark folder %SPARK_HOME%\bin.

Repeat this process for Hadoop and Java.

  • For Hadoop, the variable name is HADOOP_HOME and for the value use the path of the folder you created earlier: C:\hadoop. Add C:\hadoop\bin to the Path variable field, but we recommend using %HADOOP_HOME%\bin.

  • For Java, the variable name is JAVA_HOME and for the value use the path to your Java JDK directory (in our case it’s C:\Program Files\Java\jdk1.8.0_251).

Click OK to close all open windows.

5: Launch Spark

Open a new command prompt window using the right-click and Run as administrator:

To start Spark, enter:

spark-shell

If you set the environment path correctly, you can type spark-shell to launch Spark.

The system should display several lines indicating the status of the application. You may get a Java pop-up. Select Allow access to continue.

Finally, the Spark logo appears, and the prompt displays the Scala shell.

Open a web browser and navigate to http://localhost:4040/.

You can replace localhost with the name of your system.

You should see an Apache Spark shell Web UI. The example below shows the Executors page.

To exit Spark and close the Scala shell, press ctrl-d in the command prompt window.

Last updated