Spark Streaming
A to-the-point Instruction/Guide for any Spark enthusiast
What to expect
Here we will be explaining everything, including setting up your spark development environment, the theoretical & practical concepts related to spark, and big data.
We will also do some fun projects along the way.
It is one-stop documentation to learn and practice the Spark framework. So, just sit back and follow along.
Installation
1. Install Java 8
Check if Java 8 is already installed on your system or not
If Java is installed, it will respond with the following output:
If not then you need to install Java 8
To install Java 8, visit the following link and click on the download button
Spark needs Java 8 to work. So installing any other version won't work well. Please use the above link only.
Once downloaded, double-click on the file and complete the installation.
After installation is complete, open a new command prompt and check for the java version as follows:
2. Install Python 3
First, check if python 3 is already installed.
The above command should display a result like this:
If the python version is not showing up and you get an error, please if double-check if python is properly installed and the python path is added to environment variables.
To install the latest python version visit the following link and download the installer.
Once downloaded, run the installer.
During the installation make sure you check the following option to add the python path to the environment variables, as shown below
Make sure that python is installed with "all users" option
Now open a new command prompt and run the following command to check the python version.
The output should print the python version
3. Install Spark
Use the following command to install spark.
If the above command does not work then you can use the manual installation steps as mentioned below.
Spark Manual Installation (only if pip step 3 doesn't work)
1: Download Setup
Open the following link
Under the Download Apache Spark heading, there are two drop-down menus. Use the current non-preview version.
In Choose a Spark release drop-down menu select 3.0.3 (Jun 23, 2021). In the second drop-down Choose a package type, leave the selection Pre-built for Apache Hadoop 2.7.
Click the spark-3.0.3-bin-hadoop2.7.tgz link.
A page with a list of mirror links loads where you can see different servers to download from. Pick any from the list and save the file.
2: Install Apache Spark
Create a new folder named Spark in the root of your C: drive. From a command line, enter the following:
In Explorer, locate the Spark file you downloaded.
Right-click the file and extract it to C:\Spark using the tool you have on your system (e.g., 7-Zip).
3: Add winutils.exe File
Navigate to this URL https://github.com/cdarlint/winutils
Select the folder that matches the hadoop version with your spark download
Then, Inside the subsequent bin folder, locate winutils.exe, and click it.
Find the Download button on the right side to download the file.
Create new folders Hadoop and bin on C: using Windows Explorer or the Command Prompt.
Copy the winutils.exe file from the Downloads folder to C:\hadoop\bin.
4: Configure Environment Variables
Click Start and type environment.
Select the result labeled Edit the system environment variables.
A System Properties dialog box appears. In the lower-right corner, click Environment Variables and then click New in the next window.
For Variable Name type SPARK_HOME.
For Variable Value type C:\Spark\spark-3.0.3-bin-hadoop2.7 and click OK. If you changed the folder path, use that one instead.
In the top box, click the Path entry, then click Edit. Be careful with editing the system path. Avoid deleting any entries already on the list.
You should see a box with entries on the left. On the right, click New.
The system highlights a new line. Enter the path to the Spark folder %SPARK_HOME%\bin.
Repeat this process for Hadoop and Java.
For Hadoop, the variable name is HADOOP_HOME and for the value use the path of the folder you created earlier: C:\hadoop. Add C:\hadoop\bin to the Path variable field, but we recommend using %HADOOP_HOME%\bin.
For Java, the variable name is JAVA_HOME and for the value use the path to your Java JDK directory (in our case it’s C:\Program Files\Java\jdk1.8.0_251).
Click OK to close all open windows.
5: Launch Spark
Open a new command prompt window using the right-click and Run as administrator:
To start Spark, enter:
If you set the environment path correctly, you can type spark-shell
to launch Spark.
The system should display several lines indicating the status of the application. You may get a Java pop-up. Select Allow access to continue.
Finally, the Spark logo appears, and the prompt displays the Scala shell.
Open a web browser and navigate to http://localhost:4040/.
You can replace localhost with the name of your system.
You should see an Apache Spark shell Web UI. The example below shows the Executors page.
To exit Spark and close the Scala shell, press ctrl-d
in the command prompt window.
Last updated