Run Spark on Windows; Pair PyCharm & PySpark

Fakhredin Khorasani
3 min readMar 8, 2020

--

I tried to run Spark on Windows and configure it on PyCharm and Jupyter. I read some walk-through to find the comprehensive way to solve the issues which might happen to others, so I decided to write the solution step by step.

Before we start the steps, as you know, you can find Variable and Path from this steps: Press Windows Key + R to open Run then write SystemPropertiesAdvanced and press enter.

Click on Environment Variables.

Now you have to consider that we set HOME variables in User variable box with new button. and Path variables add up into Path in System Variable box with edit button.

JAVA 8

  1. Download & Install Java 8.
  2. Create new variable as JAVA_HOME as variable name and C:\Progra~1\Java\jdk1.8.0_241 as variable value.

Please check the right path like the picture as below:

If Java has installed on Program Files (x86), you should write “Progra~2” instead of “Progra~1”.

3. Define new Path as C:\Progra~1\Java\jdk1.8.0_241\bin

Spark

  1. Download Spark.
  2. Extract .gz file with WinRar and rename the extracted file to .zip

like this:

Now you can extract spark into a directory like C:\Spark as below:

3. Add User Variable:

SPARK_HOME = C:\Spark\spark-2.4.5-bin-hadoop2.7

HADOOP_HOME = C:\Spark\spark-2.4.5-bin-hadoop2.7

(Pay attention on version you write above is equal to version that you have downloaded)

4. Add to Path: C:\Spark\spark-2.4.5-bin-hadoop2.7\bin

Run Spark through CMD

Open CMD and write codes like below and check the result:

cmd> pyspark                       
>>> nums = sc.parallelize([1,2,3,4])
>>> nums.map(lambda x: x*x).collect()

PyCharm

  1. Create a python project SparkHelloWorld
  2. Go to File > Setting > Project: SparkHelloWorld > Project Structure
  3. Press Add Content Root twice and find python folder and
Python Folder in Spark
py4j Zip File

create a python file and write this simple code

from pyspark.sql import SparkSessiondef init_spark():
spark = SparkSession.builder.appName("HelloWorld").getOrCreate()
sc = spark.sparkContext
return spark,sc
def main():
spark,sc = init_spark()
nums = sc.parallelize([1,2,3,4])
print(nums.map(lambda x: x*x).collect())
if __name__ == '__main__':
main()

Result:

[1, 4, 9, 16]

Jupyter

  1. Download & install Anaconda.
  2. In anaconda prompt install findspark.
pip install findspark

then run Jupyter:

jupyter notebook

In jupyter notebook before start coding spark you shoud initiate find spark.

import findspark
findspark.init()

Then you can run spark code like below.

For more optional configures which are helpful to avoid some possible future errors you could read full Doron Vainrub article in references.

References:

https://www.youtube.com/watch?v=RsALKtZvqFo

Doron Vainrub Medium Post

--

--

No responses yet