Run Spark on Windows; Pair PyCharm & PySpark

3 min readMar 8, 2020

I tried to run Spark on Windows and configure it on PyCharm and Jupyter. I read some walk-through to find the comprehensive way to solve the issues which might happen to others, so I decided to write the solution step by step.

Before we start the steps, as you know, you can find Variable and Path from this steps: Press Windows Key + R to open Run then write SystemPropertiesAdvanced and press enter.

Click on Environment Variables.

Now you have to consider that we set HOME variables in User variable box with new button. and Path variables add up into Path in System Variable box with edit button.

JAVA 8

Download & Install Java 8.
Create new variable as JAVA_HOME as variable name and C:\Progra~1\Java\jdk1.8.0_241 as variable value.

Please check the right path like the picture as below:

If Java has installed on Program Files (x86), you should write “Progra~2” instead of “Progra~1”.

3. Define new Path as C:\Progra~1\Java\jdk1.8.0_241\bin

Spark

Download Spark.
Extract .gz file with WinRar and rename the extracted file to .zip

like this:

Now you can extract spark into a directory like C:\Spark as below:

3. Add User Variable:

SPARK_HOME = C:\Spark\spark-2.4.5-bin-hadoop2.7

HADOOP_HOME = C:\Spark\spark-2.4.5-bin-hadoop2.7

(Pay attention on version you write above is equal to version that you have downloaded)

4. Add to Path: C:\Spark\spark-2.4.5-bin-hadoop2.7\bin

Run Spark through CMD

Open CMD and write codes like below and check the result:

cmd> pyspark                       
>>> nums = sc.parallelize([1,2,3,4])                       
>>> nums.map(lambda x: x*x).collect()

PyCharm

Create a python project SparkHelloWorld
Go to File > Setting > Project: SparkHelloWorld > Project Structure
Press Add Content Root twice and find python folder and

create a python file and write this simple code

from pyspark.sql import SparkSessiondef init_spark():
  spark = SparkSession.builder.appName("HelloWorld").getOrCreate()
  sc = spark.sparkContext
  return spark,scdef main():
  spark,sc = init_spark()
  nums = sc.parallelize([1,2,3,4])
  print(nums.map(lambda x: x*x).collect())if __name__ == '__main__':
  main()

Result: