Run Spark on Windows; Pair PyCharm & PySpark
I tried to run Spark on Windows and configure it on PyCharm and Jupyter. I read some walk-through to find the comprehensive way to solve the issues which might happen to others, so I decided to write the solution step by step.
Before we start the steps, as you know, you can find Variable and Path from this steps: Press Windows Key + R to open Run then write SystemPropertiesAdvanced and press enter.
Click on Environment Variables.
Now you have to consider that we set HOME variables in User variable box with new button. and Path variables add up into Path in System Variable box with edit button.
JAVA 8
- Download & Install Java 8.
- Create new variable as JAVA_HOME as variable name and C:\Progra~1\Java\jdk1.8.0_241 as variable value.
Please check the right path like the picture as below:
If Java has installed on Program Files (x86), you should write “Progra~2” instead of “Progra~1”.
3. Define new Path as C:\Progra~1\Java\jdk1.8.0_241\bin
Spark
- Download Spark.
- Extract .gz file with WinRar and rename the extracted file to .zip
like this:
Now you can extract spark into a directory like C:\Spark as below:
3. Add User Variable:
SPARK_HOME = C:\Spark\spark-2.4.5-bin-hadoop2.7
HADOOP_HOME = C:\Spark\spark-2.4.5-bin-hadoop2.7
(Pay attention on version you write above is equal to version that you have downloaded)
4. Add to Path: C:\Spark\spark-2.4.5-bin-hadoop2.7\bin
Run Spark through CMD
Open CMD and write codes like below and check the result:
cmd> pyspark
>>> nums = sc.parallelize([1,2,3,4])
>>> nums.map(lambda x: x*x).collect()
PyCharm
- Create a python project SparkHelloWorld
- Go to File > Setting > Project: SparkHelloWorld > Project Structure
- Press Add Content Root twice and find python folder and
create a python file and write this simple code
from pyspark.sql import SparkSessiondef init_spark():
spark = SparkSession.builder.appName("HelloWorld").getOrCreate()
sc = spark.sparkContext
return spark,scdef main():
spark,sc = init_spark()
nums = sc.parallelize([1,2,3,4])
print(nums.map(lambda x: x*x).collect())if __name__ == '__main__':
main()
Result:
[1, 4, 9, 16]
Jupyter
- Download & install Anaconda.
- In anaconda prompt install findspark.
pip install findspark
then run Jupyter:
jupyter notebook
In jupyter notebook before start coding spark you shoud initiate find spark.
import findspark
findspark.init()
Then you can run spark code like below.
For more optional configures which are helpful to avoid some possible future errors you could read full Doron Vainrub article in references.
References: