Example: Distributing Dependencies on a PySpark Cluster
Although Python is a popular choice for data scientists, it is not straightforward to make a Python library available on a distributed PySpark cluster. To determine which dependencies are required on the cluster, you must understand that Spark code applications run in Spark executor processes distributed throughout the cluster. If the Python code you are running uses any third-party libraries, Spark executors require access to those libraries when they run on remote executors.
nltk_sample.py
), that includes pure Python libraries (nltk), on a distributed
PySpark cluster.