Skip to content

Commit 9448637

Browse files
author
ziang jia
committed
initial commit
1 parent acb9849 commit 9448637

26 files changed

+1152
-1
lines changed

.gitignore

+7
Original file line numberDiff line numberDiff line change
@@ -127,3 +127,10 @@ dmypy.json
127127

128128
# Pyre type checker
129129
.pyre/
130+
131+
# credentials
132+
credentials/
133+
spark-py/
134+
135+
customized-pvc-pv.yaml
136+
customized-config.yaml

README.md

+197-1
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,198 @@
11
# jupyterhub-k8s-apache-spark
2-
Deploy apache spark in client mode on Kubernetes cluster, integrate with Jupyter notebook through Jupyterhub server.
2+
Deploy Apache Spark in client mode on Kubernetes cluster.
3+
Allow user to interact the Spark cluster with Jupyter notebook.
4+
Serve Jupyter Notebooks through Jupyterhub server.
5+
Manage dependencies with docker images.
6+
7+
# Infrastructure
8+
Choose the cloud provider and set up the cloud infrastructure. Below is a list of required services. Note that one can optinally build all infrastructure on premises.
9+
10+
1. virtual network
11+
2. kubernetes cluster.
12+
3. container registery
13+
4. storage account
14+
5. service principle or cloud service account
15+
16+
For demonstration purpose, azure.sh includes all commands one might need to set up infrastructure in the Azure environment.
17+
18+
## Environment variables
19+
The following environment variables are needed to pre-define in the dev environment.
20+
21+
```txt
22+
TENANT_ID=<your_tenant_id>
23+
SERVICE_PRINCIPLE_ID=<your_service_principle_id>
24+
SERVICE_PRINCIPLE_SECRET=<your_service_principle_secret>
25+
ADLS_ACCOUNT_NAME=<adls_account_name>
26+
ADLS_ACCOUNT_KEY=<adls_account_sharekey>
27+
RESOURCE_GROUP=JHUB-SPARK
28+
VNET_NAME=AKS-VNET
29+
ACR_NAME=jhubsparkacr
30+
AKS_NAME=jhubsparkaks
31+
AKS_VERSION=1.22.6
32+
JHUB_NAMESPACE=jhubspark-jhub
33+
BASE_IMAGE=justmodeling/spark-py39:v3.2.1
34+
SPARK_NDOE_POOL=sparkpool
35+
ACR_PULL_SECRET=jhubsparkacr-secret
36+
SERVICE_ACCOUNT=spark-admin
37+
USER_FS=test-jhub-user
38+
PROJECT_FS=test-jhub-project
39+
```
40+
41+
One can play all of these in txt file "env" and export them at once
42+
43+
```bash
44+
set -a
45+
. ./credentials/env
46+
set +a
47+
```
48+
49+
## Kubernetes cluster management
50+
It is always a good practice to segment pods by their functionality and assign them to the dedicated node pool to maximize the performance and minimize the cost.
51+
52+
In my case, I set up four node pools as following:
53+
* **systempool**. This is where the Kubernetes scheduler pod and api-server pod are assigned to. No other pod can be scheduled on this node pool and it does not scale up automatically.
54+
55+
* **apppool**. This is the node pool for application pods, such as JupyterHub. They are most likely to be deployed through Helm. This node pool can scale up automatically depends on the usage.
56+
57+
* **jhubuserpool**. This node pool mainly host the single-user pod for JupyterHub. When users arrive to login and spawn their own workspace, each of them will have a pod to run their notebook server. These pods are created by JupyterHub so they are assign on this dedicated pool. This pool can autoscale and has more memory and compute resources as it is where user might conduct heavy algorithm computing or data manipulation. When there is no user, this pool can scale down to 0 node automatically.
58+
59+
* **sparkpool**. This pool is dedicated for spark workers. In this architecture, the single-user pod are cting like the driver which submit Spark jobs to run on worker nodes. This node pool will automaticaly scale up when where is Spark jobs submitted by any user in the jhubuserpool. Each job will have its own numbers of pods that runs the Spark executors.
60+
61+
62+
# Spark, Delta Lake and Python
63+
To run spark in client mode in kubernetes, one needs a driver pod where jupyter notebook server is running, and multiple worker pods where Spark jobs are running. To guarantee both driver and workers are using the same version of Spark, Hadoop and Python, the driver container will be built on top of the worker container.
64+
65+
To build the worker container, choose your desired Spark version and define it in BASE_IMAGE. Use the following command to build the worker image. For example, I choose justmodeling/spark-py39:v3.2.1 as the base image. You can build your own base image them reference it here as well.
66+
67+
```bash
68+
docker build \
69+
--build-arg BASE_IMAGE=$BASE_IMAGE \
70+
-f pyspark-notebook/Dockerfile.spark \
71+
-t $ACR_NAME.azurecr.io/pyspark-worker:v3.2.1 ./pyspark-notebook
72+
```
73+
74+
To build the driver container, run the following command. Where WORK_IMAGE is the image built in the step above.
75+
76+
```bash
77+
docker build \
78+
--build-arg ADLS_ACCOUNT_NAME=$ADLS_ACCOUNT_NAME \
79+
--build-arg ADLS_ACCOUNT_KEY=$ADLS_ACCOUNT_KEY \
80+
--build-arg ACR_NAME=$ACR_NAME \
81+
--build-arg ACR_PULL_SECRET=$ACR_PULL_SECRET \
82+
--build-arg JHUB_NAMESPACE=$JHUB_NAMESPACE \
83+
--build-arg WORK_IMAGE=$ACR_NAME.azurecr.io/pyspark-worker:v3.2.1 \
84+
--build-arg SPARK_NDOE_POOL=$SPARK_NDOE_POOL \
85+
--build-arg SERVICE_ACCOUNT=$SERVICE_ACCOUNT \
86+
--build-arg USER_FS_PVC=pvc-$USER_FS \
87+
--build-arg PROJECT_FS_PVC=pvc-$PROJECT_FS \
88+
-f pyspark-notebook/Dockerfile \
89+
-t $ACR_NAME.azurecr.io/pyspark-notebook:v3.2.1 ./pyspark-notebook
90+
```
91+
92+
Note that Delta Lake has been enabled in justmodeling/spark-py39:v3.2.1 but it is not necessary for most Spark jobs.
93+
94+
# JupyterHub
95+
The [zero-to-jupyterhub-k8s](https://zero-to-jupyterhub.readthedocs.io/en/latest/) project is an open source Helm application that helps deploy JupyterHub on Kubernetes automatically.
96+
97+
One might need to customize the hub image if any dependencies are needed to enable third party authentication plugin. To build a customized hub image, run the following command,
98+
99+
```bash
100+
docker build -t $ACR_NAME.azurecr.io/k8s-hub:latest -f jupyter-k8s-hub/Dockerfile ./jupyter-k8s-hub
101+
```
102+
103+
When deploying jupyterhub using Helm, make sure to include the customized images in the config.yaml file. There are two pre-define places where one can replace their value using "sed" command. If further customizations are needed, simply make a copy of this YAML file and modify the configuration directly. Deploy jupyterhub with Helm is pretty simple
104+
105+
```bash
106+
helm upgrade --install spark-jhub jupyterhub/jupyterhub \
107+
--namespace $JHUB_NAMESPACE \
108+
--version=1.2.0 \
109+
--values customized-config.yaml \
110+
--timeout=5000s
111+
```
112+
113+
# Spark UI
114+
In many use cases, developer would like to access the Spark UI to monitor and debugging their applications. It would be very convenient to allow this UI to get proxied in JupyterHub.
115+
116+
To do this, simply install the [jupyter-sparkui-proxy](https://github.com/yuvipanda/jupyter-sparkui-proxy) as a dependency in the driver container.
117+
118+
```docker
119+
COPY jupyter-sparkui-proxy /opt/jupyter-sparkui-proxy
120+
RUN cd /opt && chown -R jovyan jupyter-sparkui-proxy
121+
122+
USER $NB_UID
123+
RUN cd /opt/jupyter-sparkui-proxy \
124+
&& pip3 install .
125+
```
126+
127+
One can choose to install directly from the PYPI registery but there is a bug in the default version. It is suggested to install it from the original github repostory directly
128+
129+
```docker
130+
RUN pip3 install git+https://github.com/yuvipanda/jupyter-sparkui-proxy.git@master
131+
```
132+
133+
# Test run with PySpark
134+
Once everything is deployed, user should be able to access the jupyterhub at http://<IP-Address>/hub/spawn/<User-Name>
135+
136+
![jhub-ui](images/jhub-ui.png)
137+
138+
There can be multiple projects if configured in the config.yaml file when deploying JupyterHub. For demonstration purpose, I have only one here.
139+
140+
Assuming there is no active node in the jhubuserpool, one should expect the Kubernetes cluster scale up automatically. This might take a few minutes depends on the cloud service provider.
141+
142+
![jhub-autoscale](images/jhub-autoscale.png)
143+
144+
Once server is ready, one can see the jupyterlab UI. To test Spark jobs, simple create a test applcation as following. Note that MY_POD_IP is already an environment variables in the jupyter notebook pod. We set it as the driver host here.
145+
146+
```python
147+
import findspark
148+
findspark.init()
149+
from pyspark.sql import SparkSession
150+
from pyspark.conf import SparkConf
151+
import os
152+
153+
# Config Spark session
154+
conf = SparkConf()
155+
conf.set('spark.app.name','spark-test')
156+
conf.set('spark.driver.host', os.environ['MY_POD_IP'])
157+
conf.set('spark.submit.deployMode', 'client')
158+
conf.set('spark.executor.cores', '3')
159+
conf.set('spark.executor.memory', '12g')
160+
conf.set('spark.executor.instances', '6')
161+
162+
# Create Spark context
163+
# This step takes ~5-10 mins
164+
spark = SparkSession.builder.config(conf=conf).getOrCreate()
165+
```
166+
167+
One shoud see the sparkpool is scaling up to provide the resource requested above. The machine is these pool are 8 vcore, 32GB. We requested 6 executors which would result in 3 machines spawn up.
168+
169+
![spark-autoscale](images/spark-autoscale.png)
170+
171+
When resources are up running, spark session are registered.
172+
173+
![spark-session](images/spark-session.png)
174+
175+
Now users can access the SparkUI to monitor their jobs. With the easy set up by jupyter proxy, one can access the sparkui by http://<IP-Address>/user/<User-Name>/sparkui
176+
177+
![spark-ui](images/spark-ui.png)
178+
179+
180+
181+
182+
183+
184+
185+
186+
187+
188+
189+
190+
191+
192+
193+
194+
195+
196+
197+
198+

0 commit comments

Comments
 (0)