Skip to content

Commit

Permalink
[ZEPPELIN-3840] Zeppelin on Kubernetes
Browse files Browse the repository at this point in the history
### What type of PR is it?
This PR adds ability to run Zeppelin on Kubernetes. It aims

 - Zero configuration to start Zeppelin on Kubernetes. (and Spark on Kubernetes)
 - Run everything on Kubernetes: Zeppelin, Interpreters, Spark.
 - Highly customizable to adopt various user configurations and extensions.

Key features are

 - Provides zeppelin-server.yaml file for `kubectl` to run Zeppelin server
 - All interpreters are automatically running as a Pod.
 - Spark interpreter automatically configured to use [Spark on Kubernetes](https://spark.apache.org/docs/latest/running-on-kubernetes.html)
 - Reverse proxy is configured to access Spark UI

To do
 - [x] Document how reverse proxy for Spark UI works and how to configure custom domain.
 - [x] Document how to customize zeppelin-server and interpreter yaml.
 - [x] Document new configurations
 - [x] Document how to mount volume for notebook and configurations

### How it works

#### Run Zeppelin Server on Kubernetes
`k8s/zeppelin-server.yaml` is provided to run Zeppelin Server with few sidecars and configurations.
This file is easy to publish (user can easily consume it using `curl`), highly customizable while it includes all the necessary things.

#### K8s Interpreter launcher
This PR adds new module, `launcher-k8s-standard` under `zeppelin/zeppelin-plugins/launcher/k8s-standard/` directory. This launcher is [automatically being selected](https://github.com/apache/zeppelin/pull/3240/files#diff-82fddd2ffb77aaffc4b9cf7b5b1eaa79) when Zeppelin is running on Kubernetes. The launcher both handles Spark interpreter and All other interpreters.

The launcher launches interpreter as a Pod using template [k8s/interpreter/100-interpreter-pod.yaml](https://github.com/apache/zeppelin/pull/3240/files#diff-d9ce62e2c992d32f0184d7edb862f3c4).
Reason filename has `100-` in prefix is because all files in the directory is consumed in alphabetical order by launcher on interpreter start/stop. User can drop more files here to extend/customize interpreter, and filename can be used to control order. The template is rendered by [jinjava](https://github.com/HubSpot/jinjava).

#### Spark interpreter

When interpreter group is `spark`, K8sRemoteInterpreterProcess [sets necessary spark configuration](https://github.com/apache/zeppelin/pull/3240/files#diff-6d1d3084f55bdd519e39ede4a619e73dR297) automatically to use [Spark on Kubernetes](https://spark.apache.org/docs/latest/running-on-kubernetes.html). User doesn't have to configure anything. It uses client mode.

#### Spark UI

We may make user manually configure port-forward or do something to access Spark UI, but that's not optimal. It is the best when Spark UI is automatically accessible when user have access to Zeppelin UI, without any extra configuration.

To enable this, Zeppelin server Pod has a reverse proxy as a sidecar, and it split traffic to Zeppelin server and Spark UI running in the other Pod. It assume both `service.domain.com` and `*.service.domain.com` point the nginx proxy address. `service.domain.com` is directed to ZeppelinServer, `*.service.domain.com` is directed to interpreter Pod.

`<port>-<interpreter pod svc name>.service.domain.com` is convention to access any application running in interpreter Pod. If Spark interpreter Pod is running with a name `spark-axefeg` and Spark UI is running on port 4040,

```
4040-spark-axefeg.service.domain.com
```

is the address to access Spark UI. Default service domain is [local.zeppelin-project.org:8080](https://github.com/apache/zeppelin/pull/3240/files#diff-56ccb2e2c2617b27dbaae866d9431e51R22), while `local.zeppelin-project.org` and `*.local.zeppelin-project.org` point `127.0.0.1`, and it works with `kubectl port-forward`.

### What is the Jira issue?
https://issues.apache.org/jira/browse/ZEPPELIN-3840

### How should this be tested?

Prepare a Kubernetes cluster with enough resources (cpus > 5, mem > 6g).
If you're using [minikube](https://github.com/kubernetes/minikube), check your capacity using `kubectl describe node` command before start.

You'll need to build Zeppelin docker image and Spark docker image to test. Please follow guide docs/quickstart/kubernetes.md.

To quickly try without building docker images, I have uploaded pre-built image on docker hub `moon/zeppelin:0.9.0-SNAPSHOT`, `moon/spark:2.4.0`. Try following command

```
ZEPPELIN_SERVER_YAML="curl -s https://raw.githubusercontent.com/Leemoonsoo/zeppelin/kubernetes/k8s/zeppelin-server.yaml"
$ZEPPELIN_SERVER_YAML | sed 's/apache\/zeppelin:0.9.0-SNAPSHOT/moon\/zeppelin:0.9.0-SNAPSHOT/' | sed 's/spark:2.4.0/moon\/spark:2.4.0/' | kubectl apply -f -
```

And port forward

```
kubectl port-forward zeppelin-server 8080:80
```

And browse http://localhost:8080

To clean up

```
$ZEPPELIN_SERVER_YAML | sed 's/apache\/zeppelin:0.9.0-SNAPSHOT/moon\/zeppelin:0.9.0-SNAPSHOT/' | sed 's/spark:2.4.0/moon\/spark:2.4.0/' | kubectl delete -f -
```

### Screenshots (if appropriate)
See this video https://youtu.be/7E4ZGn4pnTo

### Future work

 - Per interpreter docker image
 - Blocking communication between interpreter Pod.
 - Spark Interpreter Pod has Role CRUD for any pod/service in the same namespace. Which should be restricted to only Spark executors Pod.
 - Per note interpreter mode by default when Zeppelin is running on Kubernetes

### Questions:
* Does the licenses files need update? no
* Is there breaking changes for older versions? no
* Does this needs documentation? yes

Author: Lee moon soo <[email protected]>
Author: Lee moon soo <[email protected]>

Closes apache#3240 from Leemoonsoo/kubernetes and squashes the following commits:

0100a36 [Lee moon soo] update how it works on docs, add some comments on yaml files
423412a [Lee moon soo] zeppelin.k8s.mode -> zeppelin.run.mode
4e7d817 [Lee moon soo] localtest.me -> local.zeppelin-project.org
993a0e4 [Lee moon soo] document configurations
9ab6fc4 [Lee moon soo] address code review
22e090f [Lee moon soo] logger -> LOGGER
11960dd [Lee moon soo] update corresponding test as well
3b652a4 [Lee moon soo] Make spark executor set ownerreference correctly
1a3a070 [Lee moon soo] Set ownerreference to Role and Rolebinding of interpreter
e2dc88a [Lee moon soo] suppress error log when wait target is already removed
fa36c18 [Lee moon soo] Make spark master configurable
b4f58a9 [Lee moon soo] sig term for quick termination
64a56b5 [Lee moon soo] Add docs
e9ce64f [Lee moon soo] update dockerfile
ec09b8b [Lee moon soo] add test
3078bac [Lee moon soo] spark ui support
9341fcb [Lee moon soo] install kubectl and configure log4j in docker image
0f7c0d4 [Lee moon soo] add license
f305611 [Lee moon soo] rename file
2b579ff [Lee moon soo] let user override namespace
f4166ad [Lee moon soo] make spark container image configurable
0d472ea [Lee moon soo] load properties and environment variables
b0e2c36 [Lee moon soo] Rbac role, rolebinding
2960dcb [Lee moon soo] configure namespace
a4072e6 [Lee moon soo] add signal handler
7a87367 [Lee moon soo] configure spark on kubernetes
263d859 [Lee moon soo] use headless service for interpreter pod
7fe9823 [Lee moon soo] interpreter pod cascade delete on zeppelin-server delete
86e8764 [Lee moon soo] add services on RBAC
18b8f68 [Lee moon soo] print spec file contents on debug log
0dea383 [Lee moon soo] create and connect interpreter pod
9f1b7a1 [Lee moon soo] run kubernetes launcher
2fd2ac8 [Lee moon soo] kubernetes mode configuration
58f9f19 [Lee moon soo] add rbac
36cf391 [Lee moon soo] correct plugin name
52bb6c7 [Lee moon soo] add k8s dir in package
5f602a6 [Lee moon soo] K8sRemoteInterpreterProcess
07489f7 [Lee moon soo] kubectl with exec
d2f3d5b [Lee moon soo] add k8s-standard launcher module
  • Loading branch information
Leemoonsoo committed Jan 18, 2019
1 parent 966a392 commit b13651c
Show file tree
Hide file tree
Showing 27 changed files with 2,164 additions and 9 deletions.
29 changes: 29 additions & 0 deletions conf/zeppelin-site.xml.template
Original file line number Diff line number Diff line change
Expand Up @@ -584,4 +584,33 @@
<description>Notebook cron folders</description>
</property>
-->

<property>
<name>zeppelin.run.mode</name>
<value>auto</value>
<description>'auto|local|k8s'</description>
</configuration>

<property>
<name>zeppelin.k8s.portforward</name>
<value>false</value>
<description>Port forward to interpreter rpc port. Set 'true' only on local development when zeppelin.k8s.mode 'on'</description>
</configuration>

<property>
<name>zeppelin.k8s.container.image</name>
<value>apache/zeppelin:0.9.0-SNAPSHOT</value>
<description>Docker image for interpreters</description>
</configuration>

<property>
<name>zeppelin.k8s.spark.container.image</name>
<value>apache/spark:latest</value>
<description>Docker image for Spark executors</description>
</configuration>

<property>
<name>zeppelin.k8s.template.dir</name>
<value>k8s</value>
<description>Kubernetes yaml spec files</description>
</configuration>
3 changes: 3 additions & 0 deletions dev/change_zeppelin_version.sh
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,9 @@ sed -i '' 's/"version": "'"${FROM_VERSION}"'",/"version": "'"${TO_VERSION}"'",/g
# Change version in Dockerfile
sed -i '' 's/Z_VERSION="'"${FROM_VERSION}"'"/Z_VERSION="'"${TO_VERSION}"'"/g' scripts/docker/zeppelin/bin/Dockerfile

# Change docker image version in configuration
sed -i '' sed 's/zeppelin:'"${OLD_VERSION}"'/zeppelin:'"${NEW_VERSION}"'/g' conf/zeppelin-site.xml.template

# When preparing new dev version from release tag, doesn't need to change docs version
if is_dev_version "${FROM_VERSION}" || ! is_dev_version "${TO_VERSION}"; then
# When prepare new rc for the maintenance release
Expand Down
1 change: 1 addition & 0 deletions docs/_includes/themes/zeppelin/_navigation.html
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
<ul class="dropdown-menu">
<li class="title"><span>Getting Started</span></li>
<li><a href="{{BASE_PATH}}/quickstart/install.html">Install</a></li>
<li><a href="{{BASE_PATH}}/quickstart/kubernetes.html">Kubernetes</a></li>
<li><a href="{{BASE_PATH}}/quickstart/explore_ui.html">Explore UI</a></li>
<li><a href="{{BASE_PATH}}/quickstart/tutorial.html">Tutorial</a></li>
<li role="separator" class="divider"></li>
Expand Down
253 changes: 253 additions & 0 deletions docs/quickstart/kubernetes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
---
layout: page
title: "Install"
description: "This page will help you get started and will guide you through installing Apache Zeppelin and running it in the command line."
group: quickstart
---
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
{% include JB/setup %}

# Zeppelin on Kubernetes

Zeppelin can run on clusters managed by [Kubernetes](https://kubernetes.io/). When Zeppelin runs in Pod, it creates pods for individual interpreter. Also Spark interpreter auto configured to use Spark on Kubernetes in client mode.

Key benefits are

- Interpreter scale-out
- Spark interpreter auto configure Spark on Kubernetes
- Able to customize Kubernetes yaml file
- Spark UI access

## Prerequisites

- Zeppelin >= 0.9.0 docker image
- Spark >= 2.4.0 docker image (in case of using Spark Interpreter)
- A running Kubernetes cluster with access configured to it using [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/)
- [Kubernetes DNS](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/) configured in your cluster
- Enough cpu and memory in your Kubernetes cluster. We recommend 4CPUs, 6g of memory to be able to start Spark Interpreter with few executors.

- If you're using [minikube](https://kubernetes.io/docs/setup/minikube/), check your cluster capacity (`kubectl describe node`) and increase if necessary

```
$ minikube delete # otherwise configuration won't apply
$ minikube config set cpus <number>
$ minikube config set memory <number in MB>
$ minikube start
$ minikube config view
```

## Quickstart

Get `zeppelin-server.yaml` from github repository or find it from Zeppelin distribution package.

```
# Get it from Zeppelin distribution package.
$ ls <zeppelin-distribution>/k8s/zeppelin-server.yaml
# or download it from github
$ curl -s -O https://raw.githubusercontent.com/apache/zeppelin/master/k8s/zeppelin-server.yaml
```

Start zeppelin on kubernetes cluster,

```
kubectl apply -f zeppelin-server.yaml
```

Port forward Zeppelin server port,

```
kubectl port-forward zeppelin-server 8080:80
```

and browse [localhost:8080](http://localhost:8080).
Try run some paragraphs and see each interpreter is running as a Pod (using `kubectl get pods`), instead of a local process.

To shutdown,

```
kubectl delete -f zeppelin-server.yaml
```

## Spark Interpreter

Build spark docker image to use Spark Interpreter.
Download spark binary distribution and run following command.
Spark 2.4.0 or later version is required.

```
# if you're using minikube, set docker-env
$ eval $(minikube docker-env)
# build docker image
$ <spark-distribution>/bin/docker-image-tool.sh -m -t 2.4.0 build
```

Run `docker images` and check if `spark:2.4.0` is created.
Configure `sparkContainerImage` of `zeppelin-server-conf` ConfigMap in `zeppelin-server.yaml`.


Create note and configure executor number (default 1)

```
%spark.conf
spark.executor.instances 5
```

And then start your spark interpreter

```
%spark
sc.parallelize(1 to 100).count
...
```
While `master` property of SparkInterpreter starts with `k8s://` (default `k8s://https://kubernetes.default.svc` when Zeppelin started using zeppelin-server.yaml), Spark executors will be automatically created in your Kubernetes cluster.
Spark UI is accessible by clicking `SPARK JOB` on the Paragraph.

Check [here](https://spark.apache.org/docs/latest/running-on-kubernetes.html) to know more about Running Spark on Kubernetes.


## Build Zeppelin image manually

To build your own Zeppelin image, first build Zeppelin project with `-Pbuild-distr` flag.

```
$ mvn package -DskipTests -Pbuild-distr <your flags>
```

Binary package will be created under `zeppelin-distribution/target` directory. Move created package file under `scripts/docker/zeppelin/bin/` directory.

```
$ mv zeppelin-distribution/target/zeppelin-*.tar.gz scripts/docker/zeppelin/bin/
```

`scripts/docker/zeppelin/bin/Dockerfile` downloads package from internet. Modify the file to add package from filesystem.

```
...
# Find following section and comment out
#RUN echo "$LOG_TAG Download Zeppelin binary" && \
# wget -O /tmp/zeppelin-${Z_VERSION}-bin-all.tgz http://archive.apache.org/dist/zeppelin/zeppelin-${Z_VERSION}/zeppelin-${Z_VERSION}-bin-all.tgz && \
# tar -zxvf /tmp/zeppelin-${Z_VERSION}-bin-all.tgz && \
# rm -rf /tmp/zeppelin-${Z_VERSION}-bin-all.tgz && \
# mv /zeppelin-${Z_VERSION}-bin-all ${Z_HOME}
# Add following lines right after the commented line above
ADD zeppelin-${Z_VERSION}.tar.gz /
RUN ln -s /zeppelin-${Z_VERSION} /zeppelin
...
```

Then build docker image.

```
# configure docker env, if you're using minikube
$ eval $(minikube docker-env)
# change directory
$ cd scripts/docker/zeppelin/bin/
# build image. Replace <tag>.
$ docker build -t <tag> .
```

Finally, set custom image `<tag>` just created to `image` and `ZEPPELIN_K8S_CONTAINER_IMAGE` env variable of `zeppelin-server` container spec in `zeppelin-server.yaml` file.

Currently, single docker image is being used in both Zeppelin server and Interpreter pods. Therefore,

| Pod | Number of instances | Image | Note |
| --- | --- | --- | --- |
| Zeppelin Server | 1 | Zeppelin docker image | User creates/deletes with kubectl command |
| Zeppelin Interpreters | n | Zeppelin docker image | Zeppelin Server creates/deletes |
| Spark executors | m | Spark docker image | Spark Interpreter creates/deletes |

Currently, size of Zeppelin docker image is quite big. Zeppelin project is planning to provides lightweight images for each individual interpreter in the future.


## How it works

### Zeppelin on Kubernetes

`k8s/zeppelin-server.yaml` is provided to run Zeppelin Server with few sidecars and configurations.
Once Zeppelin Server is started in side Kubernetes, it auto configure itself to use `K8sStandardInterpreterLauncher`.

The launcher creates each interpreter in a Pod using templates located under `k8s/interpreter/` directory.
Templates in the directory applied in alphabetical order. Templates are rendered by [jinjava](https://github.com/HubSpot/jinjava)
and all interpreter properties are accessible inside the templates.

### Spark on Kubernetes

When interpreter group is `spark`, Zeppelin sets necessary spark configuration automatically to use Spark on Kubernetes.
It uses client mode, so Spark interpreter Pod works as a Spark driver, spark executors are launched in separate Pods.
This auto configuration can be overrided by manually setting `master` property of Spark interpreter.


### Accessing Spark UI (or Service running in interpreter Pod)

Zeppelin server Pod has a reverse proxy as a sidecar, and it splits traffic to Zeppelin server and Spark UI running in the other Pods.
It assume both `<your service domain>` and `*.<your service domain>` point the nginx proxy address.
`<your service domain>` is directed to ZeppelinServer, `*.<your service domain>` is directed to interpreter Pods.

`<port>-<interpreter pod svc name>.<your service domain>` is convention to access any application running in interpreter Pod.


For example, When your service domain name is `local.zeppelin-project.org` Spark interpreter Pod is running with a name `spark-axefeg` and Spark UI is running on port 4040,

```
4040-spark-axefeg.local.zeppelin-project.org
```

is the address to access Spark UI.

Default service domain is `local.zeppelin-project.org:8080`. `local.zeppelin-project.org` and `*.local.zeppelin-project.org` configured to resolve `127.0.0.1`.
It allows access Zeppelin and Spark UI with `kubectl port-forward zeppelin-server 8080:80`.


If you like to use your custom domain

1. Configure [Ingress](https://kubernetes.io/docs/concepts/services-networking/ingress/) in Kubernetes cluster for `http` port of the service `zeppelin-server` defined in `k8s/zeppelin-server.yaml`.
2. Configure DNS record that your service domain and wildcard subdomain point the IP Addresses of your Ingress.
3. Modify `serviceDomain` of `zeppelin-server-conf` ConfigMap in `k8s/zeppelin-server.yaml` file.
4. Apply changes (e.g. `kubectl apply -f k8s/zeppelin-server.yaml`)


## Persist /notebook and /conf directory

Notebook and configurations are not persisted by default. Please configure volume and update `k8s/zeppelin-server.yaml`
to use the volume to persiste /notebook and /conf directory if necessary.


## Customization

### Zeppelin Server Pod
Edit `k8s/zeppelin-server.yaml` and apply.

### Interpreter Pod
Since Interpreter Pod is created/deleted by ZeppelinServer using templates under `k8s/interpreter` directory,
to customize,

1. Prepare `k8s/interpreter` directory with customization (edit or create new yaml file), in a Kubernetes volume.
2. Modify `k8s/zeppelin-server.yaml` and mount prepared volume dir `k8s/interpreter` to `/zeppelin/k8s/interpreter/`.
3. Apply modified `k8s/zeppelin-server.yaml`.
4. Run a paragraph will create an interpreter using modified yaml files.


## Future work

- Smaller interpreter docker image.
- Blocking communication between interpreter Pod.
- Spark Interpreter Pod has Role CRUD for any pod/service in the same namespace. Which should be restricted to only Spark executors Pod.
- Per note interpreter mode by default when Zeppelin is running on Kubernetes
30 changes: 30 additions & 0 deletions docs/setup/operation/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -365,6 +365,36 @@ If both are defined, then the **environment variables** will take priority.
<td>token</td>
<td>GitHub remote name. Default is `origin`</td>
</tr>
<tr>
<td><h6 class="properties">ZEPPELIN_RUN_MODE</h6></td>
<td><h6 class="properties">zeppelin.run.mode</h6></td>
<td>auto</td>
<td>Run mode. 'auto|local|k8s'. 'auto' autodetect environment. 'local' runs interpreter as a local process. k8s runs interpreter on Kubernetes cluster</td>
</tr>
<tr>
<td><h6 class="properties">ZEPPELIN_K8S_PORTFORWARD</h6></td>
<td><h6 class="properties">zeppelin.k8s.portforward</h6></td>
<td>false</td>
<td>Port forward to interpreter rpc port. Set 'true' only on local development when zeppelin.k8s.mode 'on'. Don't use 'true' on production environment</td>
</tr>
<tr>
<td><h6 class="properties">ZEPPELIN_K8S_CONTAINER_IMAGE</h6></td>
<td><h6 class="properties">zeppelin.k8s.container.image</h6></td>
<td>apache/zeppelin:{{ site.ZEPPELIN_VERSION }}</td>
<td>Docker image for interpreters</td>
</tr>
<tr>
<td><h6 class="properties">ZEPPELIN_K8S_SPARK_CONTAINER_IMAGE</h6></td>
<td><h6 class="properties">zeppelin.k8s.spark.container.image</h6></td>
<td>apache/spark:latest</td>
<td>Docker image for Spark executors</td>
</tr>
<tr>
<td><h6 class="properties">ZEPPELIN_K8S_TEMPLATE_DIR</h6></td>
<td><h6 class="properties">zeppelin.k8s.template.dir</h6></td>
<td>k8s</td>
<td>Kubernetes yaml spec files</td>
</tr>
</table>


Expand Down
Loading

0 comments on commit b13651c

Please sign in to comment.