Skip to content

Commit

Permalink
ZEPPELIN-1515. Notebook: HDFS as a backend storage (Use hadoop client…
Browse files Browse the repository at this point in the history
… jar)

### What is this PR for?
This PR is trying to add hdfs as another implementation for `NotebookRepo`. There's another PR about using webhdfs to implement that. Actually hdfs client library is compatibility cross major versions. See http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Compatibility.html#Wire_compatibility, if using webhdfs, the code become more complicated and may lose some features of hdfs.

This PR is also required for HA of zeppelin, so that multiple zeppelin instances can share notes via hdfs.  I add hadoop-client in pom file. So zeppelin will package hadoop client jar into its binary distribution. This is because zeppelin may be installed in a gateway machine where no hadoop is installed (only hadoop configuration file is existed in this machine) And since the hadoop client will work with multiple versions of hadoop, so it is fine to package into binary distribution. Spark also package hadoop client jar in its binary distribution.

### What type of PR is it?
[Feature]

### Todos
* [ ] - Task

### What is the Jira issue?
* https://issues.apache.org/jira/browse/ZEPPELIN-1515

### How should this be tested?
Unit test is added.  Also manually verify it in a single node cluster.

### Screenshots (if appropriate)

### Questions:
* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? No

Author: Jeff Zhang <[email protected]>

Closes apache#2455 from zjffdu/ZEPPELIN-1515 and squashes the following commits:

b3e83ab [Jeff Zhang] ZEPPELIN-1515. Notebook: HDFS as a backend storage (Read & Write Mode)
  • Loading branch information
zjffdu committed Aug 24, 2017
1 parent dfc62f5 commit 30bfcae
Show file tree
Hide file tree
Showing 11 changed files with 418 additions and 90 deletions.
4 changes: 4 additions & 0 deletions bin/zeppelin-daemon.sh
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,10 @@ if [[ -d "${ZEPPELIN_HOME}/zeppelin-server/target/classes" ]]; then
ZEPPELIN_CLASSPATH+=":${ZEPPELIN_HOME}/zeppelin-server/target/classes"
fi

if [[ -n "${HADOOP_CONF_DIR}" ]] && [[ -d "${HADOOP_CONF_DIR}" ]]; then
ZEPPELIN_CLASSPATH+=":${HADOOP_CONF_DIR}"
fi

# Add jdbc connector jar
# ZEPPELIN_CLASSPATH+=":${ZEPPELIN_HOME}/jdbc/jars/jdbc-connector-jar"

Expand Down
4 changes: 4 additions & 0 deletions bin/zeppelin.sh
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,10 @@ addJarInDir "${ZEPPELIN_HOME}/zeppelin-web/target/lib"

ZEPPELIN_CLASSPATH="$CLASSPATH:$ZEPPELIN_CLASSPATH"

if [[ -n "${HADOOP_CONF_DIR}" ]] && [[ -d "${HADOOP_CONF_DIR}" ]]; then
ZEPPELIN_CLASSPATH+=":${HADOOP_CONF_DIR}"
fi

if [[ ! -d "${ZEPPELIN_LOG_DIR}" ]]; then
echo "Log dir doesn't exist, create ${ZEPPELIN_LOG_DIR}"
$(mkdir -p "${ZEPPELIN_LOG_DIR}")
Expand Down
20 changes: 20 additions & 0 deletions conf/zeppelin-site.xml.template
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,26 @@
</property>
-->

<!-- Notebook storage layer using hdfs file system
<property>
<name>zeppelin.notebook.storage</name>
<value>org.apache.zeppelin.notebook.repo.HdfsNotebookRepo</value>
<description>hdfs notebook persistence layer implementation</description>
</property>
<property>
<name>zeppelin.hdfs.keytab</name>
<value></value>
<description>keytab for accessing kerberized hdfs</description>
</property>
<property>
<name>zeppelin.hdfs.principal</name>
<value></value>
<description>principal for accessing kerberized hdfs</description>
</property>
-->

<!-- For connecting your Zeppelin with ZeppelinHub -->
<!--
<property>
Expand Down
17 changes: 17 additions & 0 deletions docs/setup/storage/storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ There are few notebook storage systems available for a use out of the box:

* (default) use local file system and version it using local Git repository - `GitNotebookRepo`
* all notes are saved in the notebook folder in your local File System - `VFSNotebookRepo`
* all notes are saved in the notebook folder in hdfs - `HdfsNotebookRepo`
* storage using Amazon S3 service - `S3NotebookRepo`
* storage using Azure service - `AzureNotebookRepo`
* storage using MongoDB - `MongoNotebookRepo`
Expand All @@ -51,6 +52,22 @@ To enable versioning for all your local notebooks though a standard Git reposito
</property>
```

</br>

## Notebook Storage in Hdfs repository <a name="Hdfs"></a>

Notes may be stored in hdfs, so that multiple Zeppelin instances can share the same notes. It supports all the versions of hadoop 2.x. If you use `HdfsNotebookRepo`, then `zeppelin.notebook.dir` is the path on hdfs. And you need to specify `HADOOP_CONF_DIR` in `zeppelin-env.sh` so that zeppelin can find the right hadoop configuration files.
If your hadoop cluster is kerberized, then you need to specify `zeppelin.hdfs.keytab` and `zeppelin.hdfs.principal`

```
<property>
<name>zeppelin.notebook.storage</name>
<value>org.apache.zeppelin.notebook.repo.HdfsNotebookRepo</value>
<description>hdfs notebook persistence layer implementation</description>
</property>
```


</br>

## Notebook Storage in S3 <a name="S3"></a>
Expand Down
89 changes: 0 additions & 89 deletions zeppelin-server/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -195,101 +195,12 @@
<artifactId>websocket-server</artifactId>
<version>${jetty.version}</version>
</dependency>
<!--
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-jsp</artifactId>
<version>${jetty.version}</version>
<exclusions>
<exclusion>
<groupId>org.eclipse.jetty.orbit</groupId>
<artifactId>javax.servlet.jsp</artifactId>
</exclusion>
<exclusion>
<groupId>org.eclipse.jetty.orbit</groupId>
<artifactId>javax.servlet</artifactId>
</exclusion>
<exclusion>
<groupId>org.eclipse.jetty.orbit</groupId>
<artifactId>org.apache.jasper.glassfish</artifactId>
</exclusion>
</exclusions>
</dependency>

<dependency>
<groupId>org.eclipse.jetty.orbit</groupId>
<artifactId>javax.servlet.jsp</artifactId>
<version>2.2.0.v201112011158</version>
<exclusions>
<exclusion>
<groupId>org.eclipse.jetty.orbit</groupId>
<artifactId>javax.servlet</artifactId>
</exclusion>
</exclusions>
</dependency>
-->
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
</dependency>

<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop-common.version}</version>
<exclusions>
<exclusion>
<groupId>com.sun.jersey</groupId>
<artifactId>jersey-core</artifactId>
</exclusion>
<exclusion>
<groupId>com.sun.jersey</groupId>
<artifactId>jersey-json</artifactId>
</exclusion>
<exclusion>
<groupId>com.sun.jersey</groupId>
<artifactId>jersey-server</artifactId>
</exclusion>

<exclusion>
<groupId>javax.servlet</groupId>
<artifactId>servlet-api</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.jackrabbit</groupId>
<artifactId>jackrabbit-webdav</artifactId>
</exclusion>
<exclusion>
<groupId>io.netty</groupId>
<artifactId>netty</artifactId>
</exclusion>
<exclusion>
<groupId>commons-httpclient</groupId>
<artifactId>commons-httpclient</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.zookeeper</groupId>
<artifactId>zookeeper</artifactId>
</exclusion>
<exclusion>
<groupId>org.eclipse.jgit</groupId>
<artifactId>org.eclipse.jgit</artifactId>
</exclusion>
<exclusion>
<groupId>com.jcraft</groupId>
<artifactId>jsch</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.commons</groupId>
<artifactId>commons-compress</artifactId>
</exclusion>
</exclusions>
</dependency>

<dependency>
<groupId>org.quartz-scheduler</groupId>
<artifactId>quartz</artifactId>
Expand Down
66 changes: 66 additions & 0 deletions zeppelin-zengine/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@

<properties>
<!--library versions-->
<hadoop.version>2.6.0</hadoop.version>
<commons.lang3.version>3.4</commons.lang3.version>
<commons.vfs2.version>2.0</commons.vfs2.version>
<aws.sdk.s3.version>1.10.62</aws.sdk.s3.version>
Expand Down Expand Up @@ -301,6 +302,71 @@
<version>1.5</version>
</dependency>

<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>

<exclusions>
<exclusion>
<groupId>com.sun.jersey</groupId>
<artifactId>jersey-core</artifactId>
</exclusion>
<exclusion>
<groupId>com.sun.jersey</groupId>
<artifactId>jersey-json</artifactId>
</exclusion>
<exclusion>
<groupId>com.sun.jersey</groupId>
<artifactId>jersey-server</artifactId>
</exclusion>

<exclusion>
<groupId>javax.servlet</groupId>
<artifactId>servlet-api</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.jackrabbit</groupId>
<artifactId>jackrabbit-webdav</artifactId>
</exclusion>
<exclusion>
<groupId>io.netty</groupId>
<artifactId>netty</artifactId>
</exclusion>
<exclusion>
<groupId>commons-httpclient</groupId>
<artifactId>commons-httpclient</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.zookeeper</groupId>
<artifactId>zookeeper</artifactId>
</exclusion>
<exclusion>
<groupId>org.eclipse.jgit</groupId>
<artifactId>org.eclipse.jgit</artifactId>
</exclusion>
<exclusion>
<groupId>com.jcraft</groupId>
<artifactId>jsch</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.commons</groupId>
<artifactId>commons-compress</artifactId>
</exclusion>
<exclusion>
<groupId>xml-apis</groupId>
<artifactId>xml-apis</artifactId>
</exclusion>
<exclusion>
<groupId>xerces</groupId>
<artifactId>xercesImpl</artifactId>
</exclusion>
</exclusions>
</dependency>
</dependencies>

<build>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -681,7 +681,10 @@ public static enum ConfVars {
ZEPPELIN_SERVER_XFRAME_OPTIONS("zeppelin.server.xframe.options", "SAMEORIGIN"),
ZEPPELIN_SERVER_JETTY_NAME("zeppelin.server.jetty.name", null),
ZEPPELIN_SERVER_STRICT_TRANSPORT("zeppelin.server.strict.transport", "max-age=631138519"),
ZEPPELIN_SERVER_X_XSS_PROTECTION("zeppelin.server.xxss.protection", "1");
ZEPPELIN_SERVER_X_XSS_PROTECTION("zeppelin.server.xxss.protection", "1"),

ZEPPELIN_HDFS_KEYTAB("zeppelin.hdfs.keytab", ""),
ZEPPELIN_HDFS_PRINCIPAL("zeppelin.hdfs.principal", "");

private String varName;
@SuppressWarnings("rawtypes")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,7 @@ public class Note implements ParagraphJobListener, JsonSerializable {


public Note() {
generateId();
}

public Note(NotebookRepo repo, InterpreterFactory factory,
Expand Down
Loading

0 comments on commit 30bfcae

Please sign in to comment.