hgi-systems-cluster-spark

A reboot of the HGI's IaC project. This specific project has been created to address one, simple, initial objective: the lifecycle management of a spark cluster.

Why a reboot?

The code was not effective any more: the team was not confident with the codebase, the building process and the infrastructure generated by the code was missing a number of must-have features for today's infrastructures. We chose to have a fresh start on the IaC, rather then refactoring legacy code. This will let us choose simple and effective objectives, outline better requirements, and design around operability from the very beginning.

Guide

Using this repository

terraform 0.11 executable anywhere in your PATH
packer 1.4 executable anywhere in your PATH
docker distribution installed
Ensure that the following packages are installed:
- build-essential
- cmake
- g++
- libatlas3-base
- liblz4-dev
- libnetlib-java
- libopenblas-base
- make
- openjdk-8-jdk
- python3
- python3-dev
- python3-pip
- r-base
- r-recommended
- scala
Ensure that python requirements in requirements.txt are installed
Follow the setup runbook

Running tasks

invoke.sh is shell script made to wrap pyinvoke quite extensive list of tasks and collections, and meke its usage even easier. invoke.sh. To understand how to use invoke.sh, you can run:

bash invoke.sh --help

To have an idea of what the tasks are and do, please have a look at the tasks documentation. For a quick list of example usages, please refer to the users or ops runbooks.

Try your Jupyter's notebook

Jupyter Notebook

Open your hail-master Jupyter URL http://<IP_OR_NAME>/jupyter/ in a web browser, create a notebook, then initialise it:

import os
import hail
import pyspark

tmp_dir = os.path.join(os.environ['HAIL_HOME'], 'tmp')
sc = pyspark.SparkContext()

hail.init(sc=sc, tmp_dir=tmp_dir)

Interactive pyspark

(TODO: include a .ssh/config snippet to allow for an easier ssh run) ssh into your hail-master node:

$ ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ubuntu@<IP_OR_NAME>

Once you've logged in, become the application user (i.e. hgi -- for now)

$ sudo --login --user=hgi --group=hgi

The --login option will create a login shell that will have a lot of pre-configured environment variables and commands, including a pre-configured alias to pyspark, so you should not need to remember any option. Once you started pyspark, you can initialise hail like this:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.3
      /_/

Using Python version 3.7.3 (default, Mar 27 2019 22:11:17)
SparkSession available as 'spark'.
>>> import os
>>> import hail
>>> tmp_dir = os.path.join(os.environ['HAIL_HOME'], 'tmp')
>>> mail.init(sc=sc, tmp_dir=tmp_dir)

Non-interactive pyspark

Hail initialisation in a non-interactive pyspark session is the same as for the Jupyter Notebooks:

import os
import hail
import pyspark

tmp_dir = os.path.join(os.environ['HAIL_HOME'], 'tmp')
sc = pyspark.SparkContext()

hail.init(sc=sc, tmp_dir=tmp_dir)

How to contribute

Read the CONTRIBUTING.md file

Licese

Read the LICENSE.md file

Name		Name	Last commit message	Last commit date
Latest commit History 924 Commits
ansible		ansible
canaries		canaries
docs		docs
metadata		metadata
packer		packer
tasks		tasks
terraform		terraform
.gitignore		.gitignore
AUTHORS.md		AUTHORS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
CREDITS.md		CREDITS.md
LICENSE.txt		LICENSE.txt
README.md		README.md
TODO.md		TODO.md
invoke.sh		invoke.sh
requirements.txt		requirements.txt
yamllint.conf		yamllint.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hgi-systems-cluster-spark

Why a reboot?

Guide

Using this repository

Running tasks

Try your Jupyter's notebook

Jupyter Notebook

Interactive pyspark

Non-interactive pyspark

How to contribute

Licese

About

Releases

Packages

Contributors 5

Languages

License

wtsi-hgi/hgi-cloud

Folders and files

Latest commit

History

Repository files navigation

hgi-systems-cluster-spark

Why a reboot?

Guide

Using this repository

Running tasks

Try your Jupyter's notebook

Jupyter Notebook

Interactive pyspark

Non-interactive pyspark

How to contribute

Licese

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages