Skip to content

terraform and ansible codebase to provision clusters (e.g. hail/spark) at Sanger

License

Notifications You must be signed in to change notification settings

wtsi-hgi/hgi-cloud

hgi-systems-cluster-spark

A reboot of the HGI's IaC project. This specific project has been created to address one, simple, initial objective: the lifecycle management of a spark cluster.

Why a reboot?

The code was not effective any more: the team was not confident with the codebase, the building process and the infrastructure generated by the code was missing a number of must-have features for today's infrastructures. We chose to have a fresh start on the IaC, rather then refactoring legacy code. This will let us choose simple and effective objectives, outline better requirements, and design around operability from the very beginning.

Guide

Using this repository

  1. terraform 0.11 executable anywhere in your PATH
  2. packer 1.4 executable anywhere in your PATH
  3. docker distribution installed
  4. Ensure that the following packages are installed:
    • build-essential
    • cmake
    • g++
    • libatlas3-base
    • liblz4-dev
    • libnetlib-java
    • libopenblas-base
    • make
    • openjdk-8-jdk
    • python3
    • python3-dev
    • python3-pip
    • r-base
    • r-recommended
    • scala
  5. Ensure that python requirements in requirements.txt are installed
  6. Follow the setup runbook

Running tasks

invoke.sh is shell script made to wrap pyinvoke quite extensive list of tasks and collections, and meke its usage even easier. invoke.sh. To understand how to use invoke.sh, you can run:

bash invoke.sh --help

To have an idea of what the tasks are and do, please have a look at the tasks documentation. For a quick list of example usages, please refer to the users or ops runbooks.

Try your Jupyter's notebook

Jupyter Notebook

Open your hail-master Jupyter URL http://<IP_OR_NAME>/jupyter/ in a web browser, create a notebook, then initialise it:

import os
import hail
import pyspark

tmp_dir = os.path.join(os.environ['HAIL_HOME'], 'tmp')
sc = pyspark.SparkContext()

hail.init(sc=sc, tmp_dir=tmp_dir)

Interactive pyspark

(TODO: include a .ssh/config snippet to allow for an easier ssh run) ssh into your hail-master node:

$ ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ubuntu@<IP_OR_NAME>

Once you've logged in, become the application user (i.e. hgi -- for now)

$ sudo --login --user=hgi --group=hgi

The --login option will create a login shell that will have a lot of pre-configured environment variables and commands, including a pre-configured alias to pyspark, so you should not need to remember any option. Once you started pyspark, you can initialise hail like this:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.3
      /_/

Using Python version 3.7.3 (default, Mar 27 2019 22:11:17)
SparkSession available as 'spark'.
>>> import os
>>> import hail
>>> tmp_dir = os.path.join(os.environ['HAIL_HOME'], 'tmp')
>>> mail.init(sc=sc, tmp_dir=tmp_dir)

Non-interactive pyspark

Hail initialisation in a non-interactive pyspark session is the same as for the Jupyter Notebooks:

import os
import hail
import pyspark

tmp_dir = os.path.join(os.environ['HAIL_HOME'], 'tmp')
sc = pyspark.SparkContext()

hail.init(sc=sc, tmp_dir=tmp_dir)

How to contribute

Read the CONTRIBUTING.md file

Licese

Read the LICENSE.md file