Before you continue reading, there are some terms and values you are supposed to know. For any term, please read the Setup Guide, which also contains a short glossary and other useful links.
The following is a list of values / information you need to know. If unsure, ask to a member of the HGI team.
- The current name-tag of the datacenter (e.g.,
eta
) - The name-tag of the programme / team that is hosting your cluster (e.g.,
hgi
) - The name-tag of the working environment (e.g.,
dev
) - Your Openstack username (e.g.,
ld14
) - The name of the deployment (e.g.,
hail
)
Using a bash
-like syntax:
datacenter="eta"
programme="hgi"
environment="dev"
os_username="ld14"
deployment_name="hail"
bash invoke.sh user create
bash invoke.sh user destroy
bash invoke.sh user create
- you may want to do this because you have lost / replace your keypair
bash invoke.sh user destroy --yes-also-the-bucket
bash invoke.sh hail create
# TODO: yet to be implemented
bash invoke.sh hail scale --spark-slaves-count 8
bash invoke.sh hail destroy
There are 2 sets of configuration files:
- For the infrastructure provisioning:
ls -la terraform/vars/${datacenter}/${programme}/${environment}/${os_username}/
- For the software provisioning:
ls -la ansible/vars/${datacenter}/${programme}/${environment}/${os_username}/${deployment_name}/
In the following guides, ${ip_address}
is the IP address of your Hail
master node. This can be looked up by examining the configuration files
for the infrastructure provisioning (see above).
Important Any data which you need to process in Hail must be accessible to the Hail workers. Ideally, you should keep such data (e.g., VCF files, etc.) in S3. However, you may also put data in Hail's temporary working directory, which is shared amongst all workers in your Hail cluster. The path for which is:
${HAIL_HOME}/tmp
${HAIL_HOME}
, as of writing, is /opt/sanger.ac.uk/hgi/hail
. By
default, this directory is 31GB in size. Note that this is different to
the Jupyter directory, where your Jupyter notebooks will live; this
space (1GB, by default) is only meant for your source code and any data
files placed here won't be accessible to your Hail workers.
See the SSH documentation, below, for instructions on how to get data on to your cluster, if you can't use S3 for the purpose.
Your Jupyter Notebook will be accessible on the internal network at
http://${ip_address}
, which will redirect you to
http://${ip_address}/jupyter/tree
. You will be prompted for the
password to allow you to log in.
Note that the underlying Spark service status can be accessed at
http://${ip_address}/spark/
. This exists largely as a curiosity for
end users, but can be useful to help debugging.
In newer versions of the Hail cluster, in the Jupyter file browser you
will see a working
directory. This is a symlink to ${HAIL_HOME}/tmp
which will allow you to get working data in to your cluster more easily
(i.e., using the Jupyter interface, rather than scp
).
Note You will still need to refer to this location as
${HAIL_HOME}/tmp
in your notebooks, for access. Conveniently, this is
the same temporary directory as you'll specify when initialising Hail:
import os
import hail
import pyspark
sc = pyspark.SparkContext()
temp_dir = working_dir = os.path.join(os.environ["HAIL_HOME"], "tmp")
hail.init(sc=sc, tmp_dir=temp_dir)
mt = hail.import_vcf(f"{working_dir}/some-variants.vcf.bgz")
In early versions of the provisioned Notebook, there was a "Quit" button in the upper right corner. If this button was pressed, it would shut down the Jupyter service on your Hail master node. This button has since been disabled, but if you are using an older version and have this button:
-
Do not press the "Quit" button!
-
If you do press it, reflect on your life choices and then, to restart the Jupyter Notebook service, run the following command:
ssh ${ip_address} sudo systemctl restart jupyter-notebook.service
From a machine on which your private key exists, where
${ip_address}
is the IP address of your Hail master node.
You can SSH into your master node, to get data on-and-off, if needed, as well as running non-interactive Hail jobs.
Note You are advised to store data in S3, rather than embargoing it in the Hail working temporary directory. Sometimes, however, this is not an option; for example, if your Hail script needs to save output (e.g., a plot) and doesn't support writing to S3.
To upload data, available to the Hail cluster for processing:
scp my_files ${ip_address}:/opt/sanger.ac.uk/hgi/hail/tmp/
To download data from the Hail cluster:
scp ${ip_address}:/path/to/your/file /destination/path/
For additional documentation, please see the scp
man-pages.
You have SSH access to your cluster, from which you can launch
non-interactive Hail jobs. The default user on your Hail master node is
ubuntu
, so you can log in with:
ssh ubuntu@${ip_address}
Once you've established a session, you will need to change to the hgi
user in order to run Hail scripts:
sudo -iu hgi
To avoid remembering ${ip_addess}
, you can create an SSH configuration
using a more memorable name. For example, in your ~/.ssh/config
file,
replacing ${ip_address}
appropriately in the following:
Host hail
Hostname ${ip_addess}
User ubuntu
Will allow you to access your cluster's master node, either as a login
shell or copying data, using the name hail
instead of
ubuntu@${ip_address}
.
The Spark shell can be used to run non-interactive Hail scripts (e.g., those which aren't exploratory in nature and potentially require a long run time) as well as a REPL (read-evaluate-print loop), not conceptually dissimilar to a Jupyter notebook.
If you need S3 access, you should first configure your environment (see below). Otherwise, to start a REPL, run:
pyspark
You can then initiate an interactive Hail session using the following code:
import os
import hail
tmp_dir = os.path.join(os.environ["HAIL_HOME"], "tmp")
hail.init(sc=sc, tmp_dir=tmp_dir)
Alternatively, to run a non-interactive script, it can be submitted using:
spark-submit /path/to/your/script.py
The "boilerplate" for non-interactive scripts is slightly different than the code run in the REPL, because you need to acquire a Spark Context (the REPL provides this for you automatically). The code should change to:
import os
import pyspark
import hail
sc = pyspark.SparkContext()
tmp_dir = os.path.join(os.environ["HAIL_HOME"], "tmp")
hail.init(sc=sc, tmp_dir=tmp_dir)
This is the same boilerplate that you would use in a Jupyter session.
Note The following guidance is for older clusters, where your S3 credentials weren't set automatically. On newer clusters, you no longer need to do this, unless you wish to access data that exists under a different set of keys.
To get your S3 credentials into your Jupyter Notebook, you need to add the following commands after acquiring the Spark Context and before initialising Hail:
import os
import pyspark
import hail
sc = pyspark.SparkContext()
tmp_dir = os.path.join(os.environ["HAIL_HOME"], "tmp")
## This bit is for configuring S3 access:
hadoop_config = sc._jsc.hadoopConfiguration()
hadoop_config.set("fs.s3a.access.key", "XXX") # Replace XXX with your AWS Access Key
hadoop_config.set("fs.s3a.secret.key", "YYY") # Replace YYY with your AWS Secret Key
hail.init(sc=sc, tmp_dir=tmp_dir)
The XXX
and YYY
in the above need to be replaced with your AWS
access key and secret key, respectively. These can be found in you
.s3cfg
file.
Important If you are using source control (e.g., Git), take care not to check in secrets such as your AWS keys. If you accidentally push these to, say, GitHub, the whole world now has access to your S3 credentials. Ideally these should be kept in a separate configuration file, which is not checked in, and loaded in to your Jupyter Notebook. HGI can assist with setting this up.
When running a script using pyspark
or spark-submit
, you do not have
to explicitly set the AWS keys within your script. Instead, these are
determined by Spark from the environment. To this end, you must set the
appropriate environment variables before running your script:
export AWS_ACCESS_KEY_ID="XXX"
export AWS_SECRET_ACCESS_KEY="YYY"
The XXX
and YYY
in the above need to be replaced with your AWS
access key and secret key, respectively. These can be found in you
.s3cfg
file.
Tip
The environment variables only need to be set once per session and, to
avoid the session terminating prematurely, you can do this within a
tmux
or screen
session. You can then log back in to your master
node, with its working state preserved.
-
Do not use underscores in your bucket names, they are not supported by Spark's S3 driver and using them will give you misleading errors.
-
Do not include colons or other special characters in your object names. To be safe, limit yourself to alphanumeric characters, full stops (
.
), dashes (-
) and underscores (_
). -
When reading and writing S3, using Hail, you must use the URL scheme
s3a://
, rather thans3://
. So, say you have a file namedchr10.vcf.gz
in a bucket calledmy-project
, then its address (inasmuch as Hail can understand) would bes3a://my-project/chr10.vcf.gz
.