Skip to content
This repository has been archived by the owner on Aug 13, 2021. It is now read-only.

Commit

Permalink
Initial containerisation of Luigi (#106)
Browse files Browse the repository at this point in the history
* Luigi running in a container with local files copied in.

* Create docker-compose for arxlive.

* Add basic .dockerignore file.

* Add an extra level of iteration so find_filepath_from_pathstub doesn't exit early when 1 level from HOME. 

* Change relative paths to absolute for arxiv pipeline so calling from outside folder works.

* Add documentation for containerised luigi.

* Tidy up Dockerfile and implement arg for python version.
  • Loading branch information
russwinch committed Jun 18, 2019
1 parent cb9f6ed commit 3e49fa6
Show file tree
Hide file tree
Showing 14 changed files with 162 additions and 15 deletions.
4 changes: 4 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
docker-compose*
Dockerfile
.git
docs/*
67 changes: 67 additions & 0 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
########
# Python dependencies builder
#
ARG PYTHON_VERSION=3.6
FROM python:$PYTHON_VERSION-slim AS builder

WORKDIR /app
# Sets utf-8 encoding for Python et al
ENV LANG=C.UTF-8
# Turns off writing .pyc files; superfluous on an ephemeral container.
ENV PYTHONDONTWRITEBYTECODE=1
# Seems to speed things up
ENV PYTHONUNBUFFERED=1

# Ensures that the python and pip executables used
# in the image will be those from our virtualenv.
ENV PATH="/venv/bin:$PATH"

# Install OS package dependencies.
RUN apt-get update && apt-get install -y git

# Setup the virtualenv
RUN python -m venv /venv

# Install Python dependencies
# TODO: Replace with clone from git as below
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

# TODO: implement clone from github with arg for branch or tag
# RUN git clone https://github.com/nestauk/nesta.git --branch master --depth 1 --single-branch
# RUN pip install --no-cache-dir -r nesta/requirements.txt

# Install packages not in requirements.
# TODO: mysql-connector-repackaged doesn't work, investigate switching over.
RUN pip install awscli mysql-connector-python


########
# app container
#
FROM python:$PYTHON_VERSION-slim AS app

ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
ENV LANG=C.UTF-8
ENV PIP_DISABLE_PIP_VERSION_CHECK=1
ENV PATH="/venv/bin:$PATH"
ENV PYTHONPATH /app
ENV MYSQLDB /app/nesta/production/config/mysqldb.config
ENV LUIGI_CONFIG_DIR /app/nesta/production/config
ENV LUIGI_CONFIG_PATH /app/nesta/production/config/luigi.cfg

WORKDIR /app

# Copy in Python environment
COPY --from=builder /venv /venv

# Copy in the rest of the app from local to pick up configs
# TODO: replace when secrets are implemented
COPY ./ ./

RUN mkdir -p /var/log/luigi && \
mv docker/run.sh /usr/bin/run.sh && \
chmod +x /usr/bin/run.sh

ENTRYPOINT ["run.sh"]
61 changes: 61 additions & 0 deletions docker/README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
Containerised Luigi
===================

Build
-----

The build uses a multi-stage dockerfile to speed up rebuilds after code changes:
1. requirements are pip installed into a virtual environment
2. the environment is copied into the second image along with the codebase

From the root of the repository:
:code:`docker build -f docker/Dockerfile -t name:tag .`

where :code:`name` is the name of the created image and :code:`tag` is the chosen tag.
eg :code:`arxlive:dev`. This just makes the run step easier rather than using the generated id

Rebuilds due to code changes should just build from the second image but if a full rebuild is required then include:
:code:`--no-cache`

Python version defaults to 3.6 but can be set during build by including the flag:
:code: `--build-arg python-version=3.7`

Run
---

As only one pipeline runs in the container the :code:`luigid` scheduler is not used.

There is a :code:`docker-compose` file which mounts your local ~.aws folder for aws credentials as this outside docker's context
This could be adapted for each pipeline.

:code:`docker-compose -f docker/docker-compose.yml run luigi --module module_path params`

where:

- :code:`docker-compose.yml` is the docker-compose file containing the image: :code:`image_name:tag` from the build
- :code:`module_path` is the full python path to the module
- :code:`params` are any other params to supply as per normal, ie :code:`--date` :code:`--production` etc

eg :code:`docker-compose -f docker/docker-compose-arxlive-dev.yml run luigi --module nesta.production.routines.arxiv.arxiv_iterative_root_task RootTask --date 2019-04-16`

Important points
----------------

- keep any built images secure, they contain credentials
- you only need to rebuild if code has changed
- as there is no central scheduler there is nothing stopping you from running the task more than once at the same time
- the graphical interface is not enabled without the scheduler

Debugging
---------

If necessary, it's possible to debug inside the container, but the :code:`endpoint` needs to be overridden with :code:`bash`:

:code:`docker run --entrypoint /bin/bash -itv ~/.aws:/root/.aws:ro image_name:tag`

where :code:`image_name:tag` is the image from the build step
This includes the mounting of the .aws folder

Almost nothing is installed (not even vi!!) other than Python so

:code:`apt-get update` and then :code:`apt-get install` whatever you need
6 changes: 6 additions & 0 deletions docker/docker-compose-arxlive-dev.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
version: '3'
services:
luigi:
image: arxlive:dev
volumes:
- ~/.aws:/root/.aws:ro
6 changes: 6 additions & 0 deletions docker/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/usr/bin/env bash
# uncomment to enable the central scheduler (maybe useful for the graphical interface)
# luigid &

# pass any arguments straight on to luigi
luigi --local-scheduler "$@"
1 change: 1 addition & 0 deletions docs/source/nesta.production.containers.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.. include:: ../../docker/README.rst
1 change: 1 addition & 0 deletions docs/source/nesta.production.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,4 +32,5 @@ Code and scripts
nesta.production.luigihacks
nesta.production.scripts
nesta.production.elasticsearch
nesta.production.containers
nesta.production.troubleshooting
6 changes: 3 additions & 3 deletions nesta/production/luigihacks/misctools.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
'''
A collection of miscellaneous tools.
'''

import configparser
import os


def get_config(file_name, header):
'''Get the configuration from a file in the luigi config path
directory, and convert the key-value pairs under the config :code:`header`
Expand Down Expand Up @@ -46,17 +46,17 @@ def find_filepath_from_pathstub(path_stub):
by moving the current working directory backwards, one step at a time until
the file (or directory) is found. If the HOME directory is reached, the algorithm
raises :obj:`FileNotFoundError`.
Args:
path_stub (str): The partial file (or directory) path stub to find.
Returns:
The full path to the partial file (or directory) path stub.
'''
relative = 0
while True:
relative += 1
for path in get_paths_from_relative(relative):
if path.rstrip("/") == os.environ["HOME"]:
raise FileNotFoundError(f"Could not find {path_stub}")
if path.endswith(path_stub.rstrip("/")):
return path
relative += 1
10 changes: 5 additions & 5 deletions nesta/production/luigihacks/tests/test_misctools.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,16 @@
from nesta.production.luigihacks.misctools import get_config
from nesta.production.luigihacks.misctools import find_filepath_from_pathstub


class TestMiscTools(TestCase):

def test_get_config(self):
get_config("mysqldb.config", "mysqldb")
with self.assertRaises(KeyError):
get_config("mysqldb.config", "mysqld")
get_config("mysqldb.config", "invalid")
with self.assertRaises(KeyError):
get_config("mysqldb.confi", "mysqldb")
get_config("not_found.config", "mysqldb")

def test_find_filepath_from_pathstub(self):
find_filepath_from_pathstub("nesta/packages")
with self.assertRaises(FileNotFoundError):
find_filepath_from_pathstub("nesta/package")
find_filepath_from_pathstub("nesta/package")
2 changes: 1 addition & 1 deletion nesta/production/routines/arxiv/arxiv_grid_task.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import luigi
import logging

from arxiv_mag_sparql_task import MagSparqlTask
from nesta.production.routines.arxiv.arxiv_mag_sparql_task import MagSparqlTask
from nesta.packages.arxiv.collect_arxiv import add_article_institutes, create_article_institute_links
from nesta.packages.grid.grid import ComboFuzzer, grid_name_lookup
from nesta.packages.misc_utils.batches import BatchWriter
Expand Down
5 changes: 3 additions & 2 deletions nesta/production/routines/arxiv/arxiv_iterative_date_task.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,12 @@
import luigi
from sqlalchemy.sql import text

from arxiv_collect_iterative_task import CollectNewTask
from nesta.production.routines.arxiv.arxiv_collect_iterative_task import CollectNewTask
from nesta.packages.arxiv.collect_arxiv import extract_last_update_date
from nesta.production.orms.orm_utils import get_mysql_engine, db_session


# prefix for the task name in luigi_table_updates
UPDATE_PREFIX = 'ArxivIterativeCollect'


Expand Down Expand Up @@ -59,7 +60,7 @@ def requires(self):
try:
latest_update = extract_last_update_date(UPDATE_PREFIX, previous_updates)
except ValueError:
raise ValueError("Date for iterative data collection could not be determined")
raise ValueError("Date for iterative data collection could not be determined. Set the date manually with --articles-from-date")
self.articles_from_date = datetime.strftime(latest_update, '%Y-%m-%d')

logging.info(f"Updating arxiv data from date: {self.articles_from_date}")
Expand Down
4 changes: 2 additions & 2 deletions nesta/production/routines/arxiv/arxiv_iterative_root_task.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
import logging
import luigi

from arxiv_grid_task import GridTask
from nesta.production.routines.arxiv.arxiv_grid_task import GridTask


class RootTask(luigi.WrapperTask):
Expand Down Expand Up @@ -43,4 +43,4 @@ def requires(self):
mag_config_path='mag.config',
test=not self.production,
insert_batch_size=self.insert_batch_size,
articles_from_date=self.articles_from_date)
articles_from_date=self.articles_from_date)
2 changes: 1 addition & 1 deletion nesta/production/routines/arxiv/arxiv_mag_sparql_task.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
import luigi
import logging

from arxiv_mag_task import QueryMagTask
from nesta.production.routines.arxiv.arxiv_mag_task import QueryMagTask
from nesta.packages.arxiv.collect_arxiv import update_existing_articles
from nesta.packages.mag.query_mag_sparql import update_field_of_study_ids_sparql, extract_entity_id, query_articles_by_doi, query_authors
from nesta.packages.misc_utils.batches import BatchWriter
Expand Down
2 changes: 1 addition & 1 deletion nesta/production/routines/arxiv/arxiv_mag_task.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
import logging
import pprint

from arxiv_iterative_date_task import DateTask
from nesta.production.routines.arxiv.arxiv_iterative_date_task import DateTask
from nesta.packages.arxiv.collect_arxiv import BatchedTitles, update_existing_articles
from nesta.packages.misc_utils.batches import BatchWriter
from nesta.packages.mag.query_mag_api import build_expr, query_mag_api, dedupe_entities, update_field_of_study_ids
Expand Down

0 comments on commit 3e49fa6

Please sign in to comment.