Update `mozetl-databricks.py` for running external mozetl-compatible modules #316

acmiyaguchi · 2019-02-13T20:33:45Z

This updates the Databricks workflow for supporting external modules that follow the mozetl convention. The convention is as follows:

The module uses the click library for defining a command-line interface
All arguments use the @click.option decorator. Arguments that are required should use the required=True parameter.
- click will expose command-line arguments via environment variables
The module has a top-level cli submodule such that from my_module import cli is a valid command

A module that adheres to these basic conventions can be scheduled via Airflow using standard tooling.

bin/mozetl-databricks.py

acmiyaguchi · 2019-02-19T22:54:34Z

I've updated the script with a few of the issues found during review, this is the log for a run of the system_check module.

$ python bin/mozetl-databricks.py \
>     --git-path https://github.com/acmiyaguchi/python_mozetl.git \
>     --git-branch federated-mozetl \
>     --token <TOKEN> \
>     system_check
DEBUG:root:
# This runner has been auto-generated from mozilla/python_mozetl/bin/mozetl-databricks.py.
# Any changes made to the runner file will be over-written on subsequent runs.
from mozetl import cli

try:
    cli.entry_point(auto_envvar_prefix="MOZETL")
except SystemExit:
    # avoid calling sys.exit() in databricks
    # http://click.palletsprojects.com/en/7.x/api/?highlight=auto_envvar_prefix#click.BaseCommand.main
    pass

DEBUG:root:{
  "path": "/FileStore/airflow/mozetl_runner.py", 
  "overwrite": true, 
  "contents": "CiMgVGhpcyBydW5uZXIgaGFzIGJlZW4gYXV0by1nZW5lcmF0ZWQgZnJvbSBtb3ppbGxhL3B5dGhvbl9tb3pldGwvYmluL21vemV0bC1kYXRhYnJpY2tzLnB5LgojIEFueSBjaGFuZ2VzIG1hZGUgdG8gdGhlIHJ1bm5lciBmaWxlIHdpbGwgYmUgb3Zlci13cml0dGVuIG9uIHN1YnNlcXVlbnQgcnVucy4KZnJvbSBtb3pldGwgaW1wb3J0IGNsaQoKdHJ5OgogICAgY2xpLmVudHJ5X3BvaW50KGF1dG9fZW52dmFyX3ByZWZpeD0iTU9aRVRMIikKZXhjZXB0IFN5c3RlbUV4aXQ6CiAgICAjIGF2b2lkIGNhbGxpbmcgc3lzLmV4aXQoKSBpbiBkYXRhYnJpY2tzCiAgICAjIGh0dHA6Ly9jbGljay5wYWxsZXRzcHJvamVjdHMuY29tL2VuLzcueC9hcGkvP2hpZ2hsaWdodD1hdXRvX2VudnZhcl9wcmVmaXgjY2xpY2suQmFzZUNvbW1hbmQubWFpbgogICAgcGFzcwo="
}
INFO:root:status: 200 reason: OK
INFO:root:{}

DEBUG:root:{
  "libraries": {
    "pypi": {
      "package": "git+https://github.com/acmiyaguchi/python_mozetl.git@federated-mozetl"
    }
  }, 
  "new_cluster": {
    "node_type_id": "c3.4xlarge", 
    "spark_version": "4.3.x-scala2.11", 
    "aws_attributes": {
      "instance_profile_arn": "arn:aws:iam::144996185633:instance-profile/databricks-ec2", 
      "availability": "ON_DEMAND"
    }, 
    "num_workers": 2
  }, 
  "spark_python_task": {
    "python_file": "dbfs:/FileStore/airflow/mozetl_runner.py", 
    "parameters": [
      "system_check"
    ]
  }, 
  "run_name": "mozetl local submission"
}
INFO:root:status: 200 reason: OK
INFO:root:{"run_id":6141}

wlach

I'm going to question whether we need/want to support this? I'd personally rather people add these types of jobs inside this repository, that way they can benefit from our linter/tests/review/etc.

If we do absolutely need to support this workflow for some reason, could you update the README with instructions on how to use this?

acmiyaguchi · 2019-02-21T19:39:19Z

The dependency management issues with Spark/Pandas in the last few weeks have shown that managing dependencies in a mono repo can be difficult. I'm concerned about pulling in Python3-only libraries into the repository which is troublesome because of existing Python2 support. This PR is targeting support for the scripts in bgbb_airflow, which itself uses new dependencies like numba and an up-to-date pandas that could cause some issues here because of the tight coupling of jobs (although the test suite does a fairly good job at catching a lot of errors).

I wouldn't recommend starting a new repo as the default mode, because it's much simpler to follow the conventions here for writing a new job, testing it, and deploying it. However, I think it's good to have it as an option, especially for a project that's pulling very large dependencies. Regardless of how the jobs are managed, our Airflow repository is the gatekeeper for production-ready jobs. Jobs that are effectively ported notebooks should be scrutinized if they aren't part of mozetl or tbv.

wlach · 2019-02-21T21:16:21Z

Ok, that makes sense! I'm ok with landing as long as people aren't scheduling jobs outside of python_mozetl as a way to do an end-run around proper review.

acmiyaguchi · 2019-02-21T21:26:41Z

The mozetl-databricks.py script should be python 2/3 compatible now. I've tested it against the system_check job with success.

I also updated the README with some pointers on setting the git-path and module-name options for testing, along with a link to this PR.

…s script

acmiyaguchi · 2019-02-21T21:28:50Z

@wlach Thanks for the review!

acmiyaguchi added 4 commits February 13, 2019 14:21

Create or update a runner in the Databricks workspace

56c0411

Use /api/2.0/dbfs/put and update runner location

216aed5

Add a system_check to mozetl

de71a87

Add test for system_check

2052c9e

acmiyaguchi force-pushed the federated-mozetl branch from f3f66c8 to 2052c9e Compare February 13, 2019 22:32

wcbeard reviewed Feb 15, 2019

View reviewed changes

bin/mozetl-databricks.py Outdated Show resolved Hide resolved

wcbeard reviewed Feb 15, 2019

View reviewed changes

bin/mozetl-databricks.py Show resolved Hide resolved

wcbeard reviewed Feb 15, 2019

View reviewed changes

bin/mozetl-databricks.py Outdated Show resolved Hide resolved

Fix typos

21931ac

acmiyaguchi added 4 commits February 19, 2019 14:55

Fix linting issues with black

8c5b9e8

Use options instead of arguments

50da5c0

Use locals().items()

b9cd006

Fix tests

d68ff26

acmiyaguchi requested a review from wlach February 21, 2019 01:10

wlach suggested changes Feb 21, 2019

View reviewed changes

acmiyaguchi added 3 commits February 21, 2019 12:35

Fix flake8 linting errors with unused imports

71a91f1

Fix str/byte handling for py2/3 compatibility

e9937a3

Optimize system_check to prevent large dataset scans

61e38c9

wlach approved these changes Feb 21, 2019

View reviewed changes

Add notes about using mozetl-compatible modules with mozetl-databrick…

c82f5b8

…s script

acmiyaguchi merged commit 5e9fbb7 into mozilla:master Feb 21, 2019

This was referenced Feb 21, 2019

Create mozetl jobs from arbitrary repositories using mozetl-databricks.py #315

Closed

Add mozetl-runner for external mozetl-compatible modules mozilla/telemetry-airflow#434

Closed

acmiyaguchi mentioned this pull request Apr 22, 2019

Add mozetl-runner for external mozetl-compatible modules mozilla/telemetry-airflow#480

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update `mozetl-databricks.py` for running external mozetl-compatible modules #316

Update `mozetl-databricks.py` for running external mozetl-compatible modules #316

acmiyaguchi commented Feb 13, 2019

acmiyaguchi commented Feb 19, 2019

wlach left a comment

acmiyaguchi commented Feb 21, 2019

wlach commented Feb 21, 2019

acmiyaguchi commented Feb 21, 2019

acmiyaguchi commented Feb 21, 2019

Update mozetl-databricks.py for running external mozetl-compatible modules #316

Update mozetl-databricks.py for running external mozetl-compatible modules #316

Conversation

acmiyaguchi commented Feb 13, 2019

acmiyaguchi commented Feb 19, 2019

wlach left a comment

Choose a reason for hiding this comment

acmiyaguchi commented Feb 21, 2019

wlach commented Feb 21, 2019

acmiyaguchi commented Feb 21, 2019

acmiyaguchi commented Feb 21, 2019

Update `mozetl-databricks.py` for running external mozetl-compatible modules #316

Update `mozetl-databricks.py` for running external mozetl-compatible modules #316