Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update mozetl-databricks.py for running external mozetl-compatible modules #316

Merged
merged 13 commits into from
Feb 21, 2019

Conversation

acmiyaguchi
Copy link
Contributor

This updates the Databricks workflow for supporting external modules that follow the mozetl convention. The convention is as follows:

  • The module uses the click library for defining a command-line interface
  • All arguments use the @click.option decorator. Arguments that are required should use the required=True parameter.
    • click will expose command-line arguments via environment variables
  • The module has a top-level cli submodule such that from my_module import cli is a valid command

A module that adheres to these basic conventions can be scheduled via Airflow using standard tooling.

bin/mozetl-databricks.py Outdated Show resolved Hide resolved
bin/mozetl-databricks.py Outdated Show resolved Hide resolved
@acmiyaguchi
Copy link
Contributor Author

I've updated the script with a few of the issues found during review, this is the log for a run of the system_check module.

$ python bin/mozetl-databricks.py \
>     --git-path https://github.com/acmiyaguchi/python_mozetl.git \
>     --git-branch federated-mozetl \
>     --token <TOKEN> \
>     system_check
DEBUG:root:
# This runner has been auto-generated from mozilla/python_mozetl/bin/mozetl-databricks.py.
# Any changes made to the runner file will be over-written on subsequent runs.
from mozetl import cli

try:
    cli.entry_point(auto_envvar_prefix="MOZETL")
except SystemExit:
    # avoid calling sys.exit() in databricks
    # http://click.palletsprojects.com/en/7.x/api/?highlight=auto_envvar_prefix#click.BaseCommand.main
    pass

DEBUG:root:{
  "path": "/FileStore/airflow/mozetl_runner.py", 
  "overwrite": true, 
  "contents": "CiMgVGhpcyBydW5uZXIgaGFzIGJlZW4gYXV0by1nZW5lcmF0ZWQgZnJvbSBtb3ppbGxhL3B5dGhvbl9tb3pldGwvYmluL21vemV0bC1kYXRhYnJpY2tzLnB5LgojIEFueSBjaGFuZ2VzIG1hZGUgdG8gdGhlIHJ1bm5lciBmaWxlIHdpbGwgYmUgb3Zlci13cml0dGVuIG9uIHN1YnNlcXVlbnQgcnVucy4KZnJvbSBtb3pldGwgaW1wb3J0IGNsaQoKdHJ5OgogICAgY2xpLmVudHJ5X3BvaW50KGF1dG9fZW52dmFyX3ByZWZpeD0iTU9aRVRMIikKZXhjZXB0IFN5c3RlbUV4aXQ6CiAgICAjIGF2b2lkIGNhbGxpbmcgc3lzLmV4aXQoKSBpbiBkYXRhYnJpY2tzCiAgICAjIGh0dHA6Ly9jbGljay5wYWxsZXRzcHJvamVjdHMuY29tL2VuLzcueC9hcGkvP2hpZ2hsaWdodD1hdXRvX2VudnZhcl9wcmVmaXgjY2xpY2suQmFzZUNvbW1hbmQubWFpbgogICAgcGFzcwo="
}
INFO:root:status: 200 reason: OK
INFO:root:{}

DEBUG:root:{
  "libraries": {
    "pypi": {
      "package": "git+https://github.com/acmiyaguchi/python_mozetl.git@federated-mozetl"
    }
  }, 
  "new_cluster": {
    "node_type_id": "c3.4xlarge", 
    "spark_version": "4.3.x-scala2.11", 
    "aws_attributes": {
      "instance_profile_arn": "arn:aws:iam::144996185633:instance-profile/databricks-ec2", 
      "availability": "ON_DEMAND"
    }, 
    "num_workers": 2
  }, 
  "spark_python_task": {
    "python_file": "dbfs:/FileStore/airflow/mozetl_runner.py", 
    "parameters": [
      "system_check"
    ]
  }, 
  "run_name": "mozetl local submission"
}
INFO:root:status: 200 reason: OK
INFO:root:{"run_id":6141}

Copy link
Contributor

@wlach wlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to question whether we need/want to support this? I'd personally rather people add these types of jobs inside this repository, that way they can benefit from our linter/tests/review/etc.

If we do absolutely need to support this workflow for some reason, could you update the README with instructions on how to use this?

@acmiyaguchi
Copy link
Contributor Author

The dependency management issues with Spark/Pandas in the last few weeks have shown that managing dependencies in a mono repo can be difficult. I'm concerned about pulling in Python3-only libraries into the repository which is troublesome because of existing Python2 support. This PR is targeting support for the scripts in bgbb_airflow, which itself uses new dependencies like numba and an up-to-date pandas that could cause some issues here because of the tight coupling of jobs (although the test suite does a fairly good job at catching a lot of errors).

I wouldn't recommend starting a new repo as the default mode, because it's much simpler to follow the conventions here for writing a new job, testing it, and deploying it. However, I think it's good to have it as an option, especially for a project that's pulling very large dependencies. Regardless of how the jobs are managed, our Airflow repository is the gatekeeper for production-ready jobs. Jobs that are effectively ported notebooks should be scrutinized if they aren't part of mozetl or tbv.

@wlach
Copy link
Contributor

wlach commented Feb 21, 2019

Ok, that makes sense! I'm ok with landing as long as people aren't scheduling jobs outside of python_mozetl as a way to do an end-run around proper review.

@acmiyaguchi
Copy link
Contributor Author

The mozetl-databricks.py script should be python 2/3 compatible now. I've tested it against the system_check job with success.

I also updated the README with some pointers on setting the git-path and module-name options for testing, along with a link to this PR.

@acmiyaguchi
Copy link
Contributor Author

@wlach Thanks for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants