-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update mozetl-databricks.py
for running external mozetl-compatible modules
#316
Conversation
f3f66c8
to
2052c9e
Compare
I've updated the script with a few of the issues found during review, this is the log for a run of the $ python bin/mozetl-databricks.py \
> --git-path https://github.com/acmiyaguchi/python_mozetl.git \
> --git-branch federated-mozetl \
> --token <TOKEN> \
> system_check
DEBUG:root:
# This runner has been auto-generated from mozilla/python_mozetl/bin/mozetl-databricks.py.
# Any changes made to the runner file will be over-written on subsequent runs.
from mozetl import cli
try:
cli.entry_point(auto_envvar_prefix="MOZETL")
except SystemExit:
# avoid calling sys.exit() in databricks
# http://click.palletsprojects.com/en/7.x/api/?highlight=auto_envvar_prefix#click.BaseCommand.main
pass
DEBUG:root:{
"path": "/FileStore/airflow/mozetl_runner.py",
"overwrite": true,
"contents": "CiMgVGhpcyBydW5uZXIgaGFzIGJlZW4gYXV0by1nZW5lcmF0ZWQgZnJvbSBtb3ppbGxhL3B5dGhvbl9tb3pldGwvYmluL21vemV0bC1kYXRhYnJpY2tzLnB5LgojIEFueSBjaGFuZ2VzIG1hZGUgdG8gdGhlIHJ1bm5lciBmaWxlIHdpbGwgYmUgb3Zlci13cml0dGVuIG9uIHN1YnNlcXVlbnQgcnVucy4KZnJvbSBtb3pldGwgaW1wb3J0IGNsaQoKdHJ5OgogICAgY2xpLmVudHJ5X3BvaW50KGF1dG9fZW52dmFyX3ByZWZpeD0iTU9aRVRMIikKZXhjZXB0IFN5c3RlbUV4aXQ6CiAgICAjIGF2b2lkIGNhbGxpbmcgc3lzLmV4aXQoKSBpbiBkYXRhYnJpY2tzCiAgICAjIGh0dHA6Ly9jbGljay5wYWxsZXRzcHJvamVjdHMuY29tL2VuLzcueC9hcGkvP2hpZ2hsaWdodD1hdXRvX2VudnZhcl9wcmVmaXgjY2xpY2suQmFzZUNvbW1hbmQubWFpbgogICAgcGFzcwo="
}
INFO:root:status: 200 reason: OK
INFO:root:{}
DEBUG:root:{
"libraries": {
"pypi": {
"package": "git+https://github.com/acmiyaguchi/python_mozetl.git@federated-mozetl"
}
},
"new_cluster": {
"node_type_id": "c3.4xlarge",
"spark_version": "4.3.x-scala2.11",
"aws_attributes": {
"instance_profile_arn": "arn:aws:iam::144996185633:instance-profile/databricks-ec2",
"availability": "ON_DEMAND"
},
"num_workers": 2
},
"spark_python_task": {
"python_file": "dbfs:/FileStore/airflow/mozetl_runner.py",
"parameters": [
"system_check"
]
},
"run_name": "mozetl local submission"
}
INFO:root:status: 200 reason: OK
INFO:root:{"run_id":6141} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to question whether we need/want to support this? I'd personally rather people add these types of jobs inside this repository, that way they can benefit from our linter/tests/review/etc.
If we do absolutely need to support this workflow for some reason, could you update the README with instructions on how to use this?
The dependency management issues with Spark/Pandas in the last few weeks have shown that managing dependencies in a mono repo can be difficult. I'm concerned about pulling in Python3-only libraries into the repository which is troublesome because of existing Python2 support. This PR is targeting support for the scripts in bgbb_airflow, which itself uses new dependencies like numba and an up-to-date pandas that could cause some issues here because of the tight coupling of jobs (although the test suite does a fairly good job at catching a lot of errors). I wouldn't recommend starting a new repo as the default mode, because it's much simpler to follow the conventions here for writing a new job, testing it, and deploying it. However, I think it's good to have it as an option, especially for a project that's pulling very large dependencies. Regardless of how the jobs are managed, our Airflow repository is the gatekeeper for production-ready jobs. Jobs that are effectively ported notebooks should be scrutinized if they aren't part of mozetl or tbv. |
Ok, that makes sense! I'm ok with landing as long as people aren't scheduling jobs outside of python_mozetl as a way to do an end-run around proper review. |
The I also updated the README with some pointers on setting the |
@wlach Thanks for the review! |
This updates the Databricks workflow for supporting external modules that follow the mozetl convention. The convention is as follows:
click
library for defining a command-line interface@click.option
decorator. Arguments that are required should use therequired=True
parameter.click
will expose command-line arguments via environment variablescli
submodule such thatfrom my_module import cli
is a valid commandA module that adheres to these basic conventions can be scheduled via Airflow using standard tooling.