Databricks Workflow (Alpha)

This repository is an example of how to use Databricks for setting up a multi-environment data processing pipeline.

If you are part of a Data Engineering or Data Science team, and you want to start a project in Databricks, you can use this repository as a jump start.

This template shows some of the best practices and processes we have set up for production, research and development at Quby.

This code can help you to get an answer to the questions you might have when starting from scratch, like:

How do I set up a production pipeline in Databricks?
How do I run unit tests on Spark transformations?
How do I run integration tests on Notebook work flows?
How do I rapidly prototype data transformations on real data?
How do I refactor notebook experiments into tested code?
How do I organize the code?
How do I manage multiple environments and configurations?

This repository is also an open invitation to all developers out there that see improvements on our current practices. Don't hesitate to contribute and make a pull request with your suggestions!

Getting started

In order to get started with a paid or trial version of Databricks, follow the steps below. If you want to get started with the community edition of follow these steps.

Setup your Databricks account (This requires an AWS or Azure account)
Create an authentication token on your Databricks account
Install and configure your Databricks CLI (This example repository uses your default Databricks profile)
1. pip install databricks-cli
2. databricks configure --token
Install jq, we use it to parse and combine JSON configuration files for the jobs.
1. brew install jq
Clone this repository on your local machine
1. git clone [email protected]:quby-io/databricks-workflow.git

You are good to go :)

Try to execute make help, this will show the actions available.

Project structure

The project is structured in 3 folders:

jobs: Contains job definition and configuration of the scheduled Databricks jobs, along with the notebooks that execute them. More details can be found in the jobs Readme.md
scala: Contains the Scala code and relative unit tests. The content of this directory gets compiled, packaged and deployed to each environment.
scripts: Contains bash scripts that are used for managing the environment deployments and development workflow. These scripts are triggered through the make targets.

Development workflow

When creating a new feature from real data you can procede like this:

Create a new environment
1. Duplicate /jobs/environments/staging.json and rename it to reflect the feature intent /jobs/environments/dev_my_feature.json
2. Adjust the configuration of the new environment to fit your needs (eg. change the featureDb parameter)
3. Deploy the notebooks and cluster with the whole stack make dev env=dev_my_feature job=create_features. Now you should have a new cluster with your libraries installed and a copy of your notebook to work on.
Navigate to your notebook directory /dev/dev_my_feature/create_features and attach your notebook to the newly created cluster dev_my_feature_create_features
Explore, apply and try all the changes you need
Import the notebooks back to your local development environment make import_dev_notebooks env=dev_my_feature job=create_features
Refactor your code by extracting the new logic in a transformation function
Add unit tests to your function
Run an integration test make integration_test
Deploy the new job in your insulated environment make deploy env=dev_my_feature

Deploy an environment

You can deploy an environment by using the make deploy target.

Eg. make deploy env=staging

By default you have 2 environments available: staging and production.

There is a third environment called integration_test, which will be deployed without any scheduling, and it is used for running integration tests. There is no need to deploy the integration_test environment explicitly, as it is taken care of by the integration test script.

Run unit test

Execute make test

Run integration tests

The integration test will run on the Databricks platform. It will deploy an independent environment called integration_test, and will execute sequentially all the jobs defined in integration_test.json under .active_jobs section.

make integration_test

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.github/workflows		.github/workflows
doc		doc
jobs		jobs
project		project
scala		scala
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Databricks Workflow (Alpha)

Getting started

Project structure

Development workflow

Deploy an environment

Run unit test

Run integration tests

About

Releases

Packages

Contributors 5

Languages

License

quby-io/databricks-workflow

Folders and files

Latest commit

History

Repository files navigation

Databricks Workflow (Alpha)

Getting started

Project structure

Development workflow

Deploy an environment

Run unit test

Run integration tests

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages