forked from trinodb/trino.io
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
8639845
commit 220b85a
Showing
3 changed files
with
221 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,220 @@ | ||
--- | ||
layout: episode | ||
title: "21: Trino + dbt = a match made in SQL heaven?" | ||
date: 2021-07-08 | ||
tags: trino dbt etl sql | ||
youtube_id: "k0xQPWnLm_A" | ||
wistia_id: "e1p3d9d277" | ||
sections: | ||
- title: "Question of the week" | ||
desc: "Can dbt connect to different databases in the same project?" | ||
time: 1098 | ||
- title: "Concept of the week" | ||
desc: "Trino + dbt = a match made in SQL heaven?" | ||
time: 1288 | ||
- title: "Demo" | ||
desc: "Querying Trino from a dbt project" | ||
time: 2841 | ||
- title: "PR of the week" | ||
desc: "PR 8283 Externalised destination table cache expiry duration for BigQuery Connector" | ||
time: 4873 | ||
--- | ||
|
||
## Guests | ||
|
||
* Amy Chen, Partner Solutions Architect at [dbt Labs (formerly Fishtown Analytics)](https://www.getdbt.com/) | ||
([@yuanamychen](https://www.linkedin.com/in/yuanamychen/)) | ||
* Victor Coustenoble, Solutions Architect at [Starburst](https://www.starburst.io/) | ||
([@victorcouste](https://twitter.com/victorcouste)) | ||
|
||
## Release 359 | ||
|
||
Martin: | ||
|
||
* Row pattern recognition for window functions | ||
* Support for `SET TIME ZONE` | ||
* Support for `timestamp(n)` with precision higher than 3 in MySQL | ||
* ARM64-compatible docker image | ||
* Support for granting `UPDATE` privilege | ||
|
||
Manfred: | ||
|
||
* `SET TIME ZONE` is a feature from our guest Marius from last time! | ||
* ARM64 compatible docker image as well as already existing tar.gz and rpm means usage of Graviton and other ARM64 processors is now available also for Kubernetes users, there are significant cost/performance benefits, try it out | ||
* wow .. this time it took a whole month from 358 to 359 | ||
* breaking change - need Java 11.0.11 | ||
* more materialized view stuff, and I am working on docs! | ||
* Fix handling of multiple LDAP user bind patterns - for those of us in larger orgs.. | ||
* network logging in CLI | ||
* rename `connector.name` from `hive-hadoop2` to `hive` | ||
|
||
More info at <https://trino.io/docs/current/release/release-359.html>. | ||
|
||
## Question of the week: Can dbt connect to different databases in the same project? | ||
|
||
This week we are going a little out of order from our usual sequence on this | ||
show. The question really gets to the heart of the concept of the week. We'll | ||
cover this first then jump into the concept. | ||
|
||
This question was asked on [StackOverflow](https://stackoverflow.com/questions/63002171): | ||
|
||
> It seems dbt only works for a single database. If my data is in a different | ||
> database, will that still work? For example, if my datalake is using delta, | ||
> but I want to run dbt using Redshift, would dbt still work for this case? | ||
Our guest Victor replied: | ||
|
||
You can use Trino with dbt to connect to multiple databases in the same project. | ||
|
||
The GitHub example project [https://github.com/victorcouste/trino-dbt-demo](https://github.com/victorcouste/trino-dbt-demo) | ||
contains a fully working setup, that you can replicate and adapt to your needs. | ||
|
||
|
||
## Concept of the week: | ||
|
||
### What is dbt? | ||
|
||
dbt is a transformation workflow tool that lets teams quickly and collaboratively | ||
deploy analytics code, following software engineering best practices like | ||
modularity, CI/CD, testing, and documentation. It enables anyone who knows SQL | ||
to build production-grade data pipelines. | ||
|
||
When referring to dbt, it can mean two slightly different things. dbt core is | ||
the open source framework that provides the SQL compiler and framework to manage | ||
your SQL workflow. You can interact with it via a command line interface. In | ||
addition, dbtlabs offers the fully managed SaaS product dbt Cloud. You can use | ||
it to handle all of your dbt projects from development to deployment in a single | ||
browser based tool. It provides useful features like a full IDE to develop and | ||
test code, orchestration, logging, and alerting. At the moment, dbt Cloud is not | ||
available for Trino users. | ||
|
||
The framework allows you to check the quality of results, document the lineage, | ||
manage the changes/versions in the SQL scripts and orchestrate the queries, like | ||
a CI/CD framework but for your data. dbt is not an extract and load tool. The | ||
focus is on transforming what is already in your data warehouse/data lake. | ||
|
||
Check out these links to learn more: | ||
|
||
* [https://www.getdbt.com/](https://www.getdbt.com/) | ||
* [https://docs.getdbt.com/docs/introduction](https://docs.getdbt.com/docs/introduction) | ||
|
||
### Goals of dbt and how that differs from Trino | ||
|
||
<p align="center"> | ||
<img align="center" width="75%" height="100%" src="/assets/episode/21/dbt-trino-architecture.png"/><br/> | ||
Commander Bun Bun is diving deep to find anomalies! | ||
</p> | ||
|
||
Trino is the execution SQL engine and dbt is the framework to manage your SQL | ||
statements. dbt won't execute the SQL itself, rather it pushes all of the | ||
compute down to the SQL engine. This SQL engine can be Trino, or an engine | ||
included in the data source like the database itself. Using Trino as the SQL | ||
execution engine allows you to use the same SQL dialect for all connected data | ||
sources. This includes data sources that natively do not support SQL like object | ||
storage systems, Kafka, Elasticsearch, and many others. | ||
|
||
### Transformation vs ad-hoc joins | ||
|
||
Transformations done by dbt are in general used to clean and prepare data for | ||
analytics purposes. It's often used to go from the raw data to a ready-to-use | ||
data for reporting and analysis. dbt creates database objects like tables or | ||
views to be consumed by business users and analytics tools. | ||
|
||
On the other hand, even if Trino can also execute SQL to create tables and | ||
views, these SQL queries are not managed but just executed. Trino doesn't have, | ||
like dbt, all the framework to version, audit, document and orchestrate SQL | ||
script and execution. Trino is more used to execute SQL SELECT | ||
statements generated by users or BI tools to analyze data in an interactive way. | ||
|
||
### Cases for why you need both | ||
|
||
Trino and dbt are complementary when you need to access different sources from | ||
a single SQL query or when you need to run SQL query with good performance on | ||
object storage systems like S3, GCS, ADLS, or HDFS. | ||
|
||
It's where Trino can complement dbt, as dbt can only access a single data | ||
warehouse connection in a SQL query. In dbt there is no way to query multiple | ||
storage systems at the same time. | ||
|
||
Trino is recognized for great performance with object storage/data lake | ||
processing. With dbt it can transform and prepare data at scale. Trino also | ||
allows you to run dbt on a traditional, on-premise data warehouse where | ||
normally dbt only runs on a modern cloud data warehouse like Snowflake, | ||
BigQuery, or Redshift. | ||
|
||
### dbt basics | ||
|
||
dbtlabs offers a [good tutorial](https://docs.getdbt.com/tutorial/setting-up) | ||
which covers the fundamental topics of dbt for you to learn: | ||
|
||
* Project: A directory of SQL and YAML files defined with a single project file. | ||
* Models: A model is a single SQL file where you define your transformations to create a table or a view. | ||
* Profile: To define connections to your data sources. | ||
|
||
Then you have other resources like seeds, macros, tests, sources, snapshots. | ||
|
||
## Demo: Contributing to Trino | ||
|
||
Victor shows us a demo from | ||
[his blog post that inspired this episode](https://medium.com/geekculture/trino-dbt-a-match-in-sql-heaven-1df2a3d12b5e). | ||
|
||
If you looked at the code, you may have noticed that the code used an adapter | ||
called `db-presto-trino`. This adapter derives from the outdated presto naming and is | ||
still there for interaction with legacy Presto clusters. Although it can work | ||
it uses an outdated python client to interact with Trino and there is an open | ||
[issue to create an official `dbt-trino` adapter](https://github.com/dbt-labs/dbt-presto/issues/39) | ||
that uses the updated [trino-python-client](https://github.com/trinodb/trino-python-client). | ||
|
||
If you want to help with this, reach out on the issue itself and join the | ||
`#db-presto-trino` channel on the dbt Slack. | ||
[https://community.getdbt.com/](https://community.getdbt.com/) | ||
|
||
After the show [Marius Grama](https://twitter.com/findinpath), started [work on | ||
dbt-trino in his own repository](https://github.com/findinpath/dbt-trino). | ||
Thanks for the quick turnaround Marius! | ||
|
||
## PR of the week: PR 8283 Externalised destination table cache expiry duration for BigQuery Connector | ||
|
||
The [PR of the week](https://github.com/trinodb/trino/pull/8283), was committed | ||
by Ayush Bilala([Twitter](https://twitter.com/ayushbilala)), ([LinkedIn](https://www.linkedin.com/in/ayush-bilala/)), a Staff Software Engineer at | ||
Walmart Global Tech. | ||
|
||
This fixes [issue 8263](https://github.com/trinodb/trino/issues/8236) by adding | ||
a new configuration for the Big Query connector, `bigquery.views-cache-ttl` | ||
to allow configuring the cache expiration for BigQuery views. | ||
|
||
Thanks Ayush! | ||
|
||
## Events, news, and various links | ||
|
||
News | ||
- The "frog" book has been [translated to Chinese](https://item.jd.com/10028492426649.html)! | ||
Keep your eyes peeled for the rebrand into Trino for the translation. | ||
|
||
Trino Meetup groups | ||
- Virtual | ||
- [Virtual Trino Americas](https://www.meetup.com/trino-americas/) | ||
- [Virtual Trino EMEA](https://www.meetup.com/trino-emea/) | ||
- [Virtual Trino APAC](https://www.meetup.com/trino-apac/) | ||
- East Coast (US) | ||
- [Trino Boston](https://www.meetup.com/trino-boston/) | ||
- [Trino NYC](https://www.meetup.com/trino-nyc/) | ||
- West Coast (US) | ||
- [Trino San Fransisco](https://www.meetup.com/trino-san-francisco/) | ||
- [Trino Los Angeles](https://www.meetup.com/trino-los-angeles/) | ||
- Mid West (US) | ||
- [Trino Chicago](https://www.meetup.com/trino-chicago/) | ||
|
||
Latest training from David, Dain, and Martin(Now with timestamps!): | ||
- [Advanced SQL Training]({% post_url 2020-07-15-training-advanced-sql %}) | ||
- [Query Tuning Training]({% post_url 2020-07-30-training-query-tuning %}) | ||
- [Security Training]({% post_url 2020-08-13-training-security %}) | ||
- [Performance and Tuning Training]({% post_url 2020-08-27-training-performance %}) | ||
|
||
If you want to learn more about Trino, check out the definitive guide from | ||
OReilly. You can download | ||
[the free PDF](https://www.starburst.io/info/oreilly-trino-guide/) or | ||
buy the book online. | ||
|
||
Music for the show is from the [Megaman 6 Game Play album by Krzysztof | ||
Słowikowski](https://krzysztofslowikowski.bandcamp.com/album/mega-man-6-gp). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.