Initialize lift + shift for cross-db macros #359

jtcohen6 · 2022-05-18T10:22:54Z

Two Macros (Utils're Coming Home)

Follow-up to ~~dbt-labs/dbt-core#5265~~ dbt-labs/dbt-core#5298

No more spark-utils??? Not quite, but close. I've opened a follow-on PR there to ensure backwards compatibility for those who celebrate: dbt-labs/spark-utils#25

Checklist

I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have updated the CHANGELOG.md and added information about my change to the "dbt-spark next" section.

jtcohen6 · 2022-05-18T10:47:43Z

The datediff macro is failing in the session connection method with:

DEBUG    configured_file:functions.py:235 10:43:13.599281 [debug] [Thread-7  ]: Runtime Error in model test_datediff (models/test_datediff.sql)
  org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Insert of object "org.apache.hadoop.hive.metastore.model.MTable@5c1a0c62" using statement "INSERT INTO TBLS (TBL_ID,CREATE_TIME,DB_ID,LAST_ACCESS_TIME,OWNER,RETENTION,IS_REWRITE_ENABLED,SD_ID,TBL_NAME,TBL_TYPE,VIEW_EXPANDED_TEXT,VIEW_ORIGINAL_TEXT) VALUES (?,?,?,?,?,?,?,?,?,?,?,?)" failed : A truncation error was encountered trying to shrink LONG VARCHAR 'with data as (
  
      select * from test16528705702281762972_t&' to length 32700.)

This macro does compile SQL that is just about unreadably long. Perhaps it could be "minified"? :)

FWIW, it's passing on all other methods, so we might just mark it with @pytest.mark.skip_profile('session') for now.

jtcohen6 · 2022-06-15T16:17:33Z

dbt/include/spark/macros/utils/listagg.sql

+  {% if order_by_clause %}
+    {{ exceptions.warn("order_by_clause is not supported for listagg on Spark/Databricks") }}
+  {% endif %}


The docs make it pretty clear that:

The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.

The only way I've found to this requires a subquery to calculate rank() first, then passed into collect_list (with a struct and array_sort to boot, probably)

dbeatty10

Overall, I feel good about shipping this since it's mainly moving the code from one repo to another.

For long-term peace-of-mind, would prefer that we invest in finding or creating a robust collection of test cases for dateadd and datediff. That way, we can have more confidence that these macros will produce the same results across adapters for the tricky edge cases. See inline comment for slightly more detail.

dbeatty10 · 2022-06-16T14:15:50Z

tests/functional/adapter/utils/test_utils.py

+@pytest.mark.skip_profile('spark_session')
+class TestDateDiff(BaseDateDiff):
+    pass


Why are the tests for datediff skipped for spark_session?

Time logic is complicated, and we have a highly custom implementation, so it feels crucial to test it if at all possible.

On the other hand, I think this logic has been battle-tested for a couple years, which gives a nice vote of confidence.

I'd also like to re-review the BaseDateDiff implementation to see if it has coverage of each possible datepart.

Would be awesome if we could find a robust suite of a test cases that cover well-known edge cases like timestamps with a non-00 UTC offset, daylight savings boundaries, leap years, leap seconds, etc.

Spark changed its Julian vs. (Proleptic) Gregorian calendar handling between Spark 2.4 and 3.0, but not sure if we need to worry about that piece at all (talk, slides).

Would be awesome if we could find a robust suite of a test cases that cover well-known edge cases like timestamps with a non-00 UTC offset, daylight savings boundaries, leap years, leap seconds, etc.

Agreed. This feels important for our work around foundational data types + current_timestamp as well.

A bunch of tests have been failing for spark_session. I'm skipping them now for expediency, but these tests are running on the four other connection types for which we have testing. Those connection types are:

thrift for local (self-hosted) Spark

Databricks interactive cluster via HTTP

Databricks interactive cluster via ODBC

Databricks SQL endpoint via ODBC

I'll be the first to admit that the spark_session connection method is an advanced capability that I don't know lots about :) It's useful for advanced users / PySpark superusers when testing locally, but I would not recommend folks to use in production. It will never be supported in dbt Cloud. We've documented it as such.

It's true that the datediff (and datedd) macros produce a LUDICROUS amount of compiled SQL. The alternatives to tons of repeated code would be to run each snippet as an introspective query, store its result, and template it into the subsequent operation.

That talk about calendar switching is ... cool!

I'm sure there are edge cases with these implementations. The work involved in reaching parity / consistency for just the integration test cases that we already have in place was immense. I think that's the feature, for now..?

dbeatty10 · 2022-06-17T14:48:08Z

🥳

### Description Ports tests for lift + shift for cross-db macros from [dbt-labs/dbt-spark#359](dbt-labs/dbt-spark#359).

Initialize lift + shift, dateadd + datediff

cedffe1

cla-bot bot added the cla:yes label May 18, 2022

jtcohen6 mentioned this pull request May 18, 2022

Initialize lift + shift for cross-db macros dbt-labs/dbt-utils#594

Closed

13 tasks

jtcohen6 added 2 commits May 18, 2022 12:33

Fixups

beef079

More fixups

d42f5f2

jtcohen6 mentioned this pull request May 18, 2022

Lift + shift for cross-db macros dbt-labs/spark-utils#25

Merged

This was referenced Jun 1, 2022

Lift + shift for cross-db macros dbt-labs/dbt-utils#597

Merged

[CT-314] [Feature] Migrate cross-db functions from dbt-utils to definition in Core, implementation in adapters dbt-labs/dbt-core#4813

Closed

jtcohen6 mentioned this pull request Jun 15, 2022

Lift + shift for cross-db macros dbt-labs/dbt-core#5298

Merged

3 tasks

jtcohen6 commented Jun 15, 2022

View reviewed changes

Next round of utilities

44fd7a9

jtcohen6 force-pushed the jerco/utils-lift-shift branch from 4b5ec8a to 44fd7a9 Compare June 16, 2022 07:29

jtcohen6 added 2 commits June 16, 2022 11:48

Reorgnanize, skip, max for bool_or

b0cfcd3

fail -> skip_profile

0e2e1b1

jtcohen6 marked this pull request as ready for review June 16, 2022 10:37

jtcohen6 requested review from dataders and dbeatty10 June 16, 2022 10:37

dbeatty10 approved these changes Jun 16, 2022

View reviewed changes

Rm branch names

d586040

jtcohen6 merged commit 9614bca into main Jun 17, 2022

jtcohen6 deleted the jerco/utils-lift-shift branch June 17, 2022 14:06

ueshin mentioned this pull request Jul 11, 2022

Port tests for lift + shift for cross-db macros from dbt-spark databricks/dbt-databricks#118

Merged

jtcohen6 mentioned this pull request Aug 4, 2022

Some dbt_utils macros do not work with Spark SQL dbt-labs/dbt-utils#291

Closed

5 tasks

McKnight-42 mentioned this pull request Nov 17, 2022

[BACKPORT] #403 to 1.1.latest #521

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initialize lift + shift for cross-db macros #359

Initialize lift + shift for cross-db macros #359

jtcohen6 commented May 18, 2022 •

edited

Loading

jtcohen6 commented May 18, 2022 •

edited

Loading

jtcohen6 Jun 15, 2022

dbeatty10 left a comment

dbeatty10 Jun 16, 2022

jtcohen6 Jun 16, 2022

dbeatty10 commented Jun 17, 2022

Initialize lift + shift for cross-db macros #359

Initialize lift + shift for cross-db macros #359

Conversation

jtcohen6 commented May 18, 2022 • edited Loading

Checklist

jtcohen6 commented May 18, 2022 • edited Loading

jtcohen6 Jun 15, 2022

Choose a reason for hiding this comment

dbeatty10 left a comment

Choose a reason for hiding this comment

dbeatty10 Jun 16, 2022

Choose a reason for hiding this comment

jtcohen6 Jun 16, 2022

Choose a reason for hiding this comment

dbeatty10 commented Jun 17, 2022

jtcohen6 commented May 18, 2022 •

edited

Loading

jtcohen6 commented May 18, 2022 •

edited

Loading