Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialize lift + shift for cross-db macros #359

Merged
merged 7 commits into from
Jun 17, 2022
Merged

Conversation

jtcohen6
Copy link
Contributor

@jtcohen6 jtcohen6 commented May 18, 2022

Two Macros (Utils're Coming Home)

Follow-up to dbt-labs/dbt-core#5265 dbt-labs/dbt-core#5298

No more spark-utils??? Not quite, but close. I've opened a follow-on PR there to ensure backwards compatibility for those who celebrate: dbt-labs/spark-utils#25

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt-spark next" section.

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented May 18, 2022

The datediff macro is failing in the session connection method with:

DEBUG    configured_file:functions.py:235 10:43:13.599281 [debug] [Thread-7  ]: Runtime Error in model test_datediff (models/test_datediff.sql)
  org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Insert of object "org.apache.hadoop.hive.metastore.model.MTable@5c1a0c62" using statement "INSERT INTO TBLS (TBL_ID,CREATE_TIME,DB_ID,LAST_ACCESS_TIME,OWNER,RETENTION,IS_REWRITE_ENABLED,SD_ID,TBL_NAME,TBL_TYPE,VIEW_EXPANDED_TEXT,VIEW_ORIGINAL_TEXT) VALUES (?,?,?,?,?,?,?,?,?,?,?,?)" failed : A truncation error was encountered trying to shrink LONG VARCHAR 'with data as (
  
      select * from test16528705702281762972_t&' to length 32700.)

This macro does compile SQL that is just about unreadably long. Perhaps it could be "minified"? :)

FWIW, it's passing on all other methods, so we might just mark it with @pytest.mark.skip_profile('session') for now.

Comment on lines +3 to +5
{% if order_by_clause %}
{{ exceptions.warn("order_by_clause is not supported for listagg on Spark/Databricks") }}
{% endif %}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs make it pretty clear that:

The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.

The only way I've found to this requires a subquery to calculate rank() first, then passed into collect_list (with a struct and array_sort to boot, probably)

@jtcohen6 jtcohen6 marked this pull request as ready for review June 16, 2022 10:37
Copy link
Contributor

@dbeatty10 dbeatty10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit: Overall, I feel good about shipping this since it's mainly moving the code from one repo to another.

For long-term peace-of-mind, would prefer that we invest in finding or creating a robust collection of test cases for dateadd and datediff. That way, we can have more confidence that these macros will produce the same results across adapters for the tricky edge cases. See inline comment for slightly more detail.

Comment on lines +51 to +53
@pytest.mark.skip_profile('spark_session')
class TestDateDiff(BaseDateDiff):
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the tests for datediff skipped for spark_session?

Time logic is complicated, and we have a highly custom implementation, so it feels crucial to test it if at all possible.

On the other hand, I think this logic has been battle-tested for a couple years, which gives a nice vote of confidence.

I'd also like to re-review the BaseDateDiff implementation to see if it has coverage of each possible datepart.

Would be awesome if we could find a robust suite of a test cases that cover well-known edge cases like timestamps with a non-00 UTC offset, daylight savings boundaries, leap years, leap seconds, etc.

Spark changed its Julian vs. (Proleptic) Gregorian calendar handling between Spark 2.4 and 3.0, but not sure if we need to worry about that piece at all (talk, slides).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be awesome if we could find a robust suite of a test cases that cover well-known edge cases like timestamps with a non-00 UTC offset, daylight savings boundaries, leap years, leap seconds, etc.

Agreed. This feels important for our work around foundational data types + current_timestamp as well.

A bunch of tests have been failing for spark_session. I'm skipping them now for expediency, but these tests are running on the four other connection types for which we have testing. Those connection types are:

  • thrift for local (self-hosted) Spark
  • Databricks interactive cluster via HTTP
  • Databricks interactive cluster via ODBC
  • Databricks SQL endpoint via ODBC

I'll be the first to admit that the spark_session connection method is an advanced capability that I don't know lots about :) It's useful for advanced users / PySpark superusers when testing locally, but I would not recommend folks to use in production. It will never be supported in dbt Cloud. We've documented it as such.

It's true that the datediff (and datedd) macros produce a LUDICROUS amount of compiled SQL. The alternatives to tons of repeated code would be to run each snippet as an introspective query, store its result, and template it into the subsequent operation.

That talk about calendar switching is ... cool!

I'm sure there are edge cases with these implementations. The work involved in reaching parity / consistency for just the integration test cases that we already have in place was immense. I think that's the feature, for now..?

@jtcohen6 jtcohen6 merged commit 9614bca into main Jun 17, 2022
@jtcohen6 jtcohen6 deleted the jerco/utils-lift-shift branch June 17, 2022 14:06
@dbeatty10
Copy link
Contributor

🥳

ueshin added a commit to databricks/dbt-databricks that referenced this pull request Jul 11, 2022
### Description

Ports tests for lift + shift for cross-db macros from [dbt-labs/dbt-spark#359](dbt-labs/dbt-spark#359).
@McKnight-42 McKnight-42 mentioned this pull request Nov 17, 2022
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants