Skip to content

v2.0.0

Latest
Compare
Choose a tag to compare
@henrydavidge henrydavidge released this 12 Mar 01:16
· 32 commits to main since this release

What's Changed

Major changes

  • Support Spark 3.4 and 3.5
  • Add functions for left and left semi joins with overlap criteria accelerated by Databricks' range join optimization
  • Register SQL functions via SQL extension service provider interface, so glow.register is no longer necessary if Glow is on the classpath when Spark is launched

Other user facing changes

  • Remove Hail integration
  • Remove features that frequently cause incompatibilities between versions (aggregate_by_index, CSV pipe transformer). Workarounds are provided in the documentation.

Internal changes

  • Future proof for Spark 4.0 / Scala 2.13 / JDK 17
  • Migrate CI and release process to GitHub Actions

Overlap join benchmarks

On a dataset with 1B left rows and 1M right rows and varying percentages of SNPs in the left table (tested with 1 4 core executor due to quota):

Inner range join + left join, all SNP percentages: 4h
Glow join, 0% SNPs: 4h
Glow join, 50% SNPs: 2h9m
Glow join, 90% SNPs: 0h42m

Other notes

The Python source artifact is built from tag v2.0.0-conda in order to fix Glow's conda recipe.

New Contributors

  • @dvcastillo made their first contribution in #505
  • @dtzeng made their first contribution in #519
  • @srowen made their first contribution in #524
  • @a-li made their first contribution in #522
  • @scala-steward-projectglow made their first contribution in #555

Full Changelog: v1.2.1...v2.0.0