Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement relationship and refine join match warning #6753

Merged
merged 5 commits into from
Feb 24, 2023

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Feb 22, 2023

Closes #6731
Closes #6717
Joint PR on the vctrs side r-lib/vctrs#1791, which contains even more detail

This PR hopefully fixes most of the complaints we've seen about the multiple match warning. At a high level:

  • We previously warned on one-to-many and many-to-many relationships
  • We now only warn on many-to-many relationships

As outlined in #6731, only warning on many-to-many makes much more sense, as one-to-many is generally pretty useful, and is symmetric with many-to-one which we don't warn on, and that didn't make much sense. As further proof that we should warn on many-to-many, most SQL dialects won't even let you create a many-to-many "relationship" between two tables; instead you have to create a 3rd junction table to break it into two one-to-many relationships.

This PR accomplishes this in two steps by:

  • Changing the multiple default to "all" (same as SQL), and deprecating the NULL, "error", and "warning" options
  • Introducing a new relationship argument which replaces the multiple = "error" case and holistically handles multiple matches between x and y

The relationship argument takes on various options:

  • NULL (default, chooses vctrs options of "none" or "warn-many-to-many" automatically)
  • "one-to-one"
  • "one-to-many"
  • "many-to-one"
  • "many-to-many"

These are inspired by database table relationships, which end up being orthogonal to the idea of the "kind" of join you are doing (i.e. left/right/full/inner), so we can add an argument for this without any ambiguity or conflicts.

The choice of relationship activates a constraint on x and y, for example, "many-to-one" says that:

  • Each row of x can match at most 1 row in y
  • Each row of y can match any number of rows in x

Which makes relationship = "many-to-one" a nice replacement for multiple = "error". But we also now have "one-to-one" and "one-to-many" too!

Note that at most 1 means this doesn't handle the 0 match case, which is instead handled jointly by unmatched and your choice of join.

For NULL:

  • With equality joins, it won't assume any expected relationship, but instead will check to see if a many-to-many relationship is present and will warn if it is, encouraging you to either look at your data or explicitly set "many-to-many" if that is expected. This is the only time we now automatically warn, so it should come up much less often.
  • With inequality, rolling, and overlap joins, it won't perform any checks.

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed analysis on this!

R/join.R Outdated Show resolved Hide resolved
By removing `"none"` and `"warn-many-to-many"`, which are covered by `NULL`
@DavisVaughan

This comment was marked as duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants