Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IP: Undo split indices for sensitive text detection #4904

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

sarayourfriend
Copy link
Contributor

@sarayourfriend sarayourfriend commented Sep 10, 2024

Fixes

Part of #3336 by @AetherUnbound

Description

This discussion is following the Openverse decision-making process. Information about this process can be found on the Openverse documentation site. Requested reviewers or participants will be following this process. If you are being asked to give input on a specific detail, you do not need to familiarise yourself with the process and follow it.

Current round

This discussion is currently in the Decision round.

The deadline for review of this round is 2024-09-25.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • [N/A] I added or updated tests for the changes I made (if applicable).
  • [N/A] I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • [N/A] I ran the DAG documentation generator (ov just catalog/generate-docs for catalog
    PRs) or the media properties generator (ov just catalog/generate-docs media-props
    for the catalog or ov just api/generate-docs for the API) where applicable.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@sarayourfriend sarayourfriend added 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 📄 aspect: text Concerns the textual material in the repository 🧱 stack: documentation Related to Sphinx documentation 🧭 project: implementation plan An implementation plan for a project labels Sep 10, 2024
@sarayourfriend sarayourfriend requested a review from a team as a code owner September 10, 2024 06:07
@sarayourfriend sarayourfriend requested review from krysal, stacimc and dhruvkb and removed request for a team and krysal September 10, 2024 06:07
Copy link

Full-stack documentation: https://docs.openverse.org/_preview/4904

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

New files ➕:

Comment on lines +87 to +106
approach. The API will use a feature flag to determine which querying approach
to use. This makes it easy for us to switch to the old querying approach in case
of failure, without needing to roll back or revert code, which could impact
other ongoing projects.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approach. The API will use a feature flag to determine which querying approach
to use. This makes it easy for us to switch to the old querying approach in case
of failure, without needing to roll back or revert code, which could impact
other ongoing projects.

Excellent 👍

Copy link
Member

@dhruvkb dhruvkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan looks good to me. The steps are logical, the changes to the API look correct and the approximate analysis of the performance impact also makes sense.

Copy link
Member

@zackkrida zackkrida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarayourfriend this looks excellent but I'd like to suggest one addition: Could you define specific prerequisites for the "cleanup" steps? My thinking is that in the past on some projects, Nuxt 3 being a recent example, we have jumped into cleanup work somewhat hastily and perhaps without sufficient assurance that our changes were stable.

@sarayourfriend
Copy link
Contributor Author

Could you define specific prerequisites for the "cleanup" steps?

Sure thing, good call out. When I get to revision (after Staci reviews for clarification round), I'll add something like the following:

Clean-up should occur only after 2 weeks of running the new approach in production, including two full production data refreshes. This is to ensure sufficiently exericse the new approach both during the data refresh and at query time before starting to take actions that will make rolling back much more cumbersome.

Does that sound alright?

Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me, @sarayourfriend -- I had a question about the indexer worker in the local dev environment, but that should be easily handled. I'm curious about your thoughts on the ingestion approach, but I think this approach will work well and I see the tradeoffs.

- A constant `SENSITIVE_TERMS_LOC` set to
`Path(__file__).parent / "sensitive_terms.txt"`. Indexer workers are
ephemeral, so it is safe to write to disk without risking accidental
reuse across runs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indexer worker in local development is not recreated between runs of the data refresh, actually, so we may need to handle the case where the file already exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pinged Staci in Slack to clarify one small detail about the indexer worker usage, and then I will be able to make a revision to clarify this point.

If the indexer workers manage at most one task, then we can simply move the call to retrieve_sensitive_terms into the handler for POST /task, and do so in all cases. If the indexer workers can receive multiple calls to POST /task in a single "lifecycle" (1 data refresh), then we do the call in api start up as already specified, but also call the function again in the POST /task handler whenever ENVIRONMENT != "production".

@sarayourfriend
Copy link
Contributor Author

I've also been thinking about this IP for the last week and regretting my recommendation of the sensitivity list. I think instead, an object of boolean properties like sensitivity: { text: boolean, user_reported: boolean } would be better. It could also have an any: boolean field as a normalised version of all the booleans in the object, which we could query against for simpler non-sensitive queries, which are the predominant kind of queries we make. Regardless of the version we go with, I want to change the IP to go with this approach. It has the following advantages:

  1. It does not require using a Painless script to update the document in the index (way simpler, less fiddly, easier to test and maintain).
  2. It produces an identical type of must_not term query on the boolean property/ies, matching our current query. Therefore, we can be confident it will have identical performance characteristics to our current query.
  3. It is potentially more flexible long term, as an object can much more naturally grow in properties to include things like which fields had sensitive text, which sensitive terms were detected, and so forth. Things that may be valuable or even essential information to improving how our sensitive text detection works in the future.

The last advantage particularly applies in the context of a catalogue-based approach like the one Staci asked about in this comment.

@sarayourfriend
Copy link
Contributor Author

@zackkrida I've added details for a cool-off period in the plan in this commit: 13c2beb

@stacimc I've added details about the discussion we had yesterday re: moving the check into Airflow in this commit: 27b3781

That second commit also includes the update to use a sensitivity object with boolean properties and a denormalised any.

I am waiting on one last question to clarify from @stacimc that I sent in Slack regarding how the indexer workers are used, and then I will be able to make a small change (it will be small either way) to address this clarification Staci mentioned regarding the ephemerality of the indexer workers.

@sarayourfriend
Copy link
Contributor Author

@dhruvkb and @stacimc this is ready for y'all to take another look and make a decision or raise blockers. @dhruvkb I know you left an approval before but just wanted to wait until the decision round to lock it in so feel free to change your mind, of course! 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📄 aspect: text Concerns the textual material in the repository 🌟 goal: addition Addition of new feature 🟨 priority: medium Not blocking but should be addressed soon 🧭 project: implementation plan An implementation plan for a project 🧱 stack: documentation Related to Sphinx documentation
Projects
Status: Pending proposal
Status: ✅ Approved
Development

Successfully merging this pull request may close these issues.

4 participants