-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IP: Undo split indices for sensitive text detection #4904
base: main
Are you sure you want to change the base?
Conversation
b5f73fd
to
d616bf8
Compare
Full-stack documentation: https://docs.openverse.org/_preview/4904 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. New files ➕: |
approach. The API will use a feature flag to determine which querying approach | ||
to use. This makes it easy for us to switch to the old querying approach in case | ||
of failure, without needing to roll back or revert code, which could impact | ||
other ongoing projects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approach. The API will use a feature flag to determine which querying approach
to use. This makes it easy for us to switch to the old querying approach in case
of failure, without needing to roll back or revert code, which could impact
other ongoing projects.
Excellent 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan looks good to me. The steps are logical, the changes to the API look correct and the approximate analysis of the performance impact also makes sense.
...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md
Show resolved
Hide resolved
...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sarayourfriend this looks excellent but I'd like to suggest one addition: Could you define specific prerequisites for the "cleanup" steps? My thinking is that in the past on some projects, Nuxt 3 being a recent example, we have jumped into cleanup work somewhat hastily and perhaps without sufficient assurance that our changes were stable.
Sure thing, good call out. When I get to revision (after Staci reviews for clarification round), I'll add something like the following:
Does that sound alright? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me, @sarayourfriend -- I had a question about the indexer worker in the local dev environment, but that should be easily handled. I'm curious about your thoughts on the ingestion approach, but I think this approach will work well and I see the tradeoffs.
- A constant `SENSITIVE_TERMS_LOC` set to | ||
`Path(__file__).parent / "sensitive_terms.txt"`. Indexer workers are | ||
ephemeral, so it is safe to write to disk without risking accidental | ||
reuse across runs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indexer worker in local development is not recreated between runs of the data refresh, actually, so we may need to handle the case where the file already exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pinged Staci in Slack to clarify one small detail about the indexer worker usage, and then I will be able to make a revision to clarify this point.
If the indexer workers manage at most one task, then we can simply move the call to retrieve_sensitive_terms
into the handler for POST /task
, and do so in all cases. If the indexer workers can receive multiple calls to POST /task
in a single "lifecycle" (1 data refresh), then we do the call in api start up as already specified, but also call the function again in the POST /task
handler whenever ENVIRONMENT != "production"
.
...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md
Show resolved
Hide resolved
I've also been thinking about this IP for the last week and regretting my recommendation of the
The last advantage particularly applies in the context of a catalogue-based approach like the one Staci asked about in this comment. |
@zackkrida I've added details for a cool-off period in the plan in this commit: 13c2beb @stacimc I've added details about the discussion we had yesterday re: moving the check into Airflow in this commit: 27b3781 That second commit also includes the update to use a sensitivity object with boolean properties and a denormalised I am waiting on one last question to clarify from @stacimc that I sent in Slack regarding how the indexer workers are used, and then I will be able to make a small change (it will be small either way) to address this clarification Staci mentioned regarding the ephemerality of the indexer workers. |
27b3781
to
e68381a
Compare
Fixes
Part of #3336 by @AetherUnbound
Description
This discussion is following the Openverse decision-making process. Information about this process can be found on the Openverse documentation site. Requested reviewers or participants will be following this process. If you are being asked to give input on a specific detail, you do not need to familiarise yourself with the process and follow it.
Current round
This discussion is currently in the Decision round.
The deadline for review of this round is 2024-09-25.
Checklist
Update index.md
).main
) or a parent feature branch.ov just catalog/generate-docs
for catalogPRs) or the media properties generator (
ov just catalog/generate-docs media-props
for the catalog or
ov just api/generate-docs
for the API) where applicable.Developer Certificate of Origin
Developer Certificate of Origin