IP: Undo split indices for sensitive text detection #4904

sarayourfriend · 2024-09-10T06:07:53Z

Fixes

Description

This discussion is following the Openverse decision-making process. Information about this process can be found on the Openverse documentation site. Requested reviewers or participants will be following this process. If you are being asked to give input on a specific detail, you do not need to familiarise yourself with the process and follow it.

Current round

This discussion is currently in the Decision round.

The deadline for review of this round is 2024-09-25.

Checklist

My pull request has a descriptive title (not a vague title likeUpdate index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
[N/A] I added or updated tests for the changes I made (if applicable).
[N/A] I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.
[N/A] I ran the DAG documentation generator (ov just catalog/generate-docs for catalog
PRs) or the media properties generator (ov just catalog/generate-docs media-props
for the catalog or ov just api/generate-docs for the API) where applicable.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

github-actions · 2024-09-10T06:28:07Z

Full-stack documentation: https://docs.openverse.org/_preview/4904

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

New files ➕:

https://docs.openverse.org/_preview/4904/projects/proposals/trust_and_safety/detecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.html

zackkrida · 2024-09-10T13:56:51Z

...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md

+approach. The API will use a feature flag to determine which querying approach
+to use. This makes it easy for us to switch to the old querying approach in case
+of failure, without needing to roll back or revert code, which could impact
+other ongoing projects.


approach. The API will use a feature flag to determine which querying approach to use. This makes it easy for us to switch to the old querying approach in case of failure, without needing to roll back or revert code, which could impact other ongoing projects.

Excellent 👍

dhruvkb

The plan looks good to me. The steps are logical, the changes to the API look correct and the approximate analysis of the performance impact also makes sense.

...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md

zackkrida

@sarayourfriend this looks excellent but I'd like to suggest one addition: Could you define specific prerequisites for the "cleanup" steps? My thinking is that in the past on some projects, Nuxt 3 being a recent example, we have jumped into cleanup work somewhat hastily and perhaps without sufficient assurance that our changes were stable.

sarayourfriend · 2024-09-12T04:20:23Z

Could you define specific prerequisites for the "cleanup" steps?

Sure thing, good call out. When I get to revision (after Staci reviews for clarification round), I'll add something like the following:

Clean-up should occur only after 2 weeks of running the new approach in production, including two full production data refreshes. This is to ensure sufficiently exericse the new approach both during the data refresh and at query time before starting to take actions that will make rolling back much more cumbersome.

Does that sound alright?

stacimc

This looks great to me, @sarayourfriend -- I had a question about the indexer worker in the local dev environment, but that should be easily handled. I'm curious about your thoughts on the ingestion approach, but I think this approach will work well and I see the tradeoffs.

stacimc · 2024-09-13T16:47:24Z

...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md

+      - A constant `SENSITIVE_TERMS_LOC` set to
+        `Path(__file__).parent / "sensitive_terms.txt"`. Indexer workers are
+        ephemeral, so it is safe to write to disk without risking accidental
+        reuse across runs.


The indexer worker in local development is not recreated between runs of the data refresh, actually, so we may need to handle the case where the file already exists.

I've pinged Staci in Slack to clarify one small detail about the indexer worker usage, and then I will be able to make a revision to clarify this point.

If the indexer workers manage at most one task, then we can simply move the call to retrieve_sensitive_terms into the handler for POST /task, and do so in all cases. If the indexer workers can receive multiple calls to POST /task in a single "lifecycle" (1 data refresh), then we do the call in api start up as already specified, but also call the function again in the POST /task handler whenever ENVIRONMENT != "production".

...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md

sarayourfriend · 2024-09-15T23:53:42Z

I've also been thinking about this IP for the last week and regretting my recommendation of the sensitivity list. I think instead, an object of boolean properties like sensitivity: { text: boolean, user_reported: boolean } would be better. It could also have an any: boolean field as a normalised version of all the booleans in the object, which we could query against for simpler non-sensitive queries, which are the predominant kind of queries we make. Regardless of the version we go with, I want to change the IP to go with this approach. It has the following advantages:

It does not require using a Painless script to update the document in the index (way simpler, less fiddly, easier to test and maintain).
It produces an identical type of must_not term query on the boolean property/ies, matching our current query. Therefore, we can be confident it will have identical performance characteristics to our current query.
It is potentially more flexible long term, as an object can much more naturally grow in properties to include things like which fields had sensitive text, which sensitive terms were detected, and so forth. Things that may be valuable or even essential information to improving how our sensitive text detection works in the future.

The last advantage particularly applies in the context of a catalogue-based approach like the one Staci asked about in this comment.

sarayourfriend · 2024-09-19T05:08:39Z

@zackkrida I've added details for a cool-off period in the plan in this commit: 13c2beb

@stacimc I've added details about the discussion we had yesterday re: moving the check into Airflow in this commit: 27b3781

That second commit also includes the update to use a sensitivity object with boolean properties and a denormalised any.

I am waiting on one last question to clarify from @stacimc that I sent in Slack regarding how the indexer workers are used, and then I will be able to make a small change (it will be small either way) to address this clarification Staci mentioned regarding the ephemerality of the indexer workers.

…ot put the check in Airflow

sarayourfriend · 2024-09-20T00:36:40Z

@dhruvkb and @stacimc this is ready for y'all to take another look and make a decision or raise blockers. @dhruvkb I know you left an approval before but just wanted to wait until the decision round to lock it in so feel free to change your mind, of course! 🙂

sarayourfriend requested a review from a team as a code owner September 10, 2024 06:07

sarayourfriend requested review from krysal, stacimc and dhruvkb and removed request for a team and krysal September 10, 2024 06:07

sarayourfriend force-pushed the add/undo-split-filtered-index branch from b5f73fd to d616bf8 Compare September 10, 2024 06:19

zackkrida reviewed Sep 10, 2024

View reviewed changes

dhruvkb approved these changes Sep 11, 2024

View reviewed changes

...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md Show resolved Hide resolved

...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md Outdated Show resolved Hide resolved

zackkrida reviewed Sep 11, 2024

View reviewed changes

stacimc reviewed Sep 13, 2024

View reviewed changes

sarayourfriend added 4 commits September 20, 2024 10:15

IP: Undo split indices for sensitive text detection

1b886df

Clarify cool-off period before starting clean-up

5b22670

Switch to using a nested sensitivity object and explain why we will n…

d7da7f8

…ot put the check in Airflow

Clear up ingestion worker ephemerality assumptions

e68381a

sarayourfriend force-pushed the add/undo-split-filtered-index branch from 27b3781 to e68381a Compare September 20, 2024 00:35

sarayourfriend requested review from stacimc and dhruvkb September 20, 2024 00:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IP: Undo split indices for sensitive text detection #4904

IP: Undo split indices for sensitive text detection #4904

sarayourfriend commented Sep 10, 2024 •

edited

Loading

github-actions bot commented Sep 10, 2024

zackkrida Sep 10, 2024

dhruvkb left a comment

zackkrida left a comment

sarayourfriend commented Sep 12, 2024

stacimc left a comment

stacimc Sep 13, 2024

sarayourfriend Sep 19, 2024

sarayourfriend commented Sep 15, 2024

sarayourfriend commented Sep 19, 2024

sarayourfriend commented Sep 20, 2024

IP: Undo split indices for sensitive text detection #4904

Are you sure you want to change the base?

IP: Undo split indices for sensitive text detection #4904

Conversation

sarayourfriend commented Sep 10, 2024 • edited Loading

Fixes

Description

Current round

Checklist

Developer Certificate of Origin

github-actions bot commented Sep 10, 2024

zackkrida Sep 10, 2024

Choose a reason for hiding this comment

dhruvkb left a comment

Choose a reason for hiding this comment

zackkrida left a comment

Choose a reason for hiding this comment

sarayourfriend commented Sep 12, 2024

stacimc left a comment

Choose a reason for hiding this comment

stacimc Sep 13, 2024

Choose a reason for hiding this comment

sarayourfriend Sep 19, 2024

Choose a reason for hiding this comment

sarayourfriend commented Sep 15, 2024

sarayourfriend commented Sep 19, 2024

sarayourfriend commented Sep 20, 2024

sarayourfriend commented Sep 10, 2024 •

edited

Loading