feat: python-rq integration using datasets reindex as proof of concept #4427

jfcalvo · 2023-12-18T17:25:22Z

Description

This PR include changes as a proof of concept to check how to integrate rq background processor with Argilla.

The changes include also two new endpoints:

PUT /api/v1/datasets/:dataset_id/reindex
- This endpoint will return a HTTP 202 (Accepted) status.
- A background job will be enqueue to reindex the dataset.
- The response body will include the id of the job and its status (queued in this case if everything was fine).
- Users can use the id of the job to get information about what is the status of the job.
GET /api/v1/jobs/:job_id
- This endpoint is used to obtain information about one specific job (returning the id and status).
- Jobs are right now not stored on database and I'm using rq API to get information about its status.
- rq is saving job information for 500 seconds on Redis, so after a job is finished or failed the user has 500 seconds to get information about it.

Posible improvements:

~~Define a proper Redis connection using a pool of connections and getting settings from environment variables.~~ Redis is using a pool of connections by default and I have added a new environment variable to set the connection (ARGILLA_REDIS_URL).
~~Define a better way to store our jobs, maybe using a new jobs table on Argilla database and allowing to save results of the jobs there.~~ We will start with this approach of using rq results stored in Redis and in the future for more complex flows we will think into adding some data if necessary to our database.
~~Define a rq queue only for search engine purposes.~~ We will use default queue for now.
~~Once Reindexer class code is merged from PR adding reindex cli task we can remove it from the code in this PR.~~ We already merge the PR adding the reindex cli task and now the jobs are importing it and using it.
Add a result field to Job schema so we can include the result of the job inside it. (Useful to know if there are errors or additional information about the process)

Things to investigate/discuss:

How to use Redis on our docker images, specially with QuickStart images on HF.
Alternatives to use Redis using fakeredis python library instead.

Type of change

(Please delete options that are not relevant. Remember to title the PR according to the type of change)

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Refactor (change restructuring the codebase without changing functionality)
Improvement (change adding some improvement to an existing functionality)
Documentation update

How Has This Been Tested

(Please describe the tests that you ran to verify your changes. And ideally, reference tests)

Test A
Test B

Checklist

I added relevant documentation
follows the style guidelines of this project
I did a self-review of my code
I made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I filled out the contributor form (see text above)
I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

… as proof of concept

github-actions · 2023-12-19T12:09:53Z

The URL of the deployed environment for this PR is https://argilla-quickstart-pr-4427-ki24f765kq-no.a.run.app

codecov · 2023-12-19T12:37:53Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (6630d7b) 90.13% compared to head (de3721e) 91.21%.
Report is 578 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #4427      +/-   ##
===========================================
+ Coverage    90.13%   91.21%   +1.07%     
===========================================
  Files          233      351     +118     
  Lines        12493    19912    +7419     
===========================================
+ Hits         11261    18163    +6902     
- Misses        1232     1749     +517

Flag	Coverage Δ
pytest	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…ariables to config Redis connection for background jobs

gabrielmbmb · 2023-12-26T13:51:16Z

src/argilla/server/jobs/search_engine_jobs.py

+    async with AsyncSessionLocal() as db:
+        async for search_engine in get_search_engine():


Suggested change

async with AsyncSessionLocal() as db:

async for search_engine in get_search_engine():

async with AsyncSessionLocal() as db:

async with SearchEngine.get_by_name(settings.search_engine) as engine:

It's better to use the SearchEngine context manager directly than get_search_engine function which is intended for use with fastapi.Depends

we could also combine the two lines above into one:

async with AsyncSessionLocal() as db, SearchEngine.get_by_name(settings.search_engine) as engine: ...

gabrielmbmb · 2023-12-26T13:55:00Z

src/argilla/server/settings.py

+    redis_host: str = "localhost"
+    redis_port: int = 6379


Here we could use pydantic.RedisDsn type

I have found the RedisDsn to be not really flexible if we want to specify attributes to the Redis URL so instead I'm using a str redis_url settings attribute where we can set the entire URL to connect to.

gabrielmbmb · 2023-12-26T13:56:12Z

docker/quickstart.requirements.txt

@@ -1 +1,2 @@
 datasets
+rq == 1.15.1


We can remove this from here.

We have this requirements file just to install the datasets required by the docker/scripts/load_data.py of the Docker quickstart image.

As far as I understand these are the requirements of the image so I don't see any problem adding an additional requirement. Remember that we need rq to run the worker that will consume the enqueued jobs (as a different process).

…st and port

feat: first iteration of python-rq integration using datasets reindex…

0a870b1

… as proof of concept

jfcalvo requested review from frascuchon and gabrielmbmb December 18, 2023 17:25

jfcalvo changed the title ~~feat: first iteration of python-rq integration using datasets reindex as proof of concept~~ feat: python-rq integration using datasets reindex as proof of concept Dec 18, 2023

feat: add Redis to quickstart.Dockerfile

01ae3eb

jfcalvo marked this pull request as ready for review December 19, 2023 11:29

Merge branch 'develop' into feat/add-rq

115c9b1

Merge branch 'develop' into feat/add-rq

6832cef

jfcalvo added 5 commits December 19, 2023 14:34

feat: Start redis-server on quickstart script

ddd0d9b

feat: Use Reindexer class from search engine jobs

ccb17f1

feat: Add rq worker to quickstart image

a872182

feat: Add ARGILLA_REDIS_HOST and ARGILLA_REDIS_PORT as environments v…

2ffdfc1

…ariables to config Redis connection for background jobs

chore: Remove unused fakeredis

68437a5

frascuchon mentioned this pull request Dec 20, 2023

feat: Refresh dataset search content endpoint #3791

Closed

11 tasks

jfcalvo mentioned this pull request Dec 20, 2023

refactor: Review server app setup #4432

Merged

11 tasks

gabrielmbmb requested changes Dec 26, 2023

View reviewed changes

jfcalvo added 4 commits January 10, 2024 11:41

Merge branch 'develop' into feat/add-rq

920d9bc

feat: apply feedback from PR

c31b072

feat: use a new redis settings to set a Redis URL instead of using ho…

c138bbb

…st and port

feat: change redis settings attribute to redis_url

de3721e

jfcalvo requested a review from gabrielmbmb January 10, 2024 11:54

jfcalvo closed this Apr 16, 2024

jfcalvo deleted the feat/add-rq branch April 16, 2024 09:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: python-rq integration using datasets reindex as proof of concept #4427

feat: python-rq integration using datasets reindex as proof of concept #4427

jfcalvo commented Dec 18, 2023 •

edited

Loading

github-actions bot commented Dec 19, 2023

codecov bot commented Dec 19, 2023 •

edited

Loading

gabrielmbmb Dec 26, 2023

gabrielmbmb Dec 26, 2023

jfcalvo Jan 10, 2024

gabrielmbmb Dec 26, 2023

jfcalvo Jan 10, 2024 •

edited

Loading

gabrielmbmb Dec 26, 2023

jfcalvo Jan 10, 2024 •

edited

Loading

		async with AsyncSessionLocal() as db:
		async for search_engine in get_search_engine():

feat: python-rq integration using datasets reindex as proof of concept #4427

feat: python-rq integration using datasets reindex as proof of concept #4427

Conversation

jfcalvo commented Dec 18, 2023 • edited Loading

Description

github-actions bot commented Dec 19, 2023

codecov bot commented Dec 19, 2023 • edited Loading

Codecov Report

gabrielmbmb Dec 26, 2023

Choose a reason for hiding this comment

gabrielmbmb Dec 26, 2023

Choose a reason for hiding this comment

jfcalvo Jan 10, 2024

Choose a reason for hiding this comment

gabrielmbmb Dec 26, 2023

Choose a reason for hiding this comment

jfcalvo Jan 10, 2024 • edited Loading

Choose a reason for hiding this comment

gabrielmbmb Dec 26, 2023

Choose a reason for hiding this comment

jfcalvo Jan 10, 2024 • edited Loading

Choose a reason for hiding this comment

jfcalvo commented Dec 18, 2023 •

edited

Loading

codecov bot commented Dec 19, 2023 •

edited

Loading

jfcalvo Jan 10, 2024 •

edited

Loading

jfcalvo Jan 10, 2024 •

edited

Loading