-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: python-rq integration using datasets reindex as proof of concept #4427
Conversation
… as proof of concept
The URL of the deployed environment for this PR is https://argilla-quickstart-pr-4427-ki24f765kq-no.a.run.app |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #4427 +/- ##
===========================================
+ Coverage 90.13% 91.21% +1.07%
===========================================
Files 233 351 +118
Lines 12493 19912 +7419
===========================================
+ Hits 11261 18163 +6902
- Misses 1232 1749 +517
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
…ariables to config Redis connection for background jobs
async with AsyncSessionLocal() as db: | ||
async for search_engine in get_search_engine(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
async with AsyncSessionLocal() as db: | |
async for search_engine in get_search_engine(): | |
async with AsyncSessionLocal() as db: | |
async with SearchEngine.get_by_name(settings.search_engine) as engine: |
It's better to use the SearchEngine
context manager directly than get_search_engine
function which is intended for use with fastapi.Depends
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could also combine the two lines above into one:
async with AsyncSessionLocal() as db, SearchEngine.get_by_name(settings.search_engine) as engine:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
src/argilla/server/settings.py
Outdated
redis_host: str = "localhost" | ||
redis_port: int = 6379 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we could use pydantic.RedisDsn
type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have found the RedisDsn
to be not really flexible if we want to specify attributes to the Redis URL so instead I'm using a str
redis_url
settings attribute where we can set the entire URL to connect to.
@@ -1 +1,2 @@ | |||
datasets | |||
rq == 1.15.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this from here.
We have this requirements file just to install the datasets
required by the docker/scripts/load_data.py
of the Docker quickstart image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand these are the requirements of the image so I don't see any problem adding an additional requirement. Remember that we need rq
to run the worker that will consume the enqueued jobs (as a different process).
Description
This PR include changes as a proof of concept to check how to integrate
rq
background processor with Argilla.The changes include also two new endpoints:
PUT /api/v1/datasets/:dataset_id/reindex
202 (Accepted)
status.id
of the job and itsstatus
(queued
in this case if everything was fine).id
of the job to get information about what is the status of the job.GET /api/v1/jobs/:job_id
id
andstatus
).rq
API to get information about its status.rq
is saving job information for500
seconds on Redis, so after a job is finished or failed the user has 500 seconds to get information about it.Posible improvements:
Define a proper Redis connection using a pool of connections and getting settings from environment variables.Redis is using a pool of connections by default and I have added a new environment variable to set the connection (ARGILLA_REDIS_URL
).Define a better way to store our jobs, maybe using a newWe will start with this approach of usingjobs
table on Argilla database and allowing to save results of the jobs there.rq
results stored in Redis and in the future for more complex flows we will think into adding some data if necessary to our database.Define aWe will userq
queue only for search engine purposes.default
queue for now.OnceWe already merge the PR adding the reindex cli task and now the jobs are importing it and using it.Reindexer
class code is merged from PR adding reindex cli task we can remove it from the code in this PR.result
field toJob
schema so we can include the result of the job inside it. (Useful to know if there are errors or additional information about the process)Things to investigate/discuss:
fakeredis
python library instead.Type of change
(Please delete options that are not relevant. Remember to title the PR according to the type of change)
How Has This Been Tested
(Please describe the tests that you ran to verify your changes. And ideally, reference
tests
)Checklist