Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: python-rq integration using datasets reindex as proof of concept #4427

Closed
wants to merge 13 commits into from

Conversation

jfcalvo
Copy link
Member

@jfcalvo jfcalvo commented Dec 18, 2023

Description

This PR include changes as a proof of concept to check how to integrate rq background processor with Argilla.

The changes include also two new endpoints:

  • PUT /api/v1/datasets/:dataset_id/reindex
    • This endpoint will return a HTTP 202 (Accepted) status.
    • A background job will be enqueue to reindex the dataset.
    • The response body will include the id of the job and its status (queued in this case if everything was fine).
    • Users can use the id of the job to get information about what is the status of the job.
  • GET /api/v1/jobs/:job_id
    • This endpoint is used to obtain information about one specific job (returning the id and status).
    • Jobs are right now not stored on database and I'm using rq API to get information about its status.
    • rq is saving job information for 500 seconds on Redis, so after a job is finished or failed the user has 500 seconds to get information about it.

Posible improvements:

  • Define a proper Redis connection using a pool of connections and getting settings from environment variables. Redis is using a pool of connections by default and I have added a new environment variable to set the connection (ARGILLA_REDIS_URL).
  • Define a better way to store our jobs, maybe using a new jobs table on Argilla database and allowing to save results of the jobs there. We will start with this approach of using rq results stored in Redis and in the future for more complex flows we will think into adding some data if necessary to our database.
  • Define a rq queue only for search engine purposes. We will use default queue for now.
  • Once Reindexer class code is merged from PR adding reindex cli task we can remove it from the code in this PR. We already merge the PR adding the reindex cli task and now the jobs are importing it and using it.
  • Add a result field to Job schema so we can include the result of the job inside it. (Useful to know if there are errors or additional information about the process)

Things to investigate/discuss:

  • How to use Redis on our docker images, specially with QuickStart images on HF.
  • Alternatives to use Redis using fakeredis python library instead.

Type of change

(Please delete options that are not relevant. Remember to title the PR according to the type of change)

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (change restructuring the codebase without changing functionality)
  • Improvement (change adding some improvement to an existing functionality)
  • Documentation update

How Has This Been Tested

(Please describe the tests that you ran to verify your changes. And ideally, reference tests)

  • Test A
  • Test B

Checklist

  • I added relevant documentation
  • follows the style guidelines of this project
  • I did a self-review of my code
  • I made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I filled out the contributor form (see text above)
  • I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

@jfcalvo jfcalvo changed the title feat: first iteration of python-rq integration using datasets reindex as proof of concept feat: python-rq integration using datasets reindex as proof of concept Dec 18, 2023
@jfcalvo jfcalvo marked this pull request as ready for review December 19, 2023 11:29
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. area: api Indicates that an issue or pull request is related to the Fast API server or REST endpoints language: python Pull requests or issues that update Python code team: backend Indicates that the issue or pull request is owned by the backend team type: integration Indicates integrations with third parties labels Dec 19, 2023
Copy link

The URL of the deployed environment for this PR is https://argilla-quickstart-pr-4427-ki24f765kq-no.a.run.app

Copy link

codecov bot commented Dec 19, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (6630d7b) 90.13% compared to head (de3721e) 91.21%.
Report is 578 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #4427      +/-   ##
===========================================
+ Coverage    90.13%   91.21%   +1.07%     
===========================================
  Files          233      351     +118     
  Lines        12493    19912    +7419     
===========================================
+ Hits         11261    18163    +6902     
- Misses        1232     1749     +517     
Flag Coverage Δ
pytest ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines 32 to 33
async with AsyncSessionLocal() as db:
async for search_engine in get_search_engine():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async with AsyncSessionLocal() as db:
async for search_engine in get_search_engine():
async with AsyncSessionLocal() as db:
async with SearchEngine.get_by_name(settings.search_engine) as engine:

It's better to use the SearchEngine context manager directly than get_search_engine function which is intended for use with fastapi.Depends

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could also combine the two lines above into one:

async with AsyncSessionLocal() as db, SearchEngine.get_by_name(settings.search_engine) as engine:
    ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Comment on lines 76 to 77
redis_host: str = "localhost"
redis_port: int = 6379
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we could use pydantic.RedisDsn type

Copy link
Member Author

@jfcalvo jfcalvo Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have found the RedisDsn to be not really flexible if we want to specify attributes to the Redis URL so instead I'm using a str redis_url settings attribute where we can set the entire URL to connect to.

@@ -1 +1,2 @@
datasets
rq == 1.15.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this from here.

We have this requirements file just to install the datasets required by the docker/scripts/load_data.py of the Docker quickstart image.

Copy link
Member Author

@jfcalvo jfcalvo Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand these are the requirements of the image so I don't see any problem adding an additional requirement. Remember that we need rq to run the worker that will consume the enqueued jobs (as a different process).

@jfcalvo jfcalvo closed this Apr 16, 2024
@jfcalvo jfcalvo deleted the feat/add-rq branch April 16, 2024 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: api Indicates that an issue or pull request is related to the Fast API server or REST endpoints language: python Pull requests or issues that update Python code size:L This PR changes 100-499 lines, ignoring generated files. team: backend Indicates that the issue or pull request is owned by the backend team type: integration Indicates integrations with third parties
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants