Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Validating multi-label text classification records makes new copies of them instead of updating #3265

Closed
marcelbusch opened this issue Jun 26, 2023 · 12 comments · Fixed by #3328
Assignees
Labels
type: bug Indicates an unexpected problem or unintended behavior
Milestone

Comments

@marcelbusch
Copy link

marcelbusch commented Jun 26, 2023

Describe the bug
In multi-label text classification, whenever I annotate records, they get copied to a validated version, but the original record stays the same. So I have a dataset with 1000 records, and after annotating 20 of them, I now have 1020 records in my dataset, the 1000 records with status "default" and the 20 annotated records. Is this expected behaviour? Because like this I can't filter for records I haven't annotated yet, when I filter for status:default I still get all 1000 records.

I run argilla and elasticsearch with the supplied docker-compose file without any modifications.

To Reproduce
Steps to reproduce the behavior:

DATASET_NAME = '...'
labels = [...]
settings = rg.TextClassificationSettings(label_schema=labels)
rg.configure_dataset_settings(name=DATASET_NAME, settings=settings)

# sample is a pandas dataframe containing 1000 records
records = sample.apply(lambda x: rg.TextClassificationRecord(
    inputs={"user": x.username, "body": x.rawContent},
    metadata={"hashtags": x.hashtags},
    id=x.id,
    multi_label=True
), axis=1).values.tolist()

rg.log(records, DATASET_NAME)

Expected behavior
I expect the original records to change status, so after annotating 20 records I should still have 1000 records in my dataset, now 980 with status "default" and 20 with status "validated"

Environment (please complete the following information):

  • OS: macOS 11.7.5
  • Browser: Firefox
  • Argilla Version: 1.10.0
  • Docker Image: docker-compose 3.9 (argilla-server:latest, elasticsearch:8.5.3)

Edit: Update on some more behaviour:

  • when I query for a word that results in more than 1 result, I get the unlabeled and the validated records
  • when I query for a word/combination of words that exist(s) only in 1 record, I also get both records (via .load and in the ui), but in the ui I can only see the validated record, but there is blank space where the unlabeled record would be (there is no element for it in the dom though, the validated record is offset via translateY)
@davidberenstein1957
Copy link
Member

@marcelbusch do you know for sure that you are the only one working on this. What version of the server and Python package are you using?

I think it might be good to evaluate the query syntax w.r.t. the query bahaviour. Are you using it correctly?

@marcelbusch
Copy link
Author

Yes, I am sure I'm the only one working on this. Server and python package version are both 1.10.0

I'm attaching a screenshot to clearify which behaviour I mean: I labeled and validated a record (so now there is 2 versions of it in my dataset) now I query for a combination of words that only exists in this specific record, naturally I get both versions, in python via rg.load, and in the UI it also shows that there are 2 records at the bottom, but I can only see one of them
screenshot-argilla

@davidberenstein1957
Copy link
Member

@marcelbusch described in our query syntax overview you would need to query a bit differently in that case "nurnberg AND demonstration AND wegstrecke" should be what you intend to achieve, correct?

w.r.t. the duplicating records, is that still happening? Could you show me a video?

@marcelbusch
Copy link
Author

No, when I connect them with AND it's still the same behaviour...

Yes that's still happening, here's a video:

argilla-duplication.mp4

@davidberenstein1957
Copy link
Member

@frascuchon

@davidberenstein1957
Copy link
Member

Thanks @marcelbusch, this really helps us.

@frascuchon frascuchon added the type: bug Indicates an unexpected problem or unintended behavior label Jun 28, 2023
@frascuchon frascuchon added this to the v1.12.0 milestone Jun 28, 2023
@davidberenstein1957
Copy link
Member

Hi @marcelbusch, we cannot reproduce this behaviour. Could your provide us with more context w.r.t. your data? Also, could you run pip install argilla --update, delete your current docker images, and set their tag specific to 1.10.0 during re-deployment?

@frascuchon
Copy link
Member

Thanks for your feedback @marcelbusch! Is also happening when you validate a single record (without bulk validation action)

Can you share also the extra record info (you can access it by clicking the ... button) for both records?
Screenshot 2023-06-28 at 12 52 01

@marcelbusch
Copy link
Author

Could your provide us with more context w.r.t. your data?

I could send you a csv of my pandas dataframe, or what do you mean?

Also, could you run pip install argilla --update, delete your current docker images, and set their tag specific to 1.10.0 during re-deployment?

So I should run server version 1.10.0 with argilla python version 1.11.0? And the docker tag would be v1.10.0 or releases-1.10.0?

Is also happening when you validate a single record (without bulk validation action)

Yes, same thing when using the validate button at the bottom and when setting records per page to 1

Can you share also the extra record info (you can access it by clicking the ... button) for both records?>

They both have the same id, could this be because i set the id manually during record creation?
argilla-info-default
argilla-info-validated

@alvarobartt alvarobartt modified the milestones: v1.12.0, v1.13 Jun 29, 2023
@marcelbusch
Copy link
Author

I took a look in my debugger at this point: updateDatasetRecords and in the record that get's passed in there when I click on a label the id is incorrect, in fact it seems to be rounded to the nearest 100: the record has an initial id of 1478720431326732291, the record that gets passed in the updateDatasetRecords function has the id 1478720431326732300

@frascuchon
Copy link
Member

Thanks for all this info @marcelbusch. It's really helpful. We'll work on fixing that.

As a temporal workaround, I think you can set the record id as a string instead of a number. This should avoid this problem.

@frascuchon
Copy link
Member

I've been doing some test and it looks like is a javascript limit

Using curl:

curl -X 'POST' \
  "http://localhost:6900/api/datasets/test-dataset/TextClassification:search?include_metrics=false&workspace=argilla&limit=50&from=0" \
  -H 'accept: application/json' \
  -H 'X-Argilla-Api-Key: argilla.apikey' \
  -H 'Content-Type: application/json' \
  -d '{}'
{"total":1,"records":[{"id":10805720881385292014,"status":"Default","metrics":{},"last_updated":"2023-06-30T13:23:45.462574","inputs":{"additionalProp1":"string","additionalProp3":"string","additionalProp2":"string"},"multi_label":true}],"aggregations":{"predicted_as":{},"annotated_as":{},"annotated_by":{},"predicted_by":{},"status":{"Default":1},"predicted":{},"score":{},"words":{"string":1},"metadata":{}}}

From UI:
Screenshot 2023-06-30 at 15 53 52

@leiyre @damianpumar @keithCuniah Any ideas?

@frascuchon frascuchon linked a pull request Jul 4, 2023 that will close this issue
8 tasks
frascuchon added a commit that referenced this issue Jul 6, 2023
<!-- Thanks for your contribution! As part of our Community Growers
initiative 🌱, we're donating Justdiggit bunds in your name to reforest
sub-Saharan Africa. To claim your Community Growers certificate, please
contact David Berenstein in our Slack community or fill in this form
https://tally.so/r/n9XrxK once your PR has been merged. -->

# Description

This PR deprecates the integer support for record id. Also, warn if big
integers are used as record id


Refs #3265

**Type of change**

(Please delete options that are not relevant. Remember to title the PR
according to the type of change)

**How Has This Been Tested**

After the changes the, issue cannot be reproduced anymore. Tested in dev
environments

**Checklist**

- [ ] I added relevant documentation
- [X] follows the style guidelines of this project
- [X] I did a self-review of my code
- [ ] I made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [X] I have added tests that prove my fix is effective or that my
feature works
- [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK)
(see text above)
- [X] I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)

---------

Co-authored-by: David Berenstein <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants