Skip to content

Commit

Permalink
feat: create default text_descriptives as metadata via `utils.model…
Browse files Browse the repository at this point in the history
…ing` (argilla-io#4400)

<!-- Thanks for your contribution! As part of our Community Growers
initiative 🌱, we're donating Justdiggit bunds in your name to reforest
sub-Saharan Africa. To claim your Community Growers certificate, please
contact David Berenstein in our Slack community or fill in this form
https://tally.so/r/n9XrxK once your PR has been merged. -->

# Description

Local PR from argilla-io#4083 
New addition of documentation

Closes argilla-io#4017 

**Type of change**

(Please delete options that are not relevant. Remember to title the PR
according to the type of change)

- [ ] New feature (non-breaking change which adds functionality)
- [ ] Refactor (change restructuring the codebase without changing
functionality)
- [ ] Improvement (change adding some improvement to an existing
functionality)

**How Has This Been Tested**

(Please describe the tests that you ran to verify your changes. And
ideally, reference `tests`)

- [ ] Test A
- [ ] Test B

**Checklist**

- [ ] I added relevant documentation
- [ ] I followed the style guidelines of this project
- [ ] I did a self-review of my code
- [ ] I made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK)
(see text above)
- [ ] I have added relevant notes to the `CHANGELOG.md` file (See
https://keepachangelog.com/)

---------

Co-authored-by: m-newhauser <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: David Berenstein <[email protected]>
  • Loading branch information
4 people committed Dec 19, 2023
1 parent d453cab commit 1c80def
Show file tree
Hide file tree
Showing 8 changed files with 671 additions and 3 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@ These are the section headers that we use:

### Added

- Added strategy to handle and translate errors from server for `401 http status code`. ([#4362](https://github.com/argilla-io/argilla/pull/4362))
- Added strategy to handle and translate errors from the server for `401` HTTP status code` ([#4362](https://github.com/argilla-io/argilla/pull/4362))
- Added integration for `textdescriptives` using `TextDescriptivesExtractor` to configure `metadata_properties` in `FeedbackDataset` and `FeedbackRecord`. ([#4400](https://github.com/argilla-io/argilla/pull/4400)). Contributed by @m-newhauser
- Added `POST /api/v1/me/responses/bulk` endpoint to create responses in bulk for current user. ([#4380](https://github.com/argilla-io/argilla/pull/4380))
- Added new CLI task to reindex datasets and records into the search engine. ([#4404](https://github.com/argilla-io/argilla/pull/4404))

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,4 +33,4 @@ active_learning
weak_supervision
semantic_search
job_scheduling
```
```
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ The following arguments apply to specific metadata types:
```

```{note}
You can also define metadata properties after the dataset has been configured or add them to an existing dataset in Argilla. To do that use the `add_metadata_property` method as explained [here](/practical_guides/create_update_dataset/metadata.md).
You can also define metadata properties after the dataset has been configured or add them to an existing dataset in Argilla using the `add_metadata_property` method. In addition, you can now add text descriptives of your fields as metadata automatically with the `TextDescriptivesExtractor`. For more info, take a look [here](/practical_guides/create_update_dataset/metadata.md).
```

##### Define `vectors`
Expand Down
52 changes: 52 additions & 0 deletions docs/_source/practical_guides/create_update_dataset/metadata.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,58 @@ dataset.update_records(modified_records)
You can also follow the same strategy to modify existing metadata.
```

### Add Text Descriptives

You can easily add text descriptives to your records or datasets using the `TextDescriptivesExtractor` based on the [TextDescriptives](https://github.com/HLasse/TextDescriptives) library, which will add the corresponding metadata properties and metadata automatically. The `TextDescriptivesExtractor` can be used on a `FeedbackDataset` or a `RemoteFeedbackDataset` and accepts the following arguments:

- `model` (optional): The language of the spacy model that will be used. Defaults to `en`. Check [here](https://spacy.io/usage/models) the available languages and models.
- `metrics` (optional): A list of metrics to extract. The default extracted metrics are: `n_tokens`, `n_unique_tokens`, `n_sentences`, `perplexity`, `entropy`, and `flesch_reading_ease`. You can select your metrics according to the following groups `descriptive_stats`, `readability`, `dependency_distance`, `pos_proportions`, `coherence`, `quality`, and `information_theory`. For more information about each group, check this documentation [page](https://hlasse.github.io/TextDescriptives/descriptivestats.html).
- `fields` (optional): A list of field names to extract metrics from. All fields will be used by default.
- `visible_for_annotators` (optional): Whether the extracted metrics should be visible to annotators. Defaults to `True`.
- `show_progress` (optional): Whether to show a progress bar when extracting metrics. Defaults to `True`.

For a practical example, check our [tutorial on adding text descriptives as metadata](/tutorials_and_integrations/integrations/add_text_descriptives_as_metadata.ipynb).

::::{tab-set}

:::{tab-item} Records
```python
from argilla.client.feedback.integrations.textdescriptives import TextDescriptivesExtractor

records = [...] # FeedbackRecords or RemoteFeedbackRecords

tde = TextDescriptivesExtractor(
model="en",
metrics=None,
fields=None,
visible_for_annotators=True,
show_progress=True,
)

tde.update_records(records)
```
:::

:::{tab-item} Dataset
```python
from argilla.client.feedback.integrations.textdescriptives import TextDescriptivesExtractor

dataset = dataset # FeedbackDataset or RemoteFeedbackDataset

tde = TextDescriptivesExtractor(
model="en",
metrics=None,
fields=None,
visible_for_annotators=True,
show_progress=True,
)

tde.update_dataset(dataset)
```
:::

::::


## Other datasets

Expand Down
1 change: 1 addition & 0 deletions environment_dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ dependencies:
- trl>=0.5.0
- sentence-transformers
- rich!=13.1.0
- textdescriptives>=2.7.0,<3.0.0
- ipynbname>=2023.2.0.0
# install Argilla in editable mode
- -e .[server,listeners]
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,7 @@ integrations = [
"sentence-transformers",
"setfit>=0.7.0",
"span_marker",
"textdescriptives>=2.7.0,<3.0.0",
"openai>=0.27.10,<1.0.0",
"peft",
"trl>=0.5.0",
Expand Down
Loading

0 comments on commit 1c80def

Please sign in to comment.