Skip to content

Commit

Permalink
feature: Add list like aggregation support for metadata (argilla-io#4414
Browse files Browse the repository at this point in the history
)

<!-- Thanks for your contribution! As part of our Community Growers
initiative 🌱, we're donating Justdiggit bunds in your name to reforest
sub-Saharan Africa. To claim your Community Growers certificate, please
contact David Berenstein in our Slack community or fill in this form
https://tally.so/r/n9XrxK once your PR has been merged. -->

# Description

This PR adds list support for term metadata values.
 
Closes argilla-io#4359

**Type of change**

(Please delete options that are not relevant. Remember to title the PR
according to the type of change)

- [ ] New feature (non-breaking change which adds functionality)
- [ ] Refactor (change restructuring the codebase without changing
functionality)
- [X] Improvement (change adding some improvement to an existing
functionality)

**How Has This Been Tested**

(Please describe the tests that you ran to verify your changes. And
ideally, reference `tests`)

Tested locally with this code snippet:
```python
dataset = rg.FeedbackDataset(
     fields=[rg.TextField(name="text"), rg.TextField(name="optional", required=False)],
     questions=[rg.TextQuestion(name="question")],
     metadata_properties=[
         rg.TermsMetadataProperty(name="terms-metadata", values=["a", "b", "c"]),
         rg.IntegerMetadataProperty(name="integer-metadata"),
         rg.FloatMetadataProperty(name="float-metadata", min=0.0, max=10.0),
     ],
 )

ds = dataset.push_to_argilla("ds", workspace="argilla")

records = [
   rg.FeedbackRecord(fields={"text": "Hello world!"}, metadata={"terms-metadata": "a"}),
   rg.FeedbackRecord(fields={"text": "Hello world!"}, metadata={"terms-metadata": ["b", "a"]}),
]

ds.add_records(records)
```


**Checklist**

- [ ] I added relevant documentation
- [X] I followed the style guidelines of this project
- [X] I did a self-review of my code
- [ ] I made corresponding changes to the documentation
- [X] My changes generate no new warnings
- [X] I have added tests that prove my fix is effective or that my
feature works
- [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK)
(see text above)
- [X] I have added relevant notes to the `CHANGELOG.md` file (See
https://keepachangelog.com/)

---------

Co-authored-by: David Berenstein <[email protected]>
Co-authored-by: kursathalat <[email protected]>
  • Loading branch information
3 people authored and damianpumar committed Dec 20, 2023
1 parent 3ef731c commit 2691b2c
Show file tree
Hide file tree
Showing 8 changed files with 130 additions and 16 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ These are the section headers that we use:
- Added strategy to handle and translate errors from the server for `401` HTTP status code` ([#4362](https://github.com/argilla-io/argilla/pull/4362))
- Added integration for `textdescriptives` using `TextDescriptivesExtractor` to configure `metadata_properties` in `FeedbackDataset` and `FeedbackRecord`. ([#4400](https://github.com/argilla-io/argilla/pull/4400)). Contributed by @m-newhauser
- Added `POST /api/v1/me/responses/bulk` endpoint to create responses in bulk for current user. ([#4380](https://github.com/argilla-io/argilla/pull/4380))
- Added list support for term metadata properties. (Closes [#4359](https://github.com/argilla-io/argilla/issues/4359))
- Added new CLI task to reindex datasets and records into the search engine. ([#4404](https://github.com/argilla-io/argilla/pull/4404))

### Changed
Expand Down
23 changes: 20 additions & 3 deletions docs/_source/practical_guides/create_update_dataset/metadata.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,18 +72,35 @@ dataset.delete_metadata_properties(metadata_properties="groups")

### Format `metadata`

Record metadata can include any information about the record that is not part of the fields in the form of a dictionary. If you want the metadata to correspond with the metadata properties configured for your dataset so that these can be used for filtering and sorting records, make sure that the key of the dictionary corresponds with the metadata property `name`. When the key doesn't correspond, this will be considered extra metadata that will get stored with the record (as long as `allow_extra_metadata` is set to `True` for the dataset), but will not be usable for filtering and sorting.
Record metadata can include any information about the record that is not part of the fields in the form of a dictionary. If you want the metadata to correspond with the metadata properties configured for your dataset so that these can be used for filtering and sorting records, make sure that the key of the dictionary corresponds with the metadata property `name`. When the key doesn't correspond, this will be considered extra metadata that will get stored with the record (as long as `allow_extra_metadata` is set to `True` for the dataset), but will not be usable for filtering and sorting. For any metadata property, you can define a single metadata value in the form of a string or integer, or multiple metadata values in the form of a list of strings or integers.

::::{tab-set}

:::{tab-item} Single Metadata

```python
record = rg.FeedbackRecord(
fields={...},
metadata={"source": "encyclopedia", "text_length":150}
)
```
:::

:::{tab-item} Multiple Metadata
```python
record = rg.FeedbackRecord(
fields={...},
metadata={"source": ["encyclopedia", "wikipedia"], "text_length":150}
)
```

:::

::::

#### Add `metadata`

Once the `metadata_properties` were defined, to add metadata to the records, it slightly depends on whether you are using a `FeedbackDataset` or a `RemoteFeedbackDataset`. For an end-to-end example, check our [tutorial on adding metadata](/tutorials_and_integrations/tutorials/feedback/end2end_examples/add-metadata-003.ipynb).
Once the `metadata_properties` were defined, to add metadata to the records, it slightly depends on whether you are using a `FeedbackDataset` or a `RemoteFeedbackDataset`. For an end-to-end example, check our [tutorial on adding metadata](/tutorials_and_integrations/tutorials/feedback/end2end_examples/add-metadata-003.ipynb). Remember that you can either define a single metadata value for a metadata property or aggregate metadata values for the `TermsMetadataProperty` in the form of a list for the cases where one record falls into multiple metadata categories.

```{note}
The dataset not yet pushed to Argilla or pulled from HuggingFace Hub is an instance of `FeedbackDataset` whereas the dataset pulled from Argilla is an instance of `RemoteFeedbackDataset`. The difference between the two is that the former is a local one and the changes made on it stay locally. On the other hand, the latter is a remote one and the changes made on it are directly reflected on the dataset on the Argilla server, which can make your process faster.
Expand Down Expand Up @@ -202,4 +219,4 @@ for record in dataset:
record.metadata["my_metadata"] = "my_value"
modified_records.append(record)
rg.log(name="my_dataset", records=modified_records)
```
```
24 changes: 21 additions & 3 deletions docs/_source/practical_guides/create_update_dataset/records.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ After configuring a `FeedbackDataset`, as shown in the [previous guide](/practic
record = rg.FeedbackRecord(
fields={
"question": "Why can camels survive long without water?",
"answer": "Camels use the fat in their humps to keep them filled with energy and hydration for long periods of time."
"answer": "Camels use the fat in their humps to keep them filled with energy and hydration for long periods."
},
metadata={"source": "encyclopedia"},
vectors={"my_vector": [...], "my_other_vector": [...]},
Expand All @@ -46,14 +46,32 @@ record = rg.FeedbackRecord(
```

#### Format `metadata`
Record metadata can include any information about the record that is not part of the fields in the form of a dictionary. If you want the metadata to correspond with the metadata properties configured for your dataset so that these can be used for filtering and sorting records, make sure that the key of the dictionary corresponds with the metadata property `name`. When the key doesn't correspond, this will be considered extra metadata that will get stored with the record (as long as `allow_extra_metadata` is set to `True` for the dataset), but will not be usable for filtering and sorting.

Record metadata can include any information about the record that is not part of the fields in the form of a dictionary. If you want the metadata to correspond with the metadata properties configured for your dataset so that these can be used for filtering and sorting records, make sure that the key of the dictionary corresponds with the metadata property `name`. When the key doesn't correspond, this will be considered extra metadata that will get stored with the record (as long as `allow_extra_metadata` is set to `True` for the dataset), but will not be usable for filtering and sorting. As well as adding one metadata property to a single record, you can also add aggregate metadata values for the `TermsMetadataProperty` in the form of a list.

::::{tab-set}

:::{tab-item} Single Metadata

```python
record = rg.FeedbackRecord(
fields={...},
metadata={"source": "encyclopedia", "text_length":150}
)
```
:::

:::{tab-item} Multiple Metadata
```python
record = rg.FeedbackRecord(
fields={...},
metadata={"source": ["encyclopedia", "wikipedia"], "text_length":150}
)
```

:::

::::

#### Format `vectors`
You can associate vectors, like text embeddings, to your records. This will enable the [semantic search](filter_dataset.md#semantic-search) in the UI and the Python SDK. These are saved as a dictionary, where the keys correspond to the `name`s of the vector settings that were configured for your dataset and the value is a list of floats. Make sure that the length of the list corresponds to the dimensions set in the vector settings.
Expand Down Expand Up @@ -510,4 +528,4 @@ rg.delete_records(name="example-dataset", query="metadata.code=33", discard_only
```
:::

::::
::::
Original file line number Diff line number Diff line change
Expand Up @@ -300,7 +300,7 @@
"\n",
"### TermsMetadataProperty\n",
"\n",
"The `TermsMetadaProperty` is a metadata property that can be used to filter the metadata of a record based on a list of possible terms or values."
"The `TermsMetadataProperty` is a metadata property that can be used to filter the metadata of a record based on a list of possible terms or values."
]
},
{
Expand Down Expand Up @@ -439,6 +439,31 @@
"dataset_remote.update_records(modified_records)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Aggregate Metadata Values\n",
"\n",
"In addition, we have the opportunity to add multiple metadata values for the `TermsMetadataProperty` to a single record. This is quite useful when a record falls into multiple categories. For the example case at hand, let us imagine that one of the records (or any number of them) is to be annotated by two groups. We can simply encode this information by giving a list of the metadata values. Let us see how it is done for the local `FeedbackDataset` and it is just the same process for the `RemoteFeedbackDataset` as above."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dataset[1].metadata[\"group\"] = [\"group-1\", \"group-2\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have seen an example of how to add aggregate metadata values for `TermsMetadataProperty` here. Please note that this is also applicable for `IntegerMetadataProperty` and `FloatMetadataProperty`, and you can add them in the same way."
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down
4 changes: 3 additions & 1 deletion src/argilla/client/feedback/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import List, Union

from pydantic import StrictFloat, StrictInt, StrictStr

Expand All @@ -23,7 +24,7 @@
FIELD_TYPE_TO_PYTHON_TYPE = {FieldTypes.text: str}
# We are using `pydantic`'s strict types to avoid implicit type conversions
METADATA_PROPERTY_TYPE_TO_PYDANTIC_TYPE = {
MetadataPropertyTypes.terms: StrictStr,
MetadataPropertyTypes.terms: Union[StrictStr, List[StrictStr]],
MetadataPropertyTypes.integer: StrictInt,
MetadataPropertyTypes.float: StrictFloat,
}
Expand All @@ -32,4 +33,5 @@
StrictInt: int,
StrictFloat: float,
StrictStr: str,
Union[StrictStr, List[StrictStr]]: (str, list),
}
20 changes: 15 additions & 5 deletions src/argilla/client/feedback/schemas/metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,11 +145,21 @@ def server_settings(self) -> Dict[str, Any]:
settings["values"] = self.values
return settings

def _all_values_exist(self, introduced_value: Optional[str] = None) -> Optional[str]:
if introduced_value is not None and self.values is not None and introduced_value not in self.values:
raise ValueError(
f"Provided '{self.name}={introduced_value}' is not valid, only values in {self.values} are allowed."
)
def _all_values_exist(self, introduced_value: Optional[Union[str, List[str]]] = None) -> Optional[str]:
if introduced_value is None or self.values is None:
return introduced_value

if isinstance(introduced_value, str):
values = [introduced_value]
else:
values = introduced_value

for value in values:
if value not in self.values:
raise ValueError(
f"Provided '{self.name}={value}' is not valid, only values in {self.values} are allowed."
)

return introduced_value

def _validator(self, value: Any) -> Any:
Expand Down
14 changes: 11 additions & 3 deletions src/argilla/server/models/metadata_properties.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,17 @@ class TermsMetadataPropertySettings(BaseMetadataPropertySettings):
type: Literal[MetadataPropertyType.terms]
values: Optional[List[str]] = None

def check_metadata(self, value: str) -> None:
if self.values is not None and value not in self.values:
raise ValueError(f"'{value}' is not an allowed term.")
def check_metadata(self, value: Union[str, List[str]]) -> None:
if self.values is None:
return

values = value
if isinstance(values, str):
values = [value]

for v in values:
if v not in self.values:
raise ValueError(f"'{v}' is not an allowed term.")


NT = TypeVar("NT", int, float)
Expand Down
33 changes: 33 additions & 0 deletions tests/unit/server/api/v1/test_records.py
Original file line number Diff line number Diff line change
Expand Up @@ -276,6 +276,39 @@ async def test_update_record_with_no_metadata(
}
mock_search_engine.index_records.assert_not_called()

async def test_update_record_with_list_terms_metadata(
self, async_client: "AsyncClient", mock_search_engine: SearchEngine, owner_auth_header: dict
):
dataset = await DatasetFactory.create()
await TermsMetadataPropertyFactory.create(name="terms-metadata-property", dataset=dataset)
record = await RecordFactory.create(dataset=dataset)

response = await async_client.patch(
f"/api/v1/records/{record.id}",
headers=owner_auth_header,
json={
"metadata": {
"terms-metadata-property": ["a", "b", "c"],
},
},
)

assert response.status_code == 200
assert response.json() == {
"id": str(record.id),
"fields": {"text": "This is a text", "sentiment": "neutral"},
"metadata": {
"terms-metadata-property": ["a", "b", "c"],
},
"external_id": record.external_id,
"responses": [],
"suggestions": [],
"vectors": {},
"inserted_at": record.inserted_at.isoformat(),
"updated_at": record.updated_at.isoformat(),
}
mock_search_engine.index_records.assert_called_once_with(dataset, [record])

async def test_update_record_with_no_suggestions(
self, async_client: "AsyncClient", db: "AsyncSession", mock_search_engine: SearchEngine, owner_auth_header: dict
):
Expand Down

0 comments on commit 2691b2c

Please sign in to comment.