A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
-
Updated
Jun 25, 2024 - Python
A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
Data repository for pretrained NLP models and NLP corpora.
Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)
Automatic categorization of documents, consists in assigning a category to a text based on the information it contains. We'll follow different approach of Supervised Machine Learning.
Unannotated Spanish 3 Billion Words Corpora
The Official Repository for 👉 CCAE: A Corpus of Chinese-based Asian Englishes @ NLPCC 2023
Named Entity Recognition for biomedical entities
Official source for Spanish pretrained biomedical and clinical language models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).
repo for Tibetan corpora
An unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts
The Potsdam Twitter Sentiment Corpus
OPUS (opus.nlpl.eu) Python3 API
Multilingual text corpus designed to study multilingual and cross-lingual natural language understanding (NLU) models and the strategies of localization of virtual assistants
Measure the similarity of text corpora for 74 languages
Scripts for building a geo-located web corpus using Common Crawl data
Add a description, image, and links to the corpora topic page so that developers can more easily learn about it.
To associate your repository with the corpora topic, visit your repo's landing page and select "manage topics."