corpora

Here are 58 public repositories matching this topic...

juand-r / entity-recognition-datasets

A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.

nlp natural-language-processing annotations named-entity-recognition corpora datasets ner nlp-resources entity-extraction entity-recognition

Updated Jun 25, 2024
Python

nltk / nltk_data

Star

NLTK Data

nlp natural-language-processing linguistics nltk corpora

Updated Jul 29, 2024
Python

piskvorky / gensim-data

Sponsor

Star

Data repository for pretrained NLP models and NLP corpora.

dataset gensim corpora pretrained-models word2vec-model lda-model lsi-model glove-model

Updated Mar 16, 2018
Python

PlanTL-GOB-ES / lm-spanish

Star

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

nlp transformers embeddings benchmarks corpora language-model

Updated Jul 27, 2023
Python

zliucr / CrossNER

Star

CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

dataset named-entity-recognition corpora multi-domain ner cross-domain sequence-labeling domain-adaptation low-resource multi-domain-adaptation

Updated Jan 5, 2021
Python

jfainberg / self_dialogue_corpus

Star

The Self-dialogue Corpus - a collection of self-dialogues across music, movies and sports

nlp dialogue corpora

Updated Mar 19, 2024
Python

saidziani / Arabic-News-Article-Classification

Star

Automatic categorization of documents, consists in assigning a category to a text based on the information it contains. We'll follow different approach of Supervised Machine Learning.

nlp machine-learning python3 nltk corpora arabic-nlp arabic-language text-categorization

Updated Jan 1, 2019
Python

josecannete / spanish-corpora

Star

Unannotated Spanish 3 Billion Words Corpora

nlp natural-language-processing linguistics spanish corpora spanish-language

Updated Oct 20, 2022
Python

jacklanda / CCAE

Star

The Official Repository for 👉 CCAE: A Corpus of Chinese-based Asian Englishes @ NLPCC 2023

nlp transfer-learning corpora language-model

Updated Dec 6, 2023
Python

hu-ner / huner

Star

Named Entity Recognition for biomedical entities

named-entity-recognition neural-networks corpora ner bionlp

Updated Jan 11, 2023
Python

CyberZHG / wiki-dump-reader

Star

Extract corpora from Wikipedia dumps

nlp wikipedia corpora

Updated Mar 26, 2019
Python

PlanTL-GOB-ES / lm-biomedical-clinical-es

Star

Official source for Spanish pretrained biomedical and clinical language models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

nlp transformers clinical spanish corpora language-model biomedical

Updated Nov 16, 2022
Python

Esukhia / Corpora

Star

repo for Tibetan corpora

corpora tibetan-nlp tibetan-corpora tibetan-speech

Updated Apr 10, 2023
Python

EdwardSeley / lyrics-corpora

Star

An unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts

python music artists lyrics corpus songs python-api corpora corpus-linguistics scrapper scraping-websites corpus-tools billboard-charts

Updated Jul 2, 2018
Python

WladimirSidorenko / PotTS

Star

The Potsdam Twitter Sentiment Corpus

nlp social-media sentiment-analysis corpora opinion-mining

Updated Jan 15, 2020
Python

NetherlandsForensicInstitute / demeuk

Star

Demeuk is a simple tool to clean up corpora (like dictionaries) or any dataset containing plain text strings.

encoding corpus cleanup passwords corpora hacktoberfest

Updated Aug 26, 2024
Python

korenyoni / opus-api

Star

OPUS (opus.nlpl.eu) Python3 API

python api machine-learning corpus corporate opus corpora language-model parallel-corpus parallel-corpora

Updated Sep 19, 2024
Python

cartesinus / leyzer

Star

Multilingual text corpus designed to study multilingual and cross-lingual natural language understanding (NLU) models and the strategies of localization of virtual assistants

machine-translation nlu corpora virtual-assistant

Updated Jul 16, 2023
Python

jonathandunn / corpus_similarity

Star

Measure the similarity of text corpora for 74 languages

nlp language natural-language-processing text corpus corpora corpus-linguistics corpus-tools corpus-processing

Updated Jan 26, 2024
Python

jonathandunn / common_crawl_corpus

Star

Scripts for building a geo-located web corpus using Common Crawl data

corpora corpus-linguistics web-crawling corpus-tools corpus-processing

Updated Mar 13, 2024
Python

Improve this page

Add a description, image, and links to the corpora topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the corpora topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

corpora

Here are 58 public repositories matching this topic...

juand-r / entity-recognition-datasets

nltk / nltk_data

piskvorky / gensim-data

PlanTL-GOB-ES / lm-spanish

zliucr / CrossNER

jfainberg / self_dialogue_corpus

saidziani / Arabic-News-Article-Classification

josecannete / spanish-corpora

jacklanda / CCAE

hu-ner / huner

CyberZHG / wiki-dump-reader

PlanTL-GOB-ES / lm-biomedical-clinical-es

Esukhia / Corpora

EdwardSeley / lyrics-corpora

WladimirSidorenko / PotTS

NetherlandsForensicInstitute / demeuk

korenyoni / opus-api

cartesinus / leyzer

jonathandunn / corpus_similarity

jonathandunn / common_crawl_corpus

Improve this page

Add this topic to your repo