Automatic Error Correction Using the Wikipedia Page Revision History

Error correction is one of the most crucial and time-consuming steps of data preprocessing. State-of-the-art error correction systems leverage various signals, such as predefined data constraints or user-provided correction examples, to fix erroneous values in a semi-supervised manner. While these approaches reduce human involvement to a few labeled tuples, they still need supervision to fix data errors. In this paper, we propose a novel error correction approach to automatically fix data errors of dirty datasets. Our approach pretrains a set of error corrector models on correction examples extracted from the Wikipedia page revision history. It then fine-tunes these models on the dirty dataset at hand without any required user labels. Finally, our approach aggregates the fine-tuned error corrector models to find the actual correction of each data error. As our experiments show, our approach automatically fixes a large portion of data errors of various dirty datasets with high precision.

How to run the system

Our approach needs the dataset only. Dataset should be put in the datasets directory. You can simply use the notebook: Error_Corrrection.ipynb. First, you need to install all packages which could be done by using our notebook.

Resources we used for our system

Wiki Dump Files

Wiki Dump File Link:July 2020

Wiki Revision Table & Infobox Parser

mwparserfromhell

wikitextparser

Extract Old_New Values

Difflib

Model: Pretrained, Finetune

Edit_Distance

Gensim

fastText

Code Adapted

Raha

Typo Error

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
aion		aion
datasets		datasets
result		result
Error_Coorrection.ipynb		Error_Coorrection.ipynb
README.md		README.md
aggregation.py		aggregation.py
correction.py		correction.py
datasets.py		datasets.py
experiments.py		experiments.py
install.py		install.py
models.py		models.py
requirements.txt		requirements.txt
test.py		test.py
wikidataparsing.py		wikidataparsing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Error Correction Using the Wikipedia Page Revision History

How to run the system

Resources we used for our system

Wiki Dump Files

Wiki Revision Table & Infobox Parser

Extract Old_New Values

Model: Pretrained, Finetune

Code Adapted

About

Releases

Packages

Languages

hasantuberlin/AutoECUWRH

Folders and files

Latest commit

History

Repository files navigation

Automatic Error Correction Using the Wikipedia Page Revision History

How to run the system

Resources we used for our system

Wiki Dump Files

Wiki Revision Table & Infobox Parser

Extract Old_New Values

Model: Pretrained, Finetune

Code Adapted

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages