Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal
Open Source Checklist:
- Release Model (Encoder + Text decoder)
- Release Most Scripts
- Vision Decoder / Weights (Due to fake document generation ethical consideration, we plan to release this functionality as an Azure API)
- Demo
UDOP unifies vision, text, and layout through vision-text-layout Transformer and unified generative pretraining tasks including vision task, text task, layout task, and mixed task. We show the task prompts (left) and task targets (right) for all self-supervised objectives (joint text-layout reconstruction, visual text recognition, layout modeling, and masked autoencoding) and two example supervised objectives (question answering and layout analysis).
conda create -n UDOP python=3.8 # You can also use other environment.
pip install -r requirements.txt
Switch model type by:
--model_type "UdopDual"
--model_type "UdopUnimodel"
Download RVLCDIP first and change the path For OCR, you might need to customize your code
bash scripts/finetune_rvlcdip.sh # Finetuning on RVLCDIP
Download Duebenchmark and follow its procedure to preprocess the data.
The training code adapted to our framework is hosted at benchmarker by running:
bash scripts/finetune_duebenchmark.sh # Finetuning on DUE Benchmark, Switch tasks by changing path to the dataset
Evaluation of the output generation can be evaluated by Duebenchmark due_evaluator
The model checkpoints are hosted here Huggingface Hub
Models | Huggingface Weights Address |
---|---|
Unimodel 512 | udop-unimodel-large-512.zip |
Unimodel 512 (new, trained on more steps) | udop-unimodel-large-512-300k-steps.zip |
Dual 224 | udop-dual-large-224.zip |
Unimodel 224 | udop-unimodel-large-224.zip |
@article{tang2022unifying,
title={Unifying Vision, Text, and Layout for Universal Document Processing},
author={Tang, Zineng and Yang, Ziyi and Wang, Guoxin and Fang, Yuwei and Liu, Yang and Zhu, Chenguang and Zeng, Michael and Zhang, Cha and Bansal, Mohit},
journal={arXiv preprint arXiv:2212.02623},
year={2022}
}
Zineng Tang ([email protected])