Skip to content

Latest commit

 

History

History

i-Code-Doc

Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal

Open Source Checklist:

  • Release Model (Encoder + Text decoder)
  • Release Most Scripts
  • Vision Decoder / Weights (Due to fake document generation ethical consideration, we plan to release this functionality as an Azure API)
  • Demo

Introduction

UDOP unifies vision, text, and layout through vision-text-layout Transformer and unified generative pretraining tasks including vision task, text task, layout task, and mixed task. We show the task prompts (left) and task targets (right) for all self-supervised objectives (joint text-layout reconstruction, visual text recognition, layout modeling, and masked autoencoding) and two example supervised objectives (question answering and layout analysis).

Install

Setup python environment

conda create -n UDOP python=3.8   # You can also use other environment.

Install other dependencies

pip install -r requirements.txt

Run Scripts

Switch model type by:

--model_type "UdopDual"

--model_type "UdopUnimodel"

Finetuninng on RVLCDIP

Download RVLCDIP first and change the path For OCR, you might need to customize your code

bash scripts/finetune_rvlcdip.sh   # Finetuning on RVLCDIP

Finetuninng on DUE Benchmark

Download Duebenchmark and follow its procedure to preprocess the data.

The training code adapted to our framework is hosted at benchmarker by running:

bash scripts/finetune_duebenchmark.sh   # Finetuning on DUE Benchmark, Switch tasks by changing path to the dataset

Evaluation of the output generation can be evaluated by Duebenchmark due_evaluator

Model Checkpoints

The model checkpoints are hosted here Huggingface Hub

Models Huggingface Weights Address
Unimodel 512 udop-unimodel-large-512.zip
Unimodel 512 (new, trained on more steps) udop-unimodel-large-512-300k-steps.zip
Dual 224 udop-dual-large-224.zip
Unimodel 224 udop-unimodel-large-224.zip

Citation

@article{tang2022unifying,
  title={Unifying Vision, Text, and Layout for Universal Document Processing},
  author={Tang, Zineng and Yang, Ziyi and Wang, Guoxin and Fang, Yuwei and Liu, Yang and Zhu, Chenguang and Zeng, Michael and Zhang, Cha and Bansal, Mohit},
  journal={arXiv preprint arXiv:2212.02623},
  year={2022}
}

Contact

Zineng Tang ([email protected])