TF-ID

This repository contains the full training code to reproduce all TF-ID models. We also open-source the model weights and human annotated dataset all under mit license.

Model Summary

TF-ID (Table/Figure IDentifier) is a family of object detection models finetuned to extract tables and figures in academic papers created by Yifei Hu. They come in four versions:

Model	Model size	Model Description
TF-ID-base[HF]	0.23B	Extract tables/figures and their caption text
TF-ID-large[HF] (Recommended)	0.77B	Extract tables/figures and their caption text
TF-ID-base-no-caption[HF]	0.23B	Extract tables/figures without caption text
TF-ID-large-no-caption[HF] (Recommended)	0.77B	Extract tables/figures without caption text

All TF-ID models are finetuned from microsoft/Florence-2 checkpoints.

Sample Usage

Use python inference.py to extract bounding boxes from one given image
Use python pdf_to_table_figures.pyto extract all tables and figures from one pdf paper and save the cropped figures and tables under ./sample_output
TF-ID-large are used in the scripts by default. You can swtich to a different variant by changing the model_id in the scripts, but large models are always recommended.

Train TF-ID models from scratch

Clone the repo: git clone https://github.com/ai8hyf/TF-ID
cd TF-ID
Download the huggingface.co/datasets/yifeihu/TF-ID-arxiv-papers from Hugging Face
Move annotations_with_caption.json to ./annotations (Use annotations_no_caption.json if you don't want the bounding boxes to include text captions)
Unzip the arxiv_paper_images.zip and move the .png images to ./images
Convert the coco format dataset to florence 2 format: python coco_to_florence.py
You should see train.jsonl and test.jsonl under ./annotations
Train the model with Accelerate: accelerate launch train.py
The checkpoints will be saved under ./model_checkpoints

Hardware Requirement

With microsoft/Florence-2-large-ft, BATCH_SIZE=4 will require at least 40GB VRAM on a single GPU. The microsoft/Florence-2-base-ft model takes much less VRAM. Please modify the BATCH_SIZE and CHECKPOINT parameter in the train.py before you start training.

Benchmarks

We tested the models on paper pages outside the training dataset. The papers are a subset of huggingface daily paper. Correct output - the model draws correct bounding boxes for every table/figure in the given page.

Model	Total Images	Correct Output	Success Rate
TF-ID-base[HF]	258	251	97.29%
TF-ID-large[HF]	258	253	98.06%
TF-ID-base-no-caption[HF]	261	253	96.93%
TF-ID-large-no-caption[HF]	261	254	97.32%

Depending on the use cases, some "incorrect" output could be totally usable. For example, the model draw two bounding boxes for one figure with two child components.

Acknowledgement

I learned how to work with Florence 2 models from this Roboflow's awesome tutorial.
My friend Yi Zhang helped annotate some data to train our proof-of-concept models including a yolo-based TF-ID model.

Citation

If you find TD-ID useful, please cite this project as:

@misc{TF-ID,
  author = {Yifei Hu},
  title = {TF-ID: Table/Figure IDentifier for academic papers},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ai8hyf/TF-ID}},
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
annotations		annotations
assets		assets
images		images
pdfs		pdfs
sample_images		sample_images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
coco_to_florence.py		coco_to_florence.py
inference.py		inference.py
pdf_to_table_figures.py		pdf_to_table_figures.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TF-ID

Model Summary

Sample Usage

Train TF-ID models from scratch

Hardware Requirement

Benchmarks

Acknowledgement

Citation

About

Releases

Packages

Languages

License

ai8hyf/TF-ID

Folders and files

Latest commit

History

Repository files navigation

TF-ID

Model Summary

Sample Usage

Train TF-ID models from scratch

Hardware Requirement

Benchmarks

Acknowledgement

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages