This repository contains the full training code to reproduce all TF-ID models. We also open-source the model weights and human annotated dataset all under mit license.
TF-ID (Table/Figure IDentifier) is a family of object detection models finetuned to extract tables and figures in academic papers created by Yifei Hu. They come in four versions:
Model | Model size | Model Description |
---|---|---|
TF-ID-base[HF] | 0.23B | Extract tables/figures and their caption text |
TF-ID-large[HF] (Recommended) | 0.77B | Extract tables/figures and their caption text |
TF-ID-base-no-caption[HF] | 0.23B | Extract tables/figures without caption text |
TF-ID-large-no-caption[HF] (Recommended) | 0.77B | Extract tables/figures without caption text |
All TF-ID models are finetuned from microsoft/Florence-2 checkpoints.
- Use
python inference.py
to extract bounding boxes from one given image - Use
python pdf_to_table_figures.py
to extract all tables and figures from one pdf paper and save the cropped figures and tables under./sample_output
- TF-ID-large are used in the scripts by default. You can swtich to a different variant by changing the model_id in the scripts, but large models are always recommended.
- Clone the repo:
git clone https://github.com/ai8hyf/TF-ID
cd TF-ID
- Download the huggingface.co/datasets/yifeihu/TF-ID-arxiv-papers from Hugging Face
- Move annotations_with_caption.json to
./annotations
(Use annotations_no_caption.json if you don't want the bounding boxes to include text captions) - Unzip the arxiv_paper_images.zip and move the .png images to
./images
- Convert the coco format dataset to florence 2 format:
python coco_to_florence.py
- You should see train.jsonl and test.jsonl under
./annotations
- Train the model with Accelerate:
accelerate launch train.py
- The checkpoints will be saved under
./model_checkpoints
With microsoft/Florence-2-large-ft, BATCH_SIZE=4
will require at least 40GB VRAM on a single GPU. The microsoft/Florence-2-base-ft model takes much less VRAM. Please modify the BATCH_SIZE
and CHECKPOINT
parameter in the train.py
before you start training.
We tested the models on paper pages outside the training dataset. The papers are a subset of huggingface daily paper. Correct output - the model draws correct bounding boxes for every table/figure in the given page.
Model | Total Images | Correct Output | Success Rate |
---|---|---|---|
TF-ID-base[HF] | 258 | 251 | 97.29% |
TF-ID-large[HF] | 258 | 253 | 98.06% |
TF-ID-base-no-caption[HF] | 261 | 253 | 96.93% |
TF-ID-large-no-caption[HF] | 261 | 254 | 97.32% |
Depending on the use cases, some "incorrect" output could be totally usable. For example, the model draw two bounding boxes for one figure with two child components.
- I learned how to work with Florence 2 models from this Roboflow's awesome tutorial.
- My friend Yi Zhang helped annotate some data to train our proof-of-concept models including a yolo-based TF-ID model.
If you find TD-ID useful, please cite this project as:
@misc{TF-ID,
author = {Yifei Hu},
title = {TF-ID: Table/Figure IDentifier for academic papers},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ai8hyf/TF-ID}},
}