The official implementation of UPOCR: Towards Unified Pixel-Level OCR Interface (ICML 2024). The UPOCR represents a first-of-its-kind simple-yet-effective generalist model for unified pixel-level OCR interface. Through the unification of paradigms, architectures, and training strategies, UPOCR simultaneously excels in diverse pixel-level OCR tasks using a single unified model. Below is the framework of UPOCR.
We recommend using Anaconda to manage environments. Run the following commands to install dependencies.
conda create -n upocr python=3.9 -y
conda activate upocr
pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu116
git clone https://github.com/shannanyinxiang/UPOCR.git
cd UPOCR
pip install -r requirements.txt
- Download the SCUT-EnsText [repo], TextSeg [repo], and Tampered-IC13 [repo] datasets.
- Preprocess the SCUT-EnsText dataset following [link].
- Arrange the datasets according to the file structure below.
data
├─TamperedTextDetection
│ └─Tampered-IC13
│ ├─test_gt
│ ├─test_img
│ ├─train_gt
│ └─train_img
├─TextRemoval
│ └─SCUT-EnsText
│ ├─train
│ │ ├─image
│ │ ├─label
│ │ └─mask
│ └─test
│ ├─image
│ ├─label
│ └─mask
└─TextSegmentation
└─TextSeg
├─image
├─semantic_label
└─split.json
- Download the UPOCR weights at [link].
- Run the following command to perform model inference on the TextSeg dataset.
dataset=textseg # or scut-enstext or tampered-ic13
output_dir=./output/upocr-infer/
mkdir ${output_dir}
CUDA_VISIBLE_DEVICES=0 \
torchrun \
--master_port=3140 \
--nproc_per_node=1 \
main.py \
--output_dir ${output_dir} \
--data_cfg_paths data_configs/train/scut-enstext.yaml data_configs/train/tampered-ic13.yaml data_configs/train/textseg.yaml \
--eval true \
--resume pretrained/upocr.pth \
--eval_data_cfg_path data_configs/eval/${dataset}.yaml \
--visualize true \
--textseg_conf_thres 0.4 # Tune this argument for optimal text segmentation performance.
Change the dataset
variable to scut-enstext
or tampered-ic13
to run inference on the SCUT-EnsText or Tampered-IC13 datasets, respectively.
- For the text removal task, run the following command to calculate image-eval metrics. For the other two tasks, the metrics will be automatically calculated at the above step.
python -u eval/text_removal/evaluation.py \
--gt_path data/TextErase/SCUT-ENS/test/label/ \
--target_path output/upocr-infer/SCUT-EnsText
python -m pytorch_fid \
data/TextErase/SCUT-ENS/test/label/ \
output/upocr-infer/SCUT-EnsText \
--device cuda:0
- Download the pre-training weights for UPOCR at [link].
- Run the following command for model training.
output_dir=./output/upocr-train/
log_path=${output_dir}log_train.txt
mkdir 'output'
mkdir ${output_dir}
CUDA_VISIBLE_DEVICES=0,1 \
torchrun \
--master_port=3140 \
--nproc_per_node=2 \
main.py \
--output_dir ${output_dir} \
--data_cfg_paths data_configs/train/scut-enstext.yaml data_configs/train/tampered-ic13.yaml data_configs/train/textseg.yaml \
--pretrained_model pretrained/pretraining_weights.pth \
--amp true | tee -a ${log_path}
@inproceedings{peng2024upocr,
title={{UPOCR}: Towards Unified Pixel-Level {OCR} Interface},
author={Peng, Dezhi and Yang, Zhenhua and Zhang, Jiaxin and Liu, Chongyu and Shi, Yongxin and Ding, Kai and Guo, Fengjun and Jin, Lianwen},
booktitle={International Conference on Machine Learning},
year={2024},
}
This repository can only be used for non-commercial research purpose.
For commercial use, please contact Prof. Lianwen Jin (eelwjin@scut.edu.cn).
Copyright 2024, Deep Learning and Vision Computing Lab, South China University of Technology.