- HierText. Follow the official repo of HierText to download the dataset images. I label and provide the pixel-level text (stroke) segmentation ground-truths (png format, binary, 0 for background, 255 for text foreground), which can be downloaded with the following OneDrive links: train_gt (131MB), validation_gt (26MB), test_gt (25MB).
To training Hi-SAM for hierarchical text segmentation, download the training gt json file, which is derived from the gt in HierText repo by using HierText/process_gt.py
.
- Total-Text. Follow the official repo of Total-Text to download the dataset. For text (stroke) segmentation, please download the character level mask ground-truths.
- TextSeg. Follow the official repo of TextSeg to apply for the dataset.
(1) For Total-Text, rename groundtruth_pixel/Train/img61.JPG
to groundtruth_pixel/Train/img61.jpg
.
(2) For TextSeg, see TextSeg/process_textseg.py
and use it to split the original data.
(3) Organize the datasets as the following structure:
|- HierText
| |- train
| |- train_gt
| |- validation
| |- validation_gt
| |- test
| |- test_gt
| └ train_shrink_vert.json
|- TotalText
| |- groundtruth_pixel
| |- Test
| └ Train
| └ Images
| |- Test
| └ Train
|- TextSeg
| |- train_images
| |- train_gt
| |- val_images
| |- val_gt
| |- test_images
| └ test_gt