Data Preparation

Step1. Download

HierText. Follow the official repo of HierText to download the dataset images. I label and provide the pixel-level text (stroke) segmentation ground-truths (png format, binary, 0 for background, 255 for text foreground), which can be downloaded with the following OneDrive links: train_gt (131MB), validation_gt (26MB), test_gt (25MB).

To training Hi-SAM for hierarchical text segmentation, download the training gt json file, which is derived from the gt in HierText repo by using HierText/process_gt.py.

Total-Text. Follow the official repo of Total-Text to download the dataset. For text (stroke) segmentation, please download the character level mask ground-truths.
TextSeg. Follow the official repo of TextSeg to apply for the dataset.

Step2. Process & Organization

(1) For Total-Text, rename groundtruth_pixel/Train/img61.JPG to groundtruth_pixel/Train/img61.jpg .

(2) For TextSeg, see TextSeg/process_textseg.py and use it to split the original data.

(3) Organize the datasets as the following structure:

|- HierText
|  |- train
|  |- train_gt
|  |- validation
|  |- validation_gt
|  |- test
|  |- test_gt
|  └  train_shrink_vert.json
|- TotalText
|  |- groundtruth_pixel
|     |- Test
|     └  Train
|  └  Images
|     |- Test
|     └  Train
|- TextSeg
|  |- train_images
|  |- train_gt
|  |- val_images
|  |- val_gt
|  |- test_images
|  └  test_gt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_preparation.md

data_preparation.md

Data Preparation

Step1. Download

Step2. Process & Organization

Files

data_preparation.md

Latest commit

History

data_preparation.md

File metadata and controls

Data Preparation

Step1. Download

Step2. Process & Organization