English | 中文

DBNet and DBNet++

DBNet: Real-time Scene Text Detection with Differentiable Binarization DBNet++: Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion

Introduction

DBNet

DBNet is a segmentation-based scene text detection method. Segmentation-based methods are gaining popularity for scene text detection purposes as they can more accurately describe scene text of various shapes, such as curved text. The drawback of current segmentation-based SOTA methods is the post-processing of binarization (conversion of probability maps into text bounding boxes) which often requires a manually set threshold (reduces prediction accuracy) and complex algorithms for grouping pixels (resulting in a considerable time cost during inference). To eliminate the problem described above, DBNet integrates an adaptive threshold called Differentiable Binarization(DB) into the architecture. DB simplifies post-processing and enhances the performance of text detection.Moreover, it can be removed in the inference stage without sacrificing performance.[1]

Figure 1. Overall DBNet architecture

The overall architecture of DBNet is presented in Figure 1. It consists of multiple stages:

Feature extraction from a backbone at different scales. ResNet-50 is used as a backbone, and features are extracted from stages 2, 3, 4, and 5.
The extracted features are upscaled and summed up with the previous stage features in a cascade fashion.
The resulting features are upscaled once again to match the size of the largest feature map (from the stage 2) and concatenated along the channel axis.
Then, the final feature map (shown in dark blue) is used to predict both the probability and threshold maps by applying 3×3 convolutional operator and two de-convolutional operators with stride 2.
The probability and threshold maps are merged into one approximate binary map by the Differentiable binarization module. The approximate binary map is used to generate text bounding boxes.

DBNet++

DBNet++ is an extension of DBNet and thus replicates its architecture. The only difference is that instead of concatenating extracted and scaled features from the backbone as DBNet did, DBNet++ uses an adaptive way to fuse those features called Adaptive Scale Fusion (ASF) module (Figure 2). It improves the scale robustness of the network by fusing features of different scales adaptively. By using ASF, DBNet++’s ability to detect text instances of diverse scales is distinctly strengthened.[2]

Figure 2. Overall DBNet++ architecture

Figure 3. Detailed architecture of the Adaptive Scale Fusion module

ASF consists of two attention modules – stage-wise attention and spatial attention, where the latter is integrated in the former as described in the Figure 3. The stage-wise attention module learns the weights of the feature maps of different scales. While the spatial attention module learns the attention across the spatial dimensions. The combination of these two modules leads to scale-robust feature fusion. DBNet++ performs better in detecting text instances of diverse scales, especially for large-scale text instances where DBNet may generate inaccurate or discrete bounding boxes.

Requirements

mindspore	ascend driver	firmware	cann toolkit/kernel
2.3.1	24.1.RC2	7.3.0.1.231	8.0.RC2.beta1

Quick Start

Installation

Please refer to the installation instruction in MindOCR.

Dataset preparation

ICDAR2015 dataset

Please download ICDAR2015 dataset, and convert the labels to the desired format referring to dataset_converters.