Abstract
Text information in natural scene images serves as important clues for many image-based applications such as scene understanding, content-based image retrieval, assistive navigation, and automatic geocoding. However, locating text from complex background with multiple colors is a challenging task. In this paper, we explore a new framework to detect text strings with arbitrary orientations in complex natural scene images. Our proposed framework of text string detection consists of two steps: 1) Image partition to find text character candidates based on local gradient features and color uniformity of character components. 2) Character candidate grouping to detect text strings based on joint structural features of text characters in each text string such as character size differences, distances between neighboring characters, and character alignment. By assuming that a text string has at least three characters, we propose two algorithms of text string detection: 1) adjacent character grouping method, and 2) text line grouping method. The adjacent character grouping method calculates the sibling groups of each character candidate as string segments and then merges the intersecting sibling groups into text string. The text line grouping method performs Hough transform to fit text line among the centroids of text candidates. Each fitted text line describes the orientation of a potential text string. The detected text string is presented by a rectangle region covering all characters whose centroids are cascaded in its text line. To improve efficiency and accuracy, our algorithms are carried out in multi-scales. The proposed methods outperform the state-of-the-art results on the public Robust Reading Dataset which contains text only in horizontal orientation. Furthermore, the effectiveness of our methods to detect text strings with arbitrary orientations is evaluated on the Oriented Scene Text Dataset collected by ourselves containing text strings in non-horizontal orientations.
Index Terms: Adjacent character grouping, Character property, Image partition, Text line grouping, Text string detection, Text string structure
I. Introduction
AS indicative marks in natural scene images, text information provides brief and significant clues for many image-based applications such as scene understanding, content-based image retrieval, assistive navigation and automatic geocoding. To extract text information from camera-captured document images (i.e. most part of the captured image contains well organized text with clean background), many algorithms and commercial optical character recognition (OCR) systems have been developed [2, 32]. Liang et al. [18] used texture flow analysis to perform geometric rectification of the planar and curved documents. Burns et al. [3] performed topic-based partition of document image to distinguish text, white spaces and figures. Banerjee et al. [1] employed the consistency of text characters in different sections to restore document images from severe degradation based on the model of Markov Random Field. Lu et al. [20] proposed a word shape coding scheme through three topological features of characters for text recognition in document image. All the above algorithms share the same assumption that locations of text characters are approximately predictable and background interference does not resemble text characters.
Different from document images, in which text characters are normalized into elegant poses and proper resolutions, natural scene images embed text in arbitrary shapes, sizes and orientations into complex background, as shown in Fig. 1. It is impossible to recognize text in natural scene images directly because the off-the-shelf OCR software cannot handle complex background interferences and non orienting text lines. Thus we need to detect image regions containing text strings and their corresponding orientations. This is compatible with the detection and localization procedure described in the survey of text extraction algorithms [11, 38]. With knowledge of text string orientations, we can normalize them to horizontal. Some algorithms of scene text normalization are introduced in [4, 18, 26]. But the algorithms described in this paper will focus on text detection.
Fig. 1.
Examples of text in natural scene images
Previous work on text detection can be roughly classified into two categories. The first category focuses on text region initialization and extension by using distinct features of text characters. To extract candidates of text regions, Kasar et al. [12] first assigned a bounding box to the boundary of each candidate character in the edge image and then detected text characters based on the boundary model (i.e. no more than 2 inner holes in each bounding box of alphabets and letters in English); Tran et al. [35] calculated ridge points in different scales to describe text skeletons at the level of higher resolution and text orientations at the level of low resolution; Liu et al. [19] designed a stroke filter to extract the stroke-like structures; Sobottka et al. [33] combined a top-bottom analysis based on color variations in each row and column with a bottom-top analysis based on region growing by color similarity; Hasan et al. [8] and Park et al. [29] designed robust morphological processing; Wolf et al. [37] improved Otsu’s method to binarize text regions from background, followed by a sequence of morphological processing to reduce noise and correct classification errors. To group together text characters and filter out false positives, these algorithms employed similar constraints involved in character, such as the minimum and maximum size, aspect ratio, contrast between character strokes and background, the number of inner holes. However they usually fail to remove the background noise resulted from foliage, pane, bar or other background objects that resemble text characters. To reduce background noise, the algorithms in the second category partition images to blocks and then groups the blocks verified by the features of text characters. Shivakumara et al. [31] applied different edge detectors to search for blocks containing the most apparent edges of text characters; Lefevre et al. [17] further designed a fusion strategy to combine detectors of color, texture, contour, and temporal invariance respectively; Weinman et al. [36] used a group of filters to analyze texture features in each block and joint texture distributions between adjacent blocks by using conditional random field. One limitation of these algorithms is that they used non-content-based image partition to divide the image spatially into blocks of equal size before grouping is performed. Non-content-based image partition is very likely to break up text characters or text strings into fragments which fail to satisfy the texture constraints. Thus Phan et al. [30] performed line-by-line scan in edge images to combine rows and columns with high density of edge pixels into text regions. Gao et al. [7] and Suen et al. [34] performed heuristic grouping and layout analysis to cluster edges of objects with similar color, position and size into text regions. However these algorithms are not compatible with slanted text lines. Myers et al. [26] rectified the text line in 3D scene images by using horizontal and vertical features of text strings, but their work does not focus on detecting text line on complex background. Recently, Epshtein et al. [6] designed a content-based partition named as stroke width transform to extract text characters with stable stroke widths. In addition, the color uniformity of text characters in natural scene image is taken into account for content-based partition [4, 13, 14, 24, 25]. However the unexpected background noises might share the same colors with text characters, so texture features of characters are still required. The algorithms in our proposed framework belong to this category of partition and grouping, but our content-based partition is involved in both gradient features and color features.
Different from rule-based text detection as the above algorithms, Chen et al. [5] and Ho et al. [9] adopted Adaboost learning methods by using text features to establish the corresponding classifiers; Pan et al. [28] took edge segment of text character as feature of sparse representation and applied a K-SVD based learning processing for text detection; Jiang et al. [10] took the size and shape of the text characters denoted by connected components (CC) as features and used cascade SVM learning classifiers to detect text from scene images, while Kumar et al. [16] took the globally matched wavelets as features to perform text extraction, and used Markov random field (MRF) to refine the extracted text regions.
Although many research efforts have been made to detect text regions from natural scene images, more robust and effective methods are expected to handle variations of scale, orientation, and clutter background.
II. overview of our Framework
In this paper, we propose a new framework to extract text strings with multiple sizes and colors, and arbitrary orientations from scene images with a complex and cluttered background. Fig. 2 depicts the flowchart of our framework. The proposed framework consists of two main steps: a) image partition to find text character candidates based on gradient feature and color uniformity. In this step, we propose two methods to partition scene images into binary maps of non-overlapped connected components: gradient-based method and color-based method. A post processing is then performed to remove the connected components which are not text characters by size, aspect ratio, and the number of inner holes. b) Character candidate grouping to detect text strings based on joint structural features of text characters in each text string such as character sizes, distances between two neighboring characters, and character alignment. In this step, we propose two methods of structural analysis of text strings: adjacent character grouping method and text line grouping method. The proposed framework is able to effectively detect text strings in arbitrary locations, sizes, orientations, colors and slight variations of illumination or shape of attachment surface. Compared with the existing methods which focus on independent analysis of single character, the text string structure is more robust to distinguish background interferences from text information. Experiments demonstrate that our framework outperforms the state-of the-arts on Robust Reading Dataset and effective to detect text strings with arbitrary orientations on our new collected Oriented Scene Text Dataset.
Fig. 2.
Flowchart of the proposed framework of text string detection.
Overall, the work introduced in this paper offers the following main contributions to robust detection of text strings with variations of scale, color, orientation, and clutter background from natural scene images:
Most existing work of text detection from natural scene images focuses on detecting text in horizontal orientation or independent analysis of single character. We propose a new framework to robustly detect text strings with variations of orientation and scale from complex natural scene images with clutter background by integrating different types of features of text strings.
We formally draw a clear distinction between text character and text string by an incremental processing including image partition to extract candidate character components and connected component grouping to extract text strings.
We model text character by features of local gradient and stroke structure. Under this model, we develop a gradient-based partition algorithm to compute connected components of candidate characters. It is more robust and achieves better results than directly using morphological processing operators.
We model text string as text line from cascading of connected component centroids based on Hough transform. Then we extend the set of text features from single character component to text line structure, which is used to detect text strings in arbitrary orientations.
We collect an Oriented Scene Text dataset (OSTD) with text strings in arbitrary orientations, which is more challenging than the existing datasets for text detection. Text string regions are manually labeled in XML file. The OSTD dataset contains 89 images of colorful logos, indoor scenes, and street views. The resolutions of most images are from 600×450 to 1280×960. Each image contains 2 text strings on average. The OSTD dataset will be released to public in our research website.
The paper is organized as follows: Section I gives the introduction and related works. Section II briefly overviews the proposed framework. Section III describes our proposed algorithms of image partition for extracting text character candidates. Section IV introduces two grouping methods to extract text strings. Section V presents the system implementation in multiple scales. Experiments and result analysis are described in Section VI. We conclude the paper and future research directions in Section VII.
III. Image Partition
To extract text information from complex background, image partition is first performed to group together pixels that belong to the same text character, obtaining a binary map of candidate character components. Based on local gradient features and uniform colors of text characters, we design a gradient-based partition algorithm and a color-based partition algorithm respectively.
A. Gradient-Based Partition by Connecting Path of Pixel Couple
Although text characters and strings vary in font, size, color, and orientation, their strokes serving as basic elements keep closed boundaries and uniform torso intensities. In [6], each pixel is mapped to the width of the stroke it is located in, and then the consistency of stroke width was used to extract candidate character component. In our method, each pixel is mapped to connecting path of a pixel couple, defined by two edge pixels p and q on edge map with approximately equal gradient magnitudes and opposite directions, as shown in Fig. 3(a). Each pixel couple is connected by a path. Then the distribution of gradient magnitudes at pixels of the connecting path is computed to extract candidate character component.
Fig. 3.
(a) Examples of pixel couples; (b) connecting paths of all pixel couples are marked as white foreground while other pixels are marked as black background.
Fig. 3(a) depicts that a character boundary consists of a number of pixel couples. We model the character by distribution of gradient magnitudes and stroke size including width, height and aspect ratio. The partitioned components are calculated from connecting path of pixel couple across the pixels with small gradient magnitudes.
On the gradient map, Gmag (p) and (−π < dp ≤ π) are used respectively to represent the gradient magnitude and direction at pixel p. We take an edge pixel p from edge map as starting point and probe its partner along a path in gradient direction. If another edge pixel q is reached where gradient magnitudes satisfy |Gmag (p) − Gmag (q)| < 20 and directions satisfy |dq − (dp − (dp/|dp|) * π)| < π/6, we obtain a pixel couple and its connecting path from p to q. This algorithm is applied to calculate connecting paths of all pixel couples. Fig. 3(b) marks all the connecting paths shorter than 30 as white foreground. To perform the gradient-based partition, we employ gradient magnitude at each pixel on the connecting path and length of connecting path l describing the size of connected component to be partitioned. The partition process is divided into two rounds. In the first round, the length range of connecting path is set as 0 < l ≤ 30 to describe stroke width. For each pixel couple whose connecting path falls on this length range, we establish an exponential distribution of gradient magnitudes of the pixels on its connecting path, denoted by (1).
(1) |
where the decay rate λ is estimated by . A larger decay rate leads to faster falloff of gradient magnitudes on a connecting path. It means that the connecting path crosses a number of pixels with small gradient magnitudes on gradient map. This feature is consistent with the intensity uniformity inside the character strokes. Thus the connecting path with greater decay rate is marked as white foreground representing candidate character component, as shown in Fig. 4. To extract the complete stroke in rectangle shape, we start the second round to analyze the connecting paths along the stroke height (larger side). Since aspect ratio of rectangle stroke is no more than 6:1, we extend length range of connecting path to 0 < l ≤ 180. Then we repeat the same analysis of gradient magnitudes for the connecting path not only falling on this length range but also passing through the regions of candidate character components obtained from the first round. At last, we perform morphological close and open as post processing to refine the extracted connected components, as shown in Fig. 5. The refined connected components are taken as candidate character components.
Fig. 4.
(a) Two connecting paths of pixel couples marked by blue and red circles respectively. (b) The corresponding exponential distribution of gradient magnitudes on the connecting paths. (c) Partitioned components obtained from the first round.
Fig. 5.
Connecting path of pixel couple along the larger side of rectangle stroke is analyzed in the second round partition. Top row shows pixel couples in purple across the larger side of rectangle strokes. Bottom row presents the partitioned components obtained from the first round and the second round respectively.
The gradient-based partition generates a binary map of candidate character components on black background. By the model of local gradient features of character stroke, we can filter out background outliers while preserving the structure of text characters. Fig. 6 demonstrates that the gradient-based partition performs better on character component extraction than morphological processing.
Fig. 6.
Connected components obtained from direct morphological processing on gray images and corresponding binary images. We compare results of four morphological operators with result of our gradient-based partition.
B. Color-Based Partition by Color Reduction
In most scene images, text strings are usually composed of characters with similar colors. Thus we can locate text information by extracting pixels with similar colors. To label a region of connected pixels with similar colors as a connected component, we develop color-based partition method. Inspired by [27], we perform color reduction by using color histogram and weighted K-means clustering through the following steps.
First, Canny edge detector is employed to obtain edge image. Second, we calculate color histograms of the original input image. To capture the dominant colors and avoid drastic color variations around edge pixels, only non-edge pixels are sampled for color histogram calculation to obtain a set of sampled pixels p. Third, after mapping all the pixels from spatial domain to RGB color space, as shown in Fig. 7(b), weighted K-means clustering is performed to group together the pixels with similar colors. By using the initial mean point pi, which is randomly selected from the sampled pixels and an initial radius h, color clusters in RGB color space is established (see Fig. 7(c)), covering any pixel q whose color is close to pi, (see (2)). Repeat the process of cluster establishment until all the pixels have been selected by at least one color cluster, as described in (3).
(2) |
where K1 represents
and R, G, B denote the three color components respectively in RGB color model.
(3) |
Fig. 7.
(a) A scene image with multiple colors; (b) color distribution in RGB space; (c) four of the initial cube color clusters with radius h.
Thus K-value has been fixed by the number of color clusters. Taking the color histogram as a weight table, weighted average is calculated iteratively to update the values of cluster mean points until the distance change is smaller than a predefined threshold. Fourth, the clusters whose mean points are close enough are merged together to produce a final cluster, which corresponds to a color layer. The number of color layers depends on the number of dominant hues in original image and the initial radius h. The Larger the cluster radius is, the more pixels will be covered by each color cluster, so the total number of color clusters is reduced, which results in less color layers. Experiments are performed to compare the detection rate among different radius h, and the results are presented in Section VI-C.
Some examples of the color-based image partition method are displayed in Fig. 8. Each input image is partitioned to several color layers. A color layer that consists of only one foreground color on white background is a binary map of candidate character components. Then connected component analysis is performed to label foreground regions of connected pixels.
Fig. 8.
Some examples of color-based partition, where the left column contains original images and other columns contain the corresponding dominant color layers.
IV. Connected Components Grouping
The image partition creates a set of connected components S from an input image, including both text characters and unwanted noises. Observing that text information appears as one or more text strings in most natural scene images, we perform heuristic grouping and structural analysis of text strings to distinguish connected components representing text characters from those representing noises. Assuming that a text string has at least three characters in alignment, we develop two methods to locate regions containing text strings: adjacent character grouping and text line grouping respectively. In both algorithms, a connected component C is described by four metrics: height(.),width(.), centroid(.), and area(.). In addition, we use D(.) to represent the distance between the centroids of two neighboring characters.
A. Adjacent Character Grouping
Text strings in natural scene images usually appear in alignment, namely, each text character in a text string must possess character siblings at adjacent positions. The structure features among sibling characters can be used to determine whether the connected components belong to text characters or unexpected noises. Here, five constraints are defined to decide whether two connected components are siblings of each other.
Considering the capital and lowercase characters, the height ratio falls between 1/T1 and T1.
Two adjacent characters should not be too far from each other despite the variations of width, so the distance between two connected components should not be greater than T2 times the width of the wider one.
For text strings aligned approximately horizontally, the difference between y-coordinates of the connected component centroids should not be greater than T3 times the height of the higher one.
Two adjacent characters usually appear in the same font size, thus their area ratio should be greater than 1/T4 and less than T4.
If the connected components are obtained from gradient-based partition as described in Section III-A, the color difference between them should be lower than a predefined threshold T5 because the characters in the same string have similar colors.
In our system, we set T1 = T4 = 2, T2 = 3, T3 = 0.5 and T5 = 40. According to the five constraints, a left/right sibling set FL/FR is defined for each connected component C as the set of sibling components located on the left/right of C.
To extract regions containing text strings based on adjacent character grouping, we first remove the small connect components (area < TS) from the set of connected components S. In our system, we set TS = 20. Then a left-sibling set FL and a right-sibling set FR for each connected component C are initialized to record its sibling connected components on the left and right respectively. For two connected components C and C′, they can be grouped together as sibling components if the above five constraints are satisfied. When C and C′ are grouped together, their sibling sets will be updated according to their relative locations. That is, when C is located on the left of C′, C′ will be added into the right-sibling set of C, which is simultaneously added into the left-sibling set of C′. The reverse operation will be applied when C is located on the right of C′. When a connected component corresponds to a text character, the five constraints ensure that its sibling set contains sibling characters rather than the foliage, pane or irregular grain.
For a connected component C, if both sibling sets are not empty and their size difference does not exceed 3, a sibling group SG(C) is defined as the union of the two sibling sets and the connected component itself. At this point, each sibling group can be considered as a fragment of a text string. To create sibling groups corresponding to complete text strings, we merge together any two sibling groups SG(C1) and SG(C2) when the intersection SG(C1) ∩ SG(C2) contains no less than 2 connected components. Repeat the merge process until no sibling groups can be merged together. As shown in Fig. 9, the resulting union of connected components is defined as adjacent character group denoted by AG, which is a subset of the set of connected components S. Table I summarizes our algorithm in detail.
Fig. 9.
(a) Sibling group of the connected component ‘r’ where ‘B’ comes from the left sibling set and ‘o’ comes from the right sibling set; (b) Merge the sibling groups into an adjacent character group corresponding to the text string “Brolly?”; (c) Two detected adjacent character groups marked in red and green respectively.
TABLE I.
The procedure of adjacent character grouping
Locating Text Strings by adjacent character groups |
---|
S: = S − {C|C ∈ S, area (C) < Ts}; for every connected component C ∈ S Initialize the sibling sets FL and FR; endfor for two connected components C and C′ with sibling sets FL ∪ FR and FL′ ∪ FR′ respectively if 1/T1 < height(C)/height(C′) < T1 & D(centroid(C).x, centroid(C′).x) ≤ T2 * max {width(C), width(C′)} & D(centroid(C).y, centroid (C′).y) ≤ T3 * max {height(C), height(C′)} & 1/T4 < Area(C)/Area(C′) < T4 & difference of mean RGB color value is less than T5 if centroid(C).x ≤ centroid(C′).x, FR: = FR ∪ {C′}; FL′: = FL′ ∪ {C}; else FL: = FL ∪ {C′}; FR′: = FR′ ∪ {C}; endif endif endfor for every connected component C if FL ≠ Ø & FR ≠ Ø & ||FR| − |FL|| ≤ 3 SG (C): = FL ∪ FR ∪ {C}; endif endfor Repeat the following until no merge is performed and the rest sibling groups will be upgraded to adjacent character groups for two sibling groups SG (C1) and SG (C2) if |SG (C1) ∩ SG (C2)| ≥ 2 SG (C1) ≔ SG (C1) ∪ SG (C2); SG (C2) ≔ Ø; endif endfor |
Filter out false positives by the three filters decided by the area, distance and stroke width respectively. |
Calculate extracted regions based on the adjacent character groups. |
Each text string can be mapped into an adjacent character group. However some adjacent character groups correspond to unexpected false positives instead of real text strings. We design three filters based on structure of text strings to remove these false positive adjacent character groups. The filter is described by coefficient of variation CV which is defined as the ratio of the standard deviation σ to the mean μ. First, inside an adjacent character group, the coefficient of variation of connected component areas (calculated by the number of pixels) should be confined by an upper bound (see (4)). Second, the distances between every two neighboring connected component centroids in the same text string should be relatively stable (see (5)). Third, the coefficient of variation of stroke width measured by the horizontal scan lines passing through the connected components of the adjacent character group corresponding to a text string should be less than a predefined threshold.
(4) |
(5) |
where mi = centroid(Ci), .
Text string in scene images can be described by corresponding adjacent character groups. To extract a region containing a text string, we calculate the minimum circumscribed rectangle covering all the connected components in the corresponding adjacent character group.
B. Text Line Grouping
In order to locate text strings with arbitrary orientations, we develop text line grouping method. To group together the connected components which correspond to text characters in the same string which is probably non-horizontal, we use centroid as the descriptor of each connected component. Given a set of connected component centroids, groups of collinear character centroids are computed, as shown in (6) and (7).
(6) |
(7) |
where M denotes the set of centroids of all the connected components obtained from image partition, and L denotes the set of text lines which are composed of text character centroids in alignment.
A solution is to search for satisfied centroid groups in the power set of M, but the complexity of this algorithm will be O(2|M|) where |M| represents the number of centroids in the set M. We design an efficient algorithm to extract regions containing text strings. At first, we remove the centroids from the set M if areas of their corresponding connected components are smaller than the predefined threshold TS. Then three points mi, mj and mk are randomly selected from the set M to form two line segments. We calculate the length difference Δd and incline angle difference Δθ between line segments mimj and mjmk, as shown in (8) and (9). The three centroids are approximately collinear if 1/T6 ≤ Δd ≤ T6 and Δθ ≤ T7. In our system, we set T6 = 2 and . Thus they compose a preliminary fitted line lu = {mi, mj, mk} where u is the index of the fitted line. After finding out all the preliminary fitted lines, we apply Hough transform to describe the fitted line lu. by 〈ru, θu〉, resulting in lu = {m|h(ru, θu, m) = 0} where h(ru, θu, m) = 0 is equation of the fitted line in the Hough Space. Thus other collinear centroids along lu. can be conveniently added in to form the complete fitted line. Table II summarizes our algorithms in detail.
(8) |
(9) |
where d is length of line segment and d > 0, θ is the angle of incline and 0 ≤ θ < π.
TABLE II.
The procedure of text line grouping
Locating Text Strings by Grouping Centroids of Connected Components into Fitted Text Line |
---|
M = M − {m|m = centroid(C), area(C) < Ts}; |
for every three points mi, mj, mk ∈ M, |
calculate Δd and Δθ |
if 0.5 ≤ Δd ≤ 2 and |
lu: = {mi, mj, mk}; |
endif |
endfor |
for every preliminary fitted line |
for every mt ∈ M and mt ∉ lu |
〈ru, θu〉: = Hough(lu) where lu = {m|h(ru, θu, m) = 0} |
if h(ru, θu, mt) < ε |
& fitted line lu ∪ {mt} meets the two constraints |
lu: = lu ∪ {mt} |
〈ru, θu〉: = Hough(lu) |
endif |
L: = L ∪ {lu} |
endfor |
endfor |
Filter out false positive fitted lines in L by the coefficient of variations of areas and distances, and calculate extracted regions based on the positive fitted lines. |
The centroids from noise components can also be aligned as a line. To remove these false positive fitted lines, two constraints are further used to distinguish the fitted lines corresponding to text strings from those generated by unexpected noises based on the structure features of text strings. For a text string, the coefficient of variation CV of areas of the connected components corresponding to the centroids located in the fitted line should be smaller than a predefined threshold, and the distances between every two neighboring centroids should not have large coefficient of variations. Fig. 10 illustrates the processing of fitted line refinement.
Fig. 10.
(a) Centroids of connected components in a color layer shown in Fig. 8; (b) D(mA, mB) approximately equals to D(mB, mC) in text region while D(mA, mP) is much larger than D(mP, mQ) in background, where D(.) represents Euclidean distance; (c) Three neighboring connected components in red share similar areas while those in green have very different areas; (d) Resulting fitted lines from centroids cascading. Red line corresponds to text region while cyan lines are false positives to be removed.
For now, each text string is described by a fitted line. The location and size of the region containing a text string is defined by the connected components whose centroids are cascaded in the corresponding fitted line. The orientation of the text string is denoted by the incline angle θ of the fitted line. To cover these connected components properly, we calculate the minimum circumscribed rectangle as the extracted text region.
V. Multi-scale Computing
To detect text strings in different sizes and alleviate the computational complexity of adjacent character grouping and text line fitting, the scene images are processed in scale space. Text characters of different sizes will be processed in different scales, and the connected components whose areas are smaller than TS in the current scale will be directly removed by the grouping algorithms. Since the characters in same text string have similar font size, we can also save the computational cost of grouping together two connected components with excessive area difference. According to Lindeberg’s theory of scale invariance [23], we perform scale space analysis by the convolution of original image I(x, y) and Gaussian kernels G(x, y, σ), where σ is the scale and x, y are coordinates. In scale space, increasing scale level σ results in image blur so that the connected components are gradually removed. In the process of scale level increasing, smaller connected components will disappear before larger ones. We calculate the images Iσ(x, y) in scale space by (10).
(10) |
A character searches for its partners with approximate font sizes scale-by-scale from coarse level to fine level where it shows up. Larger size characters are processed in coarser scale while smaller size characters are visited in finer scale, as shown in Fig. 11. Once a character is integrated into an adjacent character group or a text line at some scale, it will not be involved in the processing at subsequent scale levels. The extracted regions will be scaled up but disabled (red regions in Fig. 11). Thus sibling grouping or line fitting is prevented across the centroids of different scale levels. The multi-scale computing enables our algorithm to be computationally tractable for the scene image whose size is up to 2000 × 2000.
Fig. 11.
Multi-scale computing. Regions detected at the current scale level are shown as blue rectangles; red regions containing the characters processed at coarser scale level will be skipped.
VI. Experimental results
A. Datasets
Two datasets are employed to evaluate the proposed algorithms. The first is the Robust Reading Dataset [40] from ICDAR 2003. In this dataset, there are totally 509 images, in which 258 images are prepared for training and 251 images for testing. The image regions containing text strings are labeled in a XML file. Each image contains about 4 text regions on average. All the text strings in this dataset are in horizontal. In our testing, we selected 420 images which are compatible with the assumption that a text string contains at least three characters with relatively uniform color. Furthermore, to verify that text line grouping can detect text strings with arbitrary orientations, we collect 89 scene images with non-horizontal text strings to construct the Oriented Scene Text dataset (OSTD). The resolutions of most images are from 600×450 to 1280×960. The average number of text strings is 2 on each image. Text string regions are also manually labeled in XML file. This OSTD dataset contains colorful logos, indoor scenes and street views.
B. Performance Evaluation
To evaluate the performance, we calculate two metrics, precision p and recall r as in [21, 22]. Here, precision p is the ratio of area of the successfully extracted text regions to area of the whole detected region, and recall is the ratio of area of the successfully extracted text regions to area of the groundtruth regions. The area of a region is the number of pixels inside it. Low precision means over-estimate while low recall means under-estimate. To combine p and r, a standard f measure is defined by (11) where α represents the relative weight between the two metrics. In our evaluation, we set α = 0.5.
(11) |
Since two algorithms are designed for partition and grouping in our framework, we evaluate four types of combinations of them on the Robust Reading dataset (RRD): 1) gradient-based partition with adjacent character grouping (GA); 2) color-based partition with adjacent character grouping (CA); 3) gradient-based partition with text line grouping (GT); and 4) color-based partition with text line grouping (CT). Since the text line grouping can detect text strings with arbitrary orientations. We furthermore perform evaluation of gradient-based partition with text line grouping (GT) and color-based partition with text line grouping (CT) on our Oriented Scene Text dataset.
C. Results and Discussions
The experimental results on the Robust Reading dataset are illustrated in Fig. 12, where blue bars denote results of GA, cyan bars denote results of CA, yellow bars denote results of GT, and red bars denote results of CT. The average time of text string detection is presented in the upper boxes.
Fig. 12.
The performance evaluation of the 4 combinations of partition and grouping on the Robust Reading Dataset, where the box presents the average time of text string detection in each scene image.
The combination of color-based partition and adjacent character grouping (CA) achieves the highest precision and recall. In most of the cases, color uniformity acts as a stronger indicator to distinguish the connected components of text characters from surrounding background. However color-based partition takes more computing time than gradient-based partition. Also color-based partition makes adjacent character grouping be performed in each of the color layers. Color-based partition still performs better when adjacent character grouping is replaced by the text line grouping. Fig. 12 also illustrates that text line grouping gives lower efficiency and precision than the adjacent character grouping for either partition. Adjacent character grouping is supported by the information of text orientations while text line grouping is performed for arbitrary text orientations, so its calculation cost is more expensive. Meanwhile, the indetermination of text orientation produces more false positive fitted lines.
Since the results of color-based partition depend on the initial radius of the color cluster in RGB space, we perform another experiment on the Robust Reading dataset to select the best radius h which achieves the highest f-measure in the text detection. Fig. 13 illustrates that the best radius is 32. When the radius turns smaller, a text character is very likely to break up into different color layers because the partition would be sensitive to the illumination variation. When the radius turns larger, more background noises would be assigned to the same color layer as the text characters. Thus the radius 32 is a tradeoff to deal with the two negative situations.
Fig. 13.
The variation of f-measure obtained from color-based partition and adjacent character grouping according to the radius h defined in the color-based partition.
By comparison with the algorithms presented in the text locating competition in ICDAR, the precision of our algorithm achieves the first rank while the recall and f-measure is comparable with the algorithms with the high performance, as shown in Table III. As a ruled-based algorithm, no trained classifiers can be applied directly, so it takes more time to perform the text string detection than the learning based algorithms when not taking into account the time spent on training.
TABLE III.
The comparison between our algorithm and the text detection algorithms presented in [21, 22] on the Robust Reading Dataset.
Precision | Recall | f-measure | |
---|---|---|---|
Ours | 0.71 | 0.62 | 0.62 |
H. Becker | 0.62 | 0.67 | 0.62 |
A. Chen | 0.60 | 0.60 | 0.58 |
Ashida | 0.55 | 0.46 | 0.50 |
HWDavid | 0.44 | 0.46 | 0.45 |
Q. Zhu | 0.33 | 0.40 | 0.33 |
Wolf | 0.30 | 0.44 | 0.35 |
J. Kim | 0.22 | 0.28 | 0.22 |
Todoran | 0.19 | 0.18 | 0.18 |
N. Ezaki | 0.18 | 0.36 | 0.22 |
Some example results of text string detection on the Robust Reading Dataset are presented in Fig. 14. Instead of using the rectangle line to denote the borders of text regions, we calculate the minimum oriented rectangle E (marked in cyan in Fig. 14) to cover the detected text strings.
Fig. 14.
Some example results of text string detection on the Robust Reading Dataset. The detected regions of text strings are marked in cyan.
The experiment on the Oriented Scene Text dataset demonstrates that the text line grouping in our framework is able to detect the text strings with arbitrary orientations, as shown in Fig. 15. Without the orientation of the text line, the multiple-line text leads to text line grouping among the characters of different text strings. The overfitting comes out as the overlap of rectangle text regions. But it will not influence the location of detected text strings because we take the union of the text regions from the fitted text lines as the final detected text regions.
Fig. 15.
Example results of text detection on the Oriented Scene Text dataset (OSTD) which contain non-horizontal text strings. The detected regions of text strings are marked in cyan.
In the experiment, color-based partition and text line grouping (CT) are performed to extract text strings with arbitrary orientations. The resulting precision, recall and f-measure are 0.56, 0.64 and 0.55 respectively. We can see that the performance, especially the precision, is lower than the experimental results on horizontal text dataset. Based on the definition in Section VI-B, precision p is related to the slant angles of text lines, as shown in (12).
(12) |
where h(Θ) denotes the detected text lines with slant angle Θ, l(Φ) denotes the groudtruth text lines with slant angle Φ, and |. | denotes the number of including text lines. Suppose all the positive text lines have been detected, we only consider the false positives, then there is h(Θ) ⊇ h(Φ), so |h(Θ)| ≥ |h(Φ)|. Since.∀θi ∉ Φ, there is l(θi) = ø. Thus l(Φ)∩h(Θ) = l(Φ)∩h(Φ) According to (12), we calculate
(13) |
If the orientations of text strings Φ have been known, we can get precision = p(Φ) exactly. However, there is no prior knowledge on text string orientations in the process of text line grouping. False positives related to slant angles of text strings will be introduced to decline the precision.
Some typical false positive text regions are presented in Fig. 16. They originate from the linear alignment of the noise components in similar sizes.
Fig. 16.
Some results containing false positive text regions which are marked in red, while the true text regions are marked in cyan color.
Kim et al. [15] developed a framework to detect text information from video. Most of the text strings in video have uniform color and horizontally aligned characters, which are compatible with our algorithms. We are able to extract the video text string by using color-based partition and adjacent character grouping (CA), as shown in Fig. 17.
Fig. 17.
Some results of video text string detection.
Fig. 18 depicts some examples that our method cannot handle to locate the text information because of very small size, overexposure, characters with non-uniform colors or fade, strings with less than 3 character members, and occlusions caused by other objects such as wire mesh.
Fig. 18.
Examples of images where our method fails.
VII. Conclusions and Future Work
Due to the unpredictable text appearances and complex backgrounds, text detection in natural scene images is still an unsolved problem. To locate text regions embedded in those images, we propose a new framework based on image partition and connected components grouping. Structural analysis is performed from text characters to text strings. First we choose the candidate text characters from connected components by gradient feature and color feature. Then character grouping is performed to combine the candidate text characters into text strings which contain at least three character members in alignment. Experiments show that color-based partition performs better than gradient-based partition, but it takes more time to detect text strings on each color layer. The text line grouping is able to extract text strings with arbitrary orientations. The combination of color-based partition and adjacent character grouping (CA) gives the best performance, which outperforms the algorithms presented in ICDAR.
Our future work will focus on developing learning based methods for text extraction from complex backgrounds and text normalization for OCR recognition. We also attempt to improve the efficiency and transplant the algorithms into a navigation system prepared for the wayfinding of visually impaired people.
Acknowledgments
The authors thank the anonymous reviewers for their constructive comments and insightful suggestions that improved the quality of this manuscript.
This work was supported in part by NIH 1R21EY020990, NSF grant IIS-0957016, and ARO grant W911NF-09-1-0565.
Biographies
Chucai Yi received his B.S. and the M.S. degrees in Department of Electronic and Information Engineering from Huazhong University of Science and Technology, Wuhan, China, in 2007 and 2009, respectively. From 2009 he is a Ph.D. graduate student in Computer Science at the Graduate Center, the City University of New York, New York, NY, USA.
His research focuses on text detection and recognition in natural scene images. His research interests include object recognition, image processing, and machine learning.
YingLi Tian (M’99–SM’01) received her BS and MS from TianJin University, China in 1987 and 1990 and her PhD from the Chinese University of Hong Kong, Hong Kong, in 1996. After holding a faculty position at National Laboratory of Pattern Recognition, Chinese Academy of Sciences, Beijing, she joined Carnegie Mellon University in 1998, where she was a postdoctoral fellow of the Robotics Institute. Then she worked as a research staff member in IBM T. J. Watson Research Center from 2001 to 2008. She is currently an associate professor in Department of Electrical Engineering at the City College of New York. Her current research focuses on a wide range of computer vision problems from motion detection and analysis, to human identification, facial expression analysis, and video surveillance. She is a senior member of IEEE.
Contributor Information
Chucai Yi, Graduate Center, City University of New York, New York, NY 10016 USA (phone: 212-650-8917; fax: 212-650-8249; cyi@gc.cuny.edu).
YingLi Tian, City College, City University of New York, New York, NY 10031 USA (ytian@ccny.cuny.edu). Prior to joining the City College in September 2008, she was with IBM T.J. Watson Research Center, Yorktown Heights, NY 10598 USA.
References
- 1.Banerjee J, Namboodiri AM, Jawahar CV. Contextual restoration of severely degraded document images; Proceedings of IEEE Conference on Computer Vision and Pattern Recognition; 2009. pp. 517–524. [Google Scholar]
- 2.Breuel TM. The OCRopus open source OCR system; Proceedings of IS&T/SPIE 20th Annual Symposium; 2008. [Google Scholar]
- 3.Burns TJ, Corso JJ. Robust Unsupervised Segmentation of Degraded Document Images with Topic Models; Proceedings of IEEE Conference on Computer Vision and Pattern Recognition; 2009. pp. 1287–1294. [Google Scholar]
- 4.Chen X, Yang J, Zhang J, Waibel A. Automatic Detection and Recognition of Signs From Natural Scenes. IEEE Transactions on Image Processing. 2004;Vol. 13(No. 1):87–99. doi: 10.1109/tip.2003.819223. [DOI] [PubMed] [Google Scholar]
- 5.Chen X, Yuille AL. Detecting and reading text in natural scenes; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2004. pp. 366–373. [Google Scholar]
- 6.Epshtein B, Ofek E, Wexler Y. Detecting Text in Nature Scenes with Stroke Width Transform; Proceedings of IEEE Conference on Computer Vision and Pattern Recognition; 2010. [Google Scholar]
- 7.Gao J, Yang J. An Adaptive Algorithm for Text Detection from Natural Scenes. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2001;Vol. 2:84–89. [Google Scholar]
- 8.Hasan YMY, Karam LJ. Morphological Text Extraction from Images. IEEE Transaction on Image Processing. 2000 Nov;Vol. 9(No. 11) doi: 10.1109/83.877220. [DOI] [PubMed] [Google Scholar]
- 9.Ho WT, Tay YH. On Detecting Spatially Similar and Dissimilar Objects Using AdaBoost; Proceedings of International Symposium on Information Technology; 2008. pp. 899–903. [Google Scholar]
- 10.Jiang R, Qi F, Xu L, Wu G. A learning-based method to detect and segment text from scene images. Journal of Zhejiang University. 2007 Apr;Vol. 8:568–574. [Google Scholar]
- 11.Jung KC, Kim KI, Jain AK. Text information extraction in images and video: a survey. Pattern Recognition. 2004 May;Vol. 5:977–997. [Google Scholar]
- 12.Kasar T, Kumar J, Ramakrishnan AG. Font and Background Color Independent Text Binarization; the Second International Workshop on Camera-Based Document Analysis and Recognition; 2007. pp. 3–9. [Google Scholar]
- 13.Kim JS, Kim SH, Yang HJ, Son HJ, Kim WP. Text Extraction for Spam-Mail Image Filtering Using a Text Color Estimation Technique; Proceedings of the 20th international conference on Industrial, engineering, and other applications of applied intelligent systems; 2007. pp. 105–114. [Google Scholar]
- 14.Kim PK. Automatic text location in complex color images using local color quantization. IEEE TENCON. 1999;1:629–632. [Google Scholar]
- 15.Kim W, Kim C. A New Approach for Overlay Text Detection and Extraction From Complex Video Scene. IEEE Transactions on Image Processing. 2009;Vol. 18(No. 2):401–411. doi: 10.1109/TIP.2008.2008225. [DOI] [PubMed] [Google Scholar]
- 16.Kumar S, Gupta R, Khanna N, Chaudhury S, Joshi SD. Text Extraction and Document Image Segmentation Using Matched Wavelets and MRF Model. IEEE Transactions on Image Processing. 2007;Vol. 16(No. 8):2117–2128. doi: 10.1109/tip.2007.900098. [DOI] [PubMed] [Google Scholar]
- 17.Lefevre S, Vincent N. Caption Localisation in Video Sequences by Fusion of Multiple Detectors; Proceedings of the Eighth International Conference on Document Analysis and Recognition; 2005. pp. 106–110. [Google Scholar]
- 18.Liang J, DeMenthon D, Doermann D. Geometric Rectification of Camera-Captured Document Images. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2008;Vol. 30(No. 4):591–605. doi: 10.1109/TPAMI.2007.70724. [DOI] [PubMed] [Google Scholar]
- 19.Liu Q, Jung C, Moon Y. Text Segmentation based on Stroke Filter; Proceedings of International Conference on Multimedia; 2006. pp. 129–132. [Google Scholar]
- 20.Lu S, Tan CL. Retrieval of machine-printed Latin documents through Word Shape Coding. Pattern Recognition. 2008 May;Vol.41(Issue 5):1799–1809. [Google Scholar]
- 21.Lucas SM, Panaretos A, Sosa L, Tang A, Wong S, Young R. ICDAR 2003 Robust Reading Competitions; 7th International Conference on Document Analysis and Recognition; 2003. [Google Scholar]
- 22.Lucas SM. ICDAR 2005 text locating competition results. Proceedings of the International. Conference on Document Analysis and Recognition. 2005;Vol. 1:80–84. [Google Scholar]
- 23.Lindeberg T. Scale-space theory: A basic tool for analysing structures at different scales. Journal of Applied Statistics. 1994;21(2):224–270. [Google Scholar]
- 24.Mancas-Thillou C, Gosselin B. Color text extraction with selective metric-based clustering. Computer Vision and Image Understanding. 2007;Vol. 107:97–107. [Google Scholar]
- 25.Mancas-Thillou C, Gosselin B. Spatial and Color Spaces Combination for Natural Scene Text Extraction; IEEE Conference on Image Processing (ICIP); 2006. [Google Scholar]
- 26.Myers GK, Bolles RC, Luong QT, Herson JA, Aradhye HB. Rectification and recognition of text in 3-D scenes. International Journal on Document Analysis and Recognition. 2004:147–158. [Google Scholar]
- 27.Nikolaou N, Papamarkos N. Color Reduction for Complex Document Images. International Journal of Imaging Systems and Technology. 2009;Vol.19:14–26. [Google Scholar]
- 28.Pan WM, Bui TD, Suen CY. Text detection from natural scene images using topographic maps and sparse representations; International Conference on Image Processing; 2009. [Google Scholar]
- 29.Park CJ, Moon KA, Oh WG, Choi HM. An Efficient Extraction of Character String Positions Using Morphological Operator; IEEE International Conference on Systems, Man, and Cybernetics; 2000. [Google Scholar]
- 30.Phan T, Shivakumara P, Tan Chew Lim. A Laplacian Method for Video Text Detection; the 10th International Conference on Document Analysis and Recognition; 2009. pp. 66–70. [Google Scholar]
- 31.Shivakumara P, Huang W, Tan CL. An Efficient Edge based Technique for Text Detection in Video Frames; the Eighth IAPR Workshop on Document Analysis Systems; 2008. [Google Scholar]
- 32.Smith R. An overview of the Tesseract OCR engine; Int. Conf. on Document Analysis and Recognition (ICDAR); 2007. [Google Scholar]
- 33.Sobottka K, Kronenberg H, Perroud T, Bunke H. Text Extraction from Colored Book and Journal Covers; the 10th International Conference on Document Analysis and Recognition; 1999. [Google Scholar]
- 34.Suen HM, Wang JF. Segmentation of Uniform Colored Text from Color Graphics Background. VISP. 1997 Dec;(No. 6):317–332. [Google Scholar]
- 35.Tran H, lux A, Nguyen HL, Boucher A. A Novel Approach for Text Detection in Images using Structural Features; the 3rd International Conference on Advances in Pattern Recognition; 2005. pp. 627–635. [Google Scholar]
- 36.Weinman J, Hanson A, McCallum A. Sign Detection in Natural Images with Conditional Random Fields; IEEE International Workshop on Machine Learning for Signal Processing; 2004. [Google Scholar]
- 37.Wolf C, Jolion JM, Chassaing F. Text localization, enhancement and binarization in multimedia documents. Proceedings of the International Conference on Pattern Recognition. 2002;Vol. 4:1037–1040. [Google Scholar]
- 38.Zhang J, Kasturi R. Extraction of Text Objects in Video Documents: Recent Progress; the Eighth IAPR Workshop on Document Analysis Systems (DAS); 2008. pp. 5–17. [Google Scholar]
- 39.Zhou J, Xiao LB, Dai R, Si S. A Robust System for Text Extraction in Video; Proceedings of International Conference on Machine Vision; 2007. pp. 119–124. [Google Scholar]
- 40.Robust Reading Dataset. http://algoval.essex.ac.uk/icdar/Datasets.html. [Google Scholar]