Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Feb 1.
Published in final edited form as: Pattern Recognit Lett. 2013 Feb 1;34(3):315–321. doi: 10.1016/j.patrec.2012.10.025

A New Distance Measure Based on Generalized Image Normalized Cross-Correlation for Robust Video Tracking and Image Recognition

Arie Nakhmani a,1, Allen Tannenbaum b
PMCID: PMC3596837  NIHMSID: NIHMS422783  PMID: 23503649

Abstract

We propose two novel distance measures, normalized between 0 and 1, and based on normalized cross-correlation for image matching. These distance measures explicitly utilize the fact that for natural images there is a high correlation between spatially close pixels. Image matching is used in various computer vision tasks, and the requirements to the distance measure are application dependent. Image recognition applications require more shift and rotation robust measures. In contrast, registration and tracking applications require better localization and noise tolerance. In this paper, we explore different advantages of our distance measures, and compare them to other popular measures, including Normalized Cross-Correlation (NCC) and Image Euclidean Distance (IMED). We show which of the proposed measures is more appropriate for tracking, and which is appropriate for image recognition tasks.

Keywords: Correlation, NCC, Image distance, Template matching

1. Introduction

Many computer vision tasks require a comparison between pairs of images, where the distance measure is used to describe quantitatively how much one image is similar to another. In registration and stereo pair matching, the images are aligned to obtain the highest similarity between them. In visual tracking and robotic navigation, the current video frame is searched for the predefined target templates or features, and the location of best match is associated with the true target's location. The image can be retrieved from a large database by sending the request of matching to a given pencil sketch. Similarly, the content of an image can be extracted, analyzed, and recognized by comparing to the predefined templates. In this paper we will concentrate on the tracking and image recognition applications.

Unfortunately, there is no a single distance measure that works well for all tasks. Different tasks demand different measure properties. For example, in image recognition we are interested to find out if the extracted object is of certain type or not. Therefore, small deformations (translations, rotations, scaling, etc.) should not change the distance a lot (if at all). In tracking, we want to localize the target, thus the distance should be much less tolerant to the translation, but robust to additive and multiplicative noise which is found in the real world scenes. But even in the tracking there may be equivocal requirements to the distance measure. If only the binary decision (target located or not) is needed, then we want the distance to be zero at the right location and the infinity otherwise, but if the distance is used to adjust the location by some iterative algorithm (from the current measurement), then we want the distance to increase gradually when moving out from the desired location. When searching for an image in the database, we want the distance to rise gradually with the enlarging deformation. In the Subsection 1.1, we will describe a few popular distance measures, their properties, and their appropriateness to the tracking and image recognition.

Image matching has a huge literature devoted to it, and so we will only be able to briefly sketch some of the most relevant methodologies for this paper. Some earlier low level image distance measures (e.g., SAD, SSD, NCC) are described in Brown [1]. The comparison of different properties of 19 similarity measures is given in Aschwanden and Guggenbuhl [2]. More recent survey of the distance and similarity measures is found in Zitova and Flusser [3]. The basic region-based image similarity measures can be divided roughly into two groups: Euclidean distance based and correlation based [2]. In many cases the correlation based measures provide superior performance [4], but they are more computationally demanding. For a popular Zero-mean Normalized Cross-Correlation (ZNCC) the methods for fast computation have been developed by Lewis [4] and recently by Yoo and Han [5]. The algorithm [5] do not use multiplications in the computing of ZNCC. To improve the image matching results, Pratt [6] uses low pass filtering of the compared images before applying the correlation, and Zhong et al. [7] proposed a nonlinear preprocessing procedure to improve the matching.

Another interesting approach by Trujillo and Izquierdo [8] is the median NCC, which uses the median function instead of the mean in ZNCC. The aforementioned correlation similarity measures are not appropriate for matching scaled or rotated targets. Scaling invariant correlation is proposed by Cahn von Seelen and Bajcsy [9]. Rotation and scaling invariant correlation is introduced in Zhao et al. [10].

Unfortunately, many of the aforementioned measures do not consider the spatial relationship between different pixels of each image, thus enlarging a deformation is not necessarily cause increasing dissimilarity, and images that look not similar to a human observer may have a high similarity measure. Moreover, gradual deformation of the image may exhibit abrupt changes in the distance or similarity measure. Our algorithms are proposed to fill the aforementioned deficiencies.

A very interesting approach to measuring distances between images was proposed by Wang et al. [11]. This distance is called IMage Euclidean Distance (IMED), and takes into account the spatial relationships between the pixels. Sun and Feng [12] proposed an algorithm for improving the computation time of IMED. This distance can be easily embedded into another high and medium level matching procedures, and it is robust to small image deformations and perturbations [11]. These properties made IMED appropriate for image recognition and visual tracking [13]. Also the generalization of IMED for tolerance to affine transform have been proposed by Liao et al. [14]. Recently, Li and Lu [15] proposed adaptive IMED, and Sun et al. [16] proposed learning IMED, which can learn the distance from the training set of images.

Inspired by the work by Wang et al. [11], we use an idea similar to correlation instead of Euclidean distance in Section 2. The general idea is that for natural images the correlation of gray-levels between the adjacent pixels is high, thus one should compute the correlation for each pixel by weighted average of correlations with the neighboring pixels. The radial weight should fade with the increasing distance from the current pixel. Moreover, the proposed similarity measure is normalized to the interval [0, 1]. We propose two different normalizations, one is based on the Normalized Cross-Correlation (NCC), and another on Zero-mean NCC (ZNCC) [2]. We show how NCC and ZNCC can be obtained from the generalized cross-correlation formulation.

In Section 3, we test the robustness of the proposed distance measures to small deformations that may occur in tracking. Regarding the tolerance to translation, rotation, scaling and additive noise, we compare our distance measures with Euclidean distance, NCC, ZNCC, and IMED. We conclude on our findings in the Section 4.

1.1. Background

In this subsection, we describe a few popular template matching techniques, and outline their properties. We suppose that both compared images are of the same size.

Let X and Y denote the intensity values of two M × N images. By lexicographic ordering, X and Y can be transformed to the MN × 1 vectors x and y. Using this notation, we define four different image similarity measures.

Euclidean Distance (ED) [2]:

dED=i=1MN(xiyi)2 (1)

This distance is used as matching criteria: the smaller distance is an indication of the better match. In general, dED can get the values in the range [0,MNF], where F denotes the maximum gray-level value (F = 255 for regular 8 bit images). Thus, the absolute distance value is image size dependent. As in all pixel-wise distances, where the spatial connections between the pixels are not concerned, the Euclidean distance may be large for small deformations (e.g., the ED between shifted by one pixel edge images is large). In addition, ED is sensitive to noise and constant change in brightness (e.g., for small change of the brightness by 1, Y = X + 1, the distance is MN).

Normalized Cross-Correlation (NCC) [2] :

NCC=i=1MNxiyii=1MNxi2i=1MNyi2 (2)

The NCC gets the values in the interval [0, 1], where 1 indicates the best match. On the one hand, this measure is more robust than ED for noisy scenes. On the other hand, NCC is not invariant under the constant change in brightness. To provide this invariance, Zero-mean Cross-Correlation is used.

Zero-mean Normalized Cross-Correlation (ZNCC) [2]:

ZNCC=i=1MN(xix¯)(yiy¯)i=1MN(xix¯)2i=1MN(yiy¯)2 (3)

where and ȳ are the mean intensity values.

The distance range is the interval [–1, 1] (1 for perfect match, and 0 for “no correlation”). Although ZNCC is invariant to constant brightness changes, but it is not defined for constant intensity images, and shows close to one correlation between approximately white and black images. This measure is known also as Pearson correlation coefficient, and is used widely in tracking applications. Note that in tracking only the positive correlation is of interest, thus max (0, ZNCC) is used as the similarity measure.

All the above similarity measures and distances were pixel-wise, and ignored the spatial connection between the pixels. In contrast, the following distance measure, proposed by Wang et al. [11], explores this spatial connection, and uses it to improve the tolerance and robustness of Euclidean distance.

IMage Euclidean Distance (IMED) [11]:

dIMED=i=1MNj=1MNgij(xiyi)(xjyj) (4)

where G = {gij}MN × MN is symmetric positive definite matrix created by some positive definite function of the distance between the i and j pixels.

Wang et al. [11] argued that for any reasonable image distance this function should be continuous and monotonically decreasing with the increasing distance between the pixels, and proposed the following Gaussian function:

gij=12πσ2exp(dist(Pi,Pj)22σ2) (5)

where dist (Pi, Pj) denote the spatial distance between the pixels i and j.

With this distance measure, smaller deformation cause smaller changes in the distance. Note that in the particular case where G is the identity matrix IMN×MN, then dIMED = dED, but such matrix is created not by continuous function, thus it is not IMED.

The clear advantage of IMED over the other similarity measures can be shown by comparing the distances and the correlations for two images with vertical white lines, one pixel spaced, where the second image is shifted one pixel horizontally. These images may look very similar to human. For example, for the following images:

(25502550255025502550255025502550)and(02550255025502550255025502550255) (6)

The Euclidean distance gets the maximal possible value ED = 1020, and the Normalized Cross-Correlation is zero, and ZNCC = –1. In contrast, IMED = 274, which is more reasonable for the proposed images.

2. Image Normalized Cross-Correlation

We denote the vectorized images as in subsection 1.1 by x and y, with dimensions MN × 1. Let f:RMNRMN be continuous function, and G = {gij}MN × MN be symmetric positive definite matrix.

The Generalized Cross-Correlation is defined by:

GCC=f(x)TGf(y) (7)

The NCC (2) and ZNCC (3) are particular cases of this general definition. For NCC, G = IMN × MN and f(x)=xi=1MNxi2. For ZNCC, G = IMN × MN and f(x)=(xx¯)i=1MN(xix¯)2.

We propose to investigate two more important particular cases, where G is not the identity matrix, but it is chosen similarly to the IMED matrix (5). The idea is to use locally weighted normalized cross-correlation for each pixel, where the weight is decreases with the increasing distance from the current pixel. In other words, we use all pair-wise pixel correlations in the computation, and incorporate the spatial relations in the computation. Following the discussion in [11], the angle between the metric coefficients gij is smaller than π/2, and it can be adjusted using parameter σ. Thus, the proposed measures are robust to small perturbations, and their tolerance to deformation can be adjusted.

The new similarity measures are defined as follows.

IMage Normalized Cross-Correlation (IMNCC):

IMNCC=i=1MNj=1MNgijxiyji=1MNi=1MNgijxixji=1MNi=1MNgijyiyj (8)

where gij is defined as in IMED (5).

IMage Zero mean Normalized Cross-Correlation (IMZNCC):

IMZNCC=i=1MNj=1MNgij(xix¯)(yjy¯)s1s2 (9)

where and ȳ are the mean intensity values, s1=i=1MNj=1MNgij(xix¯)(xjx¯) and s2=i=1MNj=1MNgij(yiy¯)(yjy¯).

It is interesting to note that for gij ≡ 1 (equal weight for all pairs of pixels) there is no mutual information computed between the images, and the absolute value of both similarity measures is identically equal to 1. This can be easily proven by separating i and j dependent variables in (8) and (9).

Lemma 1. |IMNCC| ≤ 1 and |IMZNCC| ≤ 1.

Proof. For IMNCC, we have to prove that

i=1MNj=1MNgijxiyj2i=1MNi=1MNgijxixji=1MNi=1MNgijyiyj

In matrix form it can be written as [17]: |xTGy|2 ≤ (xTGx) (yTGy). Note that for symmetric positive definite matrix G, xTGy is a general inner product 〈x, y〉. We apply Cauchy-Schwartz inequality, which states |〈x, y〉|2 ≤ 〈x, x〉 〈y, y〉, to this inner product, and conclude the proof for IMNCC. For IMZNCC the same proof steps can be applied to the transformed variables xx and yyȳ.

From the Lemma 1 it is clear that the distance measure associated with the IMNCC and IMZNCC, and normalized between 0 and 1, can be defined by:

dIMNCC=1max(0,IMNCC) (10)

and

dIMZNCC=1max(0,IMZNCC) (11)

2.1. Translation Analysis

Let the image y be a pure translation t of the image x. Since the metric coefficients gij are translation invariant (gij is not a function of absolute pixel locations), ijgijxixj=ijgijyiyi=const, which does not depend on the translation t. In the matrix form, IMNCC = (XTGX)–1XTGY, where (XTGX)–1XTG is a constant vector of weights that does not depend on t. Thus, IMNCC of translated images can be interpreted as a weighted average of the translated image y. For simplicity of computation, let us assume that the translation and images are continuous, and M × N rectangle image is translating vertically from top to bottom. Then IMNCC is proportional to XTGY, and thus [17]:

IMNCCtM+t0N0M0Ne((ix)2+(jy)2)(2σ2)djdidydx=2πeN22σ2σ2[σeN22σ2(σNπ2Erf(N2σ))][2πσ(e(Mt)22σ22et22σ2+e(M+t)22σ2)+(Mt)Erf(Mt2σ)2tErf(t2σ)+(M+t)Erf(M+t2σ)]

For small translation t < σ, this expression is close to its maximum (good tolerance for small deformations). For σ < t << M, IMNCCM2πσet22σ2tErf(t2σ), which is almost linear function of t (dashed line in Figure 1).

Figure 1.

Figure 1

Comparison of IMNCC with Euclidean Distance (ED), NCC, and IMED, with regard to translation of 10 random images. The distances are normalized between 0 and 1.

To demonstrate visually the behavior of IMNCC distance with regard to other distances, we have chosen 10 random images of the same size, and shifted them vertically from t = 0 to t = 15. Then we normalized the distances from 0 to 1. The results can be seen in Figure 1.

Similar analysis can be done for IMZNCC, where the mean is reduced from images x and y.

2.2. Noise Tolerance Analysis

Suppose that y is a noisy version of x, i.e., y = x + η, where η ~ N(0, Σ) is normally distributed noise. In this case,

IMNCC=ijgijxi(xj+ηj)ijgijxixjijgij(xi+ηi)(xj+ηj)=ijgijxixj+ijgijxiηjijgijxixjij(gijxixj+gijxiηj+gijηixj+gijηiηj)

Since the noise is independent and identically distributed, and the expected correlation between the noise and image should be zero,

IMNCCijgijxixjijgijxixj(ijgijxixj+igiiηi2)

For zero noise variance, the expression is identically equal to 1 (best correlation). The term igiiηi2 is proportional to noise variance, and so the IMNCC will decrease with increasing noise variance, but it will decrease slower than NCC, because average correlation of random noise with image region should be lower than the correlation with a single pixel.

In the following section, we will describe representative results for IMNCC and IMZNCC measures, and we will compare their tolerance to translation, rotation, scaling, and additive noise with another measures that were discussed in the subsection 1.1.

3. Experiments and Discussion

In all the experiments the function G was chosen to be as in (5), and σ was chosen to be 1.

3.1. Tracking

In this subsection, we have chosen a noisy and low contrast video sequence of 10 frames, and manually selected the target in these frames (see Figure 2 top row). Also, we have selected randomly 10 different images of the same size from the video, one for each frame (see Figure 2 bottom row). Then we have computed the pairwise Euclidean Distance (ED), IMED, IMNCC and IMZNCC distance measures between all possible pairs of these 20 images.

Figure 2.

Figure 2

(Top row) Manually selected target, walking woman, in 10 consecutive frames [18]. (Bottom row) Ten randomly selected images (same size, but different from the targets) from the same consecutive frames.

The graphical result is shown in Figure 3. The distances are color coded: black is for zero distance, and white is for maximal distance. The color bar under the distances map shows the range of the distances. The left 10 × 10 part of the distances matrix shows the distances between the selected true targets in different frames, thus we want it to be as close to black as possible. Naturally this matrix is symmetric. The right 10 × 10 part shows the distances between the true targets and the random images (not targets), thus we expect this part to be as white as possible.

Figure 3.

Figure 3

Pairwise distances between the images in Figure 2. For the left 10×10 matrix: darker is better. For the right 10×10 matrix: brighter is better.

3.2. Tolerance to translation

In this subsection we have chosen to test the distances on a “camera man” image [18] (see Figure 4). We have taken the 15 × 15 sub-image (inside the rectangle) and shifted it in different directions. For each shift we have measured and compared the shifted image with the original image below it (cropped to the same size) by the distance measures described in subsection 1.1.

Figure 4.

Figure 4

Camera man image with the outlined by the rectangle tested image.

In order to obtain the distances of the same order, the ED and IMED were divided by MNF (see subsection 1.1), and NCC and ZNCC were associated with distances similarly to (10) or (11). The representative result for the vertical translation is shown in Figure 5a.

Figure 5.

Figure 5

(a) Distance measures comparison for vertical translation from -10 to 10 pixels. (b) Distance measures comparison for scaling from 5 × 5 to 25 × 25 image sizes (original image size is 15×15). (c) Distance measures comparison for rotation in [–45°, 45°] interval. (d) Distance measures comparison for normal noise addition up to σ = 90.

3.3. Tolerance to scale

We have resized the 15 × 15 sub-image from 5 × 5 to 25 × 25 sizes and computed the distances between the scaled and the centered (and cropped) original image. The results are shown in Figure 5b.

3.4. Tolerance to rotation

We have rotated the image between –45° and 45° and compared the cropped image with the 15 × 15 sub-image. The results are shown in Figure 5c.

3.5. Tolerance to noise

We have computed the distances between the 15 × 15 sub-image and the noised version of this sub-image. We have used additive Gaussian (Normally distributed) noise with the standard deviations from 0 to 90. The results are shown in the Figure 5d.

3.6. Discussion

We can see a gradual improvement in the selectivity in Figure 3 from (a) to (d), therefore IMZNCC shows the best performance in tracking. We have repeated this test with several video sequences with the similar results. The proposed distances can be easily incorporated into the high-level tracking algorithms, and they have a clear advantage of the normalized [0, 1] range, which is easy to handle. The disadvantage is in the higher computational load.

From the Figure 5 we see that IMNCC is the most tolerant to small deformations, thus it is appropriate for image recognition tasks. On the other hand, IMZNCC is less tolerant, and similarly to ZNCC is more appropriate for the localization tasks. Both IMNCC and IMZNCC has better than ZNCC robustness to noise, comparable with IMED robustness. The response to translation of IMNCC and IMZNCC is almost linear, thus intuitively reasonable gradual change in the distances may be expected when the target is slightly off-center. Regarding the rotations, the results are inconclusive. Similar results were obtained with different distance measures. This should be expected, because when the image is rotated, the pixels that are closer to the center of rotation moved less than the farther pixels, thus the spatial relations between the pixels are different for different image regions. Unlike the IMNCC and IMZNCC where we assumed “equal rights” for all image pixels, to enlarge the tolerance to rotation, one need to emphasize far from the center pixels by choosing high σ weights gij.

In addition, we have tested the same algorithms of ZNCC and IMZNCC with median used instead of the mean, as proposed by Trujillo and Izquierdo [8]. This change seems to improve the tracking results.

4. Conclusion

We have proposed a generalization to the popular Normalized Cross-Correlation similarity measures, which consider the spatial relations between the pixels in images. The IMNCC is tolerant to noise and scaling, and is almost linearly changing with translation. The conclusion is that this measure may be appropriate for image recognition applications, where small object deformations should not influence the distance. The IMZNCC has shown high selectivity in visual tracking task, where falsely detected background image patches should be separated from the true targets. Both measures have a predefined range of values (between 0 and 1), and can be incorporated in high-level algorithms for tracking or image recognition.

Although IMNCC and IMZNCC has shown in our tests a superior performance in the matching used for tracking applications and good tolerance to small deformations, they are computationally difficult, and the algorithm for their fast computation should be developed. If more tolerance to rotation is needed, then more complex spatial relations between the pixels can be taken into account, by adjusting the weights gij. Also, the influence of median (instead the mean) on IMZNCC should be investigated more in depth.

  • Two novel distance measures, normalized between 0 and 1, for image matching

  • Advantage: Robustness of our distance measures to translation, scaling, and noise

  • Comparison to other popular measures

  • The first distance measure, IMNCC, is more appropriate for recognition tasks

  • The second distance measure, IMZNCC, is more appropriate for visual tracking

Acknowledgments

This work was supported in part by grants from AFOSR and ARO. Further, this project was supported by grants from the National Center for Research Resources (P41-RR-013218) and the National Institute of Biomedical Imaging and Bioengineering (P41-EB-015902) of the National Institutes of Health. This work is part of the National Alliance for Medical Image Computing (NAMIC), funded by the National Institutes of Health through the NIH Roadmap for Medical Research, Grant U54 EB005149. Information on the National Centers for Biomedical Computing can be obtained from http://nihroadmap.nih.gov/bioinformatics.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • 1.Brown L. A survey of image registration techniques. ACM computing surveys (CSUR) 1992;24(4):376. [Google Scholar]
  • 2.Aschwanden P, Guggenbuhl W, Robust computer vision. Wichmann Ch. Experimental results from a comparative study on correlation-type registration algorithms. 1992:268–289. [Google Scholar]
  • 3.Zitova B, Flusser J. Image registration methods: a survey. Image and vision computing. 2003;21(11):977–1000. [Google Scholar]
  • 4.Lewis J. Fast normalized cross-correlation. Vision Interface. 1995;10:120–123. [Google Scholar]
  • 5.Yoo J-C, Han T. Fast Normalized Cross-Correlation. Circuits, Systems, and Signal Processing. 2009;28(6):819–843. [Google Scholar]
  • 6.Pratt W. Correlation techniques of image registration. IEEE transactions on Aerospace and Electronic Systems. 1974;10:353–358. [Google Scholar]
  • 7.Zhong S, Cao H, Zhang T. Intensity-based correlation for heterogeneous images scene matching. Proceedings of SPIE. 2007;6786:67863B. [Google Scholar]
  • 8.Trujillo M, Izquierdo E. A robust correlation measure for correspondence estimation. Proceedings of 2nd International Symposium on 3D Data Processing, Visualization and Transmission. 2004:155–162. [Google Scholar]
  • 9.Cahn von Seelen U, Bajcsy R. Reconnaisance, Surveillance, and Target Acquisition for the Unmanned Ground Vehicle. Vol. 1. morgan Kaufmann Publishers; 1997. Adaptive correlation tracking of targets with changing scale; pp. 313–322. [Google Scholar]
  • 10.Zhao F, Huang Q, Gao W. Image matching by normalized cross-correlation. IEEE International Conference on Acoustics, Speech and Signal Processing. 2006;2 [Google Scholar]
  • 11.Wang L, Zhang Y, Feng J. On the euclidean distance of images. IEEE transactions on pattern analysis and machine intelligence. 2005;27(8):1334–1339. doi: 10.1109/TPAMI.2005.165. [DOI] [PubMed] [Google Scholar]
  • 12.Sun B, Feng J. A Fast Algorithm for Image Euclidean Distance. Chinese Conference on Pattern Recognition. 2008:1–5. [Google Scholar]
  • 13.Mei X, Zhou S, Wu H. Integrated detection, tracking and recognition for IR video-based vehicle classification. IEEE International Conference on Acoustics, Speech and Signal Processing. 2006;5 [Google Scholar]
  • 14.Liao M, Wang J-Y, Chen W-F, Tang Y. An a ne invariant eulidean distance between images. International Conference on Machine Learning and Cybernetics. 2006:4133–4137. [Google Scholar]
  • 15.Li J, Lu B. An adaptive image Euclidean distance. Pattern Recognition. 2009;42(3):349–357. [Google Scholar]
  • 16.Sun B, Feng J, Wang L. Learning imed via shift-invariant transformation. IEEE Conference on Computer Vision and Pattern Recognition. 2009:1398–1405. [Google Scholar]
  • 17.Strang G. Computational Science and Engineering. Wellesley-Cambridge Press; 2007. [Google Scholar]
  • 18. https://sites.google.com/site/nakhmania/image.

RESOURCES