Abstract
Decision-in decision-out fusion architecture can be used to fuse the outputs of multiple classifiers from different diagnostic sources. In this paper, Dempster-Shafer Theory (DST) has been used to fuse classification results of breast cancer data from two different sources: gene-expression patterns in peripheral blood cells and Fine-Needle Aspirate Cytology (FNAc) data. Classification of individual sources is done by Support Vector Machine (SVM) with linear, polynomial and Radial Base Function (RBF) kernels. Out put belief of classifiers of both data sources are combined to arrive at one final decision. Dynamic uncertainty assessment is based on class differentiation of the breast cancer. Experimental results have shown that the new proposed breast cancer data fusion methodology have outperformed single classification models.
Keywords: Data fusion, Dempster-Shafer Theory, Classification, SVM, Breast cancer, Microarray, FNAc
Background
Medical practitioners diagnose on the basis of information collected from different sources, effectively fusing the information to reach the decision. Information fusion refers to the combination of data originating from multiple sources and improving decision tasks, such as classification, estimation and prediction. Ultimately it provides a better understanding of the phenomena under consideration. In case of breast cancer, number of factors such as heterogeneity in diet, age, race, environmental factors, geographic location, number of pregnancies, as well as genetic makeup determines the risk of malignancy. [1,2] The degree of complexity of the disease is further enhanced by chromosomal rearrangements frequentlyassociated with the pre-malignant disease. The cellular pathways that are altered by these aberrations have beendifficult to evaluate in patients, especially during early stages of the disease process. [1,2]
Since there are number of factors that determined the risk of breast cancer, so it is not advisable to rely on just one source of information for diagnosis. There is a well established association between different symptoms of breast cancer e.g. germline BRCA1 or BRCA2 mutations are associated with increased lifetime risk of developing breast cancer [3] but not all mutation carriers develop breast cancer and the age of onset of breast cancer remains unpredictable. [4] There is a well established association between atypical ductal epithelium identified by histological biopsy, nipple aspiration (NA) or fine needle aspiration (FNA) and an increased risk of future breast cancer. [4] The relative risk of developing invasive breast carcinoma for women found to have atypical ductal hyperplasia on breast biopsy is 4.3 times that of the general population and, when combined with a positive family history, the relative risk of invasive breast cancer rises to 9.7 times that of the general population. [5]
Association between the different symptoms are not only factor that contribute the idea of fusing information from different resources but the limitation of diagnostic methods are one of the major fact as well e.g. mammographic screening is the most reliable method but often fails to detect tumors that are less then 5mm in size and also dense breast tissue are difficult to interpret. [6] The limitation of FNA can either be technical or related to the nature of the lesion itself. [6]
Medical information fusion has been demonstrated by Azuaje et al., [9] an information fusion technique based on a knowledge discovery model and the case-based reasoning decision framework using data from heart disease risk estimation domain. Fusion techniques combine information at the retrieval-outcome level and data at the discovery-input level. [9] Paquerault et al., proposed a technique based on the fusion of one-view and two-view information to improve the performance of mammography mass detection of breast cancer. A classifier was trained to differentiate the true mass pairs from the false pairs. A final fusion stage combined the twoview object pair information with the one-view object scores.[10]
There is wide recognition of Fine Needle Aspiration (FNA) and microarray analysis as the principal diagnostic methods. [1,18] Microarray methodology involves placing a large number of DNA fragments corresponding to the different genes to be studied on a glass slide or glass wafer. [3] Microarray analysis determined the level of expression in a tissue sample of many genes simultaneously. Microarray experiments generate large datasets with expression values for thousands of genes [4], but usually not more than a few dozens of array, that gives rise to the issue of the curse of dimension. FNA cytology is the technique that involves the insertion of a fine needle (between 21 and 25 gauge) into a lesion and the extraction of a small sample of cellular material which is smeared onto glass slides. The cells are stained and examined morphologically by cytopathologists. [6] The features are computed from digitized image of a fine needle aspirate to a breast mass. They describe characteristics of the cell nuclei present in the image. [6]
The aim of our work is to study and apply Dumpster Shafer theory of mathematical belief to fuse breast cancer data obtained from different diagnostic techniques in the management of breast disease. Input data, consisting of feature vectors ported into three different classifiers as input. The classifiers we used in this study are SVM with linear, Polynomial and RBF kernel. Each classifier provides beliefs of two classes benign and malignant. These beliefs are then combined to reach a final diagnosis using Dempster's combination formula. The experiments are carried on two types of breast cancer data. One is Fine Needle Aspirates Cytology (FNAc) data, other is obtained from gene expression pattern in peripheral blood cells. FNAc breast cancer data collected by physician W.O. Wolberg, University of Wisconsin Hospitals, contained 699 samples, 458 of which were benign and 241 of which were malignant. [1] FNAc data set is publicly available on UCI machine learning repository. Gene expression data consist of 60 blood samples obtained from 56 different women of which 24 were malignant and 36 were benign. [6] We have used leave one out cross validation. To implement this method, the available data was divided into k disjoint sets; k models were trained using different combination of k1 partitions and were tested on the remaining partition. Cross-validation makes good use of the available data as each sample is used both as training and test data. Cross-validation is therefore especially useful where the amount of available data is insufficient to form the usual training, validation and test partitions required for split-sample training. [12]
Methodology
To describe the methodology in figure 1 we start with the visualization of FNAc and microarray data. Each element of FNA cytology pattern sets consisted of 9 cytological characteristics. Each of 9 cytological characteristics of breast Fine Needle Aspirates (FNA) differs between benign and malignant samples.
Figure 1.
FNA-Cytology and gene-expression data fusion methodology using Dempster-shafer theory of evidence
The nine independent parameters of FNAc data are: clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses. [18] Each of these characteristics is assigned a number between 1 and 10, with the largerest numbers generally indicating a greater likelihood of malignancy. However, not a single measurement by itself can be used to determine whether the sample is benign or malignant.
It was noted in figure 3 that the benign samples had lower parameter values than the malignant samples shown in figure 2. It was apparent that simultaneous simple frequency distribution histograms all nine parameters for each class, would graphically illustrate any differences between the two classifications, which is highlighted in figure 2 and 3.
Figure 3.
Visualization of FNAc benign data set
Figure 2.
Visualization of FNAc malignant data set
The second data is from Sharma et al. that consist of microarray gene-expression pattern of 1368 genes in peripheral blood cells of 24 women with malignant breast cancer and 36 women with benign cancer. [6] Out of 1368 genes a panel of 37 genes had been identified with distinct expression patterns in malignant versus benign samples. We have used data matrix of 60 samples and 37 genes with two classes benign and malignant.
The relative expressions of 11 features of selected genes are presented in figure 4 and 5. The expression level of cancer genes are shown in figure 4 with 36 samples of women with malignant cancer. The expression level of 11 genes with benign cancer of 24 samples is shown in figure 5.
Figure 4.
Visualization of microarray malignant data set
Figure 5.
Visualization of microarray benign data set
Results & Discussion
This section provides the results of individual classifiers as well as the combination of classifiers using DST for breast cancer data. The performance of SVM based classifiers with linear, polynomial and RBF kernel has been evaluated using sensitivity, specificity, positive predicted value (PPV), negative predicted value (NPV) and accuracy. To highlight these parameters, let’s for some class A, the results would be True Positive (TP) if samples of class A are predicted as A and the results would be False Negative (FN) if samples of class A are predicted as non-A. The result would be False Positive (FP) if samples of non-A predicted as A, and True Negative (TN) if samples of non-A predicted as non-A. The following parameters are used to characterize performance of classifier and are given below Sensitivity=TP/(TP+FN) (14) Specificity=TN/(FP+TN) (15) PPV=TP/(TP+FP) (16) NPV=TN/(FN+TN) (17)
Sensitivity is the probability for a class A sample to be correctly predicted as class A, Specificity is the probability for a non class A sample to be correctly predicted as non-A, PPV is the probability that a sample predicted as class A actually belongs to class A, NPV is the probability that a sample predicted as non class A actually does not belong to class A. For each classification method and each class, these parameters are listed in the tables below. Table 1 shows the results of SVM classifiers on FNA data. The overall accuracy is 90.25% with sensitivity of malignant is 91.6% and benign is 88.8%.Table 2shows the performance of the microarray data the overall accuracy is 81.94% with sensitivity of malignant is 80.50% and benign is 83.30%. %.Table 3 shows the result of application of DST to fuse the classifiers. The overall accuracy shown in %.Table 3 is 94.4% with sensitivity of malignant is 97.1% and benign is 94.4%.Table 3 shows improved accuracy using information fusion with DST.
Table 1. Performance of the Support Vector Machine Classifier on FNA data.
| Class | Sensitivity | Specificity | PPV | NPV |
|---|---|---|---|---|
| Malignant | 0.916 | 0.888 | 0.891 | 0.923 |
| Benign | 0.888 | 0.916 | 0.923 | 0.891 |
Table 2. Performance of the Support Vector Machine Classifier on gene expression data.
| Class | Sensitivity | Specificity | PPV | NPV |
|---|---|---|---|---|
| Malignant | 0.805 | 0.833 | 0.828 | 0.878 |
| Benign | 0.833 | 0.805 | 0.878 | 0.828 |
Table 3. Performance of the combined result of fusion using DST.
| Class | Sensitivity | Specificity | PPV | NPV |
|---|---|---|---|---|
| Malignant | 0.971 | 0.944 | 0.921 | 0.972 |
| Benign | 0.944 | 0.971 | 0.972 | 0.921 |
A confusion matrix in Table 4 shows the classification results of classes Malignant (M) and Benign (B) classes. Fusion with DST shows the maximum accuracy where 35 malignant classes were correctly identified while 1 was classified as benign. The use of FNA data with the SVM classifiers identified 33 benign classes and 3 were incorrectly classified as malignant. Table 5 shows that when two single sources of data: gene expression and Cytology FNA data were fused using Dempster Shafer it showed higher accuracy.
Table 4. Confusion matrices of individual classifiers and the combined result of fusion using Dempster Shafer Theory.
| SVM-Micro | SVM-FNA | Combined-Fusion (DST) | ||||||
|---|---|---|---|---|---|---|---|---|
| M | B | M | B | M | B | |||
| M | 29 | 7 | M | 33 | 3 | M | 35 | 1 |
| B | 6 | 30 | B | 4 | 32 | B | 3 | 33 |
Table 5. Accuracy of classifiers for test cases on malignant and benign.
| SVM-Micro | SVM-FNA | Combined-Fusion (DST) | |
|---|---|---|---|
| Overall Accuracy | 82.00 | 90.27 | 94.44 |
Overall accuracies of individual and DST classifiers in Table 5 show that fusion by DST has improved the breast cancer prediction as compared to individual classifiers.
Conclusions
We have looked at the fusion of data from disparate sources for the prediction of breast cancer tumors. We have demonstrated our methodologies for fusing data form FNAc and microarray data set to achieve a batter overall prediction of breast cancer tumors. The paper has presented a method for fusing medical data using multiple classifiers where uncertainty and unequal costs of errors are present. The fusion framework has been presented for the computation of belief functions and uncertainty values from individual classifiers and data fusion through the Dempster-Shafer theory, in which class differentiation quality is used for the computation of uncertainties. The fusion approach has shown the best classification accuracy of breast tumor classification. The fusion approach remained robust in the presence of fairly different classifier performances. The ability to handle such situations robustly and the ability to classify samples in the presence of classifier uncertainty, makes this approach attractive for healthcare applications.
Supplementary Material
Footnotes
Citation:Razaet al., Bioinformation 1(5): 170-175 (2006)
References
- 1.Russo J, Russo IH. Oncology Research. 1999;11:169. [PubMed] [Google Scholar]
- 2.Ketcham AS, Sindelar WF. Progress in Clinical Cancer. 1975;6:99. [PubMed] [Google Scholar]
- 3.Antoniou A, et al. Am J Hum Genet. 2003;72:1117. doi: 10.1086/375033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Mitchell G, et al. Breast Cancer Research. 2005;7:R1122. doi: 10.1186/bcr1348. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Page DL, et al. Cancer. 1985;55:2698. doi: 10.1002/1097-0142(19850601)55:11<2698::aid-cncr2820551127>3.0.co;2-a. [DOI] [PubMed] [Google Scholar]
- 4.Sharma P, et al. Breast Cancer Research. 2005;7:R634. doi: 10.1186/bcr1203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Torhorst J, et al. American Journal of Pathology. 2001;159:2249. doi: 10.1016/S0002-9440(10)63075-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Miki Y, et al. Science. 1994;266:66. doi: 10.1126/science.7545954. [DOI] [PubMed] [Google Scholar]
- 9.Francisco A, et al. IEEE Trans On Biomedical Eng. 1999;46:1181. [Google Scholar]
- 10.Paquerault S, et al. Medical Physics. 2002;29:238. doi: 10.1118/1.1446098. [DOI] [PubMed] [Google Scholar]
- 11.Dudoit S, et al. Journal of the American Statistical Association. 2002;77 [Google Scholar]
- 12.Stone M. Journal of the Royal Statistical Society. 1974;36:111. [Google Scholar]
- 13.Brown PS, et al. Proc Natl Acad Sci. 1997;97:262. [Google Scholar]
- 14.Furey TS, et al. Bioinformatics. 2000;10:906. doi: 10.1093/bioinformatics/16.10.906. [DOI] [PubMed] [Google Scholar]
- 15.Chapelle o, et al. Machine Learning. 2002;46:131. [Google Scholar]
- 16.Freund Y, et al. Journal of Machine Learning Research. 2003;4:933. [Google Scholar]
- 17.Freund Y, et al. Journal of Computer and System Sciences. 1997;55:119. [Google Scholar]
- 18.Wolberg WH, Mangasarian OL. Proc Natl Acad Sci. 1990;87:9193. doi: 10.1073/pnas.87.23.9193. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.





