Skip to main content
Frontiers in Genetics logoLink to Frontiers in Genetics
. 2022 Apr 19;13:858005. doi: 10.3389/fgene.2022.858005

Analysis and Allocation of Cancer-Related Genes Using Vague DNA Sequence Data

Muhammad Aslam 1,*,, Mohammed Albassam 1
PMCID: PMC9061958  PMID: 35518359

Abstract

To test the equality of several independent multinomial distributions, the chi-square test for count data is applied. The existing test can be applied when complete information about the data is available. The complex process, such as DNA count, the existing test under classical statistics may mislead. To overcome the issue, the modification of the chi-square test for multinomial distribution under neutrosophic statistics is presented in this paper. The modified form of the chi-square test statistic under indeterminacy/uncertainty is presented and applied using the DNA count data. From the DNA count data analysis, simulation, and comparative studies, the proposed test is found to be informative, springy, and good as compared with the existing tests.

Keywords: multinomial distribution, chi-square test, classical statistics, neutrosophy, DNA data

Introduction

Without statistical analysis, it is not possible to check the significance of variables under study. For testing the significance of variables, statistical tests are applied in a variety of fields (Ali & Bhaskar, 2016 and Greenland et al., 2016). The chi-square test for multinomial distribution is applied for testing whether the allocation of objects to different groups is equally likely or not. This test is applied for testing the null hypothesis that allocation of objects to different groups is equal vs. the alternative hypothesis that allocation of objects to different groups is unequal. The test statistic is computed from the data, and the null hypothesis is accepted if the values of the statistic fall within the acceptance region. Cohen, Kolassa, & Sackrowitz (2006) use the test for equality of multinomial distributions. Chafai & Concordet (2009) study confidence intervals for multinomial distribution in the case of small samples. Turner, Deng, & Houle (2020) use the statistical tests for head and face data. Shin, Yamamoto, Brady, Lee, & Haynes (2019) and Mollan et al. (2019) discuss the applications of statistical tests.

Statistical methods are widely used in analyzing and testing the significance of DNA data. A rich literature of statistical methods analyzing DNA data is available. Goldman (1993a) applies statistical tests using DNA data. Buldyrev et al. (1998) and Kugiumtzis & Provata (2004) analyze DNA data using statistical physics. Yoshida, Kobayashi, Futagami, & Fujikoshi (1999) use statistical analysis for DNA data. Pai, Mathew, & Anindya (2021) work on prediction using DNA data. Yao, Jin, & Lee (2018) improve the statistical analysis for genetic data. Gunasekaran et al. (2021) analyze DNA data using hybrid models. Halla-aho and Lähdesmäki (2021) use statistical analysis for DNA cancer data. More applications of the statistical techniques for DNA data can be seen in Goldman (1993b), Keinduangjun, Piamsa-nga, & Poovorawan (2005), Rodriguez et al. (2012), and Pai et al. (2021).

Fuzzy-based statistical tests are applied when the data in hand has vague or incomplete information. Viertl (2006) mentions that “statistical data are frequently not precise numbers but more or less non-precise also called fuzzy. Measurements of continuous variables are always fuzzy to a certain degree.” Several studies using fuzzy-based multinomial distribution are available in the literature. Amirzadeh, Mashinchi, & Yaghoobi (2008) study multinomial distribution using fuzzy logic. Mashuri & Ahsan (2018) work on a fuzz-based chart using multinomial distribution. More information for fuzzy-based multinomial distribution can be seen in Amirzadeh et al. (2008) and Hrafnkelsson, Oddsson, & Unnthorsson (2016).

Smarandache (2013) discusses that neutrosophic logic is more efficient than interval- and fuzzy-based analysis. Neutrosophic statistics are applied to analyze the data having neutrosophic numbers; see F Smarandache (2014). Interval statistics use interval data to capture the data in the interval only and are silent about the measure of indeterminacy. On the other hand, fuzzy-based analysis only gives information about the measure of truth and of falseness. Neutrosophic statistics become classical statistics when no indeterminate information is found in the data. Chen et al. (2017a,b) introduced the methods to deal with the neutrosophic data. Later on, Sherwani et al. (2021), Aslam (2021), and Albassam, Khan, & Aslam (2021) introduced statistical tests under neutrosophic statistics.

The chi-square test for multinomial distribution available in the literature can be applied when full information about data is given. Complex processes or processes under uncertainty do not possess the full information about the data or level of significance. Therefore, there is a gap in the design of the chi-square test for multinomial distribution under neutrosophic statistics. Therefore, in this study, the chi-square test for multinomial distribution using neutrosophic statistics is introduced the first time according to the best of the author’s knowledge. The application of the proposed test is given with the aid of DNA cancer data. It is expected the proposed test will be more competent than the existing tests in terms of springy, deftness, and goodness.

Methods

The existing test for the equality of multinomial distribution can only be utilized when no vague information is presented. To overcome this issue, modification of the existing test is necessary. In this section, modification of the existing test under classical statistics is presented under neutrosophic statistics. With the expectation that the proposed test for the equality of multinomial distribution performs better for testing the null hypothesis under an uncertain environment. The main objective of the paper is to introduce the test for the equality of hN independent neutrosophic multinomial distributions. Let Y1jN,Y2jNYkjN(j=1,2,,hN) present the neutrosophic frequencies for the neutrosophic events A1N,A2NAkN . Let pijN=P(AiN) ; iN=1,2,,kN;jN=1,2,,hN . The neutrosophic form of pijNε[pijL,pijU] is expressed as

pijN=pijL+pijUIpijN;IpijNε[IpijL,IpijU] (1)

where pijL presents the determined part, and pijUIpijN presents the indeterminate part and IpijNε[IpijL,IpijU] is the measure of indeterminacy. The alternative expression of Eq. 1 can be given as

pijN=(1+IpijN)pij;IpijNε[IpijL,IpijU] (2)

The jth experiment is carried out njN times under the assumption that njN instances are independent. The modified form of the test statistic QNε[QL,QU] is expressed as follows:

QN=QL+QUIQN;IQNε[IQL,IQU] (3)

where

QN=j=1hNi=1kN(YijNnjNpijN)2njNpijN

The proposed statistic QNε[QL,QU] can be written as

QN=j=1hLi=1kL(YijLnjLpijL)2njLpijL+j=1hUi=1kU(YijUnjUpijU)2njUpijUIQN;IQNε[IQL,IQU] (4)

The simplified form of statistic can be written as

QN=(1+IQN)j=1hNi=1kN(YijNnjNpijN)2njNpijN;IQNε[IQL,IQU] (5)

Note that the proposed test QNε[QL,QU] is a generalization of the test under classic statistics. The proposed test QNε[QL,QU] reduces to the classic test under classic statistics when IQL = 0. The proposed test is also a generalization of the tests under interval statistics and fuzzy-based logic. The proposed test QNε[QL,QU] follows the neutrosophic chi-square distribution with hN(kN1) degree of freedom. The proposed test QNε[QL,QU] is applied to test the following null hypothesis:

H0N:pi1=pi2==pihN=piN,   i=1,2,3,,kN (6)

Under the null hypothesis, we estimate kN1 probabilities from

p^iN=j=1hLYijLj=1hLnjL+j=1hUYijUj=1hUnjUIp^iN;Ip^iNε[Ip^iL,Ip^iU] (7)

The statistic QNε[QL,QU] based on p^iNε[p^iL,p^iU] is expressed as

QN=j=1hLi=1kL(YijLnjLp^ijL)2njLp^ijL+j=1hUi=1kU(YijUnjUp^ijU)2njUp^ijUIQN;IQNε[IQL,IQU] (8)

The simplified form of statistic can be written as

QN=(1+IQN)j=1hNi=1kN(YijNnjNp^ijN)2njNp^ijN;IQNε[IQL,IQU] (9)

Note that QNε[QL,QU] based on p^iNε[p^iL,p^iU] follows the neutrosophic chi-square distribution with (hN1)(kN1) degree of freedom.

Application

In this section, the application of the proposed test is given using DNA sequence data. The data is related to the cancer-related gene BRCA 2. According to https://medlineplus.gov/genetics/gene/brca2/#:∼:text=Mutations%20in%20the%20BRCA2%20gene,one%20generation%20to%20the%20next “Mutations in the BRCA2 gene are associated with an increased risk of breast cancer in both men and women, as well as several other types of cancer. These mutations are present in every cell in the body and can be passed from one generation to the next.” By following https://www.math.mcgill.ca/∼dstephens/OldCourses/204-2007/Handouts/Math204-ChiSquareWithResults.pdf, the counts of nucleotide (A, C, G, T) having two counting groups are reported in Table 1. Note here that, in Table 1, the data given in “Count Group 1” is selected from the given reference, and the data given in “Count Group 2” is generated by simulation. The DNA sequence is a complex process, and there may be uncertainty/indeterminacy in counts; see Yurov, Vorsanova, & Iourov (2011). In the presence of uncertainty/indeterminacy in counts, the proposed test can be applied more effectively than the existing test under classic statistics. Suppose that there is 5% uncertainty/indeterminacy in counts of the numbers of nucleotides (A, C, G, T) in the DNA sequence of the cancer-related gene BRCA 2. Based on the information and data given in Table 1, the proposed test statistic is calculated as follows:

j=14i=14(YijLnjLp^ijL)2njLp^ijL=0.000365921+0.002051303++0.000748132=0.00664

TABLE 1.

The counts of nucleotide data.

Category 1 2 3 4 Total
Nucleotide A C G T
Count Group 1 38,514 24,631 25,685 38,249 127,079
Count Group 2 38,550 24,635 25,700 38,288 127,173

The statistic QNε[QL,QU] in neutrosophic form can be expressed as follows:

QN=0.00664+0.00664IQN;IQNε[0,0.05]

The simplified form of statistic can be written as

QN=(1+0.05)0.00664=0.00697;IQNε[0,0.05]

The proposed test DNA count data is implemented in the following steps.

Step 1: State the null hypothesis H0 : The allocation of DNA count is equally likely vs. the alternative hypothesis H1:  The allocation of DNA count is unequal.

Step 2: The level of significance α = 0.05 and the tabulated value from Kanji (2006) is 9.35.

Step 3: Compute the value of statistic QN = 0.00697 and compare it with the tabulated value.

Step 4: As the computed value of QN is less than 9.35, H0 is accepted.

Based on the analysis, it can be concluded that there is no evidence to suspect unequal allocation of counts of nucleotide (A, C, G, T).

Simulation Study

A simulation study is performed to assess the effect of indeterminacy IQN in counts of the numbers of nucleotides (A, C, G, T) in the DNA sequence of the cancer-related gene BRCA 2 on the statistic QN . To see the effect of IQN on the statistic QN , various values of IQN are considered. Using the neutrosophic form obtained for the DNA count data, the values of statistic QN are shown in Table 2. From Table 2, it can be noted that, as the value indeterminacy IQN increases, the values of QN also increase. The decision about H0 at various values of IQN is also shown in Table 2. From Table 2, although the values of statistic QN increase as IQN increases, but it does not change the decision about the acceptance H0 .

TABLE 2.

The effect of Indeterminacy on QN .

IQN QN Decision about H0 IQN QN Decision about H0
(0, 0) (0.00664, 0.00664) Do not reject H0 (0, 0.1) (0.00664, 0.007304) Do not reject H0
(0, 0.01) (0.00664, 0.006706) Do not reject H0 (0, 0.2) (0.00664, 0.007968) Do not reject H0
(0, 0.02) (0.00664, 0.006773) Do not reject H0 (0, 0.3) (0.00664, 0.008632) Do not reject H0
(0, 0.03) (0.00664, 0.006839) Do not reject H0 (0, 0.4) (0.00664, 0.009296) Do not reject H0
(0, 0.04) (0.00664, 0.006906) Do not reject H0 (0, 0.5) (0.00664, 0.00996) Do not reject H0
(0, 0.05) (0.00664, 0.006972) Do not reject H0 (0, 0.6) (0.00664, 0.010624) Do not reject H0
(0, 0.06) (0.00664, 0.007038) Do not reject H0 (0, 0.7) (0.00664, 0.011288) Do not reject H0
(0, 0.07) (0.00664, 0.007105) Do not reject H0 (0, 0.8) (0.00664, 0.011952) Do not reject H0
(0, 0.08) (0.00664, 0.007171) Do not reject H0 (0, 0.9) (0.00664, 0.012616) Do not reject H0
(0, 0.09) (0.00664, 0.007238) Do not reject H0 (0, 1) (0.00664, 0.01328) Do not reject H0

Comparative Studies

The springy, deftness, and goodness of the proposed test over the tests under interval statistics, the fuzzy-based approach, and classic statistics is shown in this section. The efficiency of the proposed test is shown in terms of the measure of indeterminacy, springyness, deftness, and goodness. The neutrosophic form of the statistic QNε[QL,QU] is expressed as follows:

QN=0.00664+0.00664IQN;IQNε[0,0.05]

The abovementioned neutrosophic form is based on two types of information. The first part, 0.00664 , gives information about the determinate part, and the second part, 0.00664IQN , gives information about the indeterminate part. The proposed statistic QNε[QL,QU] reduces to the test under classic statistics when IQL = 0. Therefore, it can be analyzed that the existing test under classic statistics gives only information about the determinate part. On the other hand, the proposed test gives information about the indeterminacy additionally as compared with the test using classic statistics. Therefore, the proposed test is more bendable than the existing test under classic statistics. The interval statistics only utilize the information given in the interval. In simple words, the interval statistics capture the information between intervals. Now comparing the results of the proposed test under the test statistic under interval statistics, it can be seen that the proposed test is more explanatory than the test using interval statistics as earlier it did not give any information about the measure of indeterminacy. Therefore, the proposed test is also more efficient than the test using the interval-based statistic. The test statistic using fuzzy logic can be considered measures of truth and falseness. The neutrosophic statistics use the set analysis and can be used for any type of set. The proposed statistic QNε[QL,QU] gives three types of information. The proposed test states that the chance of accepting H0 is 0.95 (a measure of truth), the chance of committing a type-I error is 0.05 (a measure of falseness), and the measure of indeterminacy associated with the test is 0.05. From the study, it is concluded that the proposed test is also a generalization of the test using fuzzy logic. Therefore, the proposed test is more informative than the three existing tests.

Concluding Remarks

The modification of the existing test for the equality of multinomial distribution under neutrosophic statistics is introduced in the paper. The proposed test is the generalization of several existing tests under interval statistics, fuzzy-based, and classic statistics. The modification of the test statistic is presented in the presence of indeterminacy. The simulation and comparative studies show that the proposed test is adequate and effective to apply in the presence of uncertainty. The application of the proposed test for DNA count data also shows its efficiency. The proposed test can be applied for testing the allocation of count is equally likely or not in medical science, engineering, and political science. More properties of the proposed test can be studied in future research. The proposed test using a double sampling scheme is another fruitful area for future research.

Acknowledgments

The authors are deeply thankful to the editor and reviewers for their valuable comments to improve the quality of the paper. The authors, therefore, thank DSR for their financial and technical support.

Data Availability Statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

Author Contributions

All authors listed have made a substantial, direct, and intellectual contribution to the work and approved it for publication.

Funding

The paper was funded by the Deanship of Scientific Research (DSR) at King Abdulaziz University, Jeddah, Saudi Arabia. The authors, therefore, thank DSR for their financial and technical support.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

  1. Albassam M., Khan N., Aslam M. (2021). Neutrosophic D’Agostino Test of Normality: An Application to Water Data. J. Mathematics 2021, 1. 10.1155/2021/5582102 [DOI] [Google Scholar]
  2. Ali Z., Bhaskar S. (2016). Basic Statistical Tools in Research and Data Analysis. Indian J. Anaesth. 60 (9), 662. 10.4103/0019-5049.190623 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Amirzadeh V., Mashinchi M., Yaghoobi M. A. (2008). Construction of Control Charts Using Fuzzy Multinomial Quality. J. Mathematics Stat. 4 (1), 26–31. 10.3844/jmssp.2008.26.31 [DOI] [Google Scholar]
  4. Aslam M. (2021). Neutrosophic Statistical Test for Counts in Climatology. Scientific Rep. 11 (1), 1–5. 10.1038/s41598-021-97344-x [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]
  5. Buldyrev S. V., Dokholyan N. V., Goldberger A. L., Havlin S., Peng C. K., Stanley H. E., et al. (1998). Analysis of DNA Sequences Using Methods of Statistical Physics. Physica A: Stat. Mech. Its Appl. 249 (1-4), 430–438. 10.1016/s0378-4371(97)00503-7 [DOI] [Google Scholar]
  6. Chafaï D., Concordet D. (2009). Confidence Regions for the Multinomial Parameter with Small Sample Size. J. Am. Stat. Assoc. 104 (487), 1071–1079. 10.1198/jasa.2009.tm08152 [DOI] [Google Scholar]
  7. Chen J., Ye J., Du S. (2017a). Scale Effect and Anisotropy Analyzed for Neutrosophic Numbers of Rock Joint Roughness Coefficient Based on Neutrosophic Statistics. Symmetry 9 (10), 208. 10.3390/sym9100208 [DOI] [Google Scholar]
  8. Chen J., Ye J., Du S., Yong R. (2017b). Expressions of Rock Joint Roughness Coefficient Using Neutrosophic Interval Statistical Numbers. Symmetry 9 (7), 123. 10.3390/sym9070123 [DOI] [Google Scholar]
  9. Cohen A., Kolassa J., Sackrowitz H. (2006). A Test for Equality of Multinomial Distributions vs Increasing Convex Order Institute of Mathematical Statistics. Recent Dev. Nonparametric Inference Probab. 1, 156–163. 10.1214/074921706000000662 [DOI] [Google Scholar]
  10. Goldman N. (1993a). Simple Diagnostic Statistical Tests of Models for DNA Substitution. J. Mol. Evol. 37 (6), 650–661. 10.1007/BF00182751 [DOI] [PubMed] [Google Scholar]
  11. Goldman N. (1993b). Statistical Tests of Models of DNA Substitution. J. Mol. Evol. 36 (2), 182–198. 10.1007/bf00166252 [DOI] [PubMed] [Google Scholar]
  12. Greenland S., Senn S. J., Rothman K. J., Carlin J. B., Poole C., Goodman S. N., et al. (2016). Statistical Tests, P Values, Confidence Intervals, and Power: a Guide to Misinterpretations. Eur. J. Epidemiol. 31 (4), 337–350. 10.1007/s10654-016-0149-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Gunasekaran H., Ramalakshmi K., Rex Macedo Arokiaraj A., Deepa Kanmani S., Venkatesan C., Suresh Gnana Dhas C. J. C., et al. (2021). Analysis of DNA Sequence Classification Using CNN and Hybrid Models. Comput. Math. Methods Med. 2021, 1835056. 10.1155/2021/1835056 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Halla-aho V., Lähdesmäki H. (2021). Probabilistic Modeling Methods for Cell-Free DNA Methylation Based Cancer Classification (bioRxiv Preprint). 10.1101/2021.06.18.444402 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Hrafnkelsson B., Oddsson G., Unnthorsson R. (2016). A Method for Estimating Annual Energy Production Using Monte Carlo Wind Speed Simulation. Energies 9 (4), 286. 10.3390/en9040286 [DOI] [Google Scholar]
  16. Kanji G. K. (2006). 100 Statistical Tests. United Kingdom: Sheffield Hallam University. [Google Scholar]
  17. Keinduangjun J., Piamsa-nga P., Poovorawan Y. (2005). “DNA Sequence Identification by Statistics-Based Models,” in Paper Presented at the International Conference on Fuzzy Systems and Knowledge Discovery. 1. [Google Scholar]
  18. Kugiumtzis D., Provata A. (2004). Statistical Analysis of Gene and Intergenic DNA Sequences. Physica A: Stat. Mech. Its Appl. 342 (3-4), 623–638. 10.1016/j.physa.2004.05.070 [DOI] [Google Scholar]
  19. Mashuri M., Ahsan M. (2018). Perfomance Fuzzy Multinomial Control Chart. Paper Presented at the Journal of Physics: Conference Series. [Google Scholar]
  20. Mollan K. R., Trumble I. M., Reifeis S. A., Ferrer O., Bay C. P., Baldoni P. L., et al. (2019). Exact Power of the Rank-Sum Test for a Continuous Variable. arXiv Preprint arXiv:1901.04597. Available at: https://arxiv.org/abs/1901.04597 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Pai S. S., Mathew A. R., Anindya R. (2021). A Comparative Analysis of Computational Tools for the Prediction of Epigenetic DNA Methylation from Long-Read Sequencing Data. 10.1101/2021.04.24.441281 [DOI] [Google Scholar]
  22. Rodriguez B. A., Frankhouser D., Murphy M., Trimarchi M., Tam H.-H., Curfman J. (2012). Methods for High-Throughput MethylCap-Seq Data Analysis. BMC Genomics 13 (6), 1–11. 10.1186/1471-2164-13-s6-s14 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Sherwani R. A. K., Shakeel H., Saleem M., Awan W. B., Aslam M., Farooq M. (2021). A New Neutrosophic Sign Test: An Application to COVID-19 Data. PloS One 16 (8), e0255671. 10.1371/journal.pone.0255671 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Shin D., Yamamoto Y., Brady M. P., Lee S., Haynes J. A. (2019). Modern Data Analytics Approach to Predict Creep of High-Temperature Alloys. Acta Materialia 168, 321–330. 10.1016/j.actamat.2019.02.017 [DOI] [Google Scholar]
  25. Smarandache F. (2013). Introduction to Neutrosophic Measure, Neutrosophic Integral, and Neutrosophic Probability: Sitech – Education. [Google Scholar]
  26. Smarandache F. (2014). Introduction to Neutrosophic Statistics, Sitech and Education Publisher, Craiova. Romania-Educational Publ. Columbus, Ohio USA 123, 1. [Google Scholar]
  27. Turner D. P., Deng H., Houle T. T. (2020). Statistical Hypothesis Testing: Overview and Application. Headache: J. Head Face Pain 60 (2), 302–308. 10.1111/head.13706 [DOI] [PubMed] [Google Scholar]
  28. Viertl R. (2006). Univariate Statistical Analysis with Fuzzy Data. Comput. Stat. Data Anal. 51 (1), 133–147. 10.1016/j.csda.2006.04.002 [DOI] [Google Scholar]
  29. Yao Y., Jin Z., Lee J. H. (2018). An Improved Statistical Model for Taxonomic Assignment of Metagenomics. BMC Genet. 19 (1), 98–11. 10.1186/s12863-018-0680-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Yoshida K., Kobayashi M., Futagami K., Fujikoshi Y. (1999). “Statistical Analysis of DNA Sequencing Data (1): Accuracy Test of DNA Data by Partial Re-Sequencing,” in Paper Presented at the Nucleic Acids Symposium Series, 1. [DOI] [PubMed] [Google Scholar]
  31. Yurov Y. B., Vorsanova S. G., Iourov I. Y. (2011). The DNA Replication Stress Hypothesis of Alzheimer's Disease. Scientific World J. 11, 2602–2612. 10.1100/2011/625690 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.


Articles from Frontiers in Genetics are provided here courtesy of Frontiers Media SA

RESOURCES