Abstract
Applying interval-valued data and methods, researchers have made solid accomplishments in information processing and uncertainty management. Although interval-valued statistics and probability are available for interval-valued data, current inferential decision making schemes rely on point-valued statistic and probabilistic measures mostly. To enable direct applications of these point-valued schemes on interval-valued datasets, we present point-valued variational statistics, probability, and entropy for interval-valued datasets. Related algorithms are reported with illustrative examples.
Keywords: Interval-valued dataset, Point-valued variational statistics, Probability, Information entropy
Introduction
Why Do We Study Interval-Valued Datasets?
Statistic and probabilistic measures play a very important role in processing data and managing uncertainty. In the literature, these measures are mostly point-valued and applied to point-valued dataset. While a point-valued datum intends to record a snapshot of an event instantaneously in theory, it is often imprecise in real world due to system and random errors. Applying interval-valued data to encapsulate variations and uncertainty, researchers have developed interval methods for knowledge processing. With data aggregation strategies [1, 5, 21], and others, we are able to reduce large size point-valued data into smaller interval-valued ones for efficient data management and processing. By doing so, researchers are able to focus more on qualitative properties and ignore insignificant quantitative differences.
Studying interval-valued data, Gioia and Lauro developed interval-valued statistics [4] in 2005. Lodwick and Jamison discussed interval-valued probability [17] in the analysis of problems containing a mixture of possibilistic, probabilistic, and interval uncertainty in 2008. Billard and Diday reported regression analysis of interval-valued data in [2]. Huynh et al. established a justification on decision making under interval uncertainty [13]. Works on applications of interval-valued data in knowledge processing include [3, 8, 16, 19, 20, 22], and many more. Applying interval-valued data in the stock market forecasting, Hu and He initially reported an astonishing quality improvements in [9]. Specifically, comparing against the commonly used point-valued confidence interval predictions, the interval approaches have increased the average accuracy ratio of annual stock market forecasts from 12.6% to 64.19%, and reduced the absolute mean error from 72.35% to 5.17% [9]. Additional results on the stock market forecasts reported in [6, 7, 10], and others have verified the advantages of using interval-valued data. The paper [12], published in the same volume as this one, further validates the advantages from the perspective of information theory.
Using interval-valued data can significantly improve efficiency and effectiveness in information processing and uncertainty management. Therefore, we need to study interval-valued datasets.
The Objective of this Study
As a matter of fact, powerful inferential decision making schemes in the current literature use point-valued statistic and probabilistic measures, not interval-valued ones [4] and [17], mostly. To enable direct applications of these schemes and theory on analyzing interval-valued datasets, we need to supply point-valued statistics and probability for interval-valued datasets. Therefore, the primary objective of this work is to establish and to calculate such point-valued measures for interval-valued datasets.
To make this paper easy to read, it includes brief introductions on necessary background information. It also provides easy to follow illustrative examples for novel concepts and algorithms in addition to pseudo-code. Numerical results of these examples are obtained with a recent version of Python 3. However, readers may use any preferred general purpose programming language to verify the results.
Basic Concepts and Notations
Prior to our discussion, let us first clarify some basic concepts and notations related to intervals in this paper. An interval is a connected subset of
. We denote an interval-valued object with a boldfaced letter to distinguish it from a point-valued one. We further specify the greatest lower bound and least upper bound of an interval object with an underline and an overline of the same letter but not boldfaced, respectively. For example, while a is a real, the boldfaced letter
denotes an interval with its greatest lower bound
, and least upper bound
. That is
. The absolute value of a, defined as
, is also called the length (or norm) of a. This is the greatest distance between any two numbers in a.
The midpoint and radius of an interval
are defined as
and
respectively. Because the midpoint and radius of an interval
are point-valued, we simply denote them as mid(a) and rad(a) without boldfacing the letter a. We call
the endpoint (or min-max) representation of
. We can specify an interval
with mid(a) and rad(a) too. This is because of
and
. In the rest of this paper, we use both min-max and mid-rad representations for an interval-valued object.
While we use a boldfaced lowercase letter to indicate an interval, we denote an interval-valued dataset, i.e., a collection of real intervals, with a boldfaced uppercase letter. For instance,
is an interval-valued dataset. The sets
and
are the left- and right-end sets of X, respectively. Although items in a set are not ordered, the
and
are related to the same interval
. For convenience, we denote both
and
as ordered tuples. They are the left- and right-endpoints of
. That is
and
. Similarly, the midpoint and radius of
are point-valued tuples. They are
and
respectively.
Example 1
Provided an interval-valued sample dataset
,
. Then, its left-endpoint is
, and right-endpoint is
. The midpoint of
is
, and the radius is
.
We use this sample dataset
in the rest of this paper to illustrate concepts and algorithms for its simplicity.
In the rest of this paper, we discuss statistics of an interval-valued dataset in Sect. 2; define point-valued probability distributions for an interval-valued dataset in Sect. 3; introduce point-valued information entropy in Sect. 4; and summarize the main results and future work in Sect. 5.
Descriptive Statistics of an Interval-Valued Dataset
We introduce positional statistics for an interval-valued dataset first, and then discuss its point-valued variance and standard deviation.
Positional Statistics of an Interval-Valued Dataset X
The left-, right-endpoints, midpoint, and radius
and rad(X) are among positional statistics of an interval-valued dataset X as presented in Example 1. The mean of
, denoted as
, is the arithmetic average of X. Because
in interval arithmetic1, we have
| 1 |
We now define few more observational statistics for X.
Definition 1
Let
be an interval-valued dataset, then
The envelope of
is the interval
;The core of
is the interval
; andThe mode of
is a tuple,
, where
,
is a cardinality k subset of
, and for any
if
then
.
In other words,
is a subset of
, and
is a subset of
. Furthermore,
is an ordered tuple. In which,
is the non-empty intersection of
for all
, such that, the cardinality of
is the greatest. For a given
, its mode may not be unique. This is because of that, there may be multiple cardinality k subsets of {1, 2, ..., n} satisfying the nonempty intersection requirement
.
Corollary 1
Let
be an interval-valued dataset, then
For all
,
;The core of
is not empty if and only if
; andThe mode of
is
if and only if 
Corollary 1 is straightforward.
Instead of providing a proof, we provide the mean, envelop, core and mode for the sample dataset
. In addition to its endpoints, midpoint, and radius presented in Example 1, we have its mean
;
;
because of
is greater than
; and
. Figure 1 illustrates the sample dataset
. From which, one may visualize the
and
by imaging a vertical line, like the y-axis, continuously moving from left to right. The first and last points the line touches any
determine the envelop
. The line touches at most four intervals for all
between [2.5, 3]. Hence, the mode is
.
Fig. 1.

The sample interval-valued dataset
.
While finding the envelop, core, and mean of
is straightforward, determining the mode of
involves the 2n numbers in
and
, which divide
into
sub-intervals in general (though some of them maybe degenerated as points.) Each of these
sub-intervals can be a candidate of the nonempty intersection part in the mode. For any
, it may cover some of these
sub-intervals (candidates) consecutively. For each of these candidates, we accumulate its occurrences in each
The mode(s) for
is (are) the candidate(s) with the (same) highest occurrence. As a special case, if
is not empty, then
. We summarize the above as an algorithm.
Algorithm 1 is
. This is because of that for each interval
, it may update the count in each of the
candidates takes
.
Point-Valued Variational Statistics of an Interval-Valued Dataset
In the literature, the variance of a point-valued dataset X is defined as
![]() |
2 |
in which, the term
is the distance between
and
, which is the mean of X.
Using (2) to define a variance for an interval-valued
, we need a notion of point-valued distance between two intervals,
and the interval
. May we simply use
, the absolute value of the difference between two intervals
and
, as their distance? Unfortunately, it does not work.
In interval arithmetic [18], the difference between two intervals
and
is defined as the follow:
![]() |
3 |
Equation (3) ensures
However, it also implies
, which is the maximum distance between
and
.
Mathematically, a distance between two nonempty sets A and B is usually defined as the minimum distance between
and
but not the maximum. Hence, we need to define a notion of distance between two intervals.
Definition 2
Let
and
be two nonempty intervals. The distance between
and
is defined as
![]() |
4 |
Definition 2 satisfies all mathematical requirements for a distance. They are
if and only if
;
and for any nonempty intervals
, and
,
. Definition 2 is in fact an extension of the distance between two reals. This is because of that the radius of a real is zero and the midpoint of a real is itself always.
Replacing
in Equation (2) with
defined in (4), we have the point-valued variance of
as the follow:
![]() |
The expression above has three terms. All of them involve
and
. Since
,
Therefore, the first term in the expression above
according to (2). Similarly, the second term
.
The third term is related to the absolute covariance between mid(X) and rad(X). Let
and
, then we can rewrite the term
as
.
Summarizing the discussion above, we have the point-valued variance for an interval-valued dataset
as the follow.
Definition 3
Let
be an interval-valued dataset, then the point-valued variance of
is
![]() |
5 |
Because midpoints and radii of interval-valued objects are point-valued, the variance defined in (5) is also point-valued. Hence, we have the point-valued standard deviation of
as usual:
![]() |
6 |
In evaluating (5) and (6), one does not need interval computing at all. For the sample dataset
, we have its point-valued variance
; and the standard deviation
.
It is worthwhile to note that, Eq. (5) is an extension of (2) and applicable to point-valued datasets too. This is because of that, for all
in a point-valued X,
and
always. Hence,
for a point-valued X.
Probability Distributions of an Interval-Valued Population
An interval-valued dataset
can be viewed as a sample of an interval-valued population. In this section, we study practical ways to find probability distributions for an interval-valued dataset
. Our discussion addresses two different cases. One assumes distribution information for all
. The other does not.
On Probability Distribution of X with Distribution Information for Each
Our discussion involves the concept of a probability distribution over an interval. Let us very briefly review the literature first.
A function f(x) is a probability density function (pdf) of a random variable x on the interval
if and only if
, and
. Well-known pdfs in the literature include the uniform distribution:
normal distribution:
; and beta distribution:
, where
and both parameters
and
are positive, and
is the gamma function. There are software tools available to fit point-valued sample data, which means computationally determining the parameter values in a chosen type of distribution. For instance, the Python scipy.stats module is available to find the optimal
and
to fit a point-valued dataset in a normal distribution, and/or
and
in a beta distribution.
It is safe to assume an availability of a
for each
both theoretically and computationally. In practice, an interval
is often obtained through aggregating observed points. For instances, in [9] and [11], min-max and confidence intervals are applied to aggregate points into intervals, respectively. If an interval is provided directly, one can always pick points from the interval and fit these points with a selected probability distribution computationally. Hereafter, we denote the
of
as
.
We now define a notion of
for an interval-valued dataset
.
Definition 4
A function f(x) is called a probability density function of an interval-valued dataset
if and only if f(x) satisfies all of the conditions:
![]() |
7 |
The theorem below provides a practical way to calculate a
for
.
Theorem 1
Let
be an interval-valued dataset; and
be the
of
provided
Then,
![]() |
8 |
is a pdf of X.
Proof
Because
, we have
. Hence,
. In addition,
for all
, we have
. Equation (7) satisfied. Hence, the f(x) is a pdf of X. 
Equation (8) actually provides a practical way of calculating the
of X. Provided
for each
, we have the algorithm in pseudo-code below:
Example 2
Find a pdf from the sample dataset
[2, 3], [2.5, 7],
. For simplicity, we assume a uniform distribution for each
’s, i.e.,
![]() |
Applying Algorithm 2, we have
![]() |
9 |
The
in the example is a stair function. This is because the uniform distribution assumption on each

Here are few additional notes on finding a
for
with Algorithm 2 .
If assuming uniform distribution, how do we handle the case if
such that
? First of all, an interval element
is usually not degenerated as a constant. Even there is an i such that
, we can always assign an arbitrary non-negative
value at that point. This does not impact the calculation of probability in integrating the
function.
Algorithm 2 assumes
. If it is not the case, the 2n numbers in
and
divide
in
sub-intervals. They are
together with the
sub-intervals in
. Therefore, the accumulation loop in Algorithm 2 should run through all of the
sub-intervals, and then normalize them by dividing n.
Another implicit assumption of Theorem 1 is that, all
are equally weighted. However, that is not necessary. If needed, one may place a positive weight
on each of
’s as stated in the Corollary 2.
Corollary 2
Let
be an interval-valued dataset and
be the pdf of
, then the function
![]() |
10 |
is a
of
.
A proof of Corollary 2 is straightforward too. We have successfully applied the Corollary in computationally studying the stock market [12].
Probability Distribution of an Interval-Valued X Without Distribution Information for Any
It is not necessary to assume the probability distribution for all
to find a
of X. An interval
is determined by its midpoint and radius. Let
and
be two point-valued random variables. Then, the
of
is a non-negative function
, such that
. If we assume a normal distribution for
, then f(u, v) is a bivariate normal distribution [25]. The
of a bivariate normal distribution is:
![]() |
11 |
where
and
is the normalized correlation between u and v, i.e., the ratio of their covariance and the product of
and
. Applying the pdf, we are able to estimate the probability over a region
,
as
![]() |
12 |
To calculate the probability of an interval x, whose midpoint and radius are
and
, we need a marginal
for either u or v. If we fix
, then the marginal
of v follows a single variable normal distribution. Thus,
![]() |
13 |
and the probability of x is
![]() |
14 |
An interval-valued dataset
provides us its mid(X) and rad(X). They are point-valued sample sets of u and v, respectively. All of
,
, and
can be calculated as usual to estimate the
,
,
, and
in (11). For instance, from the sample
, we have
,
,
,
, and
, respectively. Furthermore, using
and
in (13), we can estimate the probability of an arbitrary interval x with (14).
So far, we have established practical ways to calculate point-valued variance, standard deviation, and probability distribution for an interval-valued dataset X. With them, we are able to directly apply commonly available inferential decision making schemes based on interval-valued dataset.
Information Entropy of Interval-Valued Datasets
While it is out of the scope of this paper to discuss specific applications of inferential statistics on an interval-valued dataset, we are interested in measuring the amount of information in an interval-valued dataset. Information entropy is the average rate at which information is produced by a stochastic source of data [24]. Shannon introduced the concept of entropy in his seminal paper “A Mathematical Theory of Communication” [23]. The measure of information entropy associated with each possible data value is:
![]() |
15 |
where
is the probability of
.
An interval-valued dataset
divides the real axis into
sub-intervals. Using
to denote the partition and
to specify its j-th element, we have
. As illustrated in Example 2, we can apply Algorithm 2 to find the
for each
. Then, the probability of
is available. Hence, we can apply (15) to calculate the entropy of an interval-valued dataset
. For reader’s convenience, we summarize the steps of finding the entropy of
as an algorithm below.
The example below finds the entropy of the sample dataset
with the same assumption of uniform distribution in Example 2.
Example 3
Equation (9) in Example 2 provides the
of
. Applying it, we obtain the probability of each interval
as
![]() |
16 |
The entropy of
is

Algorithm 3 provides us a much needed tool in studying point-valued information entropy of an interval-valued dataset. Applying it, we have investigated entropies of the real world financial dataset, which has used in the study of stock market forecasts [6, 7], and [9], from the perspective of information theory. The results are reported in [12]. It not only reveals the deep reason of the significant quality improvements reported before, but also validates the concepts and algorithms presented here in this paper as a successful application.
Summary and Future Work
Recent advances have shown that using interval-valued data can significantly improve the quality and efficiency of information processing and uncertainty management. For interval-valued datasets, this work establishes much needed concepts of point-valued variational statistics, probability, and entropy for interval-valued datasets. Furthermore, this paper contains practical algorithms to find these point-valued measures. It provides additional theoretic foundations of applying point-valued methods in analyzing interval-valued datasets.
These point-valued measures enable us to directly apply currently available powerful point-valued statistic, probabilistic, theoretic results to interval-valued datasets. Applying these measures in various applications is definitely among a high priority of our future work. In fact, using this work as the theoretic foundation, we have successfully analyzed the entropies of the real world financial dataset related to the stock market forecasting mentioned in the introduction of this paper. The obtained results are reported in [12] and published in the same volume as this one. On a theoretic side, future work includes extending the concepts in this paper from single dimensional to multi-dimensional interval-valued datasets.
Footnotes
Contributor Information
Marie-Jeanne Lesot, Email: marie-jeanne.lesot@lip6.fr.
Susana Vieira, Email: susana.vieira@tecnico.ulisboa.pt.
Marek Z. Reformat, Email: marek.reformat@ualberta.ca
João Paulo Carvalho, Email: joao.carvalho@inesc-id.pt.
Anna Wilbik, Email: a.m.wilbik@tue.nl.
Bernadette Bouchon-Meunier, Email: bernadette.bouchon-meunier@lip6.fr.
Ronald R. Yager, Email: yager@panix.com
Chenyi Hu, Email: chu@uca.edu.
References
- 1.Bentkowska U. New types of aggregation functions for interval-valued fuzzy setting and preservation of pos-B and nec-B-transitivity in decision making problems. Inf. Sci. 2018;424(C):385–399. doi: 10.1016/j.ins.2017.10.025. [DOI] [Google Scholar]
- 2.Billard L, Diday E. Regression analysis for interval-valued data. In: Kiers HAL, Rasson JP, Groenen PJF, Schader M, editors. Data Analysis, Classification, and Related Methods. Heidelberg: Springer; 2000. [Google Scholar]
- 3.Dai J, Wang W, Mi J. Uncertainty measurement for interval-valued information systems. Inf. Sci. 2013;251:63–78. doi: 10.1016/j.ins.2013.06.047. [DOI] [Google Scholar]
- 4.Gioia F, Lauro C. Basic statistical methods for interval data. Statistica Applicata. 2005;17(1):75–104. [Google Scholar]
- 5.Grabisch M, Marichal J, Mesiar R, Pap E. Aggregation Functions. New York: Cambridge University Press; 2009. [Google Scholar]
- 6.He L, Hu C. Midpoint method and accuracy of variability forecasting. J. Empir. Econ. 2009;38:705–715. doi: 10.1007/s00181-009-0286-6. [DOI] [Google Scholar]
- 7.He L, Hu C. Impacts of interval computing on stock market forecasting. J. Comput. Econ. 2009;33(3):263–276. doi: 10.1007/s10614-008-9159-x. [DOI] [Google Scholar]
- 8.Hu C, et al. Knowledge Processing with Interval and Soft Computing. London: Springer; 2008. [Google Scholar]
- 9.Hu C, He L. An application of interval methods to stock market forecasting. J. Reliable Comput. 2007;13:423–434. doi: 10.1007/s11155-007-9039-4. [DOI] [Google Scholar]
- 10.Hu, C.: Using interval function approximation to estimate uncertainty. In: Interval/Probabilistic Uncertainty and Non-Classical Logics, pp. 341–352 (2008). 10.1007/978-3-540-77664-2_26
- 11.Hu C. A note on probabilistic confidence of the stock market ILS interval forecasts. J. Risk Finance. 2010;11(4):410–415. doi: 10.1108/15265941011071539. [DOI] [Google Scholar]
- 12.Hu, C., and Hu, Z.: A computational study on the entropy of interval-valued datasets from the stock market. In: Lesot, M.-J., et al. (eds.) The Proceedings of the 18th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU 2020), IPMU 2020, CCIS, vol. 1239, pp. 422–435. Springer (2020)
- 13.Huynh, V., Nakamori, Y., Hu, C., Kreinovich, V.: On decision making under interval uncertainty: a new justification of Hurwicz optimism-pessimism approach and its use in group decision making. In: 39th International Symposium on Multiple-Valued Logic, pp. 214–220 (2009)
- 14.IEEE Standard for Interval Arithmetic. IEEE Standards Association (2015). https://standards.ieee.org/standard/1788-2015.html
- 15.IEEE Standard for Interval Arithmetic (Simplified). IEEE Standards Association (2018). https://standards.ieee.org/standard/1788_1-2017.html
- 16.de Korvin A, Hu C, Chen P. Generating and applying rules for interval valued fuzzy observations. In: Yang ZR, Yin H, Everson RM, editors. Intelligent Data Engineering and Automated Learning – IDEAL 2004; Heidelberg: Springer; 2004. pp. 279–284. [Google Scholar]
- 17.Lodwick W-A, Jamison K-D. Interval-valued probability in the analysis of problems containing a mixture of possibilistic, probabilistic, and interval uncertainty. Fuzzy Sets Syst. 2008;159(21):2845–2858. doi: 10.1016/j.fss.2008.03.013. [DOI] [Google Scholar]
- 18.Moore RE. Methods and Applications of Interval Analysis. Philadelphia: SIAM Studies in Applied Mathematics; 1979. [Google Scholar]
- 19.Marupally, P., Paruchuri, V., Hu, C.: Bandwidth variability prediction with rolling interval least squares (RILS). In: Proceedings of the 50th ACM SE Conference, Tuscaloosa, AL, USA, 29–31 March 2012, pp. 209–213. ACM (2012). 10.1145/2184512.2184562
- 20.Nordin, B., Hu, C., Chen, B., Sheng, V.S.: Interval-valued centroids in K-means algorithms. In: Proceedings of the 11th IEEE International Conference on Machine Learning and Applications (ICMLA), Boca Raton, FL, USA, pp. 478–481. IEEE (2012). 10.1109/ICMLA.2012.87
- 21.Pkala B. Uncertainty Data in Interval-Valued Fuzzy Set Theory: Properties, Algorithms and Applications. 1. Cham: Springer; 2018. [Google Scholar]
- 22.Rhodes, C., Lemon, J., Hu, C.: An interval-radial algorithm for hierarchical clustering analysis. In: 14th IEEE International Conference on Machine Learning and Applications (ICMLA), Miami, FL, USA, pp. 849–856. IEEE (2015)
- 23.Shannon C-E. A mathematical theory of communication. Bell Syst. Tech. J. 1948;27:379–423. doi: 10.1002/j.1538-7305.1948.tb01338.x. [DOI] [Google Scholar]
- 24.Wikipedia: Information entropy. https://en.wikipedia.org/wiki/Entropy_(information_theory)
- 25.Wolfram Mathworld. Binary normal distribution. http://mathworld.wolfram.com/BivariateNormalDistribution.html

















