Skip to main content
MethodsX logoLink to MethodsX
. 2025 Feb 5;14:103202. doi: 10.1016/j.mex.2025.103202

A novel hybrid CLARA and fuzzy time series Markov chain model for predicting air pollution in Jakarta

Nurtiti Sunusi a,, Ankaz As Sikib a, Sumanta Pasari b
PMCID: PMC12370153  PMID: 40852567

Abstract

Air pollution poses a significant challenge to public health and the global environment. The Industrial Revolution, advancing technology and society, led to elevated air pollution levels, contributing to acid rain, smog, ozone depletion, and global warming. Poor air quality increases risks of respiratory inflammation, tuberculosis, asthma, chronic obstructive pulmonary disease (COPD), pneumoconiosis, and lung cancer.

In this context, developing reliable air pollution forecasting models is imperative for guiding effective mitigation strategies and policy interventions. This study presents a daily air pollution prediction model focusing on Jakarta's sulfur dioxide (SO₂) and carbon monoxide (CO) levels, leveraging a hybrid methodology that integrates Clustering Large Applications (CLARA) with the Fuzzy Time Series Markov Chain (FTSMC) approach.

The analysis revealed five distinct clusters, with medoid selection refined iteratively to ensure stabilization. A 5 × 5 Markov transition probability matrix was subsequently constructed for modeling the data. Predicted values for SO₂ and CO in Jakarta using the CLARA-FTSMC hybrid method showed strong alignment with the actual data. Forecasting accuracy results for SO₂ and CO in Jakarta, based on Mean Absolute Error (MAE) and Root Mean Square Error (RMSE), showed excellent performance, underscoring the efficacy of the CLARA-FTSMC hybrid approach in predicting air pollution levels.

  • The CLARA-FTSMC hybrid method demonstrates high effectiveness in analyzing large datasets, addressing the limitations of previous hybrid clustering fuzzy time series methods.

  • The number of fuzzy time series partitions is optimally determined based on clustering results obtained through the gap statistic approach, ensuring robust partitioning.

  • The forecasting accuracy of the CLARA-FTSMC hybrid method, evaluated using MAE and RMSE, showed excellent performance in predicting daily air pollution levels of SO₂ and CO in Jakarta.

Keywords: Air pollution, CLARA, Clustering, Forecasting, Hybrid clustering fuzzy time series

Method name: Hybrid Clustering Large Applications and Fuzzy Time Series Markov Chain

Graphical abstract

Image, graphical abstract


Specifications table

Subject area: Mathematics and Statistics
More specific subject area: Statistics; Hybrid Clustering Fuzzy Time Series; Air Pollution
Name of your method: Hybrid Clustering Large Applications and Fuzzy Time Series Markov Chain
Name and reference of original method: Finding Groups in Data: An Introduction to Cluster Analysis (1991), Gentle JE, Kaufman L, Rousseuw PJ, Biometrics, Vol. 47, 788 p.
A fuzzy time series-Markov chain model with an application to forecast the exchange rate between the Taiwan and us Dollar (2012), Tsaur RC, Int J Innov Comput Inf Control, 8(7 B):4931–42.
Resource availability: The data utilized in this study were sourced from the official website of the DKI Jakarta Environment Agency (https://lingkunganhidup.jakarta.go.id/). The dataset comprises daily air pollution standard index records spanning from January 2021 to August 31, 2024, ensuring comprehensive coverage for the analysis period.

Background

Air pollution remains one of the most pressing challenges for public health and the global environment. Since the Industrial Revolution, significant advancements in technology, energy, and societal development have provided immense benefits to humanity. However, these advancements have also resulted in severe environmental consequences, particularly the escalation of air pollution. Air pollution contributes to numerous environmental issues, including acid rain, smog, ozone depletion, and global warming [1]. Moreover, poor air quality is strongly associated with various health risks, such as respiratory inflammation, tuberculosis, asthma, chronic obstructive pulmonary disease (COPD), pneumoconiosis, and lung cancer [2,3].

Air pollution is defined as the alteration of air composition due to the presence of hazardous substances, including particulate matter (PM), sulfur dioxide (SO₂), carbon monoxide (CO), nitrogen dioxide (NO₂), and other heavy metals. According to the World Air Quality Report (2023), Indonesia ranks first as the most polluted country in Southeast Asia [4]. As of June 23, 2024, Jakarta recorded the second-highest air pollution levels globally, with an Air Quality Index (AQI) of 160 [5]. The primary drivers of air pollution in Jakarta include industrialization, fossil fuel combustion, mining activities, and the annual 10% increase in private vehicle ownership [6,7].

Although the Jakarta government has implemented various policies, such as promoting public transportation, deploying air quality task forces, and imposing disincentives for parking fees, air quality continues to deteriorate. Therefore, developing accurate air pollution forecasting models is increasingly critical to support the formulation of effective mitigation policies.

One interesting approach in time series analysis is the use of fuzzy logic-based methods. Fuzzy logic has been proven to provide more effective results in solving various practical problems, including the forecasting of time series data [8]. Fuzzy logic allows us to incorporate uncertainty in time series data, which is often caused by variations and external factors that are difficult to consider by classical analysis methods [9].

Fuzzy time series (FTS) was introduced by Song and Chissom [10] to predict enrollments at the University of Alabama. Since then, various FTS methods have been developed such as weighted [11], Chen [12], Markov [13] and multiple attributes [14]. One of the recent methods that has received attention in fuzzy-based time series analysis is the fuzzy time series Markov chain (FTSMC). FTSMC is an approach that combines fuzzy logic concepts with chain models to forecast future values based on fuzzy partitions of time series data. The partitioning allows various states to occur in the time series, whereas the fuzzy concept allows us to measure the degree of membership of each state in each partition.

Based on the studies conducted by [15,16], FTSMC is the most preferable method based on MSE and MAPE metrics in compared to other FTS methods. However, FTS has some issues, such as determining the exact number of partitions; also, the length of the interval does not have a definite formula in its calculation [17]. The relationship between the number of partitions and the interval length has been addressed in previous research works [15,18]. In fact, the number of partitions and the length of the interval greatly affect the formation of the membership relationship (FLR), resulting in differences in the accuracy of the forecasting results.

Therefore, the selection of the optimal number of partitions is an interesting problem that needs to be discussed. Some previous studies have tried to incorporate clustering methods to optimize the partitions in the FTS method [[19], [20], [21], [22]]. However, in determining the optimal partitioning, k-means and k-medoid methods still have shortcomings, as they are less effective in analyzing large data compared to improved methods such as clustering large applications (CLARA).

The CLARA is robust to large amounts of data and can cope with outliers [20]. Thus, the incorporation of CLARA in the FTSMC analysis stage results in a very flexible method in forecasting large amounts of daily air pollution data. This research aims to develop a prediction model for sulfur dioxide (SO₂) and carbon monoxide (CO) based daily air pollution in Jakarta, using a hybrid approach of Clustering Large Applications (CLARA) and Fuzzy Time Series Markov Chain (FTSMC). The model is expected to provide more accurate projections to support strategic decision-making in urban air pollution mitigation.

Method details

Fuzzy Time Series

The Fuzzy Time Series (FTS) method typically utilizes historical data in linguistic form [21]. The FTS process consists of defining the universe of discourseU, partitioningUinto several intervals, fuzzification, establishing fuzzy relationships, and defuzzification.

Definition 1

LetU={u1,u2,u3,,un}be the universe of discourse, whereun(i=1,,n)represents possible linguistic values withinU. The linguistic fuzzy variableAorUis defined as:

Ai=fAi(u1)u1+fAi(u2)u2++fAi(un)un (1)

WherefAiis the membership function of fuzzy setfAi:U[0,1],fAi(ur)[0,1]and1<r<n.

Definition 2

LetY(t)(t=1,2,...,n),be a real-valued time series, whereY(t)is defined over the fuzzy setfi(t),i=1,2,3,,n. ThenF(t)represents the fuzzy time series ofY(t).

Definition 3

IfY(t)=Ajis caused byY(t1)=Ai,then the fuzzy logical relationship (FLR) is expressed as:AiAj.

Definition 4

If an FLR originates from stateA2, and transitions to other statesAj, (j=1,2,3,n), such asA2A3,A2A2,A2A1,the FLRs are grouped into a fuzzy logical relationship group (FLRG) as follows:

A2A1,A2,A3 (2)

Fuzzification transforms numerical data into linguistic values, forming the FLR. This step requires determining the upper and lower bounds using the following equations:

ubi=clustercenteri+clustercenteri+12 (3)

Herei=1,2,,k;ubiis the upper boundary of thei-th interval, while the lower boundary of the next interval islbi+1.For the first and last clusters, where no prior or subsequent centers exist, the lower boundlb1and upper boundubkare computed as:

ubk=clustercenterk+|maxdataclustercenterk| (4)
lb1=clustercenter1|clustercenter1mindata| (5)

[23,24]

Fuzzy time series Markov chain

The transition probability matrix in Markov chain is constructed as a(p×p) matrix, whereprepresents the number of fuzzy sets [22]. The equation to determine the transition probability between states is as follows:

Pij=rijri (6)

Where:

Pij: Transition probability from stateAitoAj

rij: Number of transitions from stateAikeAj

ri: Total number of data points in stateAi

The transition probability matrix P can be expressed as:

P=[P11P12P21P22P1pP2pPp1Pp2Ppp]
  • 1) The initial forecast values are determined using the following rules:

Rule 1 If a fuzzy set does not have a Fuzzy Logical Relationship (FLR), (Ai), andY(t1)at time
t1falls intoAi, then the forecast valueFtismi, wheremiis the midpoint of intervalxi.
Rule 2. If the Fuzzy Logical Relationship Group (FLRG)Airepresents a one-to-one relationship
(AiAq), andY(t1)at timet1falls into stateAi, then the forecast valueFtis mq, where mq is the midpoint ofxqin the FLRG formed at timet1.
Rule 3. If the FLRGAirepresents a one-to-many relationship (AjA1,A2,A3,...,Aq, j = 1, 2, 3, …, q), andY(t1)at timet1falls into stateAj, the forecastF(t)is calculated as:
F(t)=m1Pi1+m2Pi2++mi1Pi(i1)+Y(t)Pii+mi+1+1Pi(i+1)++mnPin(7)
where,m1,m2,,mnare the midpoints ofu1,u2,,un and miis replaced withY(t)for stateAito improve forecasting accuracy.
  • 2) To improve the forecast accuracy, an adjustment is made by adding the difference between the actual valueY(t),and the previous value, as follows:
    F^(t+1)=F(t+1)+diff(Y(t)) (8)

diff(Y(t)) is the difference between the actual value(Y(t))at timetand the previous actual valueY(t1).

diff(Y(t))={0,ifY(t=1)Y(t)Y(t1),ifY(t2) (9)

where:

Y(t) : Actual data at periodt
F(t) : Initial forecast result at periodt
F^(t) : Adjusted forecast result at periodt

Euclidean distance

Euclidean distance is a method for calculating the distance between points in Euclidean space, which is subsequently used to group these points into clusters based on their proximity [25].

d(x,y)=i=1n(xkyi)2,k=1,2,,c (10)

where:

d(x,y): Euclidean distance betweenxkandyixk: thek-th cluster center value

yi: thei-th actual data value(i=1,2,,n)

Gap statistics

The Gap Statistics method is used to determine the optimal number of clusters. It achieves this by comparing the intra-cluster variation of the actual data with the expected variation from randomly generated data. The Gap Statistics method demonstrates higher accuracy when integrated with the CLARA algorithm, which efficiently incorporates large datasets through sampling. The Gap Statistic value is calculated using the following equation:

Gap(k)=1Bb=1Blog(Wkb)log(Wk) (11)

where:

Gap(k) : The gap statistic for the optimal number of clustersk
B : The number of bootstrap samples used in the gap statistic method
Wkb : The intra-cluster dispersion forkcluster in the bth bootstrap sample
Wk : The within-cluster variation forkcluster in the original dataset

Clustering large applications

Clustering Large Applications (CLARA) utilizes medoids as cluster centers to group large-scale data and is robust against outliers [26]. CLARA divides large datasets into smaller subsets while ensuring optimal medoid selection. The sample size for each subset is determined using the following equation:

min(40+(2×K)) (12)

where:

K : Number of clusters.

The fundamental principle of the partition around medoids (PAM) algorithm is to minimize the dissimilarity between objects within a cluster by iteratively swapping the medoid and non-medoid objects until convergence [27]. Typically, the process of finding a new medoid is repeated to achieve the best medoid with the smallest total distance representing the cluster. The formula for evaluating the medoid swap is:

S=TotalEuclideandistanceofthenewmedoidTotalEuclideandistanceoftheoldmedoid (13)

Where:

If S < 0 the medoid swap is repeated
If S > 0 the iteration stops

Accuracy of the forecasting model

The accuracy of a forecasting model improves as the error value decreases. A lower error indicates higher accuracy and vice versa [28,29]. The formula to measure the accuracy of time series analysis results is as follows (Table 1):

MAE=t=1T|ytF^t|n (14)
RMSE=t=1T(ytF^t)2n (15)

Table 1.

Model forecast accuracy criteria.

Accuracy Value Criterion
≤ 10
10 < Value ≤ 20
20 < Value ≤ 50
> 50
Excellent
Good
Fair
Poor

Method validation

In forecasting sulfur dioxide and carbon monoxide air pollution in Jakarta using the hybrid CLARA and FTSMC, we utilized secondary data, specifically the daily air quality index from January 2021 to August 31, 2024. The dataset includes two dependent variables. The details of these variables are provided in Table 2.

Table 2.

Variable description.

Variable Description
y1 Daily sulfur dioxide air pollution data
y2 Daily carbon monoxide air pollution data

Descriptive analytics

Table 3.

Descriptive Analysis of SO₂ and CO Pollution in Jakarta from January 1, 2021 to August 31, 2024.

Minimum Maximum Mean
SO2 8 112 36,27
CO 3 55 15,12

Selection of optimal cluster number

Gap Statistic is particularly effective in the CLARA algorithm, which works with large datasets. It determines the optimal number of clusters by comparing the clustering results with expectations derived from random data, ensuring more accurate clustering outcomes (Fig. 1).

Fig. 1.

Fig 1

Determination of the optimal number of clusters using Gap Statistics, with a maximum of 15 clusters and bootstrapping of 100 iterations.

Clustering large applications analysis

Based on Eq. (12), the number of samples from the actual data is determined in the process of selecting the optimal medoid (Tables 4, Table 6, Table 7, Table 8, Table 9).

(40+(2x5))=50
Table 4.

Samples of the CLARA algorithm in selecting the initial medoids, where each sample represents a data sequence from the actual SO2 and CO air pollution datasets.

[1] 19 21 48 77 92 125 150 178 219 270
[11] 274 287 290 352 364 368 369 382 412 418
[21] 465 529 578 712 766 802 809 863 867 899
[31] 918 928 968 995 1022 1026 1028 1064 1104 1116
[41] 1142 1153 1162 1175 1196 1208 1220 1233 1234 1332
Table 6.

Distance calculation between objects and initial medoids for SO₂ and CO air pollution data.

N Date SO₂ CO Cost1 Cost2 Cost3 Cost4 Cost5 Proximity
1 01/01/2021 29 6 11,40 5,10 16,16 22,02 26,40 5,10
2 02/01/2021 27 7 9,22 5,00 17,72 23,02 27,46 5,00
3 03/01/2021 25 7 7,28 6,40 19,65 24,70 29,15 6,40
4 04/01/2021 24 4 7,81 9,22 21,54 27,20 31,62 7,81
1336 28/08/2024 14 24 15,52 20,62 32,31 32,25 36,06 15,52
1337 29/09/2024 14 26 17,46 21,93 33,11 32,56 36,22 17,46
1338 30/08/2024 15 28 19,24 21,67 33,12 32,02 35,51 19,24
1339 31/08/2024 14 25 16,49 21,26 32,70 32,39 36,12 16,49
Cost Total 8.850,34
Table 7.

Samples of CLARA algorithm for new medoid selection, where each sample represents a data sequence on actual air pollution data for SO₂ and CO.

[1] 44 80 86 87 89 100 110 119 141 186
[11] 188 224 235 335 343 377 396 397 420 434
[21] 450 463 503 533 566 574 576 595 666 698
[31] 737 779 809 866 867 892 936 943 1032 1075
[41] 1102 1115 1121 1129 1178 1180 1205 1254 1332 1336
Table 8.

New medoids based on actual SO₂ and CO air pollution data samples.

Cluster Medoid
Label
SO₂ CO
1 14 11 Very Low
2 25 12 Low
3 42 15 Moderate
4 49 24 High
5 51 25 Very High
Table 9.

Calculation of object distances to new medoids for SO2 and CO air pollution data.

N Date SO₂ CO Cost1 Cost2 Cost3 Cost4 Cost5 Proximity
1 01/01/2021 29 6 15,81 7,21 15,81 26,91 29,07 7,21
2 02/01/2021 27 7 13,60 5,39 17,00 27,80 30,00 5,39
3 03/01/2021 25 7 11,70 5,00 18,79 29,41 31,62 5,00
4 04/01/2021 24 4 12,21 8,06 21,10 32,02 34,21 8,06
1336 28/08/2024 14 24 13,00 16,28 29,41 35,00 37,01 13,00
1337 29/09/2024 14 26 15,00 17,80 30,08 35,06 37,01 15,00
1338 30/08/2024 15 28 17,03 18,87 29,97 34,23 36,12 17,03
1339 31/08/2024 14 25 14,00 17,03 29,73 35,01 37,00 14,00
Cost Total 9.531,82

Selection of the new medoid

Comparison of initial and new medoid total Euclidean distance

The total Euclidean distance of the initial medoid is 8.850,34, while the total Euclidean distance of the new medoid is 9.531,82 using Eq. (13), the result is:

S=8.850,349.531,82=681,48

Since(S)>0, the iteration stops, and the initial medoid becomes the final medoid for the air pollution data of SO₂ and CO in Jakarta.

Medoid intervals

The medoid results from Table 5 serve as the basis for forming the discourse interval universeUusing Eqs. (3)(5) (Table 10).

Table 5.

Initial medoids based on the sample data of SO₂ and CO air pollution.

Cluster Medoid
Label
SO₂ CO
1 18 9 Very Low
2 30 11 Low
3 44 12 Moderate
4 46 20 High
5 50 22 Very High

Table 10.

Medoid intervals based on medoids in Table 5 for SO₂ and CO air pollution data.

SO₂
CO
Interval Midpoint Interval Midpoint
u1=[08,0;24,0) m1= 16,0 u1=[03,0;10,0) m1= 6,5
u2=[24,0;37,0) m2= 30,5 u2=[10,0;11,5) m2= 10,75
u3=[37,0;45,0) m3= 41,0 u3=[11,5;16,0) m3= 13,75
u4=[45,0;71,0) m4= 58,0 u4=[16,0;21,0) m4= 18,5
u5=[71,0;112,0) m5= 91,5 u5=[21,0;55,0) m5= 38,0

Fuzzy time series Markov chain

Fuzzification, fuzzy logic relationship (FLR), and fuzzy logic relationship group (FLRG)

Fuzzy Logic Relationships (FLR) is a concept in FTS that is used to capture the relationship between fuzzy sets in time series shown in Table 11. While FLRG is an accumulation of FLR between fuzzy sets to help understand historical data patterns for forecasting purposes as shown in Table 12.

Table 11.

Fuzzification and FLR results for SO₂ and CO air pollution data.

N Date SO₂ Fuzzification FLR CO Fuzzification FLR
1 01/01/2021 29 A2 6 A1
2 02/01/2021 27 A2 A2A2 7 A1 A1A1
3 03/01/2021 25 A2 A2A2 7 A1 A1A1
4 04/01/2021 24 A2 A2A2 4 A1 A1A1
1336 28/08/2024 14 A1 A1A1 24 A5 A5A5
1337 29/09/2024 14 A1 A1A1 26 A5 A5A5
1338 30/08/2024 15 A1 A1A1 28 A5 A5A5
1339 31/08/2024 14 A1 A1A1 25 A5 A5A5

Table 12.

FLRG results for SO₂ and CO air pollution data.

K FLRG
SO₂ CO
1 A1(276)A1, (27)A2. A1(137)A1, (42)A2,(29)A3, (11)A4,(2)A5.
2 A2(27)A1, (218)A2,(11)A3, (5)A4. A2(53)A1, (94)A2,(81)A3, (14)A4,(2)A5.
3 A3(1)A1, (13)A2,(217)A3, (49)A4. A3(25)A1, (87)A2,(156)A3, (79)A4,(28)A5.
4 A4(2)A2,(52)A3, (438)A4,(1)A5. A4(5)A1, (17)A2,(74)A3, (90)A4,(58)A5.
5 A5(1)A4. A5(4)A2,(35)A3, (50)A4,(164)A5.

Transition probability matrix Markov

Based on the FLRG in Table 12, the next step is to form a 5 × 5 transition probability matrix based on Eq. (6). whereWSO2is the transition probability matrix for sulfur dioxide andWCOis the transition probability matrix for carbon monoxide.

WSO2=[0,9110,0890,0000,0000,0000,1030,8350,0420,0190,0000,0040,0460,7750,1750,0000,0000,0040,1050,8880,0020,0000,0000,0001,0000,000]WCO=[0,6200,1900,1310,0500,0090,2170,3850,3320,0570,0080,0670,2320,4160,2110,0750,0200,0700,3030,3690,2380,0000,0160,1380,1980,648]

Hybrid CLARA FTSMC forecasting results for SO₂ and CO air pollution

Defuzzification is performed in two stages: initial forecasting and adjustment of the forecast. The results of the defuzzification process are shown in Tables 13 and 14.

Table 13.

Hybrid CLARA FTSMC forecasting results for SO₂ air pollution in Jakarta.

N Date SO₂ Fuzzification Initial Forecast(Ft) Final Forecast(Ft^)
1 01/01/2021 29 A2
2 02/01/2021 27 A2 56,45 49,45
3 03/01/2021 25 A2 45,53 57,53
4 04/01/2021 24 A2 58,89 48,89
1337 29/09/2024 14 A1 15,47 15,47
1338 30/08/2024 15 A1 15,47 15,47
1339 31/08/2024 14 A1 15,47 16,47
1340 01/09/2024 - - - 16,47

Table 14.

Forecast results for CO air pollution in Jakarta using Hybrid CLARA FTSMC.

N Date CO Fuzzification Initial Forecast(Ft) Final Forecast(Ft^)
1 01/01/2021 6 A1
2 02/01/2021 7 A1 8,83 9,83
3 03/01/2021 7 A1 9,45 9,45
4 04/01/2021 4 A1 9,45 6,45
1337 29/09/2024 26 A5 21,29 23,29
1338 30/08/2024 28 A5 22,58 24,58
1339 31/08/2024 25 A5 23,88 20,88
1340 01/09/2024 - - - 20,88

The predicted value of SO₂ using the CLARA-FTSMC hybrid method has good agreement with actual data. The predicted value for September 1 was 16,47, while the actual value reported by the DKI Jakarta Environment Agency for the same day was 12. In addition, the forecasting accuracy assessed from the Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) is classified as excellent (Fig. 2).

Fig. 2.

Fig 2

Graph of actual data and forecasted SO₂ values using Hybrid CLARA FTSMC.

The predicted value of CO using the CLARA-FTSMC hybrid method has good agreement with actual data. The predicted value for September 1 was 20,88, while the actual value reported by the DKI Jakarta Environment Agency for the same day was 22. In addition, the forecasting accuracy assessed from the Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) is classified as excellent (Fig. 3 and Table 15).

Fig. 3.

Fig 3

FTSMC Graph of actual data and forecasted CO values using Hybrid CLARA FTSMC.

Table 15.

Forecast model accuracy for SO₂ and CO air pollution in Jakarta using Hybrid CLARA FTSMC.

SO₂
CO
MAE RMSE MAE RMSE
1,19 1,63 3,17 4,66

Conclusion

Based on the descriptive analysis of daily air pollution data for SO₂ and CO in Jakarta (as shown in Table 2), SO₂ data from January 1, 2021 to August 31, 2024 has recorded an average value of 36,27, with a minimum value of 8 and a maximum of 112. Similarly, the CO data has showed an average value of 15,12, with a minimum value of 3 and a maximum of 55.

This research utilizes a hybrid methodology that combines Clustering Large Applications (CLARA) and Fuzzy Time Series Markov Chain (FTSMC). Statistical analysis of the gaps suggests the optimal number of clusters as five, with medoid selection completed at the initial optimal medoid point. This process has generated a 5 × 5 Markov transition probability matrix to effectively model the data.

Predicted values for SO₂ and CO using the CLARA-FTSMC hybrid method have showed strong alignment with the actual data. For September 1, the predicted value of SO₂ was 16,47, while the actual value reported by the DKI Jakarta Environment Agency was 12. Similarly, for CO, the predicted value was 20,88, as compared to the actual value of 22. In addition, the forecasting accuracy, evaluated by Mean Absolute Error (MAE) and Root Mean Square Error (RMSE), was classified as excellent.

Limitations

  • 1.

    Model prediction accuracy is assessed using Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).

  • 2.

    This research focuses on a single area in Indonesia, specifically the air pollution forecasting in Jakarta, thus limiting the generalizability of the results to other areas that may have different environmental conditions.

  • 3.

    The forecasting approach primarily relies on dependent variables and it has not incorporated independent variables to identify factors that can influence the results. Including such variables could improve the accuracy and robustness of the predictions.

Ethics statements

The data utilized in this study were sourced from the official website of the DKI Jakarta Environment Agency (https://lingkunganhidup.jakarta.go.id/). The dataset comprises daily air pollution standard index records spanning from January 2021 to August 31, 2024.

CRediT authorship contribution statement

Nurtiti Sunusi: Conceptualization, Methodology, Software, Writing – original draft, Visualization. Ankaz As Sikib: Conceptualization, Methodology, Writing – review & editing, Validation, Supervision. Sumanta Pasari: Conceptualization, Methodology, Writing – review & editing.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Footnotes

Related research article:None

For a published article:None

Data availability

Data will be made available on request.

References

  • 1.Kingsy G.R., Manimegalai R., Geetha D.M.S., Rajathi S., Usha K., Raabiathul B.N. Proceedings of the IEEE Region 10 Annual International Conference/TENCON. 2017. Air pollution analysis using enhanced K-Means clustering algorithm for real time sensor data; pp. 1945–1949. August 2006. [Google Scholar]
  • 2.Simkovich S.M., Goodman D., Roa C., Crocker M.E., Gianella G.E., Kirenga B.J., et al. The health and social implications of household air pollution and respiratory diseases. NPJ Prim. Care Respir. Med. 2019;29(1):1–17. doi: 10.1038/s41533-019-0126-x. [Internet]Available from: [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Wu S., Ni Y., Li H., Pan L., Yang D., Baccarelli A.A., et al. Short-term exposure to high ambient air pollution increases airway inflammation and respiratory symptoms in chronic obstructive pulmonary disease patients in Beijing, China. Environ. Int. 2016;94:76–82. doi: 10.1016/j.envint.2016.05.004. [Internet]Available from: [DOI] [PubMed] [Google Scholar]
  • 4.IQAir . IQAir; 2023. World Air Quality Report 2023; pp. 1–45.https://www.iqair.com/id/world-most-polluted-countries [Internet]Available from: [Google Scholar]
  • 5.IQAir. Ranking of the most polluted big cities directly [Internet]. 2024. Available from: https://www.iqair.com/id/world-air-quality-ranking
  • 6.Maung T.Z., Bishop J.E., Holt E., Turner A.M., Pfrang C. Indoor air pollution and the health of vulnerable groups: a systematic review focused on particulate matter (PM), volatile organic compounds (VOCs) and their effects on children and people with pre-existing lung disease. Int. J. Environ. Res. Public Health. 2022;19(14) doi: 10.3390/ijerph19148752. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Muttaqin M.Z., Herwangi Y., Susetyo C., Sefrus T., Subair M. Public transport performance based on the potential demand and service area (case study : Jakarta Public Transport) Daengku J. Humanit. Soc. Sci. Innov. 2021;1(1):1–7. [Google Scholar]
  • 8.Zhang R., Ashuri B., Deng Y. A novel method for forecasting time series based on fuzzy logic and visibility graph. Adv. Data Anal. Classif. 2017;11(4):759–783. [Google Scholar]
  • 9.Cheng S.H., Chen S.M., Jian W.S. Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics SMC 2015. 2016. A novel fuzzy time series forecasting method based on fuzzy logical relationships and similarity measures; pp. 2250–2254. [Google Scholar]
  • 10.Song Q., Chissom B.S. Forecasting enrollments with fuzzy time series–part I. Fuzzy Sets Syst. 1993;54(1):1–9. [Google Scholar]
  • 11.Yu H.K. Weighted fuzzy time series models for TAIEX forecasting. Phys. A Stat. Mech. Appl. 2005;349(3–4):609–624. [Google Scholar]
  • 12.Chen S.M. Forecasting enrollments based on fuzzy time series. Fuzzy Sets Syst. 2006;4287:324–336. LNCS. [Google Scholar]
  • 13.Sullivan J., Woodall W.H. A comparison of fuzzy forecasting and Markov modeling. Fuzzy Sets Syst. 1994;64(3):279–293. [Google Scholar]
  • 14.Cheng C.H., Cheng G.W., Wang J.W. Multi-attribute fuzzy time series method based on fuzzy clustering. Expert Syst. Appl. 2008;34(2):1235–1242. [Google Scholar]
  • 15.Alyousifi Y., Othman M., Sokkalingam R., Faye I., Silva P.C.L. Predicting daily air pollution index based on fuzzy time series Markov chain model. Symmetry. 2020;12(2):1–18. (Basel) [Google Scholar]
  • 16.Ramadani K., Devianto D. 2020. The forecasting model of bitcoin price with fuzzy time series Markov chain and Chen logical method; p. 2296. (November) [Google Scholar]
  • 17.Zaenurrohman H.S., Udjiani T. Fuzzy time series Markov Chain and Fuzzy time series Chen & Hsu for forecasting. J. Phys. Conf. Ser. 2021;1943(1) [Google Scholar]
  • 18.Mubarrok M.N., Nuryanto U.W., Fika R., Adi P., Tanati A.E. Fuzzy time series Markov chain for Rice production forecasting. Bp. Int. Res. Crit. Inst. J. 2022;5(3):27148–27154. Vol. [Google Scholar]
  • 19.Vovan T., Fuzzy LT.A. Time series model based on improved fuzzy function and cluster analysis problem. Commun. Math. Stat. 2022;10(1):51–66. doi: 10.1007/s40304-019-00203-5. [Internet]Available from: [DOI] [Google Scholar]
  • 20.Gentle J.E., Kaufman L., Rousseuw P.J. Finding groups in data: an introduction to cluster analysis. Biometrics. 1991;47:788. [Google Scholar]
  • 21.Efendi R., Ismail Z., Deris M.M. A new linguistic out-sample approach of fuzzy time series for daily forecasting of Malaysian electricity load demand. Appl. Soft Comput. J. 2015;28:422–430. doi: 10.1016/j.asoc.2014.11.043. [Internet]Available from: [DOI] [Google Scholar]
  • 22.Li N., Kolmanovsky I., Girard A., Filev D. Fuzzy encoded Markov chains: overview, observer theory, and applications. IEEE Trans. Syst. Man Cybern. Syst. 2021;51(1):116–130. [Google Scholar]
  • 23.Dewi D.A., Surono S., Thinakaran R., Nurraihan A. Hybrid fuzzy K-medoids and cat and mouse-based optimizer for Markov Weighted fuzzy Time Series. Symmetry. 2023;15(8) (Basel) [Google Scholar]
  • 24.Surono S., Goh K.W., Onn C.W., Nurraihan A., Siregar N.S., Saeid A.B., et al. Optimization of Markov weighted fuzzy time series forecasting using genetic algorithm (GA) and particle swarm optimization (PSO) Emerg. Sci. J. 2022;6(6):1375–1393. [Google Scholar]
  • 25.Alguliyev R.M., Aliguliyev R.M., Sukhostat L.V. Weighted consensus clustering and its application to big data. Expert Syst. Appl. 2020;150 doi: 10.1016/j.eswa.2020.113294. [Internet]Available from: [DOI] [Google Scholar]
  • 26.Gupta T., Panda S.P. Proceedings of the International Conference on Machine Learning, Big Data, Cloud and Parallel Computing. 2019. Clustering validation of CLARA and K-means using silhouette DUNN measures on iris dataset; pp. 10–13. Trends, Prespectives Prospect Com 2019. [Google Scholar]
  • 27.Arora P., Deepali V.S. Analysis of K-means and K-medoids algorithm for big data. Phys. Procedia. 2016;78(December 2015):507–512. doi: 10.1016/j.procs.2016.02.095. [Internet]Available from: [DOI] [Google Scholar]
  • 28.Hodson T.O. Root-mean-square error (RMSE) or mean absolute error (MAE): when to use them or not. Geosci. Model Dev. 2022;15(14):5481–5487. [Google Scholar]
  • 29.Sunusi N. Bias of automatic weather parameter measurement in monsoon area, a case study in Makassar coast. 2022;10(June):1–15.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data will be made available on request.


Articles from MethodsX are provided here courtesy of Elsevier

RESOURCES