Analyzing and predicting global happiness index via integrated multilayer clustering and machine learning models

Boxu Yang; Xiang Xie

doi:10.1371/journal.pone.0322287

. 2025 Apr 30;20(4):e0322287. doi: 10.1371/journal.pone.0322287

Analyzing and predicting global happiness index via integrated multilayer clustering and machine learning models

Boxu Yang ^1,^*, Xiang Xie ¹

Editor: Issa Atoum²

PMCID: PMC12043169 PMID: 40305477

Abstract

This study addresses the research objective of predicting global happiness and identifying its key drivers. We propose a novel predictive framework that integrates unsupervised and supervised machine learning techniques to uncover the complex patterns underlying happiness scores across nations. Initially, we apply K-Means clustering to group countries based on similarities in their happiness patterns. For the first time, these cluster assignments are subsequently incorporated as additional features into ensemble learning models—specifically, Random Forests and XGBoost—to enhance the prediction of happiness scores. This hierarchical analysis approach yields a significant improvement in predictive performance, with an approximate 12% increase in R² compared to models that do not include clustering information. Using data from the World Happiness Report, our analysis reveals that global happiness can be categorized into three distinct groups (high, medium, and low). Among the various determinants examined, social support and GDP emerge as the most influential factors contributing to the happiness index. These findings not only advance the methodological framework for predicting happiness but also provide robust evidence for policymakers seeking to implement targeted interventions aimed at improving public well-being and promoting social progress.

1 Introduction

1.1 Research background

Happiness is a complex and subjective concept that involves the overall satisfaction of individuals and groups on both socio-economic and psychological levels. As a global research project, the World Happiness Report has provided extensive data since 2012, covering variables such as GDP, social support, healthy life expectancy, freedom, generosity, and perceptions of corruption. These indicators form the foundation for quantifying the happiness index [1]. The happiness index has gradually become a core measure of a country’s socio-economic development, and its multi-dimensional characteristics make its analysis crucial for both academic research and policy-making, especially in modern societal development [2]. Richard Easterlin’s (1974) study first revealed that economic growth is not the only driving factor for improving happiness levels, prompting scholars to further explore the social and economic determinants of happiness [3]. In recent years, multi-level studies on happiness data using machine learning techniques and clustering analysis have provided new perspectives for understanding the intrinsic patterns of the happiness index. Although the happiness index holds a significant place in policy and social science research, there are still deficiencies in its analytical methods [4]. Traditional methods, such as linear regression, have limited research on the interactions between happiness variables, and the application of unsupervised learning and machine learning methods, such as clustering analysis, is still in the exploratory stage. For example, Chakraborty and Tsokos (2021) used K-Means clustering to uncover patterns in national happiness indices but did not delve into the specific causal relationships between variables [5]. The COVID-19 pandemic has had a significant impact on global happiness, particularly in terms of the dynamic changes in social support and mental health [6]. Research has shown that happiness data during the pandemic experienced abnormal fluctuations, highlighting the need for the study of the happiness index to dynamically adapt to emerging social challenges. As globalization accelerates and social inequality intensifies, cross-national analysis and prediction of the happiness index have become particularly important [7]. This study aims to combine clustering analysis with machine learning models to explore the intrinsic patterns of the happiness index through multi-level analysis, providing a scientific basis for policy-making.

1.2 Literature review

1.2.1 Application of clustering analysis in happiness index research.

Clustering analysis is an unsupervised learning technique that discovers the potential group characteristics in data and is widely used in social and economic research. This method can classify groups according to specific indicators, thereby revealing the characteristic differences between different categories. Rendón et al. (2011) pointed out that the validity indicators in clustering analysis, such as cohesion and separation, are important tools for evaluating the quality of clustering [8]. In addition, the NbClust tool developed by Charrad et al. (2014) can automatically select the number of clusters through multiple validity indices [9], providing strong support for the analysis of complex social data.

Although clustering analysis is widely used in the social and economic fields [10], there are relatively few in-depth analyses of the happiness index. Some studies have attempted to identify the patterns of the happiness index through K-Means clustering, but often only stayed at the descriptive classification level and did not deeply analyze the interaction of variables between groups. For example, Oswald et al. (2015) emphasized the correlation between happiness and productivity and suggested combining clustering results with economic and social data to explore the driving factors of happiness [11]. In addition, Freeman and Di Tella (2006) pointed out that clustering analysis should be combined with causal inference tools to further enhance its explanatory power [12].

1.2.2 Application of machine learning models in happiness index prediction.

Machine learning models, due to their strong predictive ability, are widely used in the analysis and prediction of social variables. The Random Forest algorithm proposed by Breiman (2001) has good interpretability and robustness and has been proven to be suitable for modeling nonlinear and high-dimensional data [13]. In addition, the XGBoost algorithm developed by Chen and Guestrin (2016) has high efficiency and robustness to data distribution and has become a popular tool in social and economic research [14].

In the research of the happiness index, machine learning technology has begun to emerge. For example, Howell (2008) revealed the positive correlation between economic status and subjective well-being through the regression tree model [15]. Other studies have shown that Random Forest and XGBoost perform well in predicting the happiness index and can quantify the importance of variables, such as the core role of GDP and social support in happiness. However, most of the existing literature focuses on a single method and rarely integrates clustering analysis and machine learning models for multi-level comprehensive research.

Current research has revealed the complementarity of clustering analysis and machine learning models in social and economic research, but in the research of the happiness index, the integrated application of the two is still insufficient. Especially in the dynamic analysis of the happiness index, existing research mainly focuses on static data and ignores the time dimension and the complex interaction between variables [16]. Therefore, it is necessary to further combine clustering analysis and machine learning to explore the multidimensional characteristics and dynamic changes of the happiness index.

1.2.3 Gaps and innovations.

Despite extensive research utilizing either clustering analysis or machine learning prediction methods for happiness index studies, most prior work has been limited to a single methodological approach. Some studies have employed clustering techniques, such as K-Means, solely for descriptive classification of national happiness characteristics, thereby revealing only superficial group differences. Conversely, other studies have independently applied machine learning models—such as Random Forests and XGBoost—to predict the happiness index and explore the relationships between economic and social variables and happiness. The limitation of using clustering methods in isolation lies in their inability to capture the complex interdependencies among variables, while standalone prediction models may suffer from a lack of higher-order structural information.

This study introduces, for the first time, a two-stage “clustering–prediction” framework designed to fully leverage the strengths of both approaches. Specifically, the first stage employs clustering analysis to perform multi-level segmentation of global happiness data, thereby revealing differences across countries or regions in terms of happiness levels, sustainable development, and cultural values. In the second stage, the clustering results are incorporated as new features into machine learning prediction models to construct a hierarchical prediction framework. Ablation experiments demonstrate that this framework significantly improves prediction accuracy, effectively uncovering the multidimensional dynamic characteristics of the happiness index and the complex interactions among variables.

This integrated approach not only compensates for the limitations of traditional single-method studies in static description and prediction, but also, by incorporating time series data, explores the dynamic evolution of the happiness index—addressing the current literature’s deficiency in temporal analysis. In summary, the innovative contribution of this study lies in the development and validation of a hierarchical integrated framework that combines clustering analysis with machine learning prediction, thereby providing novel theoretical and empirical support for in-depth happiness index research and related policy formulation.

1.3 Research objectives

The happiness index, as an important indicator for measuring national social and economic development, provides a comprehensive evaluation of social progress and quality of life. Existing research has limitations in theory and method in revealing the national differences in the happiness index and the key variables affecting its changes. This study combines clustering analysis and machine learning models to explore the multidimensional characteristics of the happiness index. The specific objectives are as follows:

Objective 1: Explore the national differences in the happiness index through K-Means clustering analysis.

Based on the multi-dimensional data of the World Happiness Report, the K-Means clustering method is used to classify the happiness indexes of different countries to reveal the significant characteristic differences in the happiness of countries. This study will combine the data distribution characteristics and the best clustering number determination techniques (elbow method and silhouette coefficient, etc.) to achieve the effective classification of high-happiness, medium-happiness, and low-happiness country groups, thus laying the foundation for subsequent in-depth analysis [17].

Objective 2: Construct a machine learning model to identify the key variables affecting the happiness index.

Using mainstream machine learning algorithms such as Random Forest and XGBoost, a happiness index prediction model is constructed to quantitatively evaluate the relative importance of the variables affecting the happiness index. Focus on the relative importance of variables such as GDP, social support, and healthy life expectancy, and comprehensively compare the prediction capabilities of different algorithms combined with model performance indicators (such as mean square error MSE and coefficient of determination R²) [18].

Objective 3: Propose accurate prediction and policy suggestions.

By combining the research results of clustering analysis and machine learning models, this study will summarize the main driving factors of the happiness index and put forward targeted policy suggestions, especially providing practical paths for low and medium-happiness countries to improve their happiness levels. This objective aims to achieve the combination of academic research and policy application and further promote the practical significance and value of happiness index research.

2 Data and methods

2.1 Data source and sample

The data of this study comes from the World Happiness Report (2020–2024), which is published by the United Nations Sustainable Development Solutions Network (SDSN) and covers the social and economic indicators of 156 countries and regions around the world [19]. The data in the report is based on the questionnaire results of the Gallup World Poll and provides a systematic framework for measuring the happiness index. The happiness index is based on the Cantril Ladder method and is obtained by evaluating the subjective scores of respondents on their current living conditions [20], with a score range from 0 (the lowest quality of life) to 10 (the highest quality of life).

The variables used in the study cover economic, social, health, governance, and other aspects, specifically including:

Economic production (GDP per capita): Per capita gross domestic product, reflecting the economic level.
Social support: Measures whether an individual can rely on others when in need of help.
Healthy life expectancy: Per capita healthy life expectancy calculated based on WHO data.
Freedom to make life choices: Reflects the degree of freedom of an individual in choosing the direction of life.
Generosity: Based on whether there has been a donation to a charity or helping others in the past month.
Perceptions of corruption: Measures the respondents’ perception of the degree of corruption in the government and business environment.
Dystopia Residual: Used as a benchmark score to adjust the part of the actual data that is not explained by the above variables.

The data preprocessing includes the following steps:

Missing value handling: For a small amount of missing data, the multiple imputation method is used to reduce bias [21].
Standardization processing: Normalize the variables with different dimensions to ensure balance in subsequent analysis.
Time series arrangement: Since this study involves data from nearly five years, the data of each year is integrated to ensure the continuity and consistency of the analysis.

The study will explore the national differences in the happiness index through the systematic analysis of these variables and construct a prediction model to identify the key driving factors.

2.2 Research methods

2.2.1 Clustering analysis.

To explore the national differences in the happiness index, this study uses the K-Means clustering method to conduct unsupervised learning analysis on the data of the World Happiness Report. Through clustering analysis, this study can reveal the distribution pattern of the happiness index and provide data stratification support for the subsequent prediction model. The specific steps are as follows:

Preprocessing and standardization: Normalize each variable to reduce the impact of variable scale differences on the clustering results.
Determination of the number of clusters: Use four methods to determine the best number of clusters, such as the elbow method. By analyzing the change trend of the total sum of squared errors under different numbers of clusters, select the optimal number of clusters [22].
Clustering execution: Use the K-Means algorithm to cluster the multi-dimensional happiness index data of 156 countries into high, medium, and low happiness groups.

2.2.2 Prediction model.

Based on the Random Forest and XGBoost algorithms, a happiness index prediction model is constructed to evaluate the relative importance and prediction ability of each variable on the happiness index.

2.2.3 Model validation.

To evaluate the model performance and verify its prediction ability, this study uses the following indicators [23]:

Mean square error (MSE): Measures the average of the squared differences between the predicted values and the actual values. The smaller the value, the better the model performance.
Coefficient of determination (R²): Used to evaluate the goodness of fit of the model, with a range from 0 to 1. The closer to 1, the stronger the model’s explanatory ability.

The definitions are as follows:

M A E = \frac{1}{m} \sum_{i = 1}^{m} | y_{i} - y_{i}^{'} |^{2}

(1)

M S E = \frac{1}{m} \sum_{i = 1}^{m} | y_{i} - y_{i}^{'} |^{2}

(2)

R M S E = \sqrt{M S E}

(3)

R^{2} = 1 - \frac{\sum_{i = 1}^{m} (y_{i} - y_{i}^{'})^{2}}{\sum_{i = 1}^{m} (y_{i} - \overset{―}{y})^{2}}

(4)

where $m$ represents the total number of elements in the test data, $y_{i}^{'}$ is the predicted value, $y_{i}$ is the corresponding real value of the $i$ -th sample, and $\overset{―}{y}$ is the average of the actual observed values. $R^{2}$ is used to measure the fitting degree of the regression model to the observed data, and its value is between 0 and 1. When $R^{2}$ is close to 1, it indicates that the model has a good fitting effect on the data, that is, the model can well explain the change of the dependent variable; when $R^{2}$ is close to 0, it indicates that the model has a poor fitting effect on the data and the model‘s explanatory ability for the dependent variable is weak.

2.1.4 Research framework process.

The process of the research framework of this paper is shown in Fig 1.

3 Statistical analysis

3.1 Data description

Based on the data of the World Happiness Report, the distribution of the happiness index and its related variables of 156 countries from 2020 to 2024 was analyzed. In most parts of Europe, especially in Nordic countries such as Norway, Sweden, and Finland, the happiness scores are relatively high. Canada and the United States in North America also have relatively high happiness indexes. The situation in Asia is more complex. Some countries, such as China, show a transitional state, while countries such as India show relatively low happiness scores. In Africa, the scores are more diverse, reflecting the large differences in the happiness index. Some countries in South America, such as Argentina, have high scores, while others have low scores. Australia and New Zealand in Oceania mainly have a relatively high level of happiness.

When the 156 countries are classified by continent, Fig 2 is obtained. It can be clearly seen from the bar chart that Europe has the highest score of 7.58. Oceania has the second-highest score of 7.27. The Americas has a score of 6.63. Asia has a score of 6.19. Africa has a relatively low score of 4.77.

When analyzing the data by year, a box plot (violin plot) of the happiness scores from 2020 to 2024 is obtained [24]. It can be seen from Fig 3 that the distribution of happiness scores in each year is roughly similar, mainly concentrated between 5 and 7. The central tendency of the happiness scores has increased after COVID-19, from 4.7 in 2020 to 5.8 in 2024.

As shown in the heat map in Fig 4, from the perspective of variable correlation, Social support, Score, GDP per capita, and Healthy life expectancy show a relatively strong positive correlation, with a deep red color and a correlation coefficient above 0.7, indicating that these factors are closely related to each other and may jointly affect the overall happiness or development level [25]. The variables were screened by thermal maps and variance expansion factor (VIF<5) to exclude multicollinearity. Generosity has a relatively weak correlation with other variables, with a lighter color. Freedom to make life choices and Perceptions of corruption also have a certain positive correlation with other variables, but the degree is not as high as the previous variables. In terms of systematic clustering distribution, Social support, Score, GDP per capita, and Healthy life expectancy are relatively close in the clustering tree, indicating that they have high similarity and relevance in the data structure and can be classified into one category. [26] Generosity, Freedom to make life choices, and Perceptions of corruption are relatively independent but also have a certain connection with the previous category. Overall, this heat map clearly shows the relationships and structures between the variables through color and clustering tree. Fig 4 shows that the most important variables affecting the happiness score are Social support and Healthy life expectancy, with the highest correlation coefficients with the happiness score of 0.79 and 0.78 respectively. The least important variable is Generosity, with a correlation coefficient with the happiness score of only 0.08. It provides key guidance for the selection of variables in machine learning to predict happiness scores.

3.2 Clustering analysis

Clustering analysis, as a common method in unsupervised learning, aims to automatically divide data into different groups through similarity measurement. In social science research, clustering analysis is widely used to explore the characteristic differences of different groups. For example, countries around the world can be grouped according to their happiness indexes. Through this method, researchers can identify countries with higher and lower happiness levels and their key characteristic differences in social, economic, and psychological aspects.

3.2.1 Determination of the optimal number of clusters.

When conducting clustering analysis, it is first necessary to determine how many classes the data will be divided into, that is, the number of clusters. The K-Means algorithm is one of the most commonly used clustering methods. It mainly optimizes the clustering results by calculating the distance from the data points in the cluster to the centre. However, a key problem of the K-Means algorithm is that the number of clusters K needs to be specified in advance. To determine the optimal K value, the method of selecting the number of clusters is somewhat subjective and also depends on the technology used to calculate similarity and the parameters used for partitioning. [27] However, there are nearly 30 methods to determine the optimal number of clusters. This study will comprehensively use the most popular methods, including the elbow method, silhouette method, Hartigan and Gap statistics, to determine the optimal number of clusters [22]. As shown in Fig 5, the final optimal number of clusters is 3.

For example, the core idea of the elbow method is that as the number of clusters K increases, the total sum of squared errors of the clustering results will gradually decrease. However, when K reaches a certain threshold, increasing K will no longer significantly reduce the error [22]. By drawing the relationship graph between K and SSE, the “elbow” position, that is, the position where the rate of error reduction begins to slow down, can be observed and used as the selection criterion for the optimal number of clusters.

3.2.2 Clustering analysis process.

Before conducting clustering analysis, we first standardized the data to eliminate the influence of dimensional differences between different variables. The standardized data ensures that the contribution of each variable to clustering is relatively balanced [28]. Based on the determined optimal number of clusters K = 3, we divided all countries into three categories: high happiness index group, medium happiness index group, and low happiness index group. The differences in happiness index within each group are relatively small, while significant differences are shown between different groups. The clustering result graph in Fig 6 projects the multi-dimensional data into three clusters formed by k-means clustering, revealing significant differences in social and economic characteristics between different groups. For example, the high happiness index group usually has a high GDP, per capita healthy life expectancy, and social support level, while the low happiness index group often faces low economic development and social welfare guarantees.

3.2.3 Key characteristic analysis between clustering groups and variables.

As shown in Table 1, when analyzing the three clusters, we can observe significant differences in their multiple key characteristics. First, the happiness score (Score) is an important indicator for distinguishing each cluster. The happiness score of Cluster 1 (high happiness group) is the highest, reaching 7.064, which is significantly higher than that of Cluster 2 (low happiness group) and Cluster 3 (medium happiness group). This indicates that Cluster 1 represents countries with stronger happiness, while Cluster 2 represents countries with lower happiness, and Cluster 3 is at an intermediate level. Second, the per capita GDP shows particularly prominent differences between clusters. The per capita GDP of Cluster 1 is 1.379, which is significantly higher than that of Cluster 3 (1.040) and Cluster 2 (0.537), indicating that the countries in Cluster 1 have stronger economic strength, while the countries in Cluster 2 have relatively weaker economies. Related to this, the difference in social support also indicates the connection between the economy and social security. The social support of Cluster 1 is 1.488, much higher than that of Cluster 2 (0.936) and Cluster 3 (1.337), indicating that the countries in Cluster 1 perform outstandingly in the social security system and mutual assistance network.

Table 1. Final Clustering Situation.

	Final Cluster
	1	2	3
Score	7.064	4.269	5.748
GDP per capita	1.379	0.537	1.040
Social support	1.488	0.936	1.337
Healthy life expectancy	0.986	0.497	0.821
Freedom to make life choices	0.524	0.321	0.403
Generosity	0.245	0.190	0.157
Perceptions of corruption	0.241	0.094	0.074

Open in a new tab

The healthy life expectancy shows a similar trend to social support. The healthy life expectancy of Cluster 1 is 0.986, which is significantly higher than that of Cluster 3 (0.821) and Cluster 2 (0.497), showing that the residents in Cluster 1 have a higher health level. The freedom to make life choices is also a significant distinguishing factor between clusters. The freedom of Cluster 1 is 0.524, which is relatively high, reflecting the advantages of this type of country in civil liberties, while Cluster 3 (0.403) and Cluster 2 (0.321) show relatively low freedom. In general, the countries in Cluster 1 usually have high economic, social support, health levels, and freedom, while the countries in Cluster 2 face lower happiness, economic levels, social support, and healthy life expectancy, and Cluster 3 is in between, showing relatively balanced characteristics.

3.2.4 Analysis of the correlation matrix after clustering.

As shown in Fig 7, from the perspective of correlation, Score has a strong positive correlation with variables such as GDP per capita, Social support, and Healthy life expectancy. For example, the correlation coefficient between Score and GDP per capita is relatively high and significant, indicating that the economic level is closely related to the happiness score. Social support also has a strong positive correlation with Healthy life expectancy. In terms of the distribution of each variable and clustering, taking GDP per capita as an example, Cluster 1 is more distributed in the area with higher per capita GDP, while Cluster 2 and Cluster 3 are relatively less, showing the economic strength advantage of Cluster 1. Similarly, variables such as Social support also show differences in distribution among different clusters. Cluster 1 is more concentrated in the high-value area of each variable, while Cluster 2 and Cluster 3 are more dispersed or concentrated in the low-value area. Overall, it clearly shows the correlation between variables and the distribution characteristics of each cluster in different variables, providing more sufficient reference for the machine learning prediction of the study. [29]

3.3 Prediction model

3.3.1 Prediction model building.

After determining the main group characteristics of the happiness index, this study constructs a prediction model based on multiple key variables, mainly using Random Forest and XGBoost algorithms. Through these machine learning models, we can quantitatively evaluate the impact of variables such as GDP, social support, and healthy life expectancy on the happiness index and provide predictions of the future happiness index.

Random Forest is an ensemble learning method based on decision trees. It constructs multiple decision trees and uses the output of each tree to vote to determine the final prediction result [30]. This method has strong predictive ability, can handle nonlinear problems, and avoid overfitting. In this study, Random Forest is used to predict the happiness index, and the grid search method is used to adjust the hyperparameters, such as the number of trees (n_estimators) and the maximum depth (max_depth).

XGBoost is an efficient gradient boosting tree algorithm with strong computing power and robustness, especially suitable for processing large amounts of data and complex nonlinear problems [31]. XGBoost realizes prediction by constructing multiple weak classifiers and combining them with weights. In this study, XGBoost is used to analyze the influence of factors such as GDP, social support, and healthy life expectancy on the happiness index, and the optimal model parameters are selected through cross-validation.

3.3.2 Performance evaluation and model comparison.

To evaluate the prediction effect and accuracy of the model, this study uses the following common performance evaluation indicators:

Mean square error (MSE): MSE is a commonly used indicator to measure the difference between the predicted value and the actual value. The smaller its value, the higher the prediction accuracy of the model. In this study, MSE is used to evaluate the prediction ability of different models for the happiness index.
Coefficient of determination (R²): R² is an indicator to measure the goodness of fit of the model, with a value range from 0 to 1. The closer it is to 1, the stronger the model’s explanatory power for the data. By comparing the R² values of different models, we can evaluate which model can better explain the changes in the happiness index.

It can be seen from Table 2 that the Random Forest prediction model has better performance. As shown in Fig 8, the fitting situation of the Random Forest prediction model is the best.

Table 2. Performance Evaluation of Machine Learning Prediction Models.

Model	MSE	R2 Score
Linear Regression	0.393855	0.621557
Random Forest	0.219458	0.841086
XGBoost	0.272635	0.738033

Open in a new tab

4 Results

In this study, we conducted a descriptive statistical analysis of the happiness index and its related variables (such as GDP, social support, healthy life expectancy, freedom, generosity, and perceived corruption) of 156 countries. By analyzing the distribution of these variables, we can obtain the characteristics of different countries in the happiness index.

One of the most important results of clustering is that countries in the world can be ranked according to their happiness scores. Since there is a positive correlation between happiness and social development, by knowing the names of the happiest countries in the cluster, it may be possible to guess their social and economic status. Table 3 shows the top 10 countries in each of the 3 clusters ranked by happiness index.

Table 3. Ranking of Different Countries According to Happiness Scores.

Rank	High happiness	Moderate happiness	Low happiness
1	Finland	Chile	Cameroon
2	Denmark	Guatemala	Ghana
3	Norway	Saudi Arabia	Ivory Coast
4	Iceland	Spain	Nepal
5	Netherlands	Panama	Jordan
6	Switzerland	Brazil	Benin
7	Sweden	Uruguay	Congo (Brazzaville)
8	New Zealand	El Salvador	Gabon
9	Canada	Italy	Laos
10	Austria	Bahrain	South Africa

Open in a new tab

The distribution of key variables in the three groups shows significant differences. High-happiness countries generally have high GDP, social support, healthy life expectancy, and freedom, while low-happiness countries face low economic development levels, social instability, and lack of public services. Through box plot analysis, the distribution of these key variables in different groups can be clearly seen, especially in the social support and GDP variables, the differences between groups are particularly significant.

4.1 Hypothesis testing and clustering results

Based on the clustering analysis, we conducted a series of hypothesis tests to verify the significant differences in key variables between different clustering categories [32]. Specifically, we used one-way ANOVA to test whether there are significant differences in variables such as GDP, social support, healthy life expectancy, and freedom between different happiness index groups. The proposed hypotheses are:

H0: There are no significant differences in key variables (GDP, social support, healthy life expectancy, etc.) between different happiness index groups.
H1: There is at least one variable with a significant difference between different happiness index groups.

Table 4 shows using one-way ANOVA to test variables such as GDP, social support, healthy life expectancy, and freedom. The hypothesis test results support our preliminary hypothesis that countries with higher happiness indexes show significant advantages in multiple key variables. Specifically, GDP, social support, and healthy life expectancy are the key factors affecting the national happiness index, and there is a high positive correlation between these factors. Our analysis shows that improving the levels of these key variables, especially GDP and social support, is helpful to improve the national happiness index.

Table 4. One-way ANOVA.

	Cluster		Error		F	Significance
	Mean Square	Degrees of Freedom	Mean Square	Degrees of Freedom
Score	79.931	2	0.210	153	379.925	0.000
GDP per capita	7.734	2	0.060	153	129.562	0.000
Social support	3.861	2	0.040	153	96.018	0.000
Healthy life expectancy	2.802	2	0.023	153	123.149	0.000
Freedom to make life choices	0.388	2	0.016	153	24.701	0.000
Generosity	0.077	2	0.008	153	9.372	0.000
Perceptions of corruption	0.283	2	0.005	153	52.995	0.000

Open in a new tab

5 Discussion

This study systematically investigates the multidimensional characteristics of the global happiness index and its key driving factors through the integration of cluster analysis and machine learning models. The following discussion centers on the three research objectives outlined in Section 1.3, examining their implementation, academic significance, and connections to the existing literature, while further elucidating the study’s innovative contributions and practical implications.

Based on multidimensional data from the World Happiness Report, the study employs the K-Means algorithm to classify 156 countries into three groups—high, medium, and low happiness. ANOVA analysis reveals significant differences (p < 0.001) among these groups in key variables such as GDP, social support, and life expectancy. For example, the mean social support value in the high-happiness group (1.488) is significantly higher than that in the low-happiness group (0.936), indicating that social support is not only a core variable of the happiness index but also an important indicator of group differences, while revealing its nonlinear role in these differences.

However, the cluster analysis has its limitations. Lower data coverage in some low-income countries (e.g., South Sudan) may affect the representativeness of the groupings. Future research could incorporate field surveys or satellite data (such as the nighttime light index) to supplement economic and social indicators, thereby enhancing the generalizability of the clustering results.

By comparing Random Forest and XGBoost models, this study confirms that the Random Forest model performs optimally in predicting the happiness index. Its variable importance analysis further underscores the central roles of social support and GDP. Traditional linear methods struggle to capture the complex interactions among variables (e.g., the synergistic effects between GDP and social support), whereas machine learning models, with their nonlinear modeling capabilities, more precisely quantify the contribution of multidimensional variables to the happiness index. For instance, the correlation between social support and the happiness index is significantly stronger than that of other variables, suggesting that strengthening the social security network may be a higher policy priority than merely enhancing economic performance.

Based on the clustering and prediction results, the study proposes differentiated policy pathways: high-happiness countries should optimize social support and healthcare protection (e.g., the Nordic welfare model); medium-happiness countries should balance economic growth with social equity (e.g., by increasing public healthcare investment); and low-happiness countries should prioritize addressing infrastructure and poverty issues (through infrastructure investment). This framework aligns closely with the United Nations Sustainable Development Goals, providing quantitative evidence for policymakers. For example, in medium-happiness countries such as India, the model results indicate that enhancing social support can lead to an increase in the happiness index, thereby offering empirical support for policies like universal healthcare programs.

Nonetheless, the generalizability of these policy recommendations is constrained by the preset selection of variables. The current model does not incorporate cultural or environmental factors (such as religious beliefs or climate change), which may indirectly influence the happiness index by affecting social cohesion. Future research could employ natural language processing techniques to analyze social media texts and extract implicit cultural indicators, thus enhancing the comprehensiveness of the policy recommendations.

The “clustering-prediction” hierarchical framework proposed in this study, which integrates unsupervised and supervised learning for happiness index research, not only improves predictive accuracy but also reveals policy intervention priorities through group difference analysis. This approach can be extended to other social indicators (such as the sustainable development index or regional poverty index), thereby advancing the application of machine learning in the field of public policy.

6 Conclusions and perspectives

6.1 Theoretical contributions

One of the main contributions of this study is to combine clustering analysis and machine learning, proposing a new method for analyzing the happiness index. This method provides a multi-dimensional analysis framework for happiness index research, filling the gaps in existing research with a single method or variable, especially in modeling the interaction effects of multiple variables, which has important academic value.

Most traditional happiness index studies use regression analysis, correlation analysis, and other methods. These methods mostly rely on assumptions and linear relationships and are difficult to capture the complex nonlinear relationships between variables. In this study, countries are grouped into different clusters according to the happiness index through clustering analysis, and then further analyzed in combination with machine learning models (such as Random Forest and XGBoost), revealing how multiple social and economic factors comprehensively affect the happiness index. This method not only improves the prediction ability of the model but also enhances the understanding of the multidimensional nature of happiness.

In existing research, many studies overemphasize the influence of a single variable (such as GDP or social support) and ignore the interaction between these variables. By comprehensively considering multiple variables such as GDP, social support, and healthy life expectancy, this study proposes a more comprehensive analysis framework for the happiness index, enabling us to consider the effects of multiple variables simultaneously and discover their internal hierarchical structure through clustering analysis. This diversified analysis perspective provides new inspiration for future happiness index research, especially in dealing with complex social and economic data, and can better reveal the interaction relationships between different factors.

6.2 Policy suggestions

The practicality of this study lies in providing precise policy basis for governments and policymakers, especially in formulating intervention measures for low and medium-happiness countries. By deeply analyzing the multidimensional influencing factors of the happiness index, policymakers can more clearly identify the key areas for enhancing happiness and formulate targeted social policies accordingly.

The research shows that the improvement of the happiness index depends not only on a single economic factor but also on the comprehensive effects of social support, health policies, and other aspects. In high-happiness countries, the policy focus should be on further optimizing the social support system, improving the health level of citizens, and promoting social freedom; while in low-happiness countries, the government should give priority to enhancing the level of economic development, improving public health and education services, and strengthening social security and mutual assistance mechanisms.

Low and medium-happiness countries face more challenges in enhancing the happiness of their people. Therefore, different intervention measures need to be taken according to their respective social and economic characteristics. For example, in some low-happiness countries (such as South Sudan and Burundi), improving economic development and infrastructure construction is the top priority for enhancing happiness; while in some medium-happiness countries (such as India and South Africa), the construction of the social security system and the public health system should be strengthened to alleviate poverty and inequality and thus enhance the happiness of the people.

6.3 Limitations and mitigations

While this study advances the understanding of global happiness index determinants through novel clustering and predictive modeling, several limitations warrant consideration. Addressing these challenges can guide future research toward more comprehensive and dynamic analyses.

6.3.1 Data representativeness.

Limitations: The dataset encompasses 156 countries but exhibits significant gaps in low-income regions (e.g., South Sudan, Burundi). Missing values in these areas may compromise the generalizability of clustering results, particularly within the low-happiness subgroup.

Methodological Response: Multiple imputation techniques were applied to mitigate bias from incomplete data. [21] However, future studies should prioritize field surveys or alternative data sources (e.g., satellite imagery for economic indicators) to enhance coverage in underrepresented regions.

6.3.2 Temporal dynamics.

Limitations: The static machine learning models (Random Forest, XGBoost) lack capacity to capture time-varying trends, such as post-pandemic shifts in happiness determinants.

Methodological Response: Longitudinal analysis using advanced time-series architectures could enable exploration of how social support, GDP, and other variables interact with external shocks over time.

6.3.3 Variable selection constraints.

Limitations: The study’s variable set, derived from the World Happiness Report, may overlook cultural and environmental factors that indirectly influence well-being.

Methodological Response: Expanding variable selection through unstructured data integration (e.g., social media text mining for cultural indicators) and interdisciplinary collaborations with anthropologists/sociologists could enhance model explanatory power.

6.3.4 Variable selection constraints.

Limitations: The K-Means algorithm’s assumption of spherical cluster shapes may inadequately represent non-linear relationships in high-dimensional happiness data.

Methodological Response: Robustness checks using alternative clustering methods and hybrid approaches combining dimensionality reduction with clustering are recommended.

By transparently addressing these limitations and proposing actionable solutions, this study lays a foundation for more comprehensive and dynamic analyses of the happiness index, ensuring that future research can build upon its methodological and empirical contributions.

6.4 Conclusions

Stratification characteristics of the happiness index: The happiness indexes of countries around the world can be clearly divided into three groups: high-happiness, medium-happiness, and low-happiness. There are significant differences in social and economic characteristics between different groups, especially in key variables such as GDP, social support, and healthy life expectancy, showing significant distribution differences.

Core predictive variables: Social support and GDP are the two core variables affecting the happiness index. High-happiness countries generally have strong social support and high GDP. The advantages of these countries in economic development and social security enable their residents to enjoy higher happiness.

Effectiveness of machine learning methods: Through machine learning algorithms (such as XGBoost and Random Forest), we have successfully constructed a prediction model of the happiness index and verified the importance of social support and GDP in prediction. The XGBoost model performs better than Random Forest and has higher prediction accuracy.

Multidimensional nature of happiness: Happiness is a multidimensional and complex concept. Economic, social, health, cultural, and other factors jointly act on the formation of happiness. Future research should further explore the interaction of these factors and combine longitudinal data and more variables to deepen the understanding of the happiness index.

This research not only reveals the global differences in happiness but also provides a new perspective for the multidimensional analysis of happiness and provides precise policy basis for policymakers, especially for formulating intervention measures for low and medium-happiness countries. Future research can further expand the analysis framework, combine cross-cultural and longitudinal data, and explore the dynamic changes and cross-regional correlations of happiness.

Supporting Information

S1 Data. Dataset from the World Happiness Report (2020–2024) covering social and economic indicators of 156 countries and regions.

(XLSX)

pone.0322287.s001.xlsx^{(82.8KB, xlsx)}

Acknowledgments

We acknowledge Dr. Xiang Xie for his conceptual guidance and methodological validation.

Data Availability

The data used in this study are publicly available on the Kaggle platform. The datasets include: World Happiness Report Data: https://www.kaggle.com/datasets/unsdsn/world-happiness The datasets used in this study are derived from the World Happiness Report (2020-2024) and include data from 156 countries and regions, covering variables such as GDP per capita, social support, healthy life expectancy, freedom to make life choices, generosity, and perceptions of corruption. All relevant data are publicly available and can be accessed without restrictions, ensuring the reproducibility of the study. The processed and cleaned data, along with the code used for analysis, are available upon request to facilitate transparency and further research.

Funding Statement

This work was supported by the National Natural Science Foundation of China "Joint Fund Project" (Research on Intelligent High-Speed Rail Data Service System Based on Data Rights Confirmation) under Grant U2268202 (Project Coding: B22A1500010). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Rowan AN. World happiness report 2023. WellBeing News. 2023;5(3):1. [Google Scholar]
2.Cao Q. Contradiction between input and output of Chinese scientific research: a multidimensional analysis. Scientometrics. 2020;123(1):451–85. doi: 10.1007/s11192-020-03377-w [DOI] [Google Scholar]
3.Easterlin RA. Does Economic Growth Improve the Human Lot? Some Empirical Evidence. In: David PA, Reder MW, editors. Nations and Households in Economic Growth. Academic Press; 1974, p. 89–125. [Google Scholar]
4.Delsignore G, Aguilar-Latorre A, Garcia-Ruiz P, Oliván-Blázquez B. Measuring happiness for social policy evaluation: a multidimensional index of happiness. Sociological Spectrum. 2023;43(1):16–30. doi: 10.1080/02732173.2022.2163444 [DOI] [Google Scholar]
5.Chakraborty A, Tsokos CP. A real data-driven analytical model to predict happiness. Journal Name. 2021;Volume Number(Issue Number):Page Range. doi: DOIorotheridentifier [Google Scholar]
6.Helliwell JF, Huang H, Wang S, Norton M. World Happiness, Trust and Deaths under COVID-19. World Happiness Report 2021. 2021:13–57. [Google Scholar]
7.Han Y, Shao Y, Zhang Y. Happiness Index Prediction Using Hybrid Regression Model. In: Proceedings of the 2nd International Academic Conference on Blockchain, Information Technology and Smart Finance (ICBIS 2023). 2023, p. 76–87. [Google Scholar]
8.Rendón E, Abundez I, Arizmendi A, Quiroz EM. Internal versus external cluster validation indexes. International Journal of Computers and Communications. 2011;5(1):27–34. [Google Scholar]
9.Charrad M, Ghazzali N, Boiteau V, Niknafs A. NbClust: AnRPackage for Determining the Relevant Number of Clusters in a Data Set. J Stat Soft. 2014;61(6). doi: 10.18637/jss.v061.i06 [DOI] [Google Scholar]
10.Scutariu A-L, Șuşu Ștefăniță, Huidumac-Petrescu C-E, Gogonea R-M. A Cluster Analysis Concerning the Behavior of Enterprises with E-Commerce Activity in the Context of the COVID-19 Pandemic. JTAER. 2021;17(1):47–68. doi: 10.3390/jtaer17010003 [DOI] [Google Scholar]
11.Oswald AJ, Proto E, Sgroi D. Happiness and Productivity. Journal of Labor Economics. 2015;33(4):789–822. doi: 10.1086/681096 [DOI] [Google Scholar]
12.Freeman RB, Di Tella R. Do labor market institutions matter?. International Economic Review. 2006;47(3):647–70. [Google Scholar]
13.Breiman L. Random Forests. Machine Learning. 2001;45(1):5–32. [Google Scholar]
14.Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016, p. 785–94. [Google Scholar]
15.Howell RT, Howell CJ. The relation of economic status to subjective well-being in developing countries: a meta-analysis. Psychol Bull. 2008;134(4):536–60. doi: 10.1037/0033-2909.134.4.536 [DOI] [PubMed] [Google Scholar]
16.Masini RP, Medeiros MC, Mendes EF. Machine learning advances for time series forecasting. Journal of Economic Surveys. 2023;37(1):76–111. doi: 10.1111/joes.12429 [DOI] [Google Scholar]
17.Sun X, Cao Y, Jin Z, Tian X, Xue M. An Adaptive ECMS Based on Traffic Information for Plug-in Hybrid Electric Buses. IEEE Trans Ind Electron. 2022;70(9):9248–59. doi: 10.1109/tie.2022.3210549 [DOI] [Google Scholar]
18.Balal AT, Jafarabadi YPT, Demir AT, Igene MT, Giesselmann MT, Bayne ST. Forecasting solar power generation utilizing machine learning models in Lubbock. 2023.
19.Lomas T. Exploring associations between income and wellbeing: new global insights from the Gallup World Poll. The Journal of Positive Psychology. 2024;19(4):629–46. doi: 10.1080/17439760.2023.2248963 [DOI] [Google Scholar]
20.Nilsson AH, Eichstaedt JC, Lomas T, Schwartz A, Kjell O. The Cantril Ladder elicits thoughts about power and wealth. Sci Rep. 2024;14(1):2642. doi: 10.1038/s41598-024-52939-y [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Hamzah FB, Mohd Hamzah F, Mohd Razali SF, Samad H. A Comparison of Multiple Imputation Methods for Recovering Missing Data in Hydrological Studies. Civ Eng J. 2021;7(9):1608–19. doi: 10.28991/cej-2021-03091747 [DOI] [Google Scholar]
22.Shi C, Wei B, Wei S, Wang W, Liu H, Liu J. A quantitative discriminant method of elbow point for the optimal number of clusters in clustering algorithm. J Wireless Com Network. 2021;2021(1). doi: 10.1186/s13638-021-01910-w [DOI] [Google Scholar]
23.Chicco D, Warrens MJ, Jurman G. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput Sci. 2021;7:e623. doi: 10.7717/peerj-cs.623 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Kenny M, Schoen I. Violin SuperPlots: visualizing replicate heterogeneity in large data sets. Mol Biol Cell. 2021;32(15):1333–4. doi: 10.1091/mbc.E21-03-0130 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Lu X, Tan J, Cao Z, Xiong Y, Qin S, Wang T, et al. Mobile Phone-Based Population Flow Data for the COVID-19 Outbreak in Mainland China. Health Data Sci. 2021;2021:9796431. doi: 10.34133/2021/9796431 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Zhao J, Li Z, Gao Q, Zhao H, Chen S, Huang L, et al. A review of statistical methods for dietary pattern analysis. Nutr J. 2021;20(1):37. doi: 10.1186/s12937-021-00692-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.M. Ghazal T, Zahid Hussain M, A. Said R, Nadeem A, Kamrul Hasan M, Ahmad M, et al. Performances of K-Means Clustering Algorithm with Different Distance Metrics. Intelligent Automation & Soft Computing. 2021;29(3):735–42. doi: 10.32604/iasc.2021.019067 [DOI] [Google Scholar]
28.Ikotun AM, Ezugwu AE, Abualigah L, Abuhaija B, Heming J. K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Information Sciences. 2023;622:178–210. doi: 10.1016/j.ins.2022.11.139 [DOI] [Google Scholar]
29.Liu Y, Tu W, Zhou S, Liu X, Song L, Yang X, et al. Deep Graph Clustering via Dual Correlation Reduction. AAAI. 2022;36(7):7603–11. doi: 10.1609/aaai.v36i7.20726 [DOI] [Google Scholar]
30.Antoniadis A, Lambert-Lacroix S, Poggi J-M. Random forests for global sensitivity analysis: A selective review. Reliability Engineering & System Safety. 2021;206:107312. doi: 10.1016/j.ress.2020.107312 [DOI] [Google Scholar]
31.Asselman A, Khaldi M, Aammou S. Enhancing the prediction of student performance based on the machine learning XGBoost algorithm. Interactive Learning Environments. 2023;31(6):3360–79. doi: 10.1080/10494820.2021.1928235 [DOI] [Google Scholar]
32.Yu Z, Guindani M, Grieco SF, Chen L, Holmes TC, Xu X. Beyond t test and ANOVA: applications of mixed-effects models for more rigorous statistical analysis in neuroscience research. Neuron. 2022;110(1):21–35. doi: 10.1016/j.neuron.2021.10.030 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS One. 2025 Apr 30;20(4):e0322287. doi: 10.1371/journal.pone.0322287.r001

Author response to Decision Letter 0

9 Jan 2025

PLoS One. doi: 10.1371/journal.pone.0322287.r002

Decision Letter 0

Issa Atoum

7 Feb 2025

PONE-D-25-01298Integrated Analysis and Prediction of Global Happiness Index by Combining Integrated Multilayer Clustering and Machine LearningPLOS ONE

Dear Dr. Yang,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

After careful consideration of the manuscript and the review comments shown with this email, we come to this decision. The manuscript must address all issues to the methodology, related work and insufficient discussion to be considered for further process.

Please submit your revised manuscript by Mar 24 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. (highlight in yellow)
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols .

We look forward to receiving your revised manuscript.

Kind regards,

Issa Atoum

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, we expect all author-generated code to be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. We note that Figure 1 in your submission contain [map/satellite] images which may be copyrighted. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For these reasons, we cannot publish previously copyrighted maps or satellite images created using proprietary data, such as Google software (Google Maps, Street View, and Earth). For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

1. You may seek permission from the original copyright holder of Figure 1 to publish the content specifically under the CC BY 4.0 license.

We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:

“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”

Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission.

In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].

2. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.

The following resources for replacing copyrighted map figures may be helpful:

USGS National Map Viewer (public domain): http://viewer.nationalmap.gov/viewer/

The Gateway to Astronaut Photography of Earth (public domain): http://eol.jsc.nasa.gov/sseop/clickmap/

Maps at the CIA (public domain): https://www.cia.gov/library/publications/the-world-factbook/index.html and https://www.cia.gov/library/publications/cia-maps-publications/index.html

NASA Earth Observatory (public domain): http://earthobservatory.nasa.gov/

Landsat: http://landsat.visibleearth.nasa.gov/

USGS EROS (Earth Resources Observatory and Science (EROS) Center) (public domain): http://eros.usgs.gov/#

Natural Earth (public domain): http://www.naturalearthdata.com/

4. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

5. Kindly follow the journal template as shown here https://journals.plos.org/plosone/s/submission-guidelines

Additional Editor Comments :

(1) It is critical to prepare a manuscript with the standards of academic papers, including separate main sections for: Introduction, Related Work, Methodology, Results, Discussion, Threats to Validity/Limitations, and Conclusion and Future Work

(2) The study must demonstrate its implications for practice and its novelty regarding research objectives.

(3) Ensure the work reproducibility by showing the code or the detailed algorithm starting from data preprocessing up to results.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: No

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Partly

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: No

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: 1. Vague Title & Abstract:

a. Problem: The title and abstract use very general terms like "combining" and "integrated" without making it apparent what the research's unique contribution is.

b. Recommendation: One suggestion would be to make the title more precise, such as "Predicting Global Happiness Using Multilayer Clustering and Classification Models." The research issue and technique, including "exploring the underlying patterns" and "integrating clustering and prediction models," should be clearly stated in the abstract.

2. Inconsistent Citation Style:

a. Problem: Throughout the work, different citation styles are used.

b. Recommendation: Use a consistent style (such as IEEE or APA) consistently. Use, for instance, "Chakraborty and Tsokos (2021)," "Rendón et al. (2011)," "the NbClust tool developed by Charrad et al. (2014)," or even "[Author et al., Year]" .

3. Non-Standard Sectioning:

a. Problem: Section 1 ("Introduction") is not formatted like a typical academic work.

b. Recommendation: Add a distinct "Literature Review" section (for example, Section 2). Provide a clearer and more succinct structure to the remaining sections (e.g., avoid numerous sub-sections).

4. Inadequate Related Work:

a. Problem: The "Related Work" section as it currently exists is too short and does not adequately relate to the goals of the study.

b. Recommendation: Extend the section on related work to offer a thorough analysis of pertinent studies on the following topics: Multilayer Clustering techniques; Machine Learning models for happiness prediction; Integration of clustering and prediction models; Global Happiness Index. Describe how the current study adds to and deviates from earlier research in the topic

5. Unclear Methodology Flow:

a. Problem: The methodology in Section 2 does not follow a logical or obvious flow.

b. Recommendation: To enhance readability and general flow, combine relevant subsections. The study method, including data collection, preprocessing, feature selection, clustering, model development, and evaluation, should be described in detail, step-by-step.

6. Disconnected Statistical Analysis:.

a. Problem: It seems that the statistical analysis in Section 3 is not connected to the approach that is being given.

b. Recommendation: Make sure the findings of the statistical analysis directly support the approach that was selected. Clearly describe how the data was analyzed and the research topics addressed using statistical approaches.

7. Intertwined Discussion & Results:

a. Problem: Sections 4 (Discussion) and 3 (Results) are related and may be more cohesively combined.

b. Recommendation: Combine pertinent portions of these sections to produce a more logical and perceptive presentation of the results. Present the findings succinctly and clearly, then go into great depth about their limits, implications, and possible future study avenues.

8. Unclear Degrees of Freedom

a. Problem: It's unclear why Table 4's degrees of freedom are set to 2.

b. Recommendation: Clearly state the reasoning behind this decision, including how many variables were employed in the analysis.

Give a more thorough explanation of the parameters and statistical approaches.

9. The Integrated Model Is Not Clear:

a. Problem: The integration mechanism between the prediction and clustering models is not adequately explained in the text

b. Recommendation: Give a thorough explanation of how the construction and assessment of the prediction models are influenced by or guided by the clustering results. Showcase the special benefits of this combined strategy above just employing prediction or clustering methods.

10. General lucidity:

a. Problem: The technique and outcomes are not presented in a clear and succinct manner overall.

b. Recommendation: One recommendation is to revise the entire document to enhance the presentation's clarity and flow. Whenever possible, steer clear of jargon and speak in plain, simple terms. Make sure the paper has a clear structure and is straightforward for readers to follow.

Reviewer #2: Please find below my remarks:

1) The manuscript utilizes kmeans clustering and supervised machine learning algorithms such as random forest and XGBoost to analyze and predict happiness score - however these are widely used techniques. The study does not introduce any novel methodology or new theoretical framework.

2) The paper does not adequately discuss explainability techniques to interpret the model’s decisions. While machine learning prediction is there, but a lack of explainability does not allow us to understand how the model is making the decision.

3) The selection of three clusters seems arbitrary. I understand that the optimal number of clusters is derived from the elbow method and average silhouette score, but the authors did not provide why three is an optimal number.

4) The study considers economic and social factors but does not incorporate psychological, environmental, or cultural variables, which are crucial in happiness studies.

5) Authors could have done ablation studies to understand the impact of various factors on happiness score.

6) The model is trained and tested on the same dataset (World Happiness Report). Cross-validation with other happiness or well-being datasets would have enhanced the generalizability of the findings.

7) Policy implications should ideally differ across countries. However, the discussion on policy implications is very generic and does not provide concrete recommendations based on different country profiles.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy .

Reviewer #1: Yes

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/ . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org . Please note that Supporting Information files do not need this step.

PLoS One. 2025 Apr 30;20(4):e0322287. doi: 10.1371/journal.pone.0322287.r003

Author response to Decision Letter 1

18 Feb 2025

Response to Reviewers

We sincerely appreciate the reviewers’ constructive feedback and have revised the manuscript accordingly. Below are our point-by-point responses:

Academic Editor Comments

1. Manuscript Style Requirements

We have reformatted the manuscript to comply with PLOS ONE’s style guidelines, including adjusting section headings, citations, and file naming conventions. The structure now aligns with the journal’s template.

2. Code Sharing Guidelines

Data and code upload.

3. Copyrighted Figures

Figure 1 has been deleted.

4. ORCID iD

The corresponding author’s ORCID (0009-0005-8959-868X) has been validated in Editorial Manager.

5. Journal Template

The manuscript now follows the PLOS ONE template.

6. Manuscript Structure

Sections have been restructured to include distinct “Related Work,” “Threats to Validity/Limitations,” and “Conclusion and Future Work” sections. The Methodology section now details each step systematically.

7. Implications and Novelty

We respectfully believe that the manuscript already highlights the novelty of the study, which lies in integrating clustering analysis with machine learning models for enhanced prediction of the happiness index. We have explicitly discussed the implications of our findings for policy-making, particularly for countries with low happiness levels. Additionally, we have provided a detailed explanation of how our integrated methodology surpasses traditional approaches, offering a more nuanced understanding of global happiness.

________________________________________

Reviewer #1 Comments

1. Vague Title & Abstract

Comment: The title and abstract use general terms and do not clearly state the unique contribution.

Response: We appreciate the feedback. However, we believe the title and abstract clearly convey the key novelty of the study, which is the integration of clustering and machine learning techniques to enhance the prediction of the happiness index. The use of terms such as “hierarchical analysis” and “novel predictive framework” directly highlights the innovation and contributes to the overall clarity of the manuscript.

2. Inconsistent Citation Style

Comment: Different citation styles used throughout the manuscript.

Response: We have carefully reviewed and corrected the citation style to ensure consistency. The final version follows the required PLOS ONE citation style, and all references have been formatted accordingly.

3. Non-Standard Sectioning

Comment: The manuscript structure does not follow a typical academic paper format, lacking a distinct “Literature Review” section.

Response: While we understand the preference for a separate literature review, the introduction section in our manuscript serves this function by discussing key related works. We believe that integrating the literature review within the introduction enhances the flow of the narrative, but we have added clearer section headings to improve structure and readability.

4. Inadequate Related Work

Comment: The “Related Work” section is too short and does not adequately relate to study goals.

Response: We respectfully disagree with this assessment. We have thoroughly reviewed the relevant literature on clustering, machine learning for happiness prediction, and model integration. The section explicitly discusses the limitations of previous studies and outlines how our approach addresses those gaps. We believe this adequately sets the stage for our study's contributions.

5. Unclear Methodology Flow

Comment: Methodology in Section 2 does not follow a logical flow.

Response: We have restructured Section 2 to follow a more logical flow. The subsections have been reorganized to ensure a step-by-step presentation of the methodology, making it easier for readers to follow the analysis process from clustering to model integration.

6. Disconnected Statistical Analysis

Comment: The statistical analysis in Section 3 seems disconnected from the approach.

Response: We believe that the statistical analysis is directly connected to the clustering and prediction steps. We have revised the section to explicitly reference the clustering results and how they inform the subsequent analysis, ensuring greater clarity.

7. Intertwined Discussion & Results

Results (Section 3) now present findings objectively, while Discussion (Section 4) interprets implications, limitations, and policy recommendations without overlap.

8. Unclear Degrees of Freedom

Table 4’s degrees of freedom (df=2) reflect the three-cluster ANOVA design (k-1=2). This is clarified in the caption and text.

9. The Integrated Model Is Not Clear

Comment: The integration mechanism between the prediction and clustering models is not adequately explained.

Response: We believe the integration mechanism is well explained in the manuscript. In the revised version, we have expanded the explanation of how clustering results influence the machine learning models by providing concrete examples and linking the clustering outcomes with model performance.

10. General Lucidity

Jargon has been minimized, and technical terms (e.g., “elbow method”) are defined. The manuscript has been edited for logical flow and readability.

________________________________________

Reviewer #2 Comments

1. Lack of Novelty

Comment: The study does not introduce novel methodologies or frameworks.

Response: While the techniques used (K-Means clustering, Random Forest, and XGBoost) are established, the novelty lies in the integration of these methods in a hierarchical framework. This combined approach enhances predictive accuracy, a contribution we believe has not been adequately explored in existing literature. Furthermore, the incorporation of clustering results into machine learning models is an innovation in the study of global happiness.

2. Lack of Explainability

Comment: The paper does not adequately discuss explainability techniques.

Response: We agree that explainability is an important aspect, especially in machine learning models. In the revised manuscript, we have added a detailed discussion on feature importance, which explains how different factors like GDP and social support contribute to the happiness index prediction.

3. Arbitrary Cluster Selection

Comment: The selection of three clusters seems arbitrary.

Response: The number of clusters was determined using several established methods: the elbow method, silhouette method, and gap statistics. We have clarified this process in the manuscript and emphasized that the final choice of three clusters is backed by these statistical techniques, which support the robustness of our decision.

4. Incomplete Variable Consideration

Comment: The study only considers economic and social factors, neglecting psychological, environmental, and cultural variables.

Response: While we recognize the importance of psychological and environmental factors, the focus of this study was to explore the economic and social variables due to their strong theoretical and empirical backing in the literature on happiness. Expanding the scope to include psychological and environmental factors would be valuable in future work.

5. Lack of Ablation Studies

Comment: The authors could have done ablation studies.

Response: We appreciate this suggestion. However, due to time constraints, we were unable to conduct full ablation studies. Nonetheless, we have included an analysis of the importance of different variables in the model's performance, which provides insight into their relative impact on the prediction accuracy.

6. Cross-Validation

Due to space constraints, we used other methods to verify the model

7. Generic Policy Implications

Comment: The discussion on policy implications is too generic.

Response: We have expanded the policy implications section to provide concrete recommendations tailored to specific clusters. The revised version now discusses actionable policies for high, medium, and low happiness countries, with recommendations for each group based on their unique characteristics.

We hope these revisions address the concerns raised by the reviewers and look forward to your feedback. Thank you again for your constructive comments, which have greatly contributed to improving this work.

Attachment

Submitted filename: Response to Reviewers.docx

pone.0322287.s003.docx^{(20.5KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0322287.r004

Decision Letter 1

Issa Atoum

3 Mar 2025

PONE-D-25-01298R1Analyzing and Predicting Global Happiness Index via Integrated Multilayer Clustering and Machine Learning ModelsPLOS ONE

Dear Dr. Yang,

The research framework should be smaller, and please watch out for arrows that don't align well in the left part of the figure. It would be nice if the figure had a sequence of steps.
All the figures look blurry. Please insert them in the final version to see the full picture (pending acceptance).
While the reviewers and the academic editor mention issues related to missing sections, the authors fail to address these issues fairly. The author should spare separate sections for discussion, limitations or threats to validity, and impact of the study (optional). The research objectives addressed in section 1.3 are expected to be discussed. The study's limitations, mainly found in many machine learning papers, should be described with their respective mitigation approaches.

Please submit your revised manuscript by Apr 17 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Issa Atoum

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

PLoS One. 2025 Apr 30;20(4):e0322287. doi: 10.1371/journal.pone.0322287.r005

Author response to Decision Letter 2

16 Mar 2025

Response to Reviewers

Dear Editors and Reviewers,

Thank you for your constructive feedback on our manuscript. We appreciate the opportunity to revise the paper and address the concerns raised. Below, we provide a point-by-point response to the reviewers’ comments and outline the revisions made to the manuscript.

________________________________________

Reviewer Comment 1:

“The research framework should be smaller, and please watch out for arrows that don't align well in the left part of the figure. It would be nice if the figure had a sequence of steps.”

Response:

We sincerely appreciate this observation. As requested,

1. Figure 1 (Research Framework): We have revised the figure to ensure proper alignment of arrows in the left section and adjusted the layout to emphasize a clear sequence of research steps (e.g., “Data Processing → Clustering → Model Building → Validation”).

________________________________________

Reviewer Comment 2:

“All the figures look blurry. Please insert them in the final version to see the full picture (pending acceptance).”

Response:

All figures in the manuscript have been converted to high-resolution TIFF format to ensure optimal visual clarity.

________________________________________

Reviewer Comment 3:

“The authors should spare separate sections for discussion, limitations or threats to validity, and impact of the study. The research objectives addressed in section 1.3 are expected to be discussed. The study's limitations, mainly found in many machine learning papers, should be described with their respective mitigation approaches.”

Response:

We have restructured the manuscript to include dedicated sections addressing these concerns:

1. Section 5: Discussion

o Added an in-depth discussion of the three research objectives outlined in Section 1.3, including:

� Cluster analysis results (e.g., ANOVA-confirmed group differences in GDP, social support).

� Comparative performance of machine learning models (Random Forest vs. XGBoost).

� Policy recommendations aligned with UN Sustainable Development Goals.

o Highlighted limitations of cluster analysis (e.g., data gaps in low-income countries) and proposed mitigation strategies (e.g., integrating satellite data).

o Discussed nonlinear relationships between variables (e.g., social support’s stronger correlation with happiness than GDP).

2. Section 6.3: Limitations and Mitigations

o Added four subsections addressing key limitations:

� Data Representativeness: Acknowledged gaps in low-income regions (e.g., South Sudan) and proposed field surveys/satellite data integration.

� Temporal Dynamics: Noted static model limitations and suggested longitudinal time-series analysis.

� Variable Selection Constraints: Discussed omitted cultural/environmental factors and proposed text mining for cultural indicators.

� Clustering Assumptions: Addressed K-Means’ spherical cluster bias and recommended hybrid clustering approaches.

o Each limitation is paired with methodological or future research mitigations (e.g., interdisciplinary collaborations, advanced imputation techniques).

________________________________________

We believe these revisions comprehensively address the reviewers’ concerns and significantly strengthen the manuscript’s rigor, clarity, and impact. Thank you again for your valuable feedback. We are happy to provide further clarifications if needed.

Sincerely,

Dr. Yang

Corresponding Author

Attachment

Submitted filename: Response_to_Reviewers_auresp_2.docx

pone.0322287.s004.docx^{(16.6KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0322287.r006

Decision Letter 2

Issa Atoum

19 Mar 2025

Analyzing and Predicting Global Happiness Index via Integrated Multilayer Clustering and Machine Learning Models

PONE-D-25-01298R2

Dear Dr. Yang,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Issa Atoum

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

PLoS One. doi: 10.1371/journal.pone.0322287.r007

Acceptance letter

Issa Atoum

PONE-D-25-01298R2

PLOS ONE

Dear Dr. Yang,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Issa Atoum

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Data. Dataset from the World Happiness Report (2020–2024) covering social and economic indicators of 156 countries and regions.

(XLSX)

pone.0322287.s001.xlsx^{(82.8KB, xlsx)}

Attachment

Submitted filename: Response to Reviewers.docx

pone.0322287.s003.docx^{(20.5KB, docx)}

Attachment

Submitted filename: Response_to_Reviewers_auresp_2.docx

pone.0322287.s004.docx^{(16.6KB, docx)}

Data Availability Statement

[pone.0322287.ref001] 1.Rowan AN. World happiness report 2023. WellBeing News. 2023;5(3):1. [Google Scholar]

[pone.0322287.ref002] 2.Cao Q. Contradiction between input and output of Chinese scientific research: a multidimensional analysis. Scientometrics. 2020;123(1):451–85. doi: 10.1007/s11192-020-03377-w [DOI] [Google Scholar]

[pone.0322287.ref003] 3.Easterlin RA. Does Economic Growth Improve the Human Lot? Some Empirical Evidence. In: David PA, Reder MW, editors. Nations and Households in Economic Growth. Academic Press; 1974, p. 89–125. [Google Scholar]

[pone.0322287.ref004] 4.Delsignore G, Aguilar-Latorre A, Garcia-Ruiz P, Oliván-Blázquez B. Measuring happiness for social policy evaluation: a multidimensional index of happiness. Sociological Spectrum. 2023;43(1):16–30. doi: 10.1080/02732173.2022.2163444 [DOI] [Google Scholar]

[pone.0322287.ref005] 5.Chakraborty A, Tsokos CP. A real data-driven analytical model to predict happiness. Journal Name. 2021;Volume Number(Issue Number):Page Range. doi: DOIorotheridentifier [Google Scholar]

[pone.0322287.ref006] 6.Helliwell JF, Huang H, Wang S, Norton M. World Happiness, Trust and Deaths under COVID-19. World Happiness Report 2021. 2021:13–57. [Google Scholar]

[pone.0322287.ref007] 7.Han Y, Shao Y, Zhang Y. Happiness Index Prediction Using Hybrid Regression Model. In: Proceedings of the 2nd International Academic Conference on Blockchain, Information Technology and Smart Finance (ICBIS 2023). 2023, p. 76–87. [Google Scholar]

[pone.0322287.ref008] 8.Rendón E, Abundez I, Arizmendi A, Quiroz EM. Internal versus external cluster validation indexes. International Journal of Computers and Communications. 2011;5(1):27–34. [Google Scholar]

[pone.0322287.ref009] 9.Charrad M, Ghazzali N, Boiteau V, Niknafs A. NbClust: AnRPackage for Determining the Relevant Number of Clusters in a Data Set. J Stat Soft. 2014;61(6). doi: 10.18637/jss.v061.i06 [DOI] [Google Scholar]

[pone.0322287.ref010] 10.Scutariu A-L, Șuşu Ștefăniță, Huidumac-Petrescu C-E, Gogonea R-M. A Cluster Analysis Concerning the Behavior of Enterprises with E-Commerce Activity in the Context of the COVID-19 Pandemic. JTAER. 2021;17(1):47–68. doi: 10.3390/jtaer17010003 [DOI] [Google Scholar]

[pone.0322287.ref011] 11.Oswald AJ, Proto E, Sgroi D. Happiness and Productivity. Journal of Labor Economics. 2015;33(4):789–822. doi: 10.1086/681096 [DOI] [Google Scholar]

[pone.0322287.ref012] 12.Freeman RB, Di Tella R. Do labor market institutions matter?. International Economic Review. 2006;47(3):647–70. [Google Scholar]

[pone.0322287.ref013] 13.Breiman L. Random Forests. Machine Learning. 2001;45(1):5–32. [Google Scholar]

[pone.0322287.ref014] 14.Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016, p. 785–94. [Google Scholar]

[pone.0322287.ref015] 15.Howell RT, Howell CJ. The relation of economic status to subjective well-being in developing countries: a meta-analysis. Psychol Bull. 2008;134(4):536–60. doi: 10.1037/0033-2909.134.4.536 [DOI] [PubMed] [Google Scholar]

[pone.0322287.ref016] 16.Masini RP, Medeiros MC, Mendes EF. Machine learning advances for time series forecasting. Journal of Economic Surveys. 2023;37(1):76–111. doi: 10.1111/joes.12429 [DOI] [Google Scholar]

[pone.0322287.ref017] 17.Sun X, Cao Y, Jin Z, Tian X, Xue M. An Adaptive ECMS Based on Traffic Information for Plug-in Hybrid Electric Buses. IEEE Trans Ind Electron. 2022;70(9):9248–59. doi: 10.1109/tie.2022.3210549 [DOI] [Google Scholar]

[pone.0322287.ref018] 18.Balal AT, Jafarabadi YPT, Demir AT, Igene MT, Giesselmann MT, Bayne ST. Forecasting solar power generation utilizing machine learning models in Lubbock. 2023.

[pone.0322287.ref019] 19.Lomas T. Exploring associations between income and wellbeing: new global insights from the Gallup World Poll. The Journal of Positive Psychology. 2024;19(4):629–46. doi: 10.1080/17439760.2023.2248963 [DOI] [Google Scholar]

[pone.0322287.ref020] 20.Nilsson AH, Eichstaedt JC, Lomas T, Schwartz A, Kjell O. The Cantril Ladder elicits thoughts about power and wealth. Sci Rep. 2024;14(1):2642. doi: 10.1038/s41598-024-52939-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0322287.ref021] 21.Hamzah FB, Mohd Hamzah F, Mohd Razali SF, Samad H. A Comparison of Multiple Imputation Methods for Recovering Missing Data in Hydrological Studies. Civ Eng J. 2021;7(9):1608–19. doi: 10.28991/cej-2021-03091747 [DOI] [Google Scholar]

[pone.0322287.ref022] 22.Shi C, Wei B, Wei S, Wang W, Liu H, Liu J. A quantitative discriminant method of elbow point for the optimal number of clusters in clustering algorithm. J Wireless Com Network. 2021;2021(1). doi: 10.1186/s13638-021-01910-w [DOI] [Google Scholar]

[pone.0322287.ref023] 23.Chicco D, Warrens MJ, Jurman G. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput Sci. 2021;7:e623. doi: 10.7717/peerj-cs.623 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0322287.ref024] 24.Kenny M, Schoen I. Violin SuperPlots: visualizing replicate heterogeneity in large data sets. Mol Biol Cell. 2021;32(15):1333–4. doi: 10.1091/mbc.E21-03-0130 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0322287.ref025] 25.Lu X, Tan J, Cao Z, Xiong Y, Qin S, Wang T, et al. Mobile Phone-Based Population Flow Data for the COVID-19 Outbreak in Mainland China. Health Data Sci. 2021;2021:9796431. doi: 10.34133/2021/9796431 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0322287.ref026] 26.Zhao J, Li Z, Gao Q, Zhao H, Chen S, Huang L, et al. A review of statistical methods for dietary pattern analysis. Nutr J. 2021;20(1):37. doi: 10.1186/s12937-021-00692-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0322287.ref027] 27.M. Ghazal T, Zahid Hussain M, A. Said R, Nadeem A, Kamrul Hasan M, Ahmad M, et al. Performances of K-Means Clustering Algorithm with Different Distance Metrics. Intelligent Automation & Soft Computing. 2021;29(3):735–42. doi: 10.32604/iasc.2021.019067 [DOI] [Google Scholar]

[pone.0322287.ref028] 28.Ikotun AM, Ezugwu AE, Abualigah L, Abuhaija B, Heming J. K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Information Sciences. 2023;622:178–210. doi: 10.1016/j.ins.2022.11.139 [DOI] [Google Scholar]

[pone.0322287.ref029] 29.Liu Y, Tu W, Zhou S, Liu X, Song L, Yang X, et al. Deep Graph Clustering via Dual Correlation Reduction. AAAI. 2022;36(7):7603–11. doi: 10.1609/aaai.v36i7.20726 [DOI] [Google Scholar]

[pone.0322287.ref030] 30.Antoniadis A, Lambert-Lacroix S, Poggi J-M. Random forests for global sensitivity analysis: A selective review. Reliability Engineering & System Safety. 2021;206:107312. doi: 10.1016/j.ress.2020.107312 [DOI] [Google Scholar]

[pone.0322287.ref031] 31.Asselman A, Khaldi M, Aammou S. Enhancing the prediction of student performance based on the machine learning XGBoost algorithm. Interactive Learning Environments. 2023;31(6):3360–79. doi: 10.1080/10494820.2021.1928235 [DOI] [Google Scholar]

[pone.0322287.ref032] 32.Yu Z, Guindani M, Grieco SF, Chen L, Holmes TC, Xu X. Beyond t test and ANOVA: applications of mixed-effects models for more rigorous statistical analysis in neuroscience research. Neuron. 2022;110(1):21–35. doi: 10.1016/j.neuron.2021.10.030 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Analyzing and predicting global happiness index via integrated multilayer clustering and machine learning models

Boxu Yang

Xiang Xie

Roles

Abstract

1 Introduction

1.1 Research background

1.2 Literature review

1.2.1 Application of clustering analysis in happiness index research.

1.2.2 Application of machine learning models in happiness index prediction.

1.2.3 Gaps and innovations.

1.3 Research objectives

Objective 1: Explore the national differences in the happiness index through K-Means clustering analysis.

Objective 2: Construct a machine learning model to identify the key variables affecting the happiness index.

Objective 3: Propose accurate prediction and policy suggestions.

2 Data and methods

2.1 Data source and sample

2.2 Research methods

2.2.1 Clustering analysis.

2.2.2 Prediction model.

2.2.3 Model validation.

2.1.4 Research framework process.

Fig 1. Research Framework Process.

3 Statistical analysis

3.1 Data description

Fig 2. Bar chart of average happiness scores across continents.

Fig 3. Happiness score box graph for 2020-2024.

Fig 4. Heat maps for each variable.

3.2 Clustering analysis

3.2.1 Determination of the optimal number of clusters.

Fig 5. Elbow Method (upper left); Average Silhouette Plot (upper right); Gap Statistics (lower left); Hartigan Plot (lower right) showing the optimal number of clusters.

3.2.2 Clustering analysis process.

Fig 6. Clustering Scatter Plot.

3.2.3 Key characteristic analysis between clustering groups and variables.

Table 1. Final Clustering Situation.

3.2.4 Analysis of the correlation matrix after clustering.

Fig 7. Correlation Matrix Graph.

3.3 Prediction model

3.3.1 Prediction model building.

3.3.2 Performance evaluation and model comparison.

Table 2. Performance Evaluation of Machine Learning Prediction Models.

Fig 8. Fitting Situation of Prediction Model and Actual Value.

4 Results

Table 3. Ranking of Different Countries According to Happiness Scores.

4.1 Hypothesis testing and clustering results

Table 4. One-way ANOVA.

5 Discussion

6 Conclusions and perspectives

6.1 Theoretical contributions

6.2 Policy suggestions

6.3 Limitations and mitigations

6.3.1 Data representativeness.

6.3.2 Temporal dynamics.

6.3.3 Variable selection constraints.

6.3.4 Variable selection constraints.

6.4 Conclusions

Supporting Information

Acknowledgments

Data Availability

Funding Statement

References

Author response to Decision Letter 0

Decision Letter 0

Issa Atoum

Roles

Author response to Decision Letter 1

Decision Letter 1

Issa Atoum

Roles

Author response to Decision Letter 2

Decision Letter 2

Issa Atoum

Roles

Acceptance letter

Issa Atoum

Roles

Associated Data

Supplementary Materials

Data Availability Statement