Abstract
Predicting the number of total children ever born in a country is a key component for proper implementation of economic growth policy. Here, performance metrics were used to predict models that appropriately describe the factors that affect children ever born. A comparison of 60% training and 40% validation, 70% training and 30% validation, 80% training and 20% validation also 90% training and 10% validation was performed respectively to examine the three models’ behaviours (Poisson regression, Negative Binomial regression and Generalized Poisson regression) with RMSE, R2, MAE and MSE as performance metrics. Although all the three models had almost identical performance evaluation metrics, the Poisson regression was chosen as the most appropriate model because it is the simplest model.
Keywords: Poisson regression, Negative binomial regression, Generalized poisson regression, Performance evaluation metrics
Specifications Table
Subject | Statistics, Demography |
Specific subject area | Statistics |
Type of data | The raw data is available in SPSS format (sav). The analyzed data in this article are provided in tables and figures |
How data was acquired | Secondary data was obtained from Nigeria Demographic and Health Survey (NDHS), covering all the regions |
Data format | NDHS is a secondary data consisting of a refined primary data collected and collated |
Parameters for data collection | The data were secondary data covering all regions of Nigeria Demographic Health Survey |
Data source/ Location |
Primary data source: http://www.dhsprogram.com/data/dataset_admin/login_main.cfm Abuja, Nigeria |
Data accessibility | Data can be downloaded as excel file in supplementary (.xlsx) |
Related research article | Jecinta U Ibeji, Delia North, Temesgen Zewotir, Lateef Amusa Modelling Fertility levels in Nigeria using Generalized Poisson regression-based Approach, Scientific Africa. https://doi.org/10.1016/j.sciaf.2020.e00494 |
Value of the Data
-
•
The dataset gives information about the number of total children ever born in Nigeria, which is a key component for proper implementation of economic growth policy.
-
•
Analysis of this dataset provides insight into the appropriate model describing the factors that affect children ever born in Nigeria.
-
•
The dataset could be used to create integrated support tools for the government, health policymakers and international agencies concerned with fertility-associated problems.
-
•
The information in this dataset will be valuable in planning and evaluation of fertility policies in Nigeria.
1. Data Description
Nigeria Demographic and Health survey (NDHS) 2013 was implemented by the national population commission, an agency saddled with the responsibility of collecting and collating demographic data. In 2013, data on fertility levels, marriage and fertility preference were collected. The target groups were women within the age of 15 and 49 years in randomly selected households across Nigeria. 30878 women who were within childbearing age were interviewed out of 30977 households selected. Children ever born are children born alive by married women from age 15 years and above. The data contains information on key indicators for urban and rural areas in Nigeria, the six geo-political zones, the 36 states and the federal capital territory. The data on childbearing patterns were collected in different forms. First, each woman was asked the number of daughters and sons living with her, the number born alive and later died and those living elsewhere. A complete history of all the women’s children including the name, sex, month and year of birth, age, and survival of each of the children. Data was also collected for women ever been pregnant.
The secondary data containing total children ever born with the independent variables was partitioned into training and validation of different percentages to study the performance of the three models using the parameter estimates as seen in Tables 1–4.
Table 1.
MAE | MSE | RMSE | R2 | |
---|---|---|---|---|
Poisson | ||||
Training Validation |
1.613814 1.600686 |
4.313774 4.262919 |
2.076963 2.064684 |
0.3624504 0.3604352 |
Negative Binomial | ||||
Training Validation |
1.613813 1.600686 |
4.313784 4.26293 |
2.076965 2.064686 |
0.3624491 0.3604339 |
Generalized Poisson | ||||
Training Validation |
1.613700 1.600644 |
4.315765 4.264933 |
2.077442 2.065171 |
0.3622371 0.3602402 |
Table 4.
MAE | MSE | RMSE | R2 | |
---|---|---|---|---|
Poisson | ||||
Training Validation |
1.609629 1.603307 |
4.299476 4.213045 |
2.975103 2.999453 |
0.3615511 0.3655643 |
Negative Binomial | ||||
Training Validation |
1.609629 1.603307 |
4.299486 4.213054 |
2.975114 2.999464 |
0.3615499 0.3655631 |
Generalized Poisson | ||||
Training Validation |
1.609559 1.603503 |
4.301766 4.216727 |
2.977169 3.001452 |
0.3613071 0.3650468 |
Table 1, Table 2, Table 3, Table 4 show the predictive statistics of the dataset, while the inferential statistics of this dataset was discussed in our previous publication [1]. Table 1 contains a summary comparison of Poisson regression, Negative Binomial regression and Generalized Poisson regression using 60%:40% partitioning, while Table 2, Table 3, Table 4 contain 70%:30%, 80%:20% and 90%:10%, respectively. All the variables used here can be seen in Table S1 in the supplementary information.
Table 2.
MAE | MSE | RMSE | R2 | |
---|---|---|---|---|
Poisson | ||||
Training Validation |
1.605426 1.616214 |
4.273806 4.344664 |
2.067319 2.084386 |
0.3588404 0.366524 |
Negative Binomial | ||||
Training Validation |
1.605426 1.616214 |
4.273815 4.344675 |
2.067321 2.084388 |
0.3588393 0.3665226 |
Generalized Poisson | ||||
Training Validation |
1.605347 1.616177 |
4.276152 4.347424 |
2.067886 2.085048 |
0.3585859 0.3661882 |
Table 3.
MAE | MSE | RMSE | R2 | |
---|---|---|---|---|
Poisson | ||||
Training Validation |
1.611256 1.594595 |
4.305445 4.218068 |
2.074957 2.053794 |
0.3598963 0.3722824 |
Negative Binomial | ||||
Training Validation |
1.611255 1.594595 |
4.305455 4.218074 |
2.074959 2.053795 |
0.359895 0.3722818 |
Generalized Poisson | ||||
Training Validation |
1.611217 1.594352 |
4.307683 4.218566 |
2.075496 2.053915 |
0.3596542 0.3722467 |
Based on the mean absolute error and root mean square error for Poisson, Negative Binomial and Generalized Poisson regression model, the performance evaluation for the training sample is higher than the validating sample, although with a slight difference [2], [3]. Table 1, Table 2, Table 3, Table 4 identified Poisson as the most appropriate predictive model for validating samples.
In the predictive modeling, all the three models showed almost identical performance evaluation metrics while the Poisson regression was chosen as the most appropriate as it is the simplest model. This is because the root mean square error, mean squared error and the mean absolute error of the three models showed almost identical performance metrics.
Comparing the root mean square error, mean squared error, R-squared and mean absolute error for training and validating sample of each model, showed that all the three models had almost identical performance evaluation metrics. The Poisson regression was chosen as the most appropriate because it is the simplest model. This is important because it balances the goodness of fit with simplicity and predicts the probability of the outcome. Complex models adapt their shape to fit the data, but the additional parameter may not represent anything useful.
2. Experimental Design, Materials and Methods
In this work, Secondary data was obtained from Nigeria Demographic and Health Survey 2013, covering all the regions containing all analyzed primary data. The Secondary data was filtered, and the variables of interest was chosen. One major issue in fitting a model is how well it performs when applied to new data. To solve this problem, the data needs to be partitioned into a training set, which is used to create the model; a validation set, which is used to evaluate the model performance; and a test set, which is used to assess how well the algorithm was trained using the training dataset. Using SAS version 9.4, a comparison of 60% training and 40% validation, 70% training and 30% validation, 80% training and 20% validation, and 90% training and 10% validation was performed respectively to examine the three models behaviours (Poisson regression, Negative Binomial regression, and Generalized Poisson regression). Furthermore, the variations in the training performance evaluation metrics under each partition was examined as follows. First, the model is fit on the training dataset using a supervised learning method. The training dataset is then run with the current model, and this is used to compare the target for each input vector in the training dataset. Based on this and the specific learning algorithm being used, the models' parameters were adjusted, while variable selection and parameter estimation can be included in the model fitting [4]. Subsequently, in the validation dataset, the fitted model was used to predict the responses. While tuning the model’s hyperparameters, the validation dataset provides an unbiased evaluation of a model fit on the training dataset [5].
The mean absolute error (MAE), Mean squared error (MSE), root mean square error (RMSE) and coefficient of determination (R2) are the performance evaluation metrics used. The formulas are presented below,
Root Mean Square Error (RMSE) is given as:
Mean Absolute Error (MAE) is given as:
Mean squared error (MSE) is given as:
Where N is the total number of observations.
Coefficient of determination (R2):
CRediT Author Statement
Jecinta U. Ibeji: Conceptualization, methodology, Data curation, Writing - Original draft preparation; Temesgen Zewotir: Conceptualization and supervision; Delia North: Reviewing and Editing; Lateef Amusa: Visualization, Investigation Data search and Editing.
Declaration of Competing Interest
The authors do declare that there is no conflict of interest.
Acknowledgement
Authors are grateful to the University of KwaZulu-Natal, Durban for operational and infrastructural support and DHS for granting the author access to NDHS 2013 data.
Funding
This work was not funded.
Footnotes
Supplementary material associated with this article can be found in the online version at doi:10.1016/j.dib.2021.107083.
Appendix. Supplementary materials
References
- 1.Ibeji J.U., Zewotir T., North D., Amusa L. Modelling fertility levels in Nigeria using Generalized Poisson regression-based approach. Scientific African. 2020;9:e00494. [Google Scholar]
- 2.Aertsen W., Kint V., Van Orshoven J., Özkan K., Muys B. Comparison and ranking of different modelling techniques for prediction of site index in Mediterranean mountain forests. Ecol. Model. 2010;221(8):1119–1130. [Google Scholar]
- 3.Onoro-Rubio D., López-Sastre R.J. European Conference on Computer Vision. Springer; 2016. Towards perspective-free object counting with deep learning; pp. 615–629. [Google Scholar]
- 4.Chicco D. Ten quick tips for machine learning in computational biology. BioData Mining. 2017;10(1):35. doi: 10.1186/s13040-017-0155-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.James G., Witten D., Hastie T., Tibshirani R. Springer; 2013. An Introduction to Statistical Learning. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.