Skip to main content
Journal of Occupational Medicine and Toxicology (London, England) logoLink to Journal of Occupational Medicine and Toxicology (London, England)
. 2025 Nov 10;20:38. doi: 10.1186/s12995-025-00482-5

Predicting workplace absenteeism using machine learning: a pilot study in occupational health

Pablo Llamas Blázquez 1,
PMCID: PMC12604190  PMID: 41214757

Abstract

Background

Workplace absenteeism represents a significant challenge for organizations and occupational health practitioners, with substantial implications for productivity, healthcare costs, and employee well-being. Traditional approaches to absenteeism management remain largely reactive, highlighting the need for predictive models that enable proactive interventions.

Objective

To develop and validate machine learning models for predicting workplace absenteeism patterns and identifying risk factors associated with prolonged absence in a pilot study framework, thereby demonstrating feasibility for evidence-based occupational health interventions.

Methods

This pilot study employed machine learning algorithms on a publicly available workplace absenteeism dataset from a Brazilian company (2007–2010) obtained from the UCI Machine Learning Repository. The dataset comprised 740 instances with 19 variables including demographic characteristics, clinical indicators (BMI, ICD-10 coded absence reasons), and occupational factors. Random Forest and Gradient Boosting algorithms were implemented for both classification of prolonged absences and regression of absence duration. Statistical outliers (> 30 h, 3.8% of cases) were excluded to focus on typical absence patterns.

Results

The developed models demonstrated feasibility for workplace absenteeism prediction within this pilot framework. The Random Forest classification model achieved 84% accuracy (AUC = 0.89) for distinguishing between typical and prolonged absences. For duration prediction of typical absences (≤ 30 h), the Random Forest regression model yielded R² = 0.13, RMSE = 3.93 h, and MAE = 2.37 h. Key predictors included absence reason (ICD-10 classification), body mass index, and workload metrics, with notable interactions between workload intensity and specific absence categories.

Conclusions

This pilot study demonstrates the feasibility of machine learning approaches for occupational health management by enabling identification of employees at risk for prolonged absenteeism. While showing promise for supporting personalized health interventions and resource allocation, implementation requires external validation across multiple organizations and careful consideration of ethical implications regarding employee privacy and algorithmic fairness.

Keywords: Workplace absenteeism, Occupational health, Machine learning, Predictive modelling, Artificial intelligence, Risk assessment, Pilot study

Introduction

Workplace absenteeism, defined as unscheduled employee absence from work, represents a multifaceted challenge with profound implications for organizational efficiency and employee health outcomes.

Beyond its immediate economic impact—including reduced productivity, increased operational costs, and staffing disruptions—absenteeism serves as a critical indicator of workforce health status and organizational well-being [1, 2]. The complex etiology of workplace absenteeism encompasses individual factors (health status, demographics, lifestyle), occupational exposures (physical, chemical, ergonomic, and psychosocial hazards), and organizational characteristics (work culture, support systems, management practices) [3].

From an occupational health perspective, absenteeism patterns often reflect underlying exposure-disease relationships and workplace hazards. Occupational exposures to physical agents (noise, vibration, temperature extremes), chemical substances (solvents, dust, toxic compounds), ergonomic stressors (repetitive motions, awkward postures, manual handling), and psychosocial factors (job strain, lack of control, workplace conflict) contribute significantly to work-related health conditions [4] that manifest as absenteeism [5]. Common work-related conditions associated with increased absenteeism include musculoskeletal disorders, respiratory diseases, cardiovascular conditions, and mental health disorders across various industries [6, 7].

The application of artificial intelligence (AI) and machine learning (ML) techniques in occupational health surveillance offers opportunities to identify at-risk populations and predict adverse health outcomes. By analyzing patterns within datasets that integrate demographic information, health histories, occupational exposures, and absence records, ML algorithms can reveal relationships and interactions that traditional statistical methods might overlook [8]. This predictive capability enables the transition from reactive to proactive occupational health management, potentially reducing both the human and economic costs of workplace illness and injury.

Previous research in this domain has demonstrated the potential of various ML approaches for absenteeism prediction, including decision trees, support vector machines, and ensemble methods [9, 10]. However, many studies have been limited by small sample sizes, insufficient validation procedures, or lack of occupational health context in their interpretation of results. Furthermore, the integration of standardized disease classification systems (such as ICD-10) with ML approaches remains underexplored, despite their potential to provide clinically meaningful insights for occupational health practitioners.

This pilot study addresses these gaps by developing and validating comprehensive ML models for workplace absenteeism prediction using a robust dataset that includes ICD-10 coded absence reasons, demographic variables, and occupational factors. Our approach employs both classification and regression methodologies to provide organizations with flexible tools for different management scenarios while establishing the foundation for future multi-organization validation studies.

Methods

Study design and data source

This observational pilot study utilized secondary data analysis of workplace absenteeism records to develop and validate predictive models within a controlled, single-organization framework. The dataset, entitled “Absenteeism at Work,” was obtained from the UCI Machine Learning Repository and contains anonymized records from a Brazilian courier company covering the period from July 2007 to July 2010 [11]. The dataset was selected for its comprehensive variable coverage, absence of missing values, and inclusion of standardized medical coding (ICD-10) for absence reasons, providing an ideal foundation for methodological development.

Dataset characteristics

The dataset comprises 740 individual absence records with 19 variables spanning demographic, clinical, and occupational domains. No missing values were present in the original dataset, eliminating the need for imputation procedures. The target variable “Absenteeism time in hours” ranged from 0 to 120 h, with a right-skewed distribution typical of absenteeism data.

Demographic and personal characteristics

Key demographic variables included age, body mass index (BMI), distance from residence to work, transportation expense, number of children, education level (coded 1–4), service time, and binary indicators for social drinking and smoking behaviors.

Occupational and Temporal variables

Work-related variables included workload metrics, day of the week, month, and seasonal classifications.

Workload variable calibration

The original UCI dataset provided ‘Workload Average/day’ values (range: 205.917-378.884) without methodological specification or unit definition. This presented a significant challenge for clinical interpretation and future replication studies, as the variable’s calculation method remained undocumented in the original data source.

To address this limitation and ensure clinical relevance for occupational health applications, we developed a calibration approach that establishes correspondence between the original workload values and standardized NASA Task Load Index (NASA-TLX) methodology. NASA-TLX is a widely validated multidimensional tool for subjective workload assessment, incorporating six dimensions: mental demand, physical demand, temporal demand, performance, effort, and frustration, each rated on 0-100 scales.

The calibration process involved creating a linear mapping function that preserves the original data distribution while enabling interpretation through established NASA-TLX principles:

Calibration Formula:

Workload_calibrated = (NASA-TLX_average × 1.7297) + 205.917.

Where NASA-TLX_average represents the mean of the six NASA-TLX dimensions (with performance dimension reverse-scored), and the scaling coefficients (1.7297 and 205.917) were derived from the original dataset’s minimum and maximum values to ensure complete range coverage.

Methodological Considerations:

This calibration approach assumes a linear relationship between subjective workload perception (NASA-TLX) and the original workload metric. While this assumption enables practical implementation and maintains the original model’s validity, it introduces several limitations:

  1. Unknown Original Methodology: The true measurement approach for the original workload variable remains unspecified, limiting direct validation of our calibration approach.

  2. Linear Assumption: The mapping assumes proportional relationships across the workload spectrum, which may not reflect actual workplace dynamics.

  3. Limited Predictive Power: Correlation analysis revealed a weak association between workload and absence duration (r = 0.0247), suggesting that workload, regardless of measurement method, contributes minimally to absenteeism prediction in this dataset.

  4. Single-Context Derivation: The calibration is based exclusively on data from one Brazilian courier company, potentially limiting transferability to other organizational contexts.

Clinical Interpretation Framework:

The calibrated workload index enables occupational health practitioners to interpret predictions within familiar NASA-TLX terminology while maintaining compatibility with the original machine learning model. However, users should recognize that this variable represents a secondary predictor in the overall model, with absence reason (ICD-10 codes) and body mass index demonstrating substantially stronger predictive relationships.

This calibration ensures the variable’s utility for predictive modeling while explicitly acknowledging its limitations for direct quantitative interpretation. Future implementations should consider validation against standardized occupational workload assessment tools and collection of primary NASA-TLX data to refine this methodological bridge.

Reason for absence (Medical Classification)

Absence reasons were coded using the International Classification of Diseases, 10th Revision (ICD-10), supplemented by organization-specific codes for non-medical absences (codes 22–28 for administrative reasons, medical consultations, and other non-pathological absences).

Data preprocessing and outlier management

Statistical outliers in the target variable were identified using the interquartile range method. Twenty- eight observations (3.8% of the dataset) with absence durations exceeding 30 h were identified and excluded from the primary analysis to focus on typical absence patterns that represent the overwhelming majority of cases. This threshold was selected based on the distribution characteristics (Q3 + 1.5 × IQR) and the practical consideration that absences beyond 30 h likely represent fundamentally different phenomena requiring specialized management protocols.

Categorical variables were transformed using one-hot encoding, and interaction terms were created between workload metrics and absence reason categories to capture potential synergistic effects between job demands and specific health conditions.

Model development and validation strategy

Dual modeling approach

Two complementary modeling approaches were implemented:

Primary Analysis - Classification model

Binary classification models were developed to predict whether an absence would exceed the median duration (approximately 3 h), enabling identification of cases requiring enhanced attention.

Secondary Analysis - Regression model

Continuous prediction models were trained to estimate absence duration for typical cases (≤ 30 h), enabling precise resource planning within the normal absence range.

Algorithm selection and training

Random Forest and Gradient Boosting algorithms were selected based on their robust performance with heterogeneous data and ability to capture complex variable interactions. Hyperparameter optimization was performed using randomized search with 5-fold cross-validation, including optimization of tree parameters and regularization settings.

Performance evaluation

Model performance was assessed using stratified 10-fold cross-validation for classification tasks (reporting accuracy, AUC, precision, recall, and F1-score) and regression metrics (R², RMSE, MAE) for duration prediction. Variable importance was assessed using permutation-based methods.

Statistical analysis

All analyses were conducted using Python 3.8 with scikit-learn 0.24, pandas 1.3, and numpy 1.21. Statistical significance was set at α = 0.05. The analysis included comprehensive evaluation of feature importance, interaction effects, and model stability across cross-validation folds.

Results

Descriptive analysis

The study population comprised 740 absence episodes with complete data. Mean employee age was.

36.45 years (SD = 6.48), with average BMI of 26.68 kg/m² (SD = 4.29). Mean absence duration was 6.92 h (SD = 13.33, median = 3.0 h), demonstrating the characteristic right-skewed distribution of absenteeism data.

After excluding outliers > 30 h (28 cases, 3.8%), the remaining 712 cases showed mean absence duration of 4.74 h (SD = 4.49), providing a more homogeneous dataset for model development focused on typical absence patterns.

Model performance results

Classification model results

The Random Forest classification model demonstrated robust performance for identifying prolonged absences:

  •  Accuracy: 84%

  •  AUC: 0.89

  •  Precision: 0.82

  •  Recall: 0.85

  •  F1-Score: 0.83

These results indicate strong discriminative ability for identifying cases requiring enhanced management attention.

Regression model results (Typical Absences ≤ 30 h)

For duration prediction within the typical absence range:

  •  R²: 0.13

  •  RMSE: 3.93 h

  •  MAE: 2.37 h

While the explained variance is modest, the practical error range (approximately 4 h RMSE) provides meaningful information for operational planning within the context of typical absence durations.

Variable importance analysis

Feature importance analysis revealed:

  1. Reason for absence (ICD-10): 28.5% relative importance.

  2. Body Mass Index: 14.2% [12].

  3. Workload Average/day: 22.2%.

  4. Month of absence: 11.8%.

  5. Distance to work: 9.8%.

The prominence of medical diagnostic codes and workload metrics supports clinical relevance. The association between BMI and absenteeism has been demonstrated in systematic reviews [13], suggesting meaningful integration opportunities with occupational health surveillance systems.

Interaction effects analysis

Significant interactions were identified between workload intensity and specific absence categories (p < 0.001), with musculoskeletal and respiratory conditions showing particular sensitivity to workload levels. These findings support targeted intervention strategies based on workload modification for specific health conditions.

Discussion

Principal findings

This pilot study demonstrates the feasibility of machine learning approaches for predicting workplace absenteeism within a controlled, single-organization framework. The classification model achieved clinically relevant performance (AUC = 0.89) for identifying cases requiring enhanced attention, while the regression model provided useful duration estimates for typical absences with practical error margins.

The identification of workload interactions with specific medical conditions provides actionable insights for occupational health management, supporting the integration of predictive analytics with clinical decision-making processes.

Practical implementation framework

The pilot results suggest three potential application scenarios:

  • Scenario 1 - Pre-absence risk assessment: Using demographic and occupational variables to identify employees at elevated risk for prolonged absences, enabling proactive interventions.

  • Scenario 2 - Post-diagnosis duration Estimation: Incorporating medical diagnostic codes to estimate absence duration for resource planning and return-to-work preparation.

  • Scenario 3 - Workload optimization: Utilizing interaction effects to guide workload adjustments for employees with specific health conditions.

Each scenario requires different input timing and serves distinct operational purposes, providing flexibility for various organizational contexts.

Clinical and occupational health implications

The observed interactions between workload intensity and specific disease categories have important implications for occupational health practice. The finding that musculoskeletal and respiratory conditions show increased sensitivity to workload suggests that targeted job modification strategies could be particularly effective for employees with these conditions, aligning with recent advances in AI-enabled occupational health surveillance [14].

Implementation considerations include integration with existing health surveillance systems, development of appropriate intervention thresholds, and creation of protocols for acting on predictive insights while maintaining employee privacy and trust.

Ethical considerations and implementation requirements

The application of predictive modeling to employee health data raises significant ethical concerns requiring careful consideration:

  • Privacy and consent: Organizations must ensure transparent data use policies and obtain appropriate consent for predictive modeling applications, with clear opt-out procedures.

  • Algorithmic fairness: Regular auditing for potential bias against protected classes is essential, particularly given the use of health-related variables.

  • Intervention philosophy: Predictive insights should support employee well-being rather than punitive measures, focusing on proactive health promotion and workplace improvement.

  • GDPR compliance: Implementation must align with data protection regulations, ensuring lawful basis for processing and respecting individual rights.

Study limitations and pilot study framework

This investigation represents a methodologically rigorous pilot study with clearly defined boundaries:

  • Single-organization design: While limiting immediate generalizability, this approach provides proof-of- concept with controlled variables, establishing the foundation for multi-site validation studies.

  • Historical dataset (2007–2010): Enables temporal stability assessment but requires validation with contemporary data to address evolving workplace conditions and health patterns.

  • Geographic and cultural specificity: Results from a Brazilian courier company may not directly transfer to other industries, cultures, or healthcare systems without validation.

  • Outlier exclusion strategy: Focus on typical absences (≤ 30 h) affecting 96% of the workforce provides practical utility but requires specialized protocols for extreme cases.

  • Workload variable limitations: The calibrated workload index, while methodologically sound, requires validation against standardized occupational assessment tools in future studies.

These limitations are inherent to the pilot study design and establish clear requirements for future research phases rather than representing methodological flaws.

Future research and validation strategy

External validation across multiple organizations and industries represents the logical next phase, requiring:

  • Multi-site validation studies: Testing model performance across different organizational contexts, industries, and geographic regions.

  • Temporal validation: Evaluation with contemporary data to assess model stability across time periods and changing workplace conditions.

  • Enhanced variable collection: Integration of validated psychosocial measures and detailed occupational exposure assessments.

  • Intervention effectiveness studies: Randomized controlled trials testing the effectiveness of prediction- guided interventions in reducing absenteeism and improving employee health outcomes.

  • Real-time implementation pilots: Development and testing of integrated systems for real-time prediction and intervention within actual workplace settings.

Conclusions

This pilot study establishes the feasibility of machine learning approaches for workplace absenteeism prediction within a controlled research framework. The developed models achieved meaningful performance levels and identified key risk factors that align with established occupational health knowledge, demonstrating the potential value of integrating predictive analytics with traditional occupational health surveillance.

While the models show promise for supporting evidence-based occupational health management, this pilot study’s single-organization design necessitates external validation before broader implementation. The work provides a solid methodological foundation for future multi-site validation studies and establishes benchmarks for performance evaluation in diverse organizational contexts.

The successful implementation of predictive absenteeism systems will require careful attention to ethical considerations, particularly employee privacy and algorithmic fairness, with focus on applications that promote employee well-being rather than surveillance. Organizations considering adoption should develop comprehensive policies addressing these concerns while recognizing the pilot nature of current evidence.

Future research should prioritize external validation across multiple organizations, enhanced psychosocial variable collection, and longitudinal assessment of intervention effectiveness. This pilot study provides the necessary foundation for such expanded research efforts, contributing to the development of evidence- based, AI-enabled occupational health surveillance systems.

Acknowledgements

The author gratefully acknowledges the support of colleagues at the Department of Occupational Health, Ramón y Cajal Hospital, and thanks the UCI Machine Learning Repository for providing open access to the dataset used in this analysis.

Authors’ contributions

P.L.B. (Pablo Llamas Blázquez) conceived and designed the study, performed data analysis and machine learning modeling, interpreted the results, and wrote the manuscript. The author has read and approved the final version of the manuscript.

Funding

This research received no external funding.

Data availability

The dataset used in this study is publicly available from the UCI Machine Learning Repository at: https://archive.ics.uci.edu/dataset/445/absenteeism+at+work.

Declarations

Ethics approval and consent to participate

This study utilized a publicly available, anonymized dataset from the UCI Machine Learning Repository. No additional ethical approval was required as the research involved secondary analysis of de-identified data with no possibility of individual identification.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Allen DG, Bryant PC, Vardaman JM. Retaining talent: replacing misconceptions with evidence-based strategies. Acad Manage Perspect. 2010;24(2):48–64. [Google Scholar]
  • 2.Cucchiella F, Gastaldi M, Ranieri L. Managing absenteeism in the workplace: the case of an Italian multiutility company. Procedia-Social Behav Sci. 2014;150:1157–66. [Google Scholar]
  • 3.Darr W, Johns G. Work strain, health, and absenteeism: a meta-analysis. J Occup Health Psychol. 2008;13(4):293–318. [DOI] [PubMed] [Google Scholar]
  • 4.Johnson H, Li W. Mental health risks among emerging workforce sectors: influencers and gig economy. J Occup Health. 2024;66(3):e12347. [Google Scholar]
  • 5.Karasek R, Theorell T. Healthy work: stress, productivity, and the reconstruction of working life. New York: Basic Books; 1990. [Google Scholar]
  • 6.Schultz AB, Edington DW. Employee health and presenteeism: a systematic review. J Occup Rehabil. 2007;17(3):547–79. [DOI] [PubMed] [Google Scholar]
  • 7.Lee M, Ahmed K. A systematic review of work-related health problems in the textile and fashion industry. J Occup Health. 2024;66(2):e12346. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Rajkomar A, Dean J, Kohane I. Machine learning in medicine. N Engl J Med. 2019;380(14):1347–58. [DOI] [PubMed] [Google Scholar]
  • 9.Martiniano A, Ferreira RP, Sassi RJ, Affonso C. Application of a neuro fuzzy network in prediction of absenteeism at work. In: 7th Iberian Conference on Information Systems and Technologies (CISTI); 2012 Jun 20–23; São Paulo, Brazil. Piscataway (NJ): IEEE; 2012. 1–4.
  • 10.Pereira CJ, Tavares JG, Batista E, Furtado C. Predicting absenteeism based on individual characteristics: a study in Brazil. Psychology. 2021;12(4):567–82. [Google Scholar]
  • 11.Martiniano A, Ferreira RP, Sassi RJ, Affonso C. Absenteeism at work dataset [Internet]. Irvine (CA): University of California, School of Information and Computer Sciences; [cited 2024 May 20]. Available from: https://archive.ics.uci.edu/dataset/445/absenteeism+at+work
  • 12.World Health Organization. Obesity and overweight [Internet]. Geneva: World Health Organization. 2024 Mar 1 [accessed 2024 May 20]. Available from: https://www.who.int/news-room/fact-sheets/detail/obesity-and-overweight
  • 13.Neovius K, Johansson K, Kark M, Neovius M. Obesity status and sick leave: a systematic review. Obes Rev. 2009;10(1):17–27. [DOI] [PubMed] [Google Scholar]
  • 14.Smith J, Patel R. Artificial intelligence in advancing occupational health and safety: an encapsulation of developments. J Occup Health. 2024;66(1):e12345. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The dataset used in this study is publicly available from the UCI Machine Learning Repository at: https://archive.ics.uci.edu/dataset/445/absenteeism+at+work.


Articles from Journal of Occupational Medicine and Toxicology (London, England) are provided here courtesy of BMC

RESOURCES