Data-efficient prediction in tableting using word embeddings and empirically-guided neural networks

Najeeb Abdelrahman; Stefan Klinken-Uth

doi:10.1016/j.ijpx.2025.100458

. 2025 Dec 3;11:100458. doi: 10.1016/j.ijpx.2025.100458

Data-efficient prediction in tableting using word embeddings and empirically-guided neural networks

Najeeb Abdelrahman ¹, Stefan Klinken-Uth ^1,^⁎

PMCID: PMC12828739 PMID: 41583060

Abstract

The development of robust oral tablet formulations remains time-consuming, often limited by scarce data and the difficulty of incorporating categorical formulation variables into predictive models. Traditional regression methods are interpretable but struggle with nonlinear interactions, whereas modern machine learning approaches offer higher predictive power at the expense of transparency. In this study, we present a neural network framework that employs word embedding layers to represent categorical formulation factors, such as active pharmaceutical ingredients (APIs), as trainable semantic vectors. These embeddings are integrated with empirically-guided output functions and a deep ensemble strategy to predict tablet quality attributes, including tensile strength and density as well as ejection force, and dosing height, based solely on formulation composition, compression pressure, and tablet weight. The model achieved predictive accuracy comparable to or exceeding classical regression while reliably avoiding physically implausible outputs. Analysis of the learned embedding vectors revealed meaningful clustering of APIs, enabling transfer learning across materials and robust predictions even for APIs with few or no training data. Furthermore, information gain analysis demonstrated that low-concentration formulations can substantially enhance predictive accuracy, supporting more material-efficient experimental designs. These results highlight embedding-based, empirically-guided neural networks as explainable and practical tools that could accelerate pharmaceutical formulation development in the future.

Keywords: Word embeddings, Data-efficient modeling, Empirically-guided learning, Explainable artificial intelligence, Formulation development, Tablet formulation

Graphical abstract

Highlights

•
Word embeddings enable neural networks to encode categorical formulation variables (APIs).
•
Empirically-guided output functions ensure physically consistent prediction of tablet properties.
•
Predicts tensile strength, density, ejection force, and dosing height from minimal inputs.
•
Shapley analysis reveals that low-API formulations provide high informational value.

1. Introduction

Tablets remain a common and cost-effective dosage form for oral drug delivery. However, the development of robust formulations continues to present significant challenge, including, poor bioavailability of the active pharmaceutical ingredient (API), insufficient mechanical strength, undesired drug release profiles, sticking or picking phenomena, excessive ejection forces, poor flowability of the starting materials, and variability in tablet mass and content uniformity. Accelerating formulation research and development is particularly important, as it determines the market entry of new drug substances during their patent-protected period. Moreover, it enables patients to gain earlier access to novel therapeutic options.

The growing availability of formulation and process data, together with advances in predictive modeling, is transforming pharmaceutical research into a more data-driven discipline (Kostewicz et al., 2014; Wang et al., 2021). When effectively leveraged, such tools promise to substantially reduce development costs, accelerate technology transfer and scale-up, and ultimately enhance product quality and manufacturing robustness (Hole et al., 2021). The employed methods in real of oral drug formulation development range from conventional statistical approaches, such as linear regression, principal component analysis (Basim et al., 2019; Berkenkemper et al., 2023b; Haware et al., 2009), and partial least squares regression (Dai et al., 2019), to more advanced machine learning techniques, including tree-based models (Hayashi et al., 2021; Meynard et al., 2022) and neural networks (Bao et al., 2023; Bounab et al., 2025). Modern machine learning algorithms, particularly neural networks, enable the modeling of more complex, nonlinear data relationships and exhibit superior predictive performance. However, this often comes at the expense of interpretability (Montavon et al., 2018), giving rise to a persistent trade-off between accuracy and transparency in regulated pharmaceutical environments (Vamathevan et al., 2019).

To mitigate the problem of limited interpretability in machine learning model predictions, a variety of techniques have been developed under the umbrella term explainable artificial intelligence (XAI). While these approaches do not render the information flow within a network fully transparent, they allow for an interpretable understanding of the model's decision-making process. The use of XAI has already been explored in several studies within the pharmaceutical domain. For instance, Shapley estimates have been employed to quantify the contribution of individual input factors to the model's predictions (Shahab et al., 2025). Furthermore, the implementation of multi-head attention layers, the integration of large language models to generate explanatory text accompanying each prediction (Bounab et al., 2025) or sensitivity analysis (Honti et al., 2024) represents additional strategies to enhance interpretability. Moreover, neural architectures can be designed to derive tabletability profiles from time series data, thereby increasing confidence in the prediction of individual tensile strength values (De Bisshop and Klinken, 2023). Additional examples of machine learning applied to tablet formulation development can be found in the recent literature. For instance, Grigoryan et al. demonstrated the use of a machine learning–based algorithm to predict formulation stability solely from chemical descriptors of the active substances (Grigoryan et al., 2025). However, the publication provides neither methodological detail nor details to the underlying dataset. Further work in the context of oral solid dosage form development and manufacturing includes studies on the prediction of sticking tendencies based on chemical descriptors via support vector machines (Ramahi et al., 2024), detection of tablet defects via neural networks (Kim and Han, 2025; Ma et al., 2020) or decision trees (Meynard et al., 2022), prediction of critical quality attributes (CQAs) and intermediate CQAs in a continuous manufacturing line via forests based learning (Deebes et al., 2025), PAT applications (Nagy et al., 2022), tablet disintegration prediction via Bayesian regression (Ghazwani and Hani, 2025) and random forests (Gupta et al., 2024) or prediction of dissolution profiles via convolutional (Galata et al., 2023) and recurrent networks (Li et al., 2025). The application examples underscore both the growing interest in machine-learning–driven formulation research and the breadth of potential use cases, spanning a wide range of model types and architectures.

Another key limitation arises from data availability. Many reported models depend on extensive material characterization or process analytical technologies, which, although scientifically valuable, are often impractical in industrial settings due to cost and time constraints. For instance, some models predicting tablet properties from powder mixtures are based on investigations of the corresponding pure components or selected compositions within the mixture space (Berkenkemper et al., 2023a; Corrigan et al., 2024; Michrafy et al., 2007; Tait et al., 2024; Valekar and Buckner, 2025). These models provide valuable physical insight but frequently require compound-specific calibration and are often limited in their ability to capture complex multi-component interactions or to transfer knowledge across different formulations. Consequently, there is a clear need for predictive frameworks that balance generalization and accuracy with practical feasibility by relying on easily accessible formulation and process parameters. Several machine learning approaches have been proposed in this context. Some introducing specialized architectures tailored to the specific characteristics of pharmaceutical tablet formulation datasets (Bounab et al., 2025; De Bisshop and Klinken, 2023). The selection of an appropriate modeling strategy is ultimately dictated by the characteristics of the available input data and the nature of the target variables to be predicted. In this context, compositional data of formulations represent a distinct and particularly challenging category.

Information on pharmaceutical formulation compositions is difficult to incorporate into predictive models for several reasons. The number of components, their proportions, and their types can vary substantially between formulations. In particular, the categorical nature of the component types poses a major challenge, as such nominal-scale information cannot be directly processed by most machine learning algorithms. To address this, two standard approaches have typically been employed. First, surrogate descriptors representing physicochemical or structural properties of the formulation components (e.g. solubility, molecular descriptor) can be used. Second, one-hot encoding can be applied, where each component type is assigned a separate column in the model input matrix, with the corresponding proportion as the feature value. Both strategies, however, exhibit inherent limitations: numerical proxies can introduce bias, while one-hot encoding leads to sparse, high-dimensional feature spaces that impede model generalization. Consequently, categorical formulation information, despite its critical relevance for CQAs, likely remains underexploited in current pharmaceutical data-driven modeling approaches.

To close this gap, we propose the use of word embedding layers, an approach originating from natural language processing, to represent categorical formulation and process variables in neural networks. In this framework, embeddings transform symbolic inputs such as API identity into low-dimensional, trainable vectors that capture semantic relationships in a continuous space. Applied to pharmaceutical data, embeddings may represent APIs, excipients, or process factors as latent vectors learned directly from the dataset. This strategy offers two key advantages: (i) improved predictive performance without requiring additional material information or characterization, and (ii) enhanced interpretability in context of XAI, since analysis of the embedding space can reveal domain-relevant relationships among categorical factors. Despite these benefits, embedding layers also entail several well-recognized challenges. First, embeddings trained on small or highly imbalanced datasets may fail to capture meaningful relational structure, particularly for rarely occurring APIs, excipients, or process conditions. Second, if the dimensionality of the embedding space is set too high, the model may no longer learn semantic relationships but instead approximate a high-dimensional one-hot encoding. Third, embeddings cannot represent truly novel or unseen categories at inference time, as no latent vector can be learned for factors absent from the training data. Finally, although embeddings improve interpretability relative to one-hot encodings, the latent dimensions themselves lack explicit semantic labels, meaning that downstream XAI analyses still require careful contextual interpretation.

As a case study in oral dosage form development, we developed a neural network that predicts tablet properties, including tensile strength, tablet density, ejection force, and dosing height, using only formulation composition, compression pressure, and tablet weight as inputs. These inputs were selected to mimic early-stage development conditions, where data are scarce and analytical resources limited. Ejection force serves as a process constraint for certain formulations, while dosing height prediction facilitates rapid adjustments in development and manufacturing environments. To enhance robustness, the architecture integrates empirically-guided output functions and deep ensemble learning (Mohammed and Kora, 2023). Analogous to the concept of physical guidance (Karniadakis et al., 2021), empirical guidance is achieved by embedding (semi-) empirical models as output functions within the neural network, thereby stabilizing its predictions and ensuring consistency with established physical relationships. This approach effectively shifts the model toward a more interpretable grey-box paradigm, an intermediate framework that combines the flexibility of data-driven (black-box) methods with the transparency and mechanistic grounding of physics-based (white-box) models. Such hybrid modeling strategies have already been adopted in pharmaceutical process control, where interpretability and regulatory compliance are of paramount importance (Tölle et al., 2025). Benchmarking against classical linear regression models trained on substance-specific subsets allows for a direct comparison with established white-box approaches. To further elucidate the model's internal representations, the learned embeddings are analyzed to assess their pharmaceutical relevance and interpretability, providing insights into how categorical factors are encoded in the latent space.

In addition, Shapley value estimations obtained via Monte Carlo simulations are employed to quantify the information content of individual data points within the dataset (Jia et al., 2019). This analysis establishes, for the first time, a quantitative framework to assess the informational value of experimental investigations relative to API consumption, thereby supporting more efficient experimental design and resource allocation in formulation development.

To the authors' knowledge, this study is the first to incorporate word embeddings for categorical variable representation in pharmaceutical machine learning. By enabling transfer learning across APIs and providing interpretable insights into categorical effects, this embedding-based, empirically-guided neural network framework aims to offer a practical and explainable approach to accelerate data-driven formulation and process design, bridging the gap between innovation and industrial applicability.

2. Materials and methods

2.1. Materials

The materials employed are listed in Table 1. Here, the term API refers not only to drug substances with actual pharmacological activity, but also to surrogate compounds that serve solely to introduce diversity in physicochemical properties and process behavior within the database. IBU_G represents a pre-granulated grade of ibuprofen, specifically designed for direct compression and containing additional excipients. The material Ibuprofen 50 was included in the study twice (IBU_P and IBU_{P rep.}), with both datasets analyzed separately.

Table 1.

Excipients and APIs used in tablet formulations.

Brand name	Chemical Identity	Purpose in the formulation	Abbreviation
–	Acetylsalicylic acid	API	ASA
Kollidon® CL-F	Crospovidone	Disintegrant	–
DI-CAFOS A 12	Dibasic calcium phosphate anhydrous	API surrogate	DCPA_A12
–	Efavirenz	API	EV
Ibuprofen 50	Ibuprofen	API	IBU_P and IBU_{P rep.}
Ibuprofen DC85 W	Ibuprofen	API	IBU_G
FlowLac® 100	α-lactose monohydrate	Filler	–
–	Lopinavir	API	LPV
Magnesium stearate Pharma VEG	Magnesium stearate	Lubricant	–
–	Metformin	API	MFM
VIVAPUR® 101	Microcrystalline cellulose	Binder	–
–	Paracetamol	API	AAP
AEROSIL® 150 V	Silicon dioxide	Glidant	–
–	Sodium benzoate	API	SoBe

Open in a new tab

2.2. Experimental procedure

The selection of experimental points was determined using Halton sequences (Halton, 1960). This approach ensures a uniform, space-filling distribution of data points and prevents sparsely populated regions, which are often encountered in full factorial or optimal designs. In this process, the proportions of the API, binder, disintegrant, and the tablet mass were varied. For each API or API surrogate, a separate Halton sequence was generated to ensure the distinctiveness of the different experimental series. The powder blends were prepared using a laboratory high-shear mixer operated at 400 rpm for 5 min. The lubricant was incorporated in an additional mixing step at the same speed for 30 s. In addition to the API blends, a placebo system was prepared analogously to the API formulations. In the placebo formulation, the API fraction specified by the Halton sequence was replaced by filler, keeping binder and disintegrant ratios identical to the API blends. For poorly flowing materials, 0.5 % colloidal silicon dioxide was added during the first mixing step. This adjustment was required only for AAP to ensure adequate processability in the compaction simulator.

Subsequently, the blends were compressed into tablets using a compaction simulator (STYL'One Evolution, Medelpharm, France). 8 mm EU-B punches with a V-shaped profile were employed at a punch speed of 30 mm·s⁻¹. Die filling was performed manually. For each prepared blend, tablets were compressed at three different compaction pressures, ranging approximately from 50 to 300 MPa. The compaction was carried out in a climate-controlled environment at 21 °C and 45 % r.h. Prior to further analysis, the tablets were stored under the same controlled conditions for approx. 24 h. Tablet characterization was performed using an automated tablet tester (SmartTest50, Sotax AG, Switzerland) on a sample of 10 tablets. Individual tablets were not measured in a fixed sequence, precluding one-to-one matching between STYL'One and SmartTest data. This simplified handling and accelerated production. Consequently, the data are applied as mean values in modeling. The tensile strength was calculated from the measured crushing force according to the equation by Fell and Newton (Fell and Newton, 1970; Hertz, 1882). For this calculation, the diameter of the die was used instead of the diameter measured by the SmartTest. The tablet density was calculated similarly based on the measured tablet height and mass via SmartTest and the diameter of the equipped tooling.

2.3. Database pre-evaluation, curation and presentation

The batch data were exported using the analysis software of the compaction simulator. The datasets from the compaction simulator, the SmartTest, and the disintegration test were combined using a Python script. Data were flagged as erroneous when tablet counts from SmartTest and STYL'One did not match or when calculated density < 0.6 g cm⁻³; such data were excluded from analysis. For modeling, the mean values of compaction pressure, tablet mass, and tensile strength were considered for each batch.

To enhance the degrees of freedom of the regression models used for assessing tabletability within the database and thereby improve data resolution, the individual tablet measurements were considered. It was assumed that the applied compaction pressures could be associated with the corresponding tensile strength values by pairing them in sorted order. The minor error introduced by this approximation was accepted to facilitate visualization, as it does not influence the subsequent evaluation of the model. However, this assumption must be taken into account when interpreting the resulting graphs. As described in Section 2.2, only the batch means were used for training the neural networks. Within each batch of ten tablets, the highest compaction pressure was assigned to the tablet exhibiting the highest tensile strength, and the remaining values were matched accordingly. The data were then fitted using the Sun equation for tabletability (Equation I) (Vreeman and Sun, 2022).

σ = |σ_{\max} e^{αW (- e^{- \frac{P}{β} - 1})}|

(1)

The non-linear regression was calculated in SciPy (v1.15.2) via optimize.curve_fit. The TRF algorithm in 10⁴ iterations was used for optimization. The initial guesses are given as $: σ_{\max} = 1 MPa$ , $α = 1$ , $β = 500 MPa$ . All coefficients where constrained to positive values.

2.4. Network design

The path of data processing for deep learning and the architecture of the artificial neural network is depicted in Fig. 1. (See Fig. 2.)

2.4.1. Input branches

The network architecture comprises three distinct input layers. The first input branch receives a vector of formulation and process-related features. These features include tablet mass, API fraction, binder content, and disintegrant content, together with additional process descriptors, normalized to the range [−1,1]. This input further includes linear interaction terms and second-order polynomial features, akin to a traditional MLR model. The second input branch encodes the categorical material identifier, corresponding to the API type. Each API is mapped to a unique integer, starting at zero. The third input branch receives the applied compression pressure as a scalar value. This input is used both in its unscaled form (for empirically-guided output functions) and in a normalized form (scaled by division through 300) for improved numerical stability in the ejection force output branch.

2.4.2. Embedding

To incorporate API-specific effects, an embedding layer maps the categorical API identifiers into a four-dimensional trainable vector space. The low dimensionality of the embedding space reflects the limited number of APIs in the dataset and mitigates the risk of overparameterization. To ensure a well-defined alignment in the subsequent spherical Procrustes analysis, the embedding dimensionality was chosen smaller than the number of API variables, thereby reducing the degrees of rotational freedom. An excessively high number of embedding dimensions may otherwise lead the model to encode mere categorical distinctions rather than meaningful semantic relationships between materials.

The embeddings are initialized with a normal distribution ( $μ = 0, σ = 0.1$ ), constrained by a MaxNorm (Srebro et al., 2004) of 1.5, and regularized using L2 regularization (Krogh and Hertz, 1991) ( $λ = 10^{- 6}$ ) to increase the stability of the learned vectors. The embedding of the placebo batch is hard-anchored to the origin of the embedding space, meaning that its location remains fixed during training. This design aims to ensure that the placebo formulation functions as a stable baseline and serves as a constant reference point within the embedding space.

2.4.3. Feature wise linear modulation

The flattened embeddings are subsequently used to a parameterize feature-wise linear modulation (FiLM) operation (Perez et al., 2018). In the FiLM block, two dense layers without bias predict scaling factors ( $γ$ ) and shifts ( $β$ ) using the linear activation function. An orthogonal column regularizer(Brock et al., 2017) is applied to the $γ$ -predicting layer to enforce orthogonality of its weight vectors, thereby enhancing the diversity of material embeddings. The weight of the regularizer is 10⁻⁷. FiLM block modulates the raw process features directly. To ensure that the input to the FiLM layer remained initially unchanged during training, the multiplication within the FiLM layer was implemented as $1 + γ$ .

The FiLM-modulated features are subsequently processed by a dense layer with 32 hidden units, LeakyReLU activation (Maas, 2013; Xu et al., 2015) ( $α = 0.01$ ), no bias term, and L2 regularization of 10⁻⁶ weight. To facilitate information flow from the embedding branch, the flattened embeddings are also processed in parallel through a dense layer with similar conditions but only 8 hidden units. The outputs of this embedding branch and the FiLM-modulated branch are concatenated, yielding the final trunk representation of the network.

2.4.4. Output heads

The distinct prediction branches emerge as separate heads from a shared base trunk of the network architecture following a common multitask learning design (Caruana, 1997). Tablet density and tensile strength are predicted via empirically-guided output functions (Willard et al., 2020). Specifically, the rearranged Sun compressibility (Sun, 2004; Tran and Klinken-Uth, 2025) (Equation II) and tabletability (Vreeman and Sun, 2022) equation (Equation I) are employed.

{\hat{ρ}}_{tab} = |ρ_{sun} (ε_{c} lambW (- e^{\frac{- CP - ε_{c}}{ε_{c}}}) + 1)|

(2)

The coefficients for these functions are parameterized by dedicated dense layers without regularization. For the tabletability function, the three coefficients are predicted using a dense layer with a Softplus activation function. The compressibility function coefficients are obtained through two dense layers: one predicting $C$ and $ρ_{sun}$ using a Softplus activation function, and another predicting $ε_{c}$ via a Sigmoid activation function, reflecting its restricted range between 0 and 1. In all three dense layers, the bias term is omitted to enforce positive outputs for the coefficients of the empirically-guided functions. To ensure numerical stability and consistency across parameters of different scales, the predicted coefficients are scaled by predefined multiplicative constants: $σₘₐₓ$ : $10$ , $α$ : $10$ , $β$ : 10², $ρ_{sun}$ : $1$ , $C$ : $10^{- 3}$ , and $ε_{c}$ : $1$ . The selection of the scaling values was based on regression results obtained from previously published datasets (De Bisshop and Klinken, 2023; Tran and Klinken-Uth, 2025). The scaling strategy maintains raw coefficients within comparable numerical ranges and facilitates stable gradient propagation during backpropagation. The unscaled compression pressure is then used as input to the Sun equations to predict tensile strength and tablet density.

The ejection force is predicted using two sequential dense layers with 16 and 8 nodes, employing a LeakyReLU activation function ( $α = 0.01$ ) with bias terms and without regularization. The input to the first layer is a concatenation of the second FiLM-modulated layer and the scaled compression pressure. The final prediction is produced by a dense layer with a Softplus activation function and no bias term. To maintain output magnitudes within a moderate range and stabilize training, the Softplus output is scaled by a fixed constant of 10³. The prediction of the dosing height is performed using a unregularized dense layer with 8 output units, a bias term, and a LeakyReLU activation function ( $α = 0.01$ ). This layer receives only the FiLM-modulated data as input, as compression pressure is not considered relevant for this target. The output is then passed to a single Softplus layer without a bias term, which produces the final prediction. The final output is scaled by a fixed factor of 10. The scaling of dosing height and ejection force was implemented to stabilize the gradients within the network.

2.4.5. Training

The final network comprised 2036 trainable parameters and was implemented in TensorFlow (v2.13.0) and Keras (v2.13.1). Owing to its small size, the network trains considerably faster compared with larger models. To ensure equal weighting of the four target variables in the loss function, a customized mean squared error (MSE) loss was defined. Prior to loss computation, targets $y$ and predictions $\hat{y}$ were scaled to the [0,1] range using min–max normalization based on their respective ranges of $y$ within the database. This normalization balances the residual contributions to the total loss. Additionally, the loss function includes a masking mechanism that excludes missing response values from the MSE calculation. Instead of averaging the loss over the total number of samples in a batch, it is averaged only over valid (non-missing) entries, thereby preventing distortion from sparsely populated batches. Although no missing values occurred in this dataset, the implemented masking allows handling of incomplete response data in future studies.

Training was performed using the AdamW (Loshchilov and Hutter, 2019) optimizer with a weight decay of 10⁻⁴ and an initial learning rate of 10⁻². The learning rate was scheduled according to a cosine annealing (Loshchilov and Hutter, 2017) scheme with linear warmup over the first 10 epochs, gradually decaying from 10⁻² to a minimum of 10⁻³ across the training horizon. A maximum of 10³ epochs was allowed, with early stopping based on the validation loss using a patience of 10² epochs and a minimum improvement threshold of 10⁻⁵. In case of early stopping, the model weights corresponding to the best validation performance were restored. Training was carried out with a batch size of 64.

2.4.6. Ensemble

To enhance predictive performance and model robustness, a deep ensemble (Lakshminarayanan et al., 2017) comprising 15 independently trained models was implemented using a fixed train-test split. Each ensemble member employed an additional internal train-validation split to promote generalization. Each ensemble member was subsequently trained on its respective sample and evaluated against the corresponding validation set. Unless otherwise stated, model weights were initialized using the default layer initializations provided by TensorFlow and Keras. Ensemble predictions were aggregated as the median across all valid individual model outputs, excluding any NaN values. This ensemble strategy not only enhances predictive accuracy and reliability of uncertainty estimates but also enables parallelized training, thereby substantially reducing computational burden. Moreover, by establishing a robust and generalizable modeling framework, the use of deep ensembles provides a suitable basis for integration into automated experimental design strategies, such as Bayesian optimization (Lim et al., 2021). On average, training a single network in 10³ epochs required approximately 14 s on a 12th Gen Intel® Core™ i9-12900K CPU (3.20 GHz).

2.5. Architecture design and hyperparameter optimization

The model architecture was developed in parallel with the construction of the database. Model architecture and hyperparameters were adjusted manually using internal train–validation splits; no automated optimization was applied. Once the final model architecture and hyperparameters were established, they remained unchanged throughout the remainder of the study. Subsequent to this, data were collected for Efavirenz, Lopinavir, and a reproduced batch of Ibuprofen 50. Importantly, the final database and the definitive correlations between the APIs were not part of the model development phase.

2.6. Evaluation of the word embedding vectors

To evaluate whether learned embeddings capture meaningful material relations, the trained embedding matrices were extracted and aligned for comparison. If n_API denotes the number of encoded APIs and n_ed the dimensionality of the embedding space, this yields n_API embedding vectors in a n_ed-dimensional space. The relational information between individual APIs is encoded in the geometric arrangement of these vectors, including their relative orientations and inter-vector distances. When multiple networks are trained independently, the resulting embedding spaces may be arbitrarily rotated, reflected, translated, and/or scaled with respect to one another. Such invariances complicate direct comparison of embeddings across networks. Therefore, to assess the reproducibility of the embeddings or to compute an average embedding across a deep ensemble, the embeddings must be aligned to a common coordinate system.

2.6.1. Procrustes analysis

The required transformation depends on the type of comparison to be performed. If the objective is to evaluate the reproducibility of the exact positions of each embedding vector, an orthogonal Procrustes analysis can be employed to correct for rotations and reflections without altering scale. In contrast, when, as in the present case, the absolute positions are of lesser relevance and the primary interest lies in the relative relationships between embedding vectors within a single model run, a scaled orthogonal Procrustes approach is more appropriate, as it also incorporates centering and rescaling prior to rotation.

It must be noted that n_ed should not exceed n_API, since otherwise free rotations of the embedding vectors could map them to arbitrary positions, rendering the alignment meaningless. At the same time, the special role of the placebo in the dataset requires particular consideration. The placebo serves as a natural anchor point in the embedding space, representing the absence of active API. To ensure interpretability across models, we constrained the alignment to keep this vector at the origin. This position should be preserved during Procrustes alignment. The Procrustes analyses were therefore conducted accordingly.

For each run $k$ , the reference embedding matrix $E^{ref} \in ℝ^{n \times d}$ and the target embedding $E^{(k)} \in ℝ^{n \times d}$ were translated such that the placebo vector was located at zero. Alignment was then achieved by solving the orthogonal Procrustes problem:

A = {(X^{(k)})}^{T} Y = US V^{T}, R = U V^{T}

(3)

where $Y$ and $X^{(k)}$ denote the translated reference and target embeddings, respectively, and $R$ is the optimal rotation matrix derived from singular value decomposition. If reflections were disallowed, $R$ was adjusted to enforce $\det (R) = 1$ . When scaling was applied, a uniform scalar $s$ was estimated to minimize the Frobenius norm between the aligned and reference embeddings:

s = \frac{tr ({(X^{(k)} R)}^{T} Y)}{{‖X^{(k)}‖}_{F}^{2}}

(4)

The aligned embedding was then obtained as $Z^{(k)} = s X^{(k)} R$ . This procedure ensured comparability between embedding spaces while preserving the fixed origin of the placebo.

In each deep ensemble, the embedding generated by the first network served as the reference. All subsequent embeddings within the ensemble were aligned to this reference representation. The ensemble embedding was obtained by taking the element-wise median across all aligned embeddings. A comparable strategy was employed for inter-model comparisons: the median embedding obtained from the first deep ensemble was designated as the reference, and all other embeddings were aligned accordingly.

However, if the embedding space is insufficiently regularized, the alignment procedure may introduce correspondences that do not reflect a systematically consistent semantic representation. To check whether alignment preserves meaningful relationships rather than introducing spurious correspondences, we compared trained models with a control condition where API labels were randomly permuted. This allows us to distinguish true semantic structure from artifacts of alignment.

2.6.2. Decomposition

To investigate the distribution and reproducibility of the Procrustes-aligned embeddings, PCA was performed using the decomposition.PCA module from Scikit-Learn (v1.7.1). For the input within the PCA, MinMax scaling in range [−1, 1] was applied. As the alignment of the embedding vectors could not be sufficiently captured by a projection onto the first two principal components, Uniform Manifold Approximation and Projection (McInnes and Healy, 2018) (UMAP) was employed to generate a more informative two-dimensional embedding. UMAP has previously been demonstrated to be effective in the context of pharmaceutical data mapping (Ramahi et al., 2024). The UMAP implementation from the umap-learn Python package (v0.5.9.post2) was used on similar scaled input as for the PCA. Based on the pronounced separation observed in preliminary analyses, parameters were selected to balance local neighborhood preservation with global structural representation. Specifically, the number of nearest neighbors was set to 10, the minimum inter-point distance to 10⁻¹, the spread parameter to 1, and the repulsion strength to 1, promoting moderate spatial dispersion of clusters without excessive fragmentation. Cosine similarity was used as the distance metric, the embedding dimensionality was fixed to two components, and the random state was set to 113 to ensure reproducibility. All other parameters were kept at their default values.

2.7. Cross-validation based evaluation of the network performance

To establish widely accepted white-box benchmark models for evaluating the performance of the neural network, MLR models were developed. The data corresponding to each APIs were treated separately, resulting in individual MLR models for each compound. As input features, linear terms, quadratic terms, and linear interaction terms of the formulation and process parameters were used after scaling all factors to a range between −1 and 1. Regression of the MLR models was conducted using the least-squares solver linalg.lstsq implemented in NumPy (v1.26.4).

To enable a robust comparison between the MLR approach and the neural network, Monte Carlo cross-validation without replacement was performed on the dataset. Specifically, the data were randomly split 10² times into training, validation, and test sets, with 81 % of the data allocated to training, 9 % to validation, and 10 % to testing. To avoid data leakage, test sets contained only original datapoints and excluded samples generated by the data augmentation strategy described in Section 2.4. The seeds for the random splits were set as the integer values from 1 to 10². For each split, a deep ensemble was trained. The distribution of predictive performance across these models was then analyzed. To assess the impact of the word embedding on the predictive performance of the network, the same procedure was repeated using identical random seeds after permuting the assignment of each training datapoint to its corresponding API. This strategy preserved the network architecture while effectively eliminating any information content from the embedding, thereby enabling a controlled comparison of networks trained with and without informative embeddings. A comparable procedure was applied to the MLR models. For each API, the corresponding data were also randomly partitioned 10² times into training, validation, and test sets using the same proportions as for the neural networks. On each validation set, an MLR model was optimized via backward elimination, minimizing the MSE to ensure comparability with the neural network training. A separate MLR model was generated for each response variable. The predictive performance on the respective test sets was then compared directly to that of the neural networks.

2.8. Investigation in the number needed for information

One of the proposed key advantages of the neural network approach with word embeddings lies in its ability to predict combinations of material factors for substances with few or no experimental data points included in the training set. This allows the generation of meaningful expectations even in the absence of direct measurements. Once a single data point for a new API has been measured, it can be incorporated into the dataset, thereby improving prediction accuracy for further combinations. This capability results from the knowledge transfer facilitated by the embedded representations.

To evaluate this approach, the following procedure was applied. For each experiment, the data of one API were withheld, and a deep ensemble was trained for 10³ epochs on the remaining APIs using a validation split of 10 %. Subsequently, copies of the pretrained ensemble were fine-tuned for 10² epochs with a defined fraction of the withheld API's data. The data obtained at the three compression pressures of a given blend were treated as a single unit. This rationale is based on the fact that compressing an already prepared blend at different pressures requires comparatively little additional experimental effort. In contrast, the major experimental workload and the increased API consumption result from the preparation of multiple distinct blends. Consequently, for each API, 11 possible blend settings were available, each of which could be assigned either to the training or to the test set. The data obtained at the three compression pressures of a given blend were, however, considered individually within the training and test sets.

The fraction of included data points was systematically increased up to a maximum of 7 points. 10 % of the respective API's data were reserved as validation set. This adaptive learning strategy was intended to increase the relative importance of the isolated API during training. Model performance was then assessed by predicting three randomly selected data points of the isolated API that had not been part of the training set. The test set was kept constant across the different fractions to ensure comparability. The entire train/validation/test procedure was repeated multiple times to obtain a robust distribution of performance estimates. Between repetitions, the test sets were randomly alternated. Finally, the relationship between predictive accuracy and the number of isolated API data points incorporated into the training set was systematically analyzed. As an additional benchmark, model performance was evaluated on a network that had never been trained with the withheld API, providing a measure of generalization.

In the analysis of the results, two aspects were considered. First, the change in predictive accuracy for the test data was quantified as RMSE with increasing numbers of training data points from the respective API. Second, the influence of the API content in the mixture on the marginal contribution of additional data points was investigated. For this purpose, Shapley values of the data points were estimated via Monte Carlo simulation. Since the number of possible combinations of all test data points for a given API increases factorially (Equation V), an exhaustive brute-force evaluation was computationally infeasible.

n_{combinations} = \frac{11!}{(11 - 7)!} = 1 663 200

(5)

Instead per API, 200 random subsets were sampled, and the marginal contributions of each point were approximated by the resulting change in root mean-squared error ( $ΔRMSE$ ). Three major statistical error sources must be considered in this procedure.

First, the very first training point of the withhold API is expected to exert the strongest influence on prediction performance, with contributions diminishing for subsequent points. Consequently, the frequency with which a given point appears in early training positions strongly affects the simulation outcome. Similar considerations apply, albeit to a lesser extent, to later positions in the sequence. Thus, the number of Monte Carlo simulations is critical to ensure that the probability of each point occurring in each position converges toward a uniform distribution.

Second, the selection of the test set was randomized in every training iteration. This reduces systematic bias, since final results represent aggregates over many distinct test sets. However, the differences between test sets introduce additional variability, which can be mitigated by increasing the total number of simulations.

Third, the training outcome of each individual neural network is inherently stochastic. This effect is also averaged out when a sufficiently large number of simulation runs is conducted.

Among these three sources of variability, the first is likely the most critical. To mitigate its influence, the number of simulation iterations was chosen to be sufficiently high to ensure reliable convergence of the estimates. It was assumed that each data point should appear approximately 100 times in the training data to ensure a near-uniform distribution across all possible positions. Given that each data point has a probability of $7 / 11$ of being included in the training set in a single simulation run, the total number of iterations for one API was therefore targeted at about $100 * (11 / 7) \approx 160$ ensuring sufficient coverage of all positions.

For each data point $i$ of the respective API and its possible inclusion position $m \in (1 \dots 7)$ , the resulting $ΔRMSE$ values from all repetitions were stored in a tensor $D_{i, m, t}$ . To obtain an order-agnostic estimate of the information value, we averaged the ΔMSE across all inclusion positions, considering only observed positions. To quantify uncertainty, we used a Bayesian bootstrap implemented via Gamma(1,1) resampling weights applied to the iteration axis of $D_{i, m, t}$ . For each point, 2000 bootstrap replicates were generated, from which the arithmetic mean and 2.5 and 97.5 % percentiles were computed. These intervals represent the variability of the estimated information gain over all random test-set and training-order configurations, marginalizing out order effects.

3. Results and discussion

3.1. Properties of the generated database

The curated dataset comprises measurements from 3355 individual tablets. These correspond to 336 unique combinations of factor levels, for which measurements from mostly 10 tablets are available in each case. For most APIs investigated, 33 averaged data points were obtained, while EV showed the lowest data density with only 23 data points (IQR = 4). The maximum API content in the blends was approximately 0.80 for ASA and DCPA_A12, 0.56 for EV and around 0.65 for the other APIs. The minimum content ranged from 0.10 to 0.16, depending on the blend system. The distributions of the resulting tablet masses as well as the four target variables are shown in Fig. 2 a). Overall, the values are largely evenly distributed across the subsets of the different APIs. As expected, the density distribution of DCPA_A12-containing tablets is shifted considerably upward compared to the other formulations. AAP shows highest ejection forces within the data. DCPA_A12 and IBU_G show slight deviations in dosing height, which can be attributed to their higher solid densities. For DCPA_A12 this likely reflects the effect of its high intrinsic crystal density, whereas for the pre-granulated IBU_G the effect is related to an increased granule packing density, as indicated by the absence of a corresponding shift in tablet density. Tablet masses fall within 200 to 300 mg. ASA shows slightly lower masses, likely due to a manufacturing calibration drift.

Fig. 2 b) depicts the tabletability profiles separated by API content. For each case, the curves of the blend with the highest and the lowest API concentration are shown. The individual data points represent single-tablet values derived from the assignment procedure described in Section 2.3. All investigated blends exhibit tabletability profiles consistent with expected trends. The database contains APIs that either reduce or increase the tensile strength at a given compaction pressure when present at higher concentrations. Since all blends are based on the same fundamental formulation, the differences in tabletability profiles are less pronounced than would be expected in a comparison of pure materials.

3.2. Training history

A representative training history in Fig. 3 indicates stable convergence of the deep ensemble model. As expected from the chosen cosine learning rate schedule, the RMSE decreases sharply during the initial epochs. Despite the learning rate decay, the learning rate remains relatively high at 10⁻³. The RMSE curves of each individual submodel within the ensemble exhibit an almost monotonous decline for both training and validation data. There are no indications of divergence or oscillation, highlighting the stability of the training process. The minimum validation RMSE is reached after approximately 200–500 epochs. In all iterations, training is terminated early by the early-stopping mechanism, well before reaching the maximum number of epochs (10³). Importantly, none of the submodels show abrupt increases in validation error at later epochs, further supporting the overall stability of the training process. Nevertheless, the application of early stopping remains advisable for practical reasons of computational efficiency. The observed step change in the median RMSE of the ensemble can be attributed to the exclusion of submodels from the median calculation once they are terminated.

3.3. Reproducibility of word embeddings

Fig. 4 a) and b) present the results of the PCA of the embedding vectors following Procrustes alignment. Subplot a) shows the score plot of the embeddings obtained with correct assignment of the API names to the corresponding experiments, whereas subplot b) depicts the embeddings after permutation of the API names. Embedding vectors cluster distinctly with correct labels, whereas clustering is absent after label permutation. This supports that embeddings capture material specific structure that contributes to prediction. Moreover, the pronounced difference between a) and b) confirms that the applied Procrustes analysis provides sufficient regularization of rotations and scalings, thereby preventing arbitrary alignments of the embedding vectors.

Because the first two PCA components explain about 75 % of variance, PCA underrepresents cluster separation. For instance, the overlap between LPV and EV clusters is clearly visible in a). Subplots c) and d) show the UMAP representations of the embeddings, analogous to a) and b). Here, the superior cluster separation in c) becomes apparent. In addition, UMAP provides a more faithful depiction of the global cluster structure compared to PCA. The embeddings of DCPA_A12 and IBU_G appear in close spatial proximity, consistent with both exhibiting an increased granule mass for different reasons, as discussed above. Further, distinct clusters are observed for AAP, MFM and IBU_{P rep.} as well as for IBU_P and ASA and lastly for LPV, EV, and SoBe. Importantly, all API embeddings remain clearly distinguishable from placebo.

The two datasets of IBU_P and IBU_{P rep.} Can be clearly distinguished, yet they remain in close spatial proximity within the clustering, highlighting their overall similarity. However, such a pronounced divergence was not anticipated. Given that the stability of the embeddings is evident from the distinct clusters of the different cross-validation models, the observed differences between IBU_P and IBU_{P rep.} Are unlikely to originate from the model itself. Fig. 5 depicts the tabletability profiles of the three datapoints located closest in terms of Euclidean distance to IBU_P and IBU_{P rep.}, analogous to Fig. 2 b). Due to the differences in the respective design spaces resulting from the API individual Halton design, identical points cannot be compared directly. Therefore, datapoints with minimal Euclidean distance provide the most appropriate basis for comparison between the datasets of the two APIs.

Fig. 5 — Tabletability profiles of the three closest formulations in terms of Euclidean distance between IBU_P and IBU_{P rep.} The respective contents of API, binder, and disintegrant are provided in the plot legends.

It becomes apparent that IBU_{P rep.} Systematically lies below IBU_P. This effect is most plausibly explained by differences in powder handling after lubricant addition. This difference is noteworthy in the present study. Accordingly, the initial expectation that the two batches, IBU_P and IBU_{P rep.}, would exhibit a high degree of similarity must be revised. However, this deviation does not pose a limitation for the presented network. On the contrary, it highlights the ability of the embedding-based model to capture subtle batch-to-batch variations. Consequently, shifts in embeddings between different batches or lots of starting materials can be exploited to identify and interpret process changes or deviations in product CQAs. By generating a new embedding for each incoming batch of a material, a characteristic distribution of embeddings would gradually emerge across multiple batches. Monitoring these distributions would enable the detection of shifts indicative of alterations affecting at least one input or output parameter. The stability of the embeddings for individual materials is already demonstrated by the cross-validation results, where the embeddings of each API form distinct and compact clusters. This indicates that APIs are embedded in a comparable manner, even when different proportions of their data are included during training.

Future studies must demonstrate whether such networks are, in principle, capable of identifying the underlying causes of CQA variations in products. This could be achieved by examining which of the newly generated embeddings deviates from previously established embeddings of the same substance. However, such an investigation would require that all formulation components be embedded within the same representation space. To enable this, further studies and the development of databases encompassing a broader diversity of formulation compositions will be necessary.

3.4. Deep ensemble efficiency

Fig. 6 illustrates the evolution of the RMSE distributions as a function of the number of submodels included in the deep ensemble. Predictions made by a single model not only exhibit substantial higher variance but also tend to result in higher mean RMSE values compared to those generated by ensembles of increasing size. The integration of additional sub models systematically improves predictive performance, with average RMSE reductions between approximately 39 % (tablet density) and 22 % (tensile strength), depending on the response variable. A performance plateau is typically reached after incorporating around five to ten ensemble members, beyond which further improvements become marginal.

In addition to reducing the mean error, ensemble averaging decreases the variance of the RMSE across repeated data splits, indicating a pronounced increase in model robustness. This effect is particularly evident in the case of ejection force and dosing height, where both the mean and spread of the RMSE decline markedly with ensemble size. Since each violin plot summarizes performance across 100 different Monte Carlo cross validation train-test splits, the observed trends also demonstrate the enhanced generalizability of ensemble-based predictions across varying data partitions.

3.5. ANN vs. MLR comparison

Fig. 7 presents the performance of the Monte Carlo cross-validation of the neural network compared to classical MLR models on the test dataset. The predictive accuracy of the neural network is comparable to that of the MLR models trained specifically on material-dependent subsets. What is particularly noteworthy is the benefit of the neural network predictions enabled by the empirically-guided output layer, especially for values close to zero. MLR permits negative values for inherently non negative properties, whereas the empirically-guided network constrains outputs to physically plausible ranges. This advantage is most evident in the predictions of tensile strength and ejection force.

The distribution of the R² and RMSE values of the models are summarized in Table 2. In addition, the figure includes the performance of a control network trained on permuted drug labels, which shows a substantial loss of predictive accuracy compared to the correctly labeled neural network, thereby confirming the relevance of the semantic information captured by the drug labels. When comparing the R² and RMSE distributions of the MLR models and the neural network, it becomes evident that the latter achieves superior predictive performance for both tensile strength and tablet density. The first quartiles of the neural network predictions fall within the range of the third quartiles of the MLR models, indicating a shift toward lower error.

Table 2.

Model performance expressed as the first, second, and third quartile of R² and RMSE for each response variable.

Model	R²				RMSE
Model	$σ$	$ρ$	$EF$	$DH$	$σ$ / MPa	$ρ$ / g cm⁻³	$EF$ / N	$DH$ / mm
MLR	0.919, 0.958, 0.973	0.958, 0.977, 0.987	0.797, 0.926, 0.964	0.990, 0.996, 0.998	0.173 0.216, 0.272	0.018 0.023, 0.028	25.421 31.217, 47.489	0.121 0.167, 0.255
ANN	0.967, 0.975, 0.982	0.981, 0.986, 0.990	0.912, 0.930, 0.954	0.991, 0.995, 0.997	0.140, 0.168, 0.199	0.016, 0.018, 0.020	26.488, 30.484, 37.965	0.168, 0.203, 0.267
ANN_perm	0.647, 0.715, 0.779	0.582, 0.685, 0.746	0.454, 0.539, 0.598	0.527, 0.652, 0.746	0.485, 0.562, 0.614	0.071, 0.085, 0.098	74.372, 98.842, 117.466	1.506, 1.681, 2.025

Open in a new tab

Importantly, the RMSE values of the neural network across all four response variables are within ranges considered acceptable for pharmaceutical manufacturing. For tensile strength, where typical target values for tablet formulations lie between 1–2 MPa (Leane et al., 2015; Sun et al., 2009), the neural network achieved mean deviations of 0.15–0.20 MPa, corresponding to approximately 7.5–20 % of the target range. This error is lower than the standard deviation of tensile strength observed across many batches in the dataset, indicating sufficient predictive precision. Moreover, it is comparable to, or even lower than, the error levels reported for established models describing tablet tensile strength (Bounab et al., 2025; Khalid et al., 2015; Tait et al., 2024). The 99 % quantile of the absolute residuals is 0.6 MPa, underscoring the uniformity and overall robustness of the model predictions. A similar conclusion can be drawn for tablet density: RMSE values of 0.016–0.020 g cm⁻³ correspond to prediction errors of 0.8–2 % relative to the measured densities of the tablets, which is likewise within the range reported for published models (Anuschek et al., 2023; Hayashi et al., 2019; Puckhaber et al., 2022). Assuming literature-reported correlations between disintegration time and variations in the solid fraction within the tablet (Bi et al., 1999; Quodbach and Kleinebudde, 2015), it can be inferred that density deviations within the prediction error (corresponding to equivalent deviations in solid fraction) are expected to exert only a negligible influence on disintegration time. For density, the 99 % percentile of the absolute residuals is 0.05 g cm⁻³, which likewise lies within a range that can be considered acceptable for predictive modeling. Ejection force, which for many formulations remains below 1 kN (Sun, 2015; Uzondu et al., 2018), was also predicted with acceptable accuracy of under 40 N as the mean RMSE and 140 N as the 99 % percentile of the absolute residuals. For dosing height, deviations of 0.8–2.7 % were observed. Assuming a homogeneous bulk density during die filling, this deviation corresponds directly to systematic errors in the target tablet mass. According to Ph. Eur. 2.9.40 (European Pharmacopoeia, 2025), systematic deviations above 1.5 % are included in the content uniformity calculation. Consequently, a systematic deviation of 2.7 % would still allow the batch to comply with pharmacopoeial requirements at the first testing level, provided that the relative standard deviation, arising from mixture or die-filling inhomogeneities, remains below 5.75 % within the batch. It is worth noting that the 99 % percentile of the absolute relative residuals for the dosing height is 8.0 %. Considering the limits defined in the pharmacopoeial monograph Uniformity of dosage units, an acceptance value of 13.7 remains below the threshold of 15 when individual deviations are approximately 3 %. Such variability, or even smaller deviations, are within the range observed for commercial products (Belew et al., 2019; Vandy et al., 2024). Even under these conditions, the batch would comply with the pharmacopoeial requirements, further underscoring the adequacy of the model's predictive performance.

Taken together, these results demonstrate that the neural network not only outperforms material-specific MLR models in terms of predictive robustness but also achieves an accuracy that can be considered sufficient for practical application in pharmaceutical tablet development and manufacturing.

3.6. Effect of training point count on prediction error

Fig. 8 illustrates the prediction error, expressed as RMSE on the test dataset, as a function of the number of available material-specific training data points. As a benchmark, models without any training points of the respective API were used. The reduction in prediction error with increasing material-specific data availability is clearly dependent on both the material and the response variable. It must be noted, however, that the distributions of prediction errors do not allow for a distinction between contributions originating from the training sets and those arising from variations in the test sets. Consequently, it seems to be not entirely evident whether the maximum or the median of these distributions should be considered the more reliable parameter. Nevertheless, since for almost all models, the median prediction error obtained after inclusion of the maximum number of data points yields values comparable to those derived from cross-validation across the entire dataset, the median is regarded as a representative parameter.

Material-response combinations with high anticipated initial errors are of particular interest. DCPA_A12, characterized by its high particle and bulk density and producing the densest tablets within the dataset, exemplifies this behavior. In the absence of any DCPA_A12-specific training data, prediction errors for tablet density are markedly elevated. The inclusion of a single DCPA_A12 data point reduces the RMSE significantly by approximately 0.25 g cm⁻³. With further additions, the error decreases toward the level observed for other APIs but does not reach the median prediction error across APIs, remaining at approximately 0.1 g cm⁻³ (Table 2). For the prediction of tablet density in other APIs, reductions in error are far less pronounced. For example, the RMSE for SoBe decreases by only ∼0.08 g cm⁻³ to 0.17 g cm⁻³ after inclusion of a single data point. This is likely due to the relatively narrow and homogeneous distribution of tablet densities across most APIs, with little variation in their central tendency (Fig. 2 a).

In contrast, a greater number of APIs exhibit reductions in prediction error for tensile strength. Notable decreases are observed upon inclusion of the first one or two data points, particularly for DCPA_A12 (1.2 to 0.3 MPa), IBU_G (1.2 to 0.5 MPa), IBU_P (0.9 to 0.4 MPa), IBU_{P rep.} (1.3 to 0.6 MPa), MFM (1.2 to 0.6 MPa), and AAP (1.5 to 0.5 MPa). Beyond these first inclusions, prediction errors converge toward the median values reported in Table 2: Model performance expressed as the first, second, and third quartile of R² and RMSE for each response variable.Table 2. Importantly, after as few as two included data points, the median prediction error across APIs reaches ∼0.5 MPa, which can be considered borderline to sufficiently accurate for pharmaceutical development purposes.

For ejection force, AAP stands out, as expected. As shown in Fig. 2, AAP exhibits the highest values of this parameter within the dataset. Consequently, prediction errors are particularly high in the absence of AAP-specific data or when only a small number of points are included. Even with increasing numbers of AAP data points, the errors remain elevated, reaching a minimum RMSE of ∼140 N. For other APIs, the errors decline as expected toward values close to the medians listed in Table 2.

The advantage of the model in achieving satisfactory predictions from only one or a few experiments becomes apparent from these results. It should be emphasized, however, that a more diverse and extensive database would likely enhance the transfer of information between materials, thereby improving the efficiency of data point utilization.

3.7. Influence of information shares of the predictability

The assessment of the information content of individual data points is enabled through the estimation of Shapley values using a Monte Carlo approach. These values represent an estimation for the average contribution of each data point to the model prediction. In the context of formulation development for new drug substances, the reduction of API consumption is of particular interest. The reduction of API consumption within experimentation for predictive modeling is one possible target. For this reason, the evaluation of Shapley values as a function of the API content in the tablets is highly relevant. If the Shapley values of data points with high API concentrations are comparable to those with low concentrations, it becomes possible to estimate tablet properties while consuming less API. Accordingly, blends with reduced API content can be prepared and compressed, and the resulting data can already provide sufficient predictive insight into the behavior of tablets with higher API levels when combined with neural network models.

The results of this analysis are presented in Fig. 9. Distinct patterns emerge when comparing the Shapley values across different APIs and responses. The majority of data points exhibit positive Shapley estimates, indicating that these points contribute positively to the model's predictions, while the overall noise in their information content appears to be low.

As expected, APIs that fall largely within the range of values and correlations already well represented in the training data exhibit very low Shapley values over the full concentration range. This is exemplified by the comparison of Fig. 8 and Fig. 9, particularly in the case of tablet density predictions. For most materials, accurate predictions can be obtained with only a small number of data points, resulting in Shapley values close to zero. In contrast, DCPA_A12, EV and SoBe demonstrate markedly different behavior. Here, a trend toward higher information content on the tablet density at increasing API concentrations is observed.

For tensile strength, three distinct groups of behaviors can be identified. The first group, including DCPA_A12, AAP, and MFM, shows a pronounced increase in information content with higher API levels, suggesting that formulations at elevated concentrations provide additional unique information to the model. The second group, comprising EV, LNV, and SoBe, exhibits consistently low Shapley estimates across all concentrations, indicating that the inclusion of additional data points from these APIs does not further improve the model. This is consistent with the clustering observed in the embedding space, where these APIs form a tightly connected cluster. It can be inferred that incorporating data from two of these materials already provides sufficient information for the model to accurately predict the tensile strength behavior of the third. The third group (ASA, IBU_G, IBU_P, and IBU_{P rep.}) shows constant or even decreasing Shapley values with increasing API concentration. This finding is unexpected and suggests that, for certain APIs, experiments at lower concentration ranges may already capture the relevant relationships needed to describe the entire concentration range.

In the case of ejection force, AAP is particularly noteworthy. It exhibits the highest ejection force values in the dataset, and its data points consistently show comparatively high Shapley estimates without a discernible trend with respect to API concentration. However, substantial scatter in the values indicates variability in the information contribution of individual points. Expanding the dataset to include additional materials with similarly elevated ejection forces would likely provide valuable context and enhance the interpretability of the AAP results. Other materials, such as ASA, DCPA_A12, and to a lesser extent LNV, display slightly increased Shapley estimates at higher API concentrations with respect to ejection force, whereas no clear trend is observed for the remaining materials.

For the dosing height, EV stands out, displaying the most pronounced positive trend in Shapley values with increasing API content. Overall, the analysis reveals that the intuitive expectation, that higher API concentrations generally provide greater informational gain, is often valid. However, there are clear cases in which the API content seems not to substantially influence the information contribution of a data point. The example of EV, LNV, and SoBe in relation to tensile strength indicates that when similar materials are already represented in the dataset, the API content becomes less critical for model improvement. This finding implies that for neural networks trained on large and diverse datasets, the API content of new experimental blends might be reduced without significant loss of informational value, thereby enabling more resource-efficient experimentation.

3.8. Possible future application domains in an industrial context

The proposed modeling approach may contribute to the development of models that lend themselves to a broad range of industrial applications across the formulation development and pharmaceutical manufacturing pipeline. First, embedding-based predictive models could help accelerate formulation selection during early development. Because the networks are enabled to generalize across APIs through latent semantic relationships, generalizable information about correlations in manufacturing processes and material behavior may be transferred between different experiments, potentially reducing experimental effort. This could be particularly advantageous in discovery and preclinical phases, where API availability is limited and material costs are high. The demonstrated ability of low-API blends of specific APIs to contribute to model accuracy suggests that development teams might obtain predictive insights while consuming less material, thereby reducing both costs and environmental burden. By quantifying the informational value of individual data points via Shapley analysis, development teams may be able to prioritize experiments that meaningfully enhance predictive performance.

In process development, the model may provide rapid, data-efficient estimations of key tablet quality attributes and required process settings directly from formulation composition and compaction conditions. This could allow development scientists to explore broad design spaces prior to extensive experimentation and thereby guide the selection of suitable excipient ratios or compaction regimes. The empirically guided output heads might help ensure that predictions remain physically plausible, reducing the likelihood of exploring infeasible or unsafe operating regions. When combined with calibrated uncertainty estimates from deep ensembles, the framework could also be integrated into Bayesian optimization loops or adaptive DoE strategies (Sano et al., 2020), supporting automated data acquisition and more efficient exploration of complex formulation landscapes.

The model may further offer value for batch quality evaluation and change management. Because different batches or lots of the same raw material can be represented through distinct embeddings, the framework could potentially detect subtle shifts in material behavior, such as changes in powder flow, compressibility, or lubrication sensitivity. Comparisons of embedding distributions across batches may provide a principled, data-driven means to monitor raw-material variability and identify atypical batches before they propagate downstream. Such functionality could support root-cause analyses and reinforce overall control strategies in continuous or batch manufacturing environments. Moreover, embeddings might also encode differences between operators, equipment types, or production sites, offering a scalable approach for evaluating process transfer and ensuring consistency across manufacturing settings.

4. Conclusion

This study shows that word-embedding neural networks can substantially enhance process understanding and predictive performance in pharmaceutical tableting. By learning conceptual relationships among categorical formulation variables, the model transforms nominal factors into structured, continuous representations that can be exploited by downstream predictive architectures. The combination of embedding layers with empirically-guided output functions and a deep-ensemble strategy resulted in a robust framework capable of accurately predicting multiple tablet quality attributes and process parameters across different APIs. Analysis of the learned embedding space revealed interpretable relationships among materials, linking API characteristics to predicted performance and thus contributing to the broader advancement of explainable artificial intelligence in formulation science.

Information-content analyses further demonstrated that blends with very low API content still contribute meaningfully to model training, underscoring the potential of embedding-based approaches for material-efficient experimental design, an aspect of particular relevance during early development when API availability is limited. Beyond the present case study, the framework can be extended in future to encode additional categorical dimensions, such as equipment type, process line, or manufacturing site, thereby supporting technology transfer and scale-up efforts in modern pharmaceutical manufacturing.

Despite these promising results, important limitations and challenges remain. Embedding models learn relational structure from the available data, which means that insufficient representation of certain materials, APIs, or processing conditions may bias the learned embedding space. The approach also presumes that categorical factors exhibit latent continuity, an assumption that may not hold for all formulation variables, particularly when data coverage for specific categories is sparse or unbalanced. Moreover, the stability of embeddings across independently trained models, datasets, or manufacturing sites requires further investigation, particularly when embeddings are to be reused or transferred. In practical industrial settings, reliable application of such models will additionally depend on robust data-streaming infrastructures, harmonized pre-processing pipelines, and the deeper integration of explainable AI tools capable of interrogating both model outputs and embedding geometry.

Overall, while broader validation across more diverse datasets and operational environments is needed, this work demonstrates that embedding-based, empirically-guided neural networks offer a viable route toward data-efficient, interpretable, and operationally relevant modeling. As such, they represent a promising component of the digitalisation strategy for future pharmaceutical development and manufacturing.

Declaration of generative AI use

During the preparation of this work the authors used ChatGPT-5 (OpenAI) in order to assist with internal review and language refinement. After using, the authors reviewed and edited the content as needed and take full responsibility for the content of the published article.

CRediT authorship contribution statement

Najeeb Abdelrahman: Writing – review & editing, Investigation, Data curation. Stefan Klinken-Uth: Writing – review & editing, Writing – original draft, Visualization, Validation, Supervision, Software, Resources, Methodology, Investigation, Funding acquisition, Formal analysis, Data curation, Conceptualization.

Funding

This research was supported by the German Research Foundation (DFG). SPP 2364, Grantnumber: 504702251.

Declaration of competing interest

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:

Stefan Klinken-Uth reports financial support was provided by German Research Foundation. If there are other authors, they declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The authors gratefully acknowledge Jörg Breitkreutz for his unwavering support.

Data availability

Data will be made available on request.

References

Anuschek M., Skelbæk-Pedersen A.L., Kvistgaard Vilhelmsen T., Skibsted E., Zeitler J.A., Rantanen J. Terahertz time-domain spectroscopy for the investigation of tablets prepared from roller compacted granules. Int. J. Pharm. 2023;642 doi: 10.1016/j.ijpharm.2023.123165. [DOI] [PubMed] [Google Scholar]
Bao Z., Bufton J., Hickman R.J., Aspuru-Guzik A., Bannigan P., Allen C. Revolutionizing drug formulation development: the increasing impact of machine learning. Adv. Drug Deliv. Rev. 2023;202 doi: 10.1016/j.addr.2023.115108. [DOI] [PubMed] [Google Scholar]
Basim P., Haware R.V., Dave R.H. Tablet capping predictions of model materials using multivariate approach. Int. J. Pharm. 2019;569 doi: 10.1016/j.ijpharm.2019.118548. [DOI] [PubMed] [Google Scholar]
Belew S., Suleman S., Mohammed T., Mekonnen Y., Duguma M., Teshome H., Bayisa B., Wynendaele E., D’Hondt M., Duchateau L., De Spiegeleer B. Quality of fixed dose artemether/lumefantrine products in Jimma Zone, Ethiopia. Malar. J. 2019;18:236. doi: 10.1186/s12936-019-2872-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
Berkenkemper S., Klinken S., Kleinebudde P. Investigating compressibility descriptors for binary mixtures of different deformation behavior. Powder Technol. 2023;424 [Google Scholar]
Berkenkemper S., Klinken S., Kleinebudde P. Multivariate data analysis to evaluate commonly used compression descriptors. Int. J. Pharm. 2023;637 doi: 10.1016/j.ijpharm.2023.122890. [DOI] [PubMed] [Google Scholar]
Bi Y.X., Sunada H., Yonezawa Y., Danjo K. Evaluation of rapidly disintegrating tablets prepared by a direct compression method. Drug Dev. Ind. Pharm. 1999;25:571–581. doi: 10.1081/ddc-100102211. [DOI] [PubMed] [Google Scholar]
Bounab Y., Antikainen O., Sivén M., Juppo A. Advancing direct tablet compression with AI: a multi-task framework for quality control, batch acceptance, and causal analysis. Eur. J. Pharm. Sci. 2025;212 doi: 10.1016/j.ejps.2025.107142. [DOI] [PubMed] [Google Scholar]
Brock A., Lim T., Ritchie J.M., Weston N. 2017. Neural Photo Editing With Introspective Adversarial Networks. 5th ICLR. [Google Scholar]
Caruana R. Multitask Learning. Mach. Learn. 1997;28:41–75. [Google Scholar]
Corrigan J., Li F., Dawson N., Reynolds G., Bellinghausen S., Zomer S., Litster J. An interaction-based mixing model for predicting porosity and tensile strength of directly compressed ternary blends of pharmaceutical powders. Int. J. Pharm. 2024;664 doi: 10.1016/j.ijpharm.2024.124587. [DOI] [PubMed] [Google Scholar]
Dai S., Xu B., Zhang Z., Yu J., Wang F., Shi X., Qiao Y. A compression behavior classification system of pharmaceutical powders for accelerating direct compression tablet formulation design. Int. J. Pharm. 2019;572 doi: 10.1016/j.ijpharm.2019.118742. [DOI] [PubMed] [Google Scholar]
De Bisshop J., Klinken S. Prediction of the tensile strength of tablets using LSTM networks on compression profiles. Int. J. Pharm. 2023;645 doi: 10.1016/j.ijpharm.2023.123280. [DOI] [PubMed] [Google Scholar]
Deebes M., Mahfouf M., Omar C., Islam S., Morgan B. A plant wide modelling framework for the multistage processes of the continuous manufacturing of pharmaceutical tablets. J. Pharm. Innov. 2025;20:115. [Google Scholar]
European Pharmacopoeia 12th ed., 2.9.40 Uniformity of Dosage Units 2025.
Fell J.T., Newton J.M. Determination of tablet strength by the diametral-compression test. J. Pharm. Sci. 1970;59:688–691. doi: 10.1002/jps.2600590523. [DOI] [PubMed] [Google Scholar]
Galata D.L., Zsiros B., Knyihár G., Péterfi O., Mészáros L.A., Ronkay F., Nagy B., Szabó E., Nagy Z.K., Farkas A. Convolutional neural network-based evaluation of chemical maps obtained by fast Raman imaging for prediction of tablet dissolution profiles. Int. J. Pharm. 2023;640 doi: 10.1016/j.ijpharm.2023.123001. [DOI] [PubMed] [Google Scholar]
Ghazwani M., Hani U. Data driven analysis of tablet design via machine learning for evaluation of impact of formulations properties on the disintegration time. Ain Shams Eng. J. 2025;16 [Google Scholar]
Grigoryan A., Helfrich S., Lequeux V., Lapras B., Marchand C., Merienne C., Bruno F., Mazet R., Pirot F. Smart formulation: AI-driven web platform for optimization and stability prediction of compounded pharmaceuticals using KNIME. Pharmaceuticals. 2025;18:1240. doi: 10.3390/ph18081240. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gupta D., Biswas A.A., Chand Sahu R., Arora S., Kumar D., Agrawal A.K. Advancing pharmaceutical Intelligence via computationally Prognosticating the in-vitro parameters of fast disintegration tablets using Machine Learning models. Eur. J. Pharmacol. 2024;204 doi: 10.1016/j.ejpb.2024.114508. [DOI] [PubMed] [Google Scholar]
Halton J.H. On the efficiency of certain quasi-random sequences of points in evaluating multi-dimensional integrals. Numer. Math. 1960;2:84–90. [Google Scholar]
Haware R.V., Tho I., Bauer-Brandl A. Multivariate analysis of relationships between material properties, process parameters and tablet tensile strength for α-lactose monohydrates. Eur. J. Pharmacol. 2009;73:424–431. doi: 10.1016/j.ejpb.2009.08.005. [DOI] [PubMed] [Google Scholar]
Hayashi Y., Marumo Y., Takahashi T., Nakano Y., Kosugi A., Kumada S., Hirai D., Takayama K., Onuki Y. In silico predictions of tablet density using a quantitative structure–property relationship model. Int. J. Pharm. 2019;558:351–356. doi: 10.1016/j.ijpharm.2018.12.087. [DOI] [PubMed] [Google Scholar]
Hayashi Y., Nakano Y., Marumo Y., Kumada S., Okada K., Onuki Y. Application of machine learning to a material library for modeling of relationships between material properties and tablet properties. Int. J. Pharm. 2021:609. doi: 10.1016/j.ijpharm.2021.121158. [DOI] [PubMed] [Google Scholar]
Hertz H. Über die Berührung fester elastischer Körper. J. fur Reine Angew. Math. 1882;92:156–171. [Google Scholar]
Hole G., Hole A.S., McFalone-Shaw I. Digitalization in pharmaceutical industry: what to focus on under the digital implementation process? Int. J. Pharm.:X. 2021;3 doi: 10.1016/j.ijpx.2021.100095. [DOI] [PMC free article] [PubMed] [Google Scholar]
Honti B., Farkas A., Nagy Z.K., Pataki H., Nagy B. Explainable deep recurrent neural networks for the batch analysis of a pharmaceutical tableting process in the spirit of Pharma 4.0. Int. J. Pharm. 2024;662 doi: 10.1016/j.ijpharm.2024.124509. [DOI] [PubMed] [Google Scholar]
Jia R., Dao D., Wang B., Hubis F.A., Hynes N., Gürel N.M., Li B., Zhang C., Song D.X., Spanos C.J. 2019. Towards Efficient Data Valuation Based on the Shapley Value. ArXiv abs/1902.10275. [Google Scholar]
Karniadakis G.E., Kevrekidis I.G., Lu L., Perdikaris P., Wang S., Yang L. Physics-informed machine learning. Nat. Rev. Phys. 2021;3:422–440. [Google Scholar]
Khalid M.H., Tuszynski P.K., Szlek J., Jachowicz R., Mendyk A. 2015 13th International Conference on Frontiers of Information Technology (FIT) 2015. From black-box to transparent computational intelligence models: A pharmaceutical case study. [Google Scholar]
Kim S.H., Han S.H. Development of an intelligent tablet press machine for the in-line detection of defective tablets using machine learning and deep learning models. Pharmaceutics. 2025;17:406. doi: 10.3390/pharmaceutics17040406. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kostewicz E.S., Aarons L., Bergstrand M., Bolger M.B., Galetin A., Hatley O., Jamei M., Lloyd R., Pepin X., Rostami-Hodjegan A., Sjögren E., Tannergren C., Turner D.B., Wagner C., Weitschies W., Dressman J. PBPK models for the prediction of in vivo performance of oral dosage forms. Eur. J. Pharm. Sci. 2014;57:300–321. doi: 10.1016/j.ejps.2013.09.008. [DOI] [PubMed] [Google Scholar]
Krogh A., Hertz J.A. 1991. A Simple Weight Decay Can Improve Generalization. 4th NeurIPS. [Google Scholar]
Lakshminarayanan B., Pritzel A., Blundell C. 2017. Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles. 31th NeurIPS. [Google Scholar]
Leane M., Pitt K., Reynolds G., Anwar J., Charlton S., Crean A., Creekmore R., Davies C., DeBeer T., De-Matas M., Djemai A., Douroumis D., Gaisford S., Gamble J., Stone E.H., Kavanagh A., Khimyak Y., Kleinebudde P., Moreton C., Paudel A., Storey R., Toschkoff G., Vyas K. A proposal for a drug product manufacturing classification system (MCS) for oral solid dosage forms. Pharm. Dev. Technol. 2015;20:12–21. doi: 10.3109/10837450.2014.954728. [DOI] [PubMed] [Google Scholar]
Li Y., Veettil S.R.U., Pham T., An L., Mohan S., Foti C. Modeling and predicting tablet dissolution slowdown using an acceleration factor approach and constrained neural network. J. Pharm. Sci. 2025;114 doi: 10.1016/j.xphs.2025.104015. [DOI] [PubMed] [Google Scholar]
Lim Y.-F., Ng C.K., Vaitesswar U.S., Hippalgaonkar K. Extrapolative bayesian optimization with gaussian process and neural network ensemble surrogate models. Adv. Intell. Syst. 2021;3:2100101. [Google Scholar]
Loshchilov I., Hutter F. 2017. SGDR: Stochastic Gradient Descent With Warm Restarts. 5th ICLR. [Google Scholar]
Loshchilov I., Hutter F. 2019. Decoupled Weight Decay Regularization. 7th ICLR. [Google Scholar]
Ma X., Kittikunakorn N., Sorman B., Xi H., Chen A., Marsh M., Mongeau A., Piché N., Williams R.O., Skomski D. Application of deep learning convolutional neural networks for internal tablet defect detection: high accuracy, throughput, and adaptability. J. Pharm. Sci. 2020;109:1547–1557. doi: 10.1016/j.xphs.2020.01.014. [DOI] [PubMed] [Google Scholar]
Maas A.L. 2013. Rectifier Nonlinearities Improve Neural Network Acoustic Models. 30th ICML. [Google Scholar]
McInnes L., Healy J. 2018. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv. [Google Scholar]
Meynard J., Amado-Becker F., Tchoreloff P., Mazel V. On the complexity of predicting tablet capping. Int. J. Pharm. 2022;623 doi: 10.1016/j.ijpharm.2022.121949. [DOI] [PubMed] [Google Scholar]
Michrafy A., Michrafy M., Kadiri M.S., Dodds J.A. Predictions of tensile strength of binary tablets using linear and power law mixing rules. Int. J. Pharm. 2007;333:118–126. doi: 10.1016/j.ijpharm.2006.10.008. [DOI] [PubMed] [Google Scholar]
Mohammed A., Kora R. A comprehensive review on ensemble deep learning: Opportunities and challenges. J. King Saud Univ. Comput. Inform. Sci. 2023;35:757–774. [Google Scholar]
Montavon G., Samek W., Müller K.R. Methods for interpreting and understanding deep neural networks. Digit. Signal Process.: Rev. J. 2018;73:1–15. [Google Scholar]
Nagy B., Galata D.L., Farkas A., Nagy Z.K. Application of artificial neural networks in the process analytical technology of pharmaceutical manufacturing—a review. AAPS J. 2022;24:74. doi: 10.1208/s12248-022-00706-0. [DOI] [PubMed] [Google Scholar]
Perez E., Strub F., De Vries H., Dumoulin V., Courville A. 2018. FiLM: Visual Reasoning With a General Conditioning Layer. 32th AAAI. [Google Scholar]
Puckhaber D., Finke J.H., David S., Serratoni M., Zafar U., John E., Juhnke M., Kwade A. Prediction of the impact of lubrication on tablet compactibility. Int. J. Pharm. 2022;617 doi: 10.1016/j.ijpharm.2022.121557. [DOI] [PubMed] [Google Scholar]
Quodbach J., Kleinebudde P. Performance of tablet disintegrants: Impact of storage conditions and relative tablet density. Pharm. Dev. Technol. 2015;20:762–768. doi: 10.3109/10837450.2014.920357. [DOI] [PubMed] [Google Scholar]
Ramahi A.D.A., Shinde V.V., Pearce T.C., Sinka I.C. Virtual screening of drug materials for pharmaceutical tablet manufacturability with reference to sticking. Int. J. Pharm. 2024;667 doi: 10.1016/j.ijpharm.2024.124722. [DOI] [PubMed] [Google Scholar]
Sano S., Kadowaki T., Tsuda K., Kimura S. Application of Bayesian optimization for pharmaceutical product development. J. Pharm. Innov. 2020;15:333–343. [Google Scholar]
Shahab M., Sixon D., Pierce J.A., Yang M.-H., Gonzalez M., Nagy Z.K., Reklaitis G.V. Enhanced ribbon quality in roller compaction process by mitigating splitting through a machine-learning framework. Int. J. Pharm. 2025;685 doi: 10.1016/j.ijpharm.2025.126174. [DOI] [PubMed] [Google Scholar]
Srebro N., Rennie J.D.M., Jaakkola T.S. 2004. Maximum-Margin Matrix Factorization. 17th NeurIPS. [Google Scholar]
Sun C. A novel method for deriving true density of pharmaceutical solids including hydrates and water-containing powders. J. Pharm. Sci. 2004;93:646–653. doi: 10.1002/jps.10595. [DOI] [PubMed] [Google Scholar]
Sun C.C. Dependence of ejection force on tableting speed—a compaction simulation study. Powder Technol. 2015;279:123–126. [Google Scholar]
Sun C.C., Hou H., Gao P., Ma C., Medina C., Alvarez F.J., Hou H., Gao P. Development of a high drug load tablet formulation based on assessment of powder manufacturability: moving towards quality by design. J. Pharm. Sci. 2009;98:239–247. doi: 10.1002/jps.21422. [DOI] [PubMed] [Google Scholar]
Tait T., Salehian M., Aroniada M., Shier A.P., Elkes R., Robertson J., Markl D. Empirical Model Variability: developing a new global optimisation approach to populate compression and compaction mixture rules. Int. J. Pharm. 2024;662 doi: 10.1016/j.ijpharm.2024.124475. [DOI] [PubMed] [Google Scholar]
Tölle S.R., Klinken-Uth S., Elkhashap A., Delvos A., Breitkreutz J., Vallery H., Stemmler S. Combined temperature-moisture gray-box model for horizontal fluidized bed drying of pharmaceutical granules. IFAC-PapersOnLine. 2025;59:1089–1094. [Google Scholar]
Tran A.T., Klinken-Uth S. Influences of variations of the amount of compressed material on compressibility, tabletability and compactability. J. Pharm. Sci. 2025;114 doi: 10.1016/j.xphs.2025.103831. [DOI] [PubMed] [Google Scholar]
Uzondu B., Leung L.Y., Mao C., Yang C.-Y. A mechanistic study on tablet ejection force and its sensitivity to lubrication for pharmaceutical powders. Int. J. Pharm. 2018;543:234–244. doi: 10.1016/j.ijpharm.2018.03.064. [DOI] [PubMed] [Google Scholar]
Valekar P., Buckner I.S. Comparing the applicability of mixing rules to predict the tensile strength of compacted mixtures. J. Pharm. Sci. 2025;104010 doi: 10.1016/j.xphs.2025.104010. [DOI] [PubMed] [Google Scholar]
Vamathevan J., Clark D., Czodrowski P., Dunham I., Ferran E., Lee G., Li B., Madabhushi A., Shah P., Spitzer M., Zhao S. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. 2019;18:463–477. doi: 10.1038/s41573-019-0024-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vandy A., Conteh E., Lahai M., Kolipha-Kamara M., Marah M., Marah F., Suma K.M., Mattia S.C., Tucker K.D.S., Wray V.S.E., Koroma A., Lebbie A.U. Physicochemical quality assessment of various brands of paracetamol tablets sold in Freetown Municipality. Heliyon. 2024;10 doi: 10.1016/j.heliyon.2024.e25502. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vreeman G., Sun C.C. A powder tabletability equation. Powder Technol. 2022;408 [Google Scholar]
Wang W., Ye Z., Gao H., Ouyang D. Computational pharmaceutics - a new paradigm of drug delivery. J. Control. Release. 2021;338:119–136. doi: 10.1016/j.jconrel.2021.08.030. [DOI] [PubMed] [Google Scholar]
Willard J.D., Jia X., Xu S., Steinbach M.S., Kumar V. 2020. Integrating Physics-Based Modeling With Machine Learning: A Survey. arXiv. [Google Scholar]
Xu B., Wang N., Chen T., Li M. 2015. Empirical Evaluation of Rectified Activations in Convolutional Network. arXiv. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data will be made available on request.

[bb0005] Anuschek M., Skelbæk-Pedersen A.L., Kvistgaard Vilhelmsen T., Skibsted E., Zeitler J.A., Rantanen J. Terahertz time-domain spectroscopy for the investigation of tablets prepared from roller compacted granules. Int. J. Pharm. 2023;642 doi: 10.1016/j.ijpharm.2023.123165. [DOI] [PubMed] [Google Scholar]

[bb0010] Bao Z., Bufton J., Hickman R.J., Aspuru-Guzik A., Bannigan P., Allen C. Revolutionizing drug formulation development: the increasing impact of machine learning. Adv. Drug Deliv. Rev. 2023;202 doi: 10.1016/j.addr.2023.115108. [DOI] [PubMed] [Google Scholar]

[bb0015] Basim P., Haware R.V., Dave R.H. Tablet capping predictions of model materials using multivariate approach. Int. J. Pharm. 2019;569 doi: 10.1016/j.ijpharm.2019.118548. [DOI] [PubMed] [Google Scholar]

[bb0020] Belew S., Suleman S., Mohammed T., Mekonnen Y., Duguma M., Teshome H., Bayisa B., Wynendaele E., D’Hondt M., Duchateau L., De Spiegeleer B. Quality of fixed dose artemether/lumefantrine products in Jimma Zone, Ethiopia. Malar. J. 2019;18:236. doi: 10.1186/s12936-019-2872-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0025] Berkenkemper S., Klinken S., Kleinebudde P. Investigating compressibility descriptors for binary mixtures of different deformation behavior. Powder Technol. 2023;424 [Google Scholar]

[bb0030] Berkenkemper S., Klinken S., Kleinebudde P. Multivariate data analysis to evaluate commonly used compression descriptors. Int. J. Pharm. 2023;637 doi: 10.1016/j.ijpharm.2023.122890. [DOI] [PubMed] [Google Scholar]

[bb0035] Bi Y.X., Sunada H., Yonezawa Y., Danjo K. Evaluation of rapidly disintegrating tablets prepared by a direct compression method. Drug Dev. Ind. Pharm. 1999;25:571–581. doi: 10.1081/ddc-100102211. [DOI] [PubMed] [Google Scholar]

[bb0040] Bounab Y., Antikainen O., Sivén M., Juppo A. Advancing direct tablet compression with AI: a multi-task framework for quality control, batch acceptance, and causal analysis. Eur. J. Pharm. Sci. 2025;212 doi: 10.1016/j.ejps.2025.107142. [DOI] [PubMed] [Google Scholar]

[bb0045] Brock A., Lim T., Ritchie J.M., Weston N. 2017. Neural Photo Editing With Introspective Adversarial Networks. 5th ICLR. [Google Scholar]

[bb0050] Caruana R. Multitask Learning. Mach. Learn. 1997;28:41–75. [Google Scholar]

[bb0055] Corrigan J., Li F., Dawson N., Reynolds G., Bellinghausen S., Zomer S., Litster J. An interaction-based mixing model for predicting porosity and tensile strength of directly compressed ternary blends of pharmaceutical powders. Int. J. Pharm. 2024;664 doi: 10.1016/j.ijpharm.2024.124587. [DOI] [PubMed] [Google Scholar]

[bb0060] Dai S., Xu B., Zhang Z., Yu J., Wang F., Shi X., Qiao Y. A compression behavior classification system of pharmaceutical powders for accelerating direct compression tablet formulation design. Int. J. Pharm. 2019;572 doi: 10.1016/j.ijpharm.2019.118742. [DOI] [PubMed] [Google Scholar]

[bb0065] De Bisshop J., Klinken S. Prediction of the tensile strength of tablets using LSTM networks on compression profiles. Int. J. Pharm. 2023;645 doi: 10.1016/j.ijpharm.2023.123280. [DOI] [PubMed] [Google Scholar]

[bb0070] Deebes M., Mahfouf M., Omar C., Islam S., Morgan B. A plant wide modelling framework for the multistage processes of the continuous manufacturing of pharmaceutical tablets. J. Pharm. Innov. 2025;20:115. [Google Scholar]

[bb0075] European Pharmacopoeia 12th ed., 2.9.40 Uniformity of Dosage Units 2025.

[bb0080] Fell J.T., Newton J.M. Determination of tablet strength by the diametral-compression test. J. Pharm. Sci. 1970;59:688–691. doi: 10.1002/jps.2600590523. [DOI] [PubMed] [Google Scholar]

[bb0085] Galata D.L., Zsiros B., Knyihár G., Péterfi O., Mészáros L.A., Ronkay F., Nagy B., Szabó E., Nagy Z.K., Farkas A. Convolutional neural network-based evaluation of chemical maps obtained by fast Raman imaging for prediction of tablet dissolution profiles. Int. J. Pharm. 2023;640 doi: 10.1016/j.ijpharm.2023.123001. [DOI] [PubMed] [Google Scholar]

[bb0090] Ghazwani M., Hani U. Data driven analysis of tablet design via machine learning for evaluation of impact of formulations properties on the disintegration time. Ain Shams Eng. J. 2025;16 [Google Scholar]

[bb0095] Grigoryan A., Helfrich S., Lequeux V., Lapras B., Marchand C., Merienne C., Bruno F., Mazet R., Pirot F. Smart formulation: AI-driven web platform for optimization and stability prediction of compounded pharmaceuticals using KNIME. Pharmaceuticals. 2025;18:1240. doi: 10.3390/ph18081240. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0100] Gupta D., Biswas A.A., Chand Sahu R., Arora S., Kumar D., Agrawal A.K. Advancing pharmaceutical Intelligence via computationally Prognosticating the in-vitro parameters of fast disintegration tablets using Machine Learning models. Eur. J. Pharmacol. 2024;204 doi: 10.1016/j.ejpb.2024.114508. [DOI] [PubMed] [Google Scholar]

[bb0105] Halton J.H. On the efficiency of certain quasi-random sequences of points in evaluating multi-dimensional integrals. Numer. Math. 1960;2:84–90. [Google Scholar]

[bb0110] Haware R.V., Tho I., Bauer-Brandl A. Multivariate analysis of relationships between material properties, process parameters and tablet tensile strength for α-lactose monohydrates. Eur. J. Pharmacol. 2009;73:424–431. doi: 10.1016/j.ejpb.2009.08.005. [DOI] [PubMed] [Google Scholar]

[bb0115] Hayashi Y., Marumo Y., Takahashi T., Nakano Y., Kosugi A., Kumada S., Hirai D., Takayama K., Onuki Y. In silico predictions of tablet density using a quantitative structure–property relationship model. Int. J. Pharm. 2019;558:351–356. doi: 10.1016/j.ijpharm.2018.12.087. [DOI] [PubMed] [Google Scholar]

[bb0120] Hayashi Y., Nakano Y., Marumo Y., Kumada S., Okada K., Onuki Y. Application of machine learning to a material library for modeling of relationships between material properties and tablet properties. Int. J. Pharm. 2021:609. doi: 10.1016/j.ijpharm.2021.121158. [DOI] [PubMed] [Google Scholar]

[bb0125] Hertz H. Über die Berührung fester elastischer Körper. J. fur Reine Angew. Math. 1882;92:156–171. [Google Scholar]

[bb0130] Hole G., Hole A.S., McFalone-Shaw I. Digitalization in pharmaceutical industry: what to focus on under the digital implementation process? Int. J. Pharm.:X. 2021;3 doi: 10.1016/j.ijpx.2021.100095. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0135] Honti B., Farkas A., Nagy Z.K., Pataki H., Nagy B. Explainable deep recurrent neural networks for the batch analysis of a pharmaceutical tableting process in the spirit of Pharma 4.0. Int. J. Pharm. 2024;662 doi: 10.1016/j.ijpharm.2024.124509. [DOI] [PubMed] [Google Scholar]

[bb0140] Jia R., Dao D., Wang B., Hubis F.A., Hynes N., Gürel N.M., Li B., Zhang C., Song D.X., Spanos C.J. 2019. Towards Efficient Data Valuation Based on the Shapley Value. ArXiv abs/1902.10275. [Google Scholar]

[bb0145] Karniadakis G.E., Kevrekidis I.G., Lu L., Perdikaris P., Wang S., Yang L. Physics-informed machine learning. Nat. Rev. Phys. 2021;3:422–440. [Google Scholar]

[bb0150] Khalid M.H., Tuszynski P.K., Szlek J., Jachowicz R., Mendyk A. 2015 13th International Conference on Frontiers of Information Technology (FIT) 2015. From black-box to transparent computational intelligence models: A pharmaceutical case study. [Google Scholar]

[bb0155] Kim S.H., Han S.H. Development of an intelligent tablet press machine for the in-line detection of defective tablets using machine learning and deep learning models. Pharmaceutics. 2025;17:406. doi: 10.3390/pharmaceutics17040406. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0160] Kostewicz E.S., Aarons L., Bergstrand M., Bolger M.B., Galetin A., Hatley O., Jamei M., Lloyd R., Pepin X., Rostami-Hodjegan A., Sjögren E., Tannergren C., Turner D.B., Wagner C., Weitschies W., Dressman J. PBPK models for the prediction of in vivo performance of oral dosage forms. Eur. J. Pharm. Sci. 2014;57:300–321. doi: 10.1016/j.ejps.2013.09.008. [DOI] [PubMed] [Google Scholar]

[bb0165] Krogh A., Hertz J.A. 1991. A Simple Weight Decay Can Improve Generalization. 4th NeurIPS. [Google Scholar]

[bb0170] Lakshminarayanan B., Pritzel A., Blundell C. 2017. Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles. 31th NeurIPS. [Google Scholar]

[bb0175] Leane M., Pitt K., Reynolds G., Anwar J., Charlton S., Crean A., Creekmore R., Davies C., DeBeer T., De-Matas M., Djemai A., Douroumis D., Gaisford S., Gamble J., Stone E.H., Kavanagh A., Khimyak Y., Kleinebudde P., Moreton C., Paudel A., Storey R., Toschkoff G., Vyas K. A proposal for a drug product manufacturing classification system (MCS) for oral solid dosage forms. Pharm. Dev. Technol. 2015;20:12–21. doi: 10.3109/10837450.2014.954728. [DOI] [PubMed] [Google Scholar]

[bb0180] Li Y., Veettil S.R.U., Pham T., An L., Mohan S., Foti C. Modeling and predicting tablet dissolution slowdown using an acceleration factor approach and constrained neural network. J. Pharm. Sci. 2025;114 doi: 10.1016/j.xphs.2025.104015. [DOI] [PubMed] [Google Scholar]

[bb0185] Lim Y.-F., Ng C.K., Vaitesswar U.S., Hippalgaonkar K. Extrapolative bayesian optimization with gaussian process and neural network ensemble surrogate models. Adv. Intell. Syst. 2021;3:2100101. [Google Scholar]

[bb0190] Loshchilov I., Hutter F. 2017. SGDR: Stochastic Gradient Descent With Warm Restarts. 5th ICLR. [Google Scholar]

[bb0195] Loshchilov I., Hutter F. 2019. Decoupled Weight Decay Regularization. 7th ICLR. [Google Scholar]

[bb0200] Ma X., Kittikunakorn N., Sorman B., Xi H., Chen A., Marsh M., Mongeau A., Piché N., Williams R.O., Skomski D. Application of deep learning convolutional neural networks for internal tablet defect detection: high accuracy, throughput, and adaptability. J. Pharm. Sci. 2020;109:1547–1557. doi: 10.1016/j.xphs.2020.01.014. [DOI] [PubMed] [Google Scholar]

[bb0205] Maas A.L. 2013. Rectifier Nonlinearities Improve Neural Network Acoustic Models. 30th ICML. [Google Scholar]

[bb0210] McInnes L., Healy J. 2018. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv. [Google Scholar]

[bb0215] Meynard J., Amado-Becker F., Tchoreloff P., Mazel V. On the complexity of predicting tablet capping. Int. J. Pharm. 2022;623 doi: 10.1016/j.ijpharm.2022.121949. [DOI] [PubMed] [Google Scholar]

[bb0220] Michrafy A., Michrafy M., Kadiri M.S., Dodds J.A. Predictions of tensile strength of binary tablets using linear and power law mixing rules. Int. J. Pharm. 2007;333:118–126. doi: 10.1016/j.ijpharm.2006.10.008. [DOI] [PubMed] [Google Scholar]

[bb0225] Mohammed A., Kora R. A comprehensive review on ensemble deep learning: Opportunities and challenges. J. King Saud Univ. Comput. Inform. Sci. 2023;35:757–774. [Google Scholar]

[bb0230] Montavon G., Samek W., Müller K.R. Methods for interpreting and understanding deep neural networks. Digit. Signal Process.: Rev. J. 2018;73:1–15. [Google Scholar]

[bb0235] Nagy B., Galata D.L., Farkas A., Nagy Z.K. Application of artificial neural networks in the process analytical technology of pharmaceutical manufacturing—a review. AAPS J. 2022;24:74. doi: 10.1208/s12248-022-00706-0. [DOI] [PubMed] [Google Scholar]

[bb0240] Perez E., Strub F., De Vries H., Dumoulin V., Courville A. 2018. FiLM: Visual Reasoning With a General Conditioning Layer. 32th AAAI. [Google Scholar]

[bb0245] Puckhaber D., Finke J.H., David S., Serratoni M., Zafar U., John E., Juhnke M., Kwade A. Prediction of the impact of lubrication on tablet compactibility. Int. J. Pharm. 2022;617 doi: 10.1016/j.ijpharm.2022.121557. [DOI] [PubMed] [Google Scholar]

[bb0250] Quodbach J., Kleinebudde P. Performance of tablet disintegrants: Impact of storage conditions and relative tablet density. Pharm. Dev. Technol. 2015;20:762–768. doi: 10.3109/10837450.2014.920357. [DOI] [PubMed] [Google Scholar]

[bb0255] Ramahi A.D.A., Shinde V.V., Pearce T.C., Sinka I.C. Virtual screening of drug materials for pharmaceutical tablet manufacturability with reference to sticking. Int. J. Pharm. 2024;667 doi: 10.1016/j.ijpharm.2024.124722. [DOI] [PubMed] [Google Scholar]

[bb0260] Sano S., Kadowaki T., Tsuda K., Kimura S. Application of Bayesian optimization for pharmaceutical product development. J. Pharm. Innov. 2020;15:333–343. [Google Scholar]

[bb0265] Shahab M., Sixon D., Pierce J.A., Yang M.-H., Gonzalez M., Nagy Z.K., Reklaitis G.V. Enhanced ribbon quality in roller compaction process by mitigating splitting through a machine-learning framework. Int. J. Pharm. 2025;685 doi: 10.1016/j.ijpharm.2025.126174. [DOI] [PubMed] [Google Scholar]

[bb0270] Srebro N., Rennie J.D.M., Jaakkola T.S. 2004. Maximum-Margin Matrix Factorization. 17th NeurIPS. [Google Scholar]

[bb0275] Sun C. A novel method for deriving true density of pharmaceutical solids including hydrates and water-containing powders. J. Pharm. Sci. 2004;93:646–653. doi: 10.1002/jps.10595. [DOI] [PubMed] [Google Scholar]

[bb0280] Sun C.C. Dependence of ejection force on tableting speed—a compaction simulation study. Powder Technol. 2015;279:123–126. [Google Scholar]

[bb0285] Sun C.C., Hou H., Gao P., Ma C., Medina C., Alvarez F.J., Hou H., Gao P. Development of a high drug load tablet formulation based on assessment of powder manufacturability: moving towards quality by design. J. Pharm. Sci. 2009;98:239–247. doi: 10.1002/jps.21422. [DOI] [PubMed] [Google Scholar]

[bb0290] Tait T., Salehian M., Aroniada M., Shier A.P., Elkes R., Robertson J., Markl D. Empirical Model Variability: developing a new global optimisation approach to populate compression and compaction mixture rules. Int. J. Pharm. 2024;662 doi: 10.1016/j.ijpharm.2024.124475. [DOI] [PubMed] [Google Scholar]

[bb0295] Tölle S.R., Klinken-Uth S., Elkhashap A., Delvos A., Breitkreutz J., Vallery H., Stemmler S. Combined temperature-moisture gray-box model for horizontal fluidized bed drying of pharmaceutical granules. IFAC-PapersOnLine. 2025;59:1089–1094. [Google Scholar]

[bb0300] Tran A.T., Klinken-Uth S. Influences of variations of the amount of compressed material on compressibility, tabletability and compactability. J. Pharm. Sci. 2025;114 doi: 10.1016/j.xphs.2025.103831. [DOI] [PubMed] [Google Scholar]

[bb0305] Uzondu B., Leung L.Y., Mao C., Yang C.-Y. A mechanistic study on tablet ejection force and its sensitivity to lubrication for pharmaceutical powders. Int. J. Pharm. 2018;543:234–244. doi: 10.1016/j.ijpharm.2018.03.064. [DOI] [PubMed] [Google Scholar]

[bb0310] Valekar P., Buckner I.S. Comparing the applicability of mixing rules to predict the tensile strength of compacted mixtures. J. Pharm. Sci. 2025;104010 doi: 10.1016/j.xphs.2025.104010. [DOI] [PubMed] [Google Scholar]

[bb0315] Vamathevan J., Clark D., Czodrowski P., Dunham I., Ferran E., Lee G., Li B., Madabhushi A., Shah P., Spitzer M., Zhao S. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. 2019;18:463–477. doi: 10.1038/s41573-019-0024-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0320] Vandy A., Conteh E., Lahai M., Kolipha-Kamara M., Marah M., Marah F., Suma K.M., Mattia S.C., Tucker K.D.S., Wray V.S.E., Koroma A., Lebbie A.U. Physicochemical quality assessment of various brands of paracetamol tablets sold in Freetown Municipality. Heliyon. 2024;10 doi: 10.1016/j.heliyon.2024.e25502. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0325] Vreeman G., Sun C.C. A powder tabletability equation. Powder Technol. 2022;408 [Google Scholar]

[bb0330] Wang W., Ye Z., Gao H., Ouyang D. Computational pharmaceutics - a new paradigm of drug delivery. J. Control. Release. 2021;338:119–136. doi: 10.1016/j.jconrel.2021.08.030. [DOI] [PubMed] [Google Scholar]

[bb0335] Willard J.D., Jia X., Xu S., Steinbach M.S., Kumar V. 2020. Integrating Physics-Based Modeling With Machine Learning: A Survey. arXiv. [Google Scholar]

[bb0340] Xu B., Wang N., Chen T., Li M. 2015. Empirical Evaluation of Rectified Activations in Convolutional Network. arXiv. [Google Scholar]

PERMALINK

Data-efficient prediction in tableting using word embeddings and empirically-guided neural networks

Najeeb Abdelrahman

Stefan Klinken-Uth

Abstract

Graphical abstract

Highlights

1. Introduction

2. Materials and methods

2.1. Materials

Table 1.

2.2. Experimental procedure

2.3. Database pre-evaluation, curation and presentation

2.4. Network design

Fig. 1.

Fig. 2.

2.4.1. Input branches

2.4.2. Embedding

2.4.3. Feature wise linear modulation

2.4.4. Output heads

2.4.5. Training

2.4.6. Ensemble

2.5. Architecture design and hyperparameter optimization

2.6. Evaluation of the word embedding vectors

2.6.1. Procrustes analysis

2.6.2. Decomposition

2.7. Cross-validation based evaluation of the network performance

2.8. Investigation in the number needed for information

3. Results and discussion

3.1. Properties of the generated database

3.2. Training history

Fig. 3.

3.3. Reproducibility of word embeddings

Fig. 4.

Fig. 5.

3.4. Deep ensemble efficiency

Fig. 6.

3.5. ANN vs. MLR comparison

Fig. 7.

Table 2.

3.6. Effect of training point count on prediction error

Fig. 8.

3.7. Influence of information shares of the predictability

Fig. 9.

3.8. Possible future application domains in an industrial context

4. Conclusion

Declaration of generative AI use

CRediT authorship contribution statement

Funding

Declaration of competing interest

Acknowledgements

Data availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases