Iterative Regression of Corrective Baselines (IRCB): A New Model for Quantitative Spectroscopy

Matthew Glace; Roudabeh S Moazeni-Pourasil; Daniel W Cook; Thomas D Roper

doi:10.1021/acs.jcim.4c00359

. 2024 Jun 19;64(13):5006–5015. doi: 10.1021/acs.jcim.4c00359

Iterative Regression of Corrective Baselines (IRCB): A New Model for Quantitative Spectroscopy

Matthew Glace ^†, Roudabeh S Moazeni-Pourasil ^‡, Daniel W Cook ^‡, Thomas D Roper ^†,^*

PMCID: PMC11234360 PMID: 38897609

Abstract

graphic file with name ci4c00359_0009.jpg

In this work, a new model with broad utility for quantitative spectroscopy development is reported. A primary objective of this work is to create a novel modeling procedure that may allow for higher automation of the model development process. The fundamental concept is simple yet powerful even for complex spectra and is employed with no additional preprocessing. This approach is applicable for several types of spectroscopic data to develop regression models that have similar or greater quality than the current methods. The key modeling steps are a matrix transformation and subsequent feature selection process that are collectively referred to as iterative regression of corrective baselines (IRCB). The transformed matrix (X_transform) is a linearized form of the original X data set. Features from X_t_ransform that are predictive of Y can be ranked and selected by ordinary least-squares regression. The best features (rows of X_t_ransform) are linear depictions of Y that can be utilized to develop regression models with several machine learning models. The IRCB workflow is first detailed by using a case study of Fourier transform infrared (FTIR) spectroscopy for prepared solutions of a three-component mixture. Next, IRCB is applied and compared to benchmark results for the 2006 “Chimiométrie” near-infrared spectroscopy (NIR) soil composition challenge and Raman measurements of a simulated nuclear waste slurry.

Spectroscopic instrumentation, when combined with chemometric or machine learning models, becomes a very effective tool for process analytical chemistry (PAC).^1,2 These techniques are nondestructive and can be employed in-line, or online, to monitor processes in real time.³ Raman and infrared spectroscopies have been applied for an increasing number of use cases. Among others, these applications of spectroscopy include food,⁴⁻⁶ pharmaceuticals,⁷⁻¹⁰ cosmetics,¹¹ tobacco,¹² and nuclear waste.^13,14 In 2004, the US Food and Drug Administration (FDA) and the International Council for Harmonization (ICH) established an initiative to apply process analytical technologies (PAT), including spectroscopic PAC, for manufacturing quality assurance.^15,16 Many studies have reported on the use of spectroscopic analyzers, such as infrared (IR) and Raman, to monitor various stages of pharmaceutical manufacturing.^8,9,17−22 Spectroscopic PAT is useful both for real-time release and model predictive control.^7,9,20,22 The linking of PAT to continuous manufacturing for real-time optimization and control using artificial intelligence was referred to by Price et. al as the “holy grail”.⁷

Because spectrometers do not provide physical separation between the measured compounds, the resulting measurement is the combined molecular fingerprint for all of the compounds within the mixture—providing data-rich but highly complex spectra.^23,24 Partial least-squares regression (PLS-R) and principal component analysis (PCA) have typically been utilized to deconvolute and model the resulting spectra.²⁵⁻²⁷ The development and implementation of quantitative models from the spectra has historically been a challenging task.²⁸⁻³⁰ Data treatment, known as preprocessing, is also typically required for complex mixtures to correct for nonlinearity and to focus on the model on the analyte of interest. Significant efforts have been made in the development of new preprocessing techniques to improve the capabilities of spectroscopic PAC to model more complex data, such as crude reaction mixtures. As such, new types of data processing are frequently reported, some of which rely on iterative approaches or neural networks for preprocessing optimization.³¹⁻³⁹ Although artificial intelligence has previously been applied for preprocessing treatments, few examples for end-to-end automated quantitative model development have been attempted.⁴⁰ Automated end-to-end quantitative model development may provide significant advantages for the generalizable accuracy and repeatability of chemometric models.

In this work, we introduce a new model for quantitative spectroscopy development termed iterative regression of corrective baselines (IRCB). The proposed approach is simple, intuitive, and highly automated; yet it can provide valuable spectra insights and be used to generate predictions that may outperform existing PLS-R models. The approach is based in statistics and does not rely on any spectral interpretation from molecular structure or carry forward previous knowledge. IRCB is utilized in tandem with several supervised machine learning models such as ensemble linear regression (ELR), random forest (RF) from scikit-learn,⁴¹ and extreme gradient boosting⁴² (XGB) to complete the model construction. While the IRCB model itself may be conceptualized as a preprocessing step for machine learning, it is used without any additional preprocessing. In summary, IRCB is an expansive matrix transformation that effectively generates many linear predictors from the original data.

We hypothesize that IRCB can improve the automatability and interpretability of regression model development for many types of spectroscopic analytical techniques. The employed computational approach can be beneficial to identify the spectral regions of high selectivity and result in more consistent results across different model developers. Here within, the effectiveness of the IRCB model is assessed with several diverse spectroscopic PAC case studies. The IRCB workflow is first detailed using a Fourier transform infrared (FTIR) spectroscopy case study for prepared solutions of propofol and two structurally related impurities.⁴³ Next, IRCB is applied, and the statistical results are compared for two additional previously benchmarked case studies. Case study 2 is near-infrared (NIR) measurements of crude soil samples,⁴⁴ and case study 3 is Raman spectroscopy quantification of solids in a slurry of simulated nuclear waste.¹⁴ The additional insights from the novel matrix transformation are also discussed.

Experimental Section

Iterative Regression of Corrective Baselines (IRCB)

The IRCB model is detailed in Figure 1, where “n” is the number of spectra and “p” is the number of data points in each spectrum. In addition to the calibration matrices of X (spectra) and Y (concentrations), the choice of a final machine learning predictive model is also required for regression model development. No preprocessing or structural information about the spectra is required. The necessary model parameters (in the form of baseline indices) are passed from the IRCB procedure to the trained model for application on X_test. All test samples, even within k-folds, are excluded for the entirety of model development. The X_t_ransform columns remain sample-specific, whereas the number of rows in X_t_ransform along the expanded axis is a function of the original number of data points “p” in the spectra. The operation from X_calibration to X_t_ransform results in a unit change of the matrix elements from arbitrary units (a.u) to (a.u.)². The generation of X_t_ransform is facilitated by the application of an iterative baseline correction and a subsequent area summation. Each X_t_ransform entry stores the area between a spectrum as a “baseline”. Every unique baseline is a line segment with end points at two specific locations along the spectrum.⁸ The position of the line segment end points is row-specific within the X_transform and sufficient to describe the application of a unique baseline to all samples in X_c_alibration and X_test.

Procedure for model fitting with the IRCB.

For each row of X_transform, a unique pair of start and stop data point locations will be applied to generate a linear baseline for each spectrum of X_calibration individually. Because each baseline and spectra contain discrete data, a uniformly scaled area between them can be calculated using a trapezoidal summation applying a length of one between data points.⁸ For all data sets with equidistant spacing between observations, the spacing is an arbitrary scale factor, so it can be removed for computational efficiency. Therefore, the simplest area summation procedure is taking the sum of the matrix that results from subtracting the baseline from the spectral response. Accordingly, the “inverted” areas (above the spectra and beneath the corrective baseline) are considered negative during the area approximation.

If the baseline connects two adjacent data points, then the area between the baseline and the response will be zero. Therefore, only baselines spanning at least three data points result in a summed area that is nonzero and are useful for the next step of the operation. The number of potentially useful linear baselines, or the maximum number of rows contained in X_transform, is defined as Inline graphic , where p is the number of data points in the spectra and “a” is an arbitrary counter variable. The baseline generation algorithm is a comprehensive approach that considers every possible linear connection of two data points that can result in a nonzero area between the linear baseline and the spectra within the range of the two data points.

Because of the expansive nature of the iterative baseline correction, X_transform is significantly larger than the original X data set. A procedure to select the most useful features (rows) of X_transform is next employed. Each row within X_transform is assigned a coefficient of determination (R²) for Y_calibration and sorted row-wise by the R² value assigned. Because of this sorting, the highest rated features of X_transform are linear depictions of Y. After sorting, an arbitrary number of the top features (highest R²) within X_transform are carried forward into the new matrix X_features. For the case studies described in this work, generally around 2% of the highest rated features of X_t_ransform were selected for X_features, although using less may also result in an adequate prediction, as dictated by the complexity of the system. For any model, the start and stop baseline locations from the calibration set are indexed to generate an equivalent X_features matrix (same number of rows) for the test set(s). The python code to produce and sort the X_transform matrix is from the initial spectra detailed in the Supporting Information and is available from the GitHub linked in the Data and Software Availability section. For readability, the most fundamental version of the baseline correction operation is shown in the Supporting Information, although a much faster multicore version is available from the GitHub. For the regression operation, a less resource-intensive multithreading approach was employed to significantly enhance computational feasibility, as shown in the Supporting Information.

In a classical spectroscopic interpretation, the simplest corrective baseline (row of X_transform) captures a selective analyte peak which correlates to concentration as dictated by Beer–Lambert’s law. An example of a highly rated baseline is shown in Figure 2. The area between the spectral response and the baseline, which spans from 1079 to 1087 cm^–1, is selective for the targeted component concentration (Y_calibration) as demonstrated by a high R² value. This indicates that this feature of X_transform will be useful for making predictions about Y and is likely to be included in X_features. The matrix X_f_eatures is useful as input for a variety of predictive machine learning models, including ensemble linear regression (ELR), random forest (RF), and extreme gradient boosting (XGB).

Sample baseline correction and R² assignment for a highly effective baseline (1079–1087 cm^–1).

The predictions produced by the various machine learning models may vary in effectiveness based on the complexity of the spectra and size of the calibration data. For simple cases, such as a three-component mixture of pure compounds, ELR was the most effective and interpretable. For the ELR approach, each X_features element (baseline) was used as a simple linear regression model and the predictions of numerous linear regression models were averaged for the test set prediction. The exact number of ensembled regression models included was determined by minimizing the error of the cross-validation prediction of the calibration set. This ELR strategy was specifically developed for compatibility with IRCB to select the most appropriate number of baselines to include in the final predictions—up to the maximum threshold set by the user. For the more complicated examples, RF and XGB were required to produce accurate and robust test set predictions. However, for the RF and XGB models, the percentage of the sorted X_features that was included in X_transform must be manually specified by the user and is used directly without the potential reduction that is possible in the ELR case.

Materials

Materials were used as received from vendors. 2,6-Diisopropylphenol (100%) and 2-isopropyl phenol (98%) were procured from Chem-Impex International, Inc. 4-Hydroxy-3,5-diisopropylbenzoic acid (98%) was procured from Combi-Blocks. The 4-hydroxy-3,5-diisopropylbenzoic acid used for spectroscopic measurements was synthesized in-house and purified by recrystallization from heptane. Acetonitrile (HPLC grade) was procured from Sigma-Aldrich.

Sample Preparation—Case Study 1

Each stock solution of the analyte was generated by the dissolution of the purified compounds into acetonitrile. A total of 14 calibration and 10 test samples were prepared in the range of 5.0–45.0 mg mL^–1 for each analyte. For reference concentrations, each sample was analyzed by HPLC in duplicate after dilution of the FTIR samples into the linear range of the HPLC calibration. The concentrations as determined by HPLC were considered the true analyte concentration (Table S1).

High-Performance Liquid Chromatography

For each analyte (Propofol [1], 2-IP [2], HDIPBA [3]), a 6-point calibration curve, from 0.05 to 1.00 mg mL^–1, was prepared by acetonitrile. A minimum coefficient of variation (R² = 0.99) was enforced for all HPLC calibrations. Samples were analyzed in duplicate. Full details of the HPLC method are outlined in the Supporting Information.

Fourier Transform Infrared Spectroscopy

A ReactIR 15 instrument was equipped with a DS Micro Flow Cell. The detector was chilled for at least 2 h with liquid N₂ prior to analysis. FTIR samples were maintained at room temperature prior to and during analysis. Samples were analyzed by manual injection into the Micro Flow Cell DS DiComp. A 30 s scan time was selected, and three spectra were collected for all calibration and test samples.

Results and Discussion

A primary aim of the methods employed in this study was to model the target Y from the multicomponent calibration set X without reference spectra, manual peak identification, or inferred structural knowledge. To demonstrate the wide versatility and usefulness of IRCB, three case studies were investigated and reported. An outline of all three case studies is given in Table 1. The three case studies utilized different instruments (FTIR, NIR, Raman) and were applied to physically different sample types (Solution, Solids, Slurry). Similarly, X_f_eatures was used as an input for three different machine learning models to generate the final Y prediction. In case study 1, the solution concentration of three different pharmaceutically relevant analytes was predicted using FTIR. In case study 2, IRCB was applied to a previously reported data set that utilized NIR to measure solid soil samples. Lastly, in case study 3, IRCB was tested to predict the concentrations of five solids within a complex slurry using Raman spectroscopy. For case studies 2 and 3, the modeling results may be compared to the previously published benchmarks.^14,44

Table 1. Outline of the Case Studies.

case study	description	num. models	instrument	type	model
1	propofol	3	FTIR	solution	ELR
2	soil	3	NIR	solids	RF
3	nuclear waste	5	Raman	slurry	XGB

Open in a new tab

Case Study 1: FTIR for Three-Component Mixture

In the first case study, the model development procedure for a three-compound mixture of propofol and two structurally similar impurities is detailed. The three chemical structures from the solution are shown in Figure 3. These three compounds were chosen based on their structural similarity. Propofol has no unique functional selectivity when compared to those of 4-hydroxy-3,5-diisopropylbenzoic acid (HDIPBA) and 2-isopropyl phenol (2-IP) collectively.

Chemical structures for (1) 2,6-diisopropylphenol (Propofol), (2) 4-hydroxy-3,5-diisopropylbenzoic acid (HDIPBA), and (3) 2-isopropyl phenol (2-IP).

X_calibration contained 42 calibration spectra (14 independent samples × 3 replicates) and 1798 data points for each spectrum. The samples all contained each of the three analytes at concentrations between 5.0 and 45.0 mg mL^–1. The matrix transformation was applied to create X_transform of size 1,613,706 × 42. Although three different analyte models were developed, X_transform is a comprehensive matrix of X_c_alibration that is generic to all three substrates. An X_f_eatures matrix must be generated for each substrate independently using the generic X_transform and a substrate-specific Y_c_alibration. The top four baselines of X_features are described for each of the three substrates in Table 2. Highly selective baselines were discovered for each analyte as demonstrated by several R² values >0.99. The upper and lower limits describe the end point locations of the baselines contained within X_f_eatures. Notably, Table 2 describes only the four most selective baselines for each of the three substrates, but X_features contains many additional baselines for each analyte with a gradually decreasing R² value for each entry. A full description of each baseline in the X_f_eatures matrices is available for this case study and the others in the Supporting Information.

Table 2. Best Baselines for Case Study 1.

start (cm^–1)	stop (cm^–1)	R²	slope (A.U.² mg^–1)	intercept (A.U.²)
a. Propofol
1443	1650	0.998	–3.50 × 10⁰¹	–2.95 × 10⁰²
1201	1214	0.998	8.03 × 10⁰²	–4.10 × 10⁰⁰
930	975	0.998	–4.07 × 10⁰²	–1.03 × 10⁰²
1443	1648	0.998	–3.57 × 10⁰¹	–2.96 × 10⁰²
b. HDIPBA
1655	1778	0.999	2.73 × 10⁰¹	2.04 × 10⁰⁰
1655	1782	0.999	2.73 × 10⁰¹	2.22 × 10⁰⁰
1655	1780	0.999	2.73 × 10⁰¹	1.87 × 10⁰⁰
1648	1788	0.999	2.65 × 10⁰¹	4.46 × 10⁰⁰
c. 2-IP
1079	1087	0.999	1.19 × 10⁰³	1.05 × 10⁰⁰
831	1225	0.999	–6.16 × 10⁰⁰	4.40 × 10⁰¹
1081	1085	0.999	7.96 × 10⁰³	4.85 × 10^–01
1497	1517	0.999	4.24 × 10⁰²	5.12 × 10⁰⁰

Open in a new tab

In Figure 4A–C, the best eight baselines were plotted onto the calibration sets for propofol, HDIPBA, and 2-IP, respectively. Each of the baselines in Table 2 (and four more for each analyte) is overlaid to the spectra of Figure 4A–C in a unique color. Several of the baselines selected by the IRCB are directly indicative of functional selectivity. In 2-IP, aromatic C–H bending near 1500 and 1080 cm^–1 was captured. The carboxylic acid functional group near 1725 cm^–1 for HDIPBA was similarly selected. Additionally, several nonintuitive baselines are shown to be quantitative concentration indicators.

Plot of top baselines in case study 1 for (A) propofol, (B) HDIPBA, and (C) 2-IP.

Although propofol lacks functional group selectivity, the resulting X_features elements from several baselines were still highly correlated with Y. The selection of the best baselines by the IRCB model is a comprehensive approach that does not require any structural knowledge or manual interpretation because every possible linear baseline is tested. Those selected by the protocol are unbiased by a portion of the spectra that the operator may be predisposed to believe is the most selective. The selected baselines are considered the best only because when applied to the calibration set, they result in areas that have the highest correlation to the targeted Y.

In some cases, the best baseline regions for different analytes may cross or entirely overlap. For example, the selective 2-IP baseline from 1499 to 1517 cm^–1 is entirely contained within the selective propofol baseline from 1445 to 1562 cm^–1. The baseline discovery process may be enhanced by the area summation feature that allows “inverted” portions to be considered negative. For example, the second highest rated 2-IP baseline from 831 to 1225 cm^–1 has significant spectral responses both above and below the corrective baseline. However, the sum of positive and negative areas balances to give a linear response in the overall area of the baseline to the 2-IP concentration. In some instances, the applied baseline resulted in a net negative area of the spectrum being captured. It can be seen in the best propofol baselines that some selective regions (ex. 1443–1650 cm^–1) are entirely “inverted”. In terms of classical spectroscopic interpretation, this result was initially perplexing. However, every typically “clean” baseline for a unique selective peak is contained within X_transform and the complex solutions appearing within X_features demonstrated higher correlation with Y than the simpler ones that may be easier to select manually.

The selection and application of a machine learning predictive model are required for converting the X_features in Y prediction for the calibration and test sets. For case study 1, ensemble linear regression (ELR) was applied to create the regression model. Using ELR, each row in X_features served as a linear regression model that was averaged into the final prediction for Y. The number of linear regression models to average for the final ELR prediction was determined by minimizing the 5-fold cross-validation error of the calibration set. The number of ensemble regressions included was 98, 29, and 8 for propofol, HDIPBA, and 2-IP, respectively. For computational efficiency, the maximum percentage of X_transform that may be included in X_features must be manually specified. However, with the ELR approach, the number of features selected can be automatically reduced below the user-specified maximum threshold. For the later RF and XGB models, the user-specified threshold percentage was used directly without the potential reduction.

For this case study, the IRCB-ELR model was applied to a selection of 30 spectra (10 samples x 3 scans) for an external test set evaluation. For the test set, X_f_eatures for each analyte was generated by using the corresponding baseline indices from the calibration set. The statistical results for each of the three analytes are outlined in Table 3a. The model provided an excellent fit for each of the three target analytes as indicated by a test set R² of >0.99 and a test RMSE of <0.50 mg mL^–1. The results indicate that the baselines selected by assigning an R² value to the calibration set samples were effective for predicting the concentration of the test set. The results show that the IRCB model is effective because the baselines selected by the model are directly useful to generate a quantitative test set prediction. The modeling results are plotted in Figure 5A–C for propofol, HDIPBA, and 2-IP respectively.

Table 3. Statistical Results for (a) Case Study 1 and (b) Case Study 2^a.

a. Statistical results for case study 1. Propofol system using ensemble linear regression (ELR) for FTIR
	RMSE				R²
componet	units	calibration	C.V	test	calibration	C.V.	test
propofol	mg mL^–1	0.189	0.199	0.452	1.000	1.000	0.999
HDIPBA	mg mL^–1	0.210	0.248	0.457	1.000	0.999	0.998
2-IP	mg mL^–1	0.207	0.238	0.323	1.000	1.000	0.999

b. Statistical results for case study 2. Crude soil samples using random forest (RF) with NIR
	RMSE				R²
component	units	calibration	C.V	test	calibration	C.V.	test
nitrogen	g kg^–1	0.185	0.582	0.657	0.977	0.756	0.766
carbon	percent	0.324	0.903	0.527	0.970	0.762	0.880
CEC	meq 100 g^–1	1.597	6.643	3.739	0.944	0.000	0.715

Open in a new tab

CEC = Cation exchange capacity, HDIPBA = 4-hydroxy-3,5-diisopropylbenzoic acid, 2-IP = 2-isopropyl phenol, C.V. = Cross-Validation.

Case study 1 regression models plotted for (A) propofol, (B) HDIPBA, and (C) 2-IP.

Case Study 2: “Chimiométrie 2006” Soil Quantification with NIR Spectroscopy

Crude samples can introduce significant spectral complexity as compared to prepared solutions with limited interfering analytes. As such, for case study 2, IRCB was next tested using the “Chimiométrie 2006 Conference” soil quantification challenge.⁴⁴ This data set contains NIR measurement of 618 soil samples and offline measurements for total nitrogen (g kg^–1 dry soil), carbon percentage in dry soil (carbon, %), and cation exchange capacity (CEC, meq 100 g^–1 of dry soil).⁴⁴ The external test set was utilized as outlined by the conference guidelines.⁴⁴

Although this data set was collected over a multiday period with likely instrument and environment variation, no preprocessing was performed prior to the IRCB deployment. The baseline correction was applied to generate X_transform from the raw data. Next, iterative regression was performed on X_transform to generate the three X_features matrices for the three Y_calibration matrices (nitrogen, carbon, and CEC). For this case study, each X_features matrix retained the top 2% of X_transform that was most selective for the respective Y_calibration. As previously mentioned, the percentage of X_transform that is included in X_features must be manually specified for the ICRB-RF model. The effect of this percentage for case study 2 is shown in Figure S2. Generally, the performance began to plateau at around 2% inclusion.

The best baseline for each component is plotted in Figure 6A–C for nitrogen, carbon, and CEC, respectively. The full X_features for each component is available in the Supporting Information. The top baselines for the three targeted Y_calibration matrices in case study 2 were significantly less predictive than those reported for the simpler system in case study 1. The top R² values for the best individual baselines to Y_calibration were 0.701, 0.667, and 0.509 for nitrogen, carbon, and CEC, respectively. These R² values represent the coefficient of determination for the top row of each X_features matrix with respect to its respective Y_calibration. One clear advantage of the outlined approach is the ability to determine which portion of the spectra is most correlated to the Y of interest. For example, cation exchange capacity (CEC) does not directly correspond to a known functional group, but using the IRCB model, it can be observed that 1698–1720 cm^–1 is a region of key interest (Figure 6C). The best nitrogen baseline (Figure 6A) was complex, as it contained both positive and negative regions that summed to give a linear (R² = 0.701) response.

Plot of top baselines in case study 2 for (A) nitrogen, (B) carbon, and (C) CEC.

Given that best baselines were not immediately highly effective regression models for the target analytes, it was hypothesized that a nonlinear machine learning model would be useful to generate more robust predictions. As such, random forest (RF) machine learning was selected to generate regression predictions from the X_f_eatures matrices. The X_features matrices for the calibration set were used to train RF models from scikit-learn.⁴¹ The statistical results for the case study 2 IRCB-RF regression models are outlined in Table 3b. The model predictions and true values are plotted in Figure 7 for both the calibration and test sets with the default RF hyperparameters, as shown in Table S2. The IRCB-RF NIR soil composition test set prediction statistical parameters ranked highly among the six previously reported models and statistical comparison of the benchmarked test set is shown in Table S3.⁴⁴ Although the test set fitting for the “CEC” model was considered satisfactory compared to other analyses of this data set, the cross-validation error was quite high. It is hypothesized that this is due to outliers within Y_calibration for this data set. The cross-validation statistics for other models were not previously reported.⁴⁴

Case study 2 regression models plotted for (A) nitrogen, (B) carbon, and (C) CEC. Random forest (RF) model with default hyperparameters as described in Table S2.

Despite test set predictions, they were extremely competitive with other approaches (Table S3), overfitting of the RF model remained problematic as indicated by the difference in RMSE values between the calibration and test sets (Table 3b). In the next model development iteration, the RF hyperparameters were tuned to address overfitting by minimizing the RMSE of 5-fold cross-validation of the calibration set. An exhaustive grid search approach was utilized for the RF hyperparameter design space shown in Table S2. The design space of the grid search was made to be more conservative than the default hyperparameters by implementing limitations, such as increasing the minimum samples per leaf and per split. The resulting models and the selected best hyperparameters are shown in Table S4. This approach did reduce the absolute difference in the RMSE between the test and calibration sets as compared with default RF hyperparameter values but typically did not benefit the statistical metrics of the test set. As an exception, the CEC test set R² was improved from 0.715 (Table 3) to 0.746 (Table S4).

In summary, the original X is transformed into a novel matrix form using IRCB—that is a more linearized depiction of the original data. After the best X_features values are identified using IRCB, the RF model is then able to weight linear predictors and secondary interactions within X_features. The overall regression prediction of the RF model surpasses the predictive capacity of any one baseline region. For example, in the carbon model, the best-fitted baseline to Y_calibration was R² = 0.667 but by using the best 2% of baselines (X_features) with RF, the overall prediction for an unknown test set was R² = 0.880 (Table 3b). This indicates that significant predictive power is being generated within X_transform and effectively extracted into X_features using the outlined iterative regression sorting procedure. In summary, the baselines that are highly correlated to the calibration set are predictive of test set concentrations. It is shown within the results of case study 2 that IRCB can find selective responses even for complex mixtures with overlapping signals. This is further demonstrated by the containment of the best carbon and CEC baselines within the best nitrogen baseline (Figure 6).

Case Study 3: Raman Spectroscopy for Dense Slurry Solid Quantification

In case study 3, the IRCB approach was applied to another previously published data set, which utilized Raman spectroscopy to measure a dense multicomponent slurry designed to simulate nuclear waste.¹⁴ The objective of the model reported is to predict the solid concentration for each of the five analytes (kyanite, wollastonite, olivine, silica, and zircon) using 66 spectra of the slurry. This data set provides additional complications compared to the previous two as the Raman probe is exposed to both solids and liquids. Moreover, the composition of the solids exposed to the probe may fluctuate as the slurry is stirred. In the original work,¹⁴ five solid analytes were modeled using partial least-squares regression (PLS-R) and assessed using leave-one-out cross-validation (LOOCV). For our approach, a 10-fold cross-validation was applied to develop and assess the model of each analyte.

First, the generic X_transform was generated and then X_features were generated for each analyte within each cross-validation fold. Once again, X_features consisted of the top 2% of X_transform for each analyte within each fold. For fold-1, the top baseline for each component is plotted in Figure 8. Although the results are shown from the first fold, the order and statistics of the top baselines varied only slightly across the k-folds. It can be observed in Figure 8 that the most selective baseline for each analyte is far more complex than a single analyte peak. For almost all of the samples, there are portions of the spectral response both above and below the baselines that sum to increase the linearity of the response. Given this complexity, the manual identification of selective regions is extremely challenging or entirely impossible. This evidence helps to support our hypothesis that IRCB is beneficial for creating a standard automatable approach.

Plot of top baselines in case study 3 for (A) kyanite, (B) wollastonite, (C) olivine, (D) silica, and (E) zircon.

A machine learning package, XGBoost (XGB), which stands for extreme gradient boosting, was used to develop the final regression model for each fold separately using the X_f_eatures matrices. XGB is a powerful machine learning model that uses gradient boosted decision trees to solve many types of supervised regression and classification problems.⁴² While RF may also be suitable for this case study, XGB is also an acceptable choice and further demonstrates the versatility of IRCB with several machine learning packages. For each IRCB-XGB model (specific to the analyte and fold), XGB hyperparameters were tuned using a randomized grid search approach to minimize prediction of cross-validation error. The hyperparameter grid is shown in Table S2. The statistical metrics of the test set prediction are shown and compared with the previously reported PLS-R results in Table 4. The PLS-R model from the previously work utilized 10 principal components and the application of Savitzky–Golay filter.¹⁴

Table 4. Comparison of IRCB-XGB and SVG-PLS-R.

iterative regression of corrective baselines (IRCB) and extreme gradient boosting (XGB)
test set metrics	kyanite	wollastonite	olivine	silica	zircon
coefficient of determination (R²)	0.901	0.849	0.614	0.855	0.916
mean absolute error (g kg-solvent^–1)	6.03	9.2	5.8	17.66	2.12
root-mean-squared error (g kg-solvent^–1)	7.69	11.53	7.7	21.59	3.12
mean percent error	17.9	29.6	39.3	21.7	15.8

partial least-squares regression (PLS-R)
test set metrics	kyanite	wollastonite	olivine	silica	zircon
coefficient of determination (R²)	0.932	0.912	0.527	0.885	0.837
mean absolute error (g kg-solvent^–1)	4.57	5.66	5.2	13.53	2.52
root-mean-squared error (g kg-solvent^–1)	6.04	7.88	6.95	17.29	3.68
mean percent error	16.5	16.7	39.4	18.2	21.4

Open in a new tab

As shown in Table 4, IRCB-XGB was generally comparable with PLS-R for the reported statistical metrics. The five IRCB-XGB models were compared with the previously reported PLS-R models using an elliptical joint confidence region (EJCR) test,^45,46 and the results are shown in Figure S4. The IRCB-XGB model outperformed the PLS-R model for zircon and underperformed the model for wollastonite. The statistics and EJCR test for kyanite, olivine, and silica indicate slightly better performance for the PLS-R model. Overall, the statistical results for the case study 3 model indicate that using IRCB-XGB is effective for the development of regression models from complex data sets. It is plausible that the combination of IRCB and nonlinear machine learning models may require more samples for a robust calibration as compared to PLS-R. This is evidenced by the clear outperformance of ICRB-RF in case study 2 (Table S3) with several hundred calibration spectra but slightly worse overall performance in case study 3, with only around 60 training spectra for each k-fold. Alternatively, the complexity of the data or systematic error (Figure S4) may impact the comparative performance.

IRCB Compared to Other Preprocessing Methods

The concept of the linear corrective baseline was reported in our previous work for the purpose of finding an optimal regression from two overlapping peak to develop a Raman spectroscopy model.⁸ The primary limitation of the previous approach was that only a small portion of the spectra and the single best baseline were used for a linear regression model. The previous approach did not facilitate its application to complex systems, where numerous baselines coupled with machine learning models are required to make an effective prediction on the test set. For most data sets, it is difficult to identify the region that contains the best baseline, and a single baseline is insufficient to develop a robust prediction.

Furthermore, the IRCB can be seen as an effective preprocessing tool for enhancing the RF and XGB models. IRCB was compared with other preprocessing approaches including no preprocessing (none), Savitzsky-Golay (SVG), multiplicative scatter correction (MSC), and second derivative filtering for RF (case study 2) and XGB models (case study 3). For comparison, all models were run with the default hyperparameters as shown in Table S2. The IRCB-RF and IRCB-XGB models outperformed all of the RF and XGB models that were developed with other preprocessing methods. A statistical comparison between the IRCB–machine learning models and the machine learning models with other preprocessing methods is shown in the Supporting Information Tables S5 and S6 for case study 2 and case study 3, respectively. For many of the case study 3 models, XGB showed low predictive power without the prior application of IRCB (Table S6). For example, comparing IRCB-XGB with the next best XGB modeling result, it is shown that the test set R² was improved from 0.389 (SVG) to 0.909 (IRCB) for zircon and 0.232 (none) to 0.838 (IRCB) for silica (Table S6). For case study 2 (Table S5), the largest improvement was for the carbon model, where the test set R² was increased from 0.639 (RF with no processing) to 0.880 (IRCB-RF).

Overall, the results of the three case studies indicate that the IRCB is a highly automatable and effective model in producing linear predictors of Y. However, certain challenges persist in the end-to-end automation of this model development process. Although IRCB may be utilized to automatically select the spectral regions with the highest importance for Y, the developer is still required to select the machine learning model (ELR, RF, XGB), determine the threshold percentage of X_transform to be included in X_f_eatures, and in some cases, manually tune the machine learning hyperparameters to avoid overfitting.

Conclusions

We have introduced a new framework for the development of spectroscopic models that can, in some instances, outperform the existing methodologies. Generally, the matrix transformation employed within IRCB is both an effective preprocessing strategy for machine learning and a highly versatile model for generating linear features from continuous data. IRCB as a preprocessing treatment can significantly improve the application of nonlinear machine learning models RF and XGB. The efficacy of using the resulting areas from thousands of corrective baselines to improve the regression prediction with several different machine learning models further indicates the broad utility of IRCB. By applying IRCB, the optimal baseline regions can be directly mapped and identified even by a nonexpert or in instances when the physical structure of the target is unknown. For simple systems, the IRCB can capture clear molecular selectivity based on classical spectroscopic interpretation. However, the spectral regions selected by IRCB are frequently nonintuitive and difficult to manually identify for complex mixtures. The selection of certain baseline regions may provide insights into spectral interpretability that were previously difficult to identify.

The development of a feature linearization and extraction technique that does not rely upon user experience represents a key milestone toward the automation of chemometric regression analysis. The removal of variable preprocessing requirements can significantly lower the barrier of entry to model development. The proposed model may help to democratize accessibility to the development process, as facilitated by a more structured and scientific approach. The ongoing efforts in place to modify IRCB for application to classification problems are primarily focused on new functions for selecting the most relevant features from X_transform. Furthermore, it is important to investigate strategies that optimize the threshold percentage of features that are included in the regression model, can determine which machine learning predictive model is most appropriate, and can automatically tune the machine learning hyperparameters to avoid overfitting.

Acknowledgments

This work was supported by the Virginia Innovation Partnership Corporation (CCF23-0230-HE). The authors acknowledge Dr. Martha Grover for useful guidance and discussions.

Data Availability Statement

Data analysis was performed in Python (v 3.9). The Python modules we developed to run these case studies, a complete list of package dependencies, and the case study raw data (.xlsx format) are available from https://github.com/mglacier/IRCB. FTIR spectra were collected in ICIR (v 7.1, Mettler Toledo).

Supporting Information Available

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.4c00359.

Key python functions; HPLC method information (Figure S1 and Table S1); case study 2 feature inclusion (Figure S2); case study 3 regression plots (Figure S3); EJCR test (Figure S4); hyperparameter information (Tables S2 and S4); case study 2 benchmark (Table S3); and preprocessing comparison (Tables S5 and S6) (PDF)
Every baseline in the X_features for each analyte for the three case studies (XLSX)

Author Contributions

The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript. M.G.: conceptualization, methodology, writing—original draft; R.M.-P.: methodology, conceptualization; D.W.C.: writing—review and editing; T.D.R.: supervision, writing—review and editing

The authors declare the following competing financial interest(s): Virginia Commonwealth University has filed a provisional patent for some commercial uses relating to the model.

Supplementary Material

ci4c00359_si_001.pdf^{(741.9KB, pdf)}

ci4c00359_si_002.xlsx^{(5.4MB, xlsx)}

References

Workman J.; Lavine B.; Chrisman R.; Koch M. Process Analytical Chemistry. Anal. Chem. 2011, 83, 4557–4578. 10.1021/ac200974w. [DOI] [PubMed] [Google Scholar]
Mazivila S. J.; Santos J. L. M. A Review on Multivariate Curve Resolution Applied to Spectroscopic and Chromatographic Data Acquired during the Real-Time Monitoring of Evolving Multi-Component Processes: From Process Analytical Chemistry (PAC) to Process Analytical Technology (PAT). TrAC Trends Anal. Chem. 2022, 157, 116698 10.1016/j.trac.2022.116698. [DOI] [Google Scholar]
Callis J. B.; Illman D. L.; Kowalski B. R. Process Analytical Chemistry. Anal. Chem. 1987, 59, 624A–637A. 10.1021/ac00136a723. [DOI] [Google Scholar]
Pérez-Beltrán C. H.; Jiménez-Carvelo A. M.; Torrente-López A.; Navas N. A.; Cuadros-Rodríguez L. QbD/PAT—State of the Art of Multivariate Methodologies in Food and Food-Related Biotech Industries. Food Eng. Rev. 2023, 15, 24–40. 10.1007/s12393-022-09324-0. [DOI] [Google Scholar]
Kharbach M.; Mansouri M. A.; Taabouz M.; Yu H. Current Application of Advancing Spectroscopy Techniques in Food Analysis: Data Handling with Chemometric Approaches. Foods 2023, 12, 2753 10.3390/foods12142753. [DOI] [PMC free article] [PubMed] [Google Scholar]
Biancolillo A.; Marini F.; Ruckebusch C.; Vitale R. Chemometric Strategies for Spectroscopy-Based Food Authentication. Appl. Sci. 2020, 10, 6544 10.3390/app10186544. [DOI] [Google Scholar]
Price G. A.; Mallik D.; Organ M. G. Process Analytical Tools for Flow Analysis: A Perspective. J. Flow Chem. 2017, 7, 82–86. 10.1556/1846.2017.00032. [DOI] [Google Scholar]
Glace M.; Wu W.; Kraus H.; Acevedo D.; Roper T. D.; Mohammad A. The Development of a Continuous Synthesis for Carbamazepine Using Validated In-Line Raman Spectroscopy and Kinetic Modelling for Disturbance Simulation. React. Chem. Eng. 2023, 8, 1032–1042. 10.1039/D2RE00476C. [DOI] [Google Scholar]
Sacher S.; Poms J.; Rehrl J.; Khinast J. G. PAT Implementation for Advanced Process Control in Solid Dosage Manufacturing – A Practical Guide. Int. J. Pharm. 2022, 613, 121408 10.1016/j.ijpharm.2021.121408. [DOI] [PubMed] [Google Scholar]
Miyai Y.; Formosa A.; Armstrong C.; Marquardt B.; Rogers L.; Roper T. PAT Implementation on a Mobile Continuous Pharmaceutical Manufacturing System: Real-Time Process Monitoring with In-Line FTIR and Raman Spectroscopy. Org. Process Res. Dev. 2021, 25, 2707–2717. 10.1021/acs.oprd.1c00299. [DOI] [Google Scholar]
Arora T.; Verma R.; Kumar R.; Chauhan R.; Kumar B.; Sharma V. Chemometrics Based ATR-FTIR Spectroscopy Method for Rapid and Non-Destructive Discrimination between Eyeliner and Mascara Traces. Microchem. J. 2021, 164, 106080 10.1016/j.microc.2021.106080. [DOI] [Google Scholar]
Liang Y.; Zhao L.; Guo J.; Wang H.; Liu S.; Wang L.; Chen L.; Chen M.; Zhang N.; Liu H.; Nie C. Just-in-Time Learning-Integrated Partial Least-Squares Strategy for Accurately Predicting 71 Chemical Constituents in Chinese Tobacco by Near-Infrared Spectroscopy. ACS Omega 2022, 7, 38650–38659. 10.1021/acsomega.2c04139. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kocevska S.; Maggioni G. M.; Crouse S. H.; Prasad R.; Rousseau R. W.; Grover M. A. Effect of Ion Interactions on the Raman Spectrum of NO3–: Toward Monitoring of Low-Activity Nuclear Waste at Hanford. Chem. Eng. Res. Des. 2022, 181, 173–194. 10.1016/j.cherd.2022.03.002. [DOI] [Google Scholar]
Prasad R.; Crouse S. H.; Rousseau R. W.; Grover M. A. Quantifying Dense Multicomponent Slurries with In-Line ATR-FTIR and Raman Spectroscopies: A Hanford Case Study. Ind. Eng. Chem. Res. 2023, 62, 15962–15973. 10.1021/acs.iecr.3c01249. [DOI] [PMC free article] [PubMed] [Google Scholar]
Continuous Manufacturing of Drug Substances and Drug Products (Q13). ICH 2021..
Health and Human Services (HHS); FDA; CDER; CVM; ORA . Guidance for Industry PAT — A Framework for Innovative Pharmaceutical Development, Manufacturing, and Quality Assurance, 2004.
Acevedo D.; Yang X.; Mohammad A.; Pavurala N.; Wu W. L.; O’Connor T. F.; Nagy Z. K.; Cruz C. N. Raman Spectroscopy for Monitoring the Continuous Crystallization of Carbamazepine. Org. Process Res. Dev. 2018, 22, 156–165. 10.1021/acs.oprd.7b00322. [DOI] [Google Scholar]
Cervera-Padrell A. E.; Nielsen J. P.; Pedersen M. P.; Christensen K. M.; Mortensen A. R.; Skovby T.; Dam-Johansen K.; Kiil S.; Gernaey K. V. Monitoring and Control of a Continuous Grignard Reaction for the Synthesis of an Active Pharmaceutical Ingredient Intermediate Using Inline NIR Spectroscopy. Org. Process Res. Dev. 2012, 16, 901–914. 10.1021/op2002563. [DOI] [Google Scholar]
Glace M.; Kraus H.; Wu W.; Acevedo D.; Liu D.; Roper T. D.; Mohammad A. Impurity Profiling for a Scalable Continuous Synthesis and Crystallization of Carbamazepine Drug Substance. Org. Process Res. Dev. 2024, 28, 2013–2027. 10.1021/acs.oprd.4c00081. [DOI] [Google Scholar]
Singh R.; Sahay A.; Karry K. M.; Muzzio F.; Ierapetritou M.; Ramachandran R. Implementation of an Advanced Hybrid MPC-PID Control System Using PAT Tools into a Direct Compaction Continuous Pharmaceutical Tablet Manufacturing Pilot Plant. Int. J. Pharm. 2014, 473, 38–54. 10.1016/j.ijpharm.2014.06.045. [DOI] [PubMed] [Google Scholar]
Sagmeister P.; Lebl R.; Castillo I.; Rehrl J.; Kruisz J.; Sipek M.; Horn M.; Sacher S.; Cantillo D.; Williams J. D.; Kappe C. O. Advanced Real-Time Process Analytics for Multistep Synthesis in Continuous Flow*. Angew. Chem., Int. Ed. 2021, 60, 8139–8148. 10.1002/anie.202016007. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim E. J.; Kim J. H.; Kim M. S.; Jeong S. H.; Choi D. H. Process Analytical Technology Tools for Monitoring Pharmaceutical Unit Operations: A Control Strategy for Continuous Process Verification. Pharmaceutics 2021, 13, 919 10.3390/pharmaceutics13060919. [DOI] [PMC free article] [PubMed] [Google Scholar]
Eisen K.; Eifert T.; Herwig C.; Maiwald M. Current and Future Requirements to Industrial Analytical Infrastructure—Part 1: Process Analytical Laboratories. Anal. Bioanal. Chem. 2020, 412, 2027–2035. 10.1007/s00216-020-02420-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Eifert T.; Eisen K.; Maiwald M.; Herwig C. Current and Future Requirements to Industrial Analytical Infrastructure—Part 2: Smart Sensors. Anal. Bioanal. Chem. 2020, 412, 2037–2045. 10.1007/s00216-020-02421-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
Andersson M. A Comparison of Nine PLS1 Algorithms. J. Chemom. 2009, 23, 518–529. 10.1002/cem.1248. [DOI] [Google Scholar]
Ma X.; Sun X.; Wang H.; Wang Y.; Chen D.; Li Q. Raman Spectroscopy for Pharmaceutical Quantitative Analysis by Low-Rank Estimation. Front. Chem. 2018, 6, 400 10.3389/fchem.2018.00400. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gemperline P. J.; Long J. R.; Gregoriou V. G. Nonlinear Multivariate Calibration Using Principal Components Regression and Artificial Neural Networks. Anal. Chem. 1991, 63, 1149–1153. 10.1021/ac00020a022. [DOI] [Google Scholar]
Bocklitz T.; Walter A.; Hartmann K.; Rösch P.; Popp J. How to Pre-Process Raman Spectra for Reliable and Stable Models?. Anal. Chim. Acta 2011, 704, 47–56. 10.1016/j.aca.2011.06.043. [DOI] [PubMed] [Google Scholar]
Smith J. P.; Holahan E. C.; Smith F. C.; Marrero V.; Booksh K. S. A Novel Multivariate Curve Resolution-Alternating Least Squares (MCR-ALS) Methodology for Application in Hyperspectral Raman Imaging Analysis. Analyst 2019, 144, 5425–5438. 10.1039/C9AN00787C. [DOI] [PubMed] [Google Scholar]
Jirasek A.; Schulze G.; Yu M. M. L.; Blades M. W.; Turner R. F. B. Accuracy and Precision of Manual Baseline Determination. Appl. Spectrosc. 2004, 58, 1488–1499. 10.1366/0003702042641236. [DOI] [PubMed] [Google Scholar]
Chi M.; Han X.; Xu Y.; Wang Y.; Shu F.; Zhou W.; Wu Y. An Improved Background-Correction Algorithm for Raman Spectroscopy Based on the Wavelet Transform. Appl. Spectrosc. 2019, 73, 78–87. 10.1177/0003702818805116. [DOI] [PubMed] [Google Scholar]
Yu H. G.; Park D. J.; Chang D. E.; Nam H. An Effective Baseline Correction Algorithm Using Broad Gaussian Vectors for Chemical Agent Detection with Known Raman Signature Spectra. Sensors 2021, 21, 8260 10.3390/s21248260. [DOI] [PMC free article] [PubMed] [Google Scholar]
Guo S.; Bocklitz T.; Popp J. Optimization of Raman-Spectrum Baseline Correction in Biological Application. Analyst 2016, 141, 2396–2404. 10.1039/C6AN00041J. [DOI] [PubMed] [Google Scholar]
Hu H.; Bai J.; Xia G.; Zhang W.; Ma Y. Improved Baseline Correction Method Based on Polynomial Fitting for Raman Spectroscopy. Photonic Sens. 2018, 8, 332–340. 10.1007/s13320-018-0512-y. [DOI] [Google Scholar]
Cai Y.; Yang C.; Xu D.; Gui W. Baseline Correction for Raman Spectra Using Penalized Spline Smoothing Based on Vector Transformation. Anal. Methods 2018, 10, 3525–3533. 10.1039/C8AY00914G. [DOI] [Google Scholar]
Liu X.; Zhang Z.; Sousa P. F. M.; Chen C.; Ouyang M.; Wei Y.; Liang Y.; Chen Y.; Zhang C. Selective Iteratively Reweighted Quantile Regression for Baseline Correction. Anal. Bioanal. Chem. 2014, 406, 1985–1998. 10.1007/s00216-013-7610-x. [DOI] [PubMed] [Google Scholar]
Chen L.; Wu Y.; Li T.; Chen Z. Collaborative Penalized Least Squares for Background Correction of Multiple Raman Spectra. J. Anal. Methods Chem. 2018, 2018, 9031356 10.1155/2018/9031356. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nagy B.; Galata D. L.; Farkas A.; Nagy Z. K. Application of Artificial Neural Networks in the Process Analytical Technology of Pharmaceutical Manufacturing—a Review. AAPS J. 2022, 24, 74 10.1208/s12248-022-00706-0. [DOI] [PubMed] [Google Scholar]
Barton B.; Thomson J.; Diz E. L.; Portela R. Chemometrics for Raman Spectroscopy Harmonization. Appl. Spectrosc. 2022, 76, 1021–1041. 10.1177/00037028221094070. [DOI] [PubMed] [Google Scholar]
Guo S.; Popp J.; Bocklitz T. Chemometric Analysis in Raman Spectroscopy from Experimental Design to Machine Learning–Based Modeling. Nat. Protoc. 2021, 16, 5426–5459. 10.1038/s41596-021-00620-3. [DOI] [PubMed] [Google Scholar]
Buitinck L.; Louppe G.; Blondel M.; Pedregosa F.; Mueller A.; Grisel O.; Niculae V.; Prettenhofer P.; Gramfort A.; Grobler J.; Layton R.; Vanderplas J.; Joly A.; Holt B.; Varoquaux G.. API Design for Machine Learning Software: Experiences from the Scikit-Learn Project, 2013. arXiv:1309.0238. arXiv.org e-Print archive. https://arxiv.org/abs/1309.0238.
Chen T.; Guestrin C.. et al. In XGBoost: A Scalable Tree Boosting System, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery; ACM, 2016; pp 785–794.
Vinet L.; Di Marco L.; Kairouz V.; Charette A. B. Process Intensive Synthesis of Propofol Enabled by Continuous Flow Chemistry. Org. Process. Res. Dev. 2022, 26, 2330–2336. 10.1021/acs.oprd.1c00416. [DOI] [Google Scholar]
Pierna J. A. F.; Dardenne P. Soil Parameter Quantification by NIRS as a Chemometric Challenge at “Chimiométrie 2006.. Chemom. Intell. Lab. Syst. 2008, 91, 94–98. 10.1016/j.chemolab.2007.06.007. [DOI] [Google Scholar]
Olivieri A. C. Practical Guidelines for Reporting Results in Single- and Multi-Component Analytical Calibration: A Tutorial. Anal. Chim. Acta 2015, 868, 10–22. 10.1016/j.aca.2015.01.017. [DOI] [PubMed] [Google Scholar]
Mazivila S. J.; Lombardi J. M.; Páscoa R. N. M. J.; Bortolato S. A.; Leitão J. M. M.; da Silva J. C. G. E. Three-Way Calibration Using PARAFAC and MCR-ALS with Previous Synchronization of Second-Order Chromatographic Data Through a New Functional Alignment of Pure Vectors for the Quantification in the Presence of Retention Time Shifts in Peak Position and Shape. Anal. Chim. Acta 2021, 1146, 98–108. 10.1016/j.aca.2020.12.033. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ci4c00359_si_001.pdf^{(741.9KB, pdf)}

ci4c00359_si_002.xlsx^{(5.4MB, xlsx)}

Data Availability Statement

[ref1] Workman J.; Lavine B.; Chrisman R.; Koch M. Process Analytical Chemistry. Anal. Chem. 2011, 83, 4557–4578. 10.1021/ac200974w. [DOI] [PubMed] [Google Scholar]

[ref2] Mazivila S. J.; Santos J. L. M. A Review on Multivariate Curve Resolution Applied to Spectroscopic and Chromatographic Data Acquired during the Real-Time Monitoring of Evolving Multi-Component Processes: From Process Analytical Chemistry (PAC) to Process Analytical Technology (PAT). TrAC Trends Anal. Chem. 2022, 157, 116698 10.1016/j.trac.2022.116698. [DOI] [Google Scholar]

[ref3] Callis J. B.; Illman D. L.; Kowalski B. R. Process Analytical Chemistry. Anal. Chem. 1987, 59, 624A–637A. 10.1021/ac00136a723. [DOI] [Google Scholar]

[ref4] Pérez-Beltrán C. H.; Jiménez-Carvelo A. M.; Torrente-López A.; Navas N. A.; Cuadros-Rodríguez L. QbD/PAT—State of the Art of Multivariate Methodologies in Food and Food-Related Biotech Industries. Food Eng. Rev. 2023, 15, 24–40. 10.1007/s12393-022-09324-0. [DOI] [Google Scholar]

[ref5] Kharbach M.; Mansouri M. A.; Taabouz M.; Yu H. Current Application of Advancing Spectroscopy Techniques in Food Analysis: Data Handling with Chemometric Approaches. Foods 2023, 12, 2753 10.3390/foods12142753. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref6] Biancolillo A.; Marini F.; Ruckebusch C.; Vitale R. Chemometric Strategies for Spectroscopy-Based Food Authentication. Appl. Sci. 2020, 10, 6544 10.3390/app10186544. [DOI] [Google Scholar]

[ref7] Price G. A.; Mallik D.; Organ M. G. Process Analytical Tools for Flow Analysis: A Perspective. J. Flow Chem. 2017, 7, 82–86. 10.1556/1846.2017.00032. [DOI] [Google Scholar]

[ref8] Glace M.; Wu W.; Kraus H.; Acevedo D.; Roper T. D.; Mohammad A. The Development of a Continuous Synthesis for Carbamazepine Using Validated In-Line Raman Spectroscopy and Kinetic Modelling for Disturbance Simulation. React. Chem. Eng. 2023, 8, 1032–1042. 10.1039/D2RE00476C. [DOI] [Google Scholar]

[ref9] Sacher S.; Poms J.; Rehrl J.; Khinast J. G. PAT Implementation for Advanced Process Control in Solid Dosage Manufacturing – A Practical Guide. Int. J. Pharm. 2022, 613, 121408 10.1016/j.ijpharm.2021.121408. [DOI] [PubMed] [Google Scholar]

[ref10] Miyai Y.; Formosa A.; Armstrong C.; Marquardt B.; Rogers L.; Roper T. PAT Implementation on a Mobile Continuous Pharmaceutical Manufacturing System: Real-Time Process Monitoring with In-Line FTIR and Raman Spectroscopy. Org. Process Res. Dev. 2021, 25, 2707–2717. 10.1021/acs.oprd.1c00299. [DOI] [Google Scholar]

[ref11] Arora T.; Verma R.; Kumar R.; Chauhan R.; Kumar B.; Sharma V. Chemometrics Based ATR-FTIR Spectroscopy Method for Rapid and Non-Destructive Discrimination between Eyeliner and Mascara Traces. Microchem. J. 2021, 164, 106080 10.1016/j.microc.2021.106080. [DOI] [Google Scholar]

[ref12] Liang Y.; Zhao L.; Guo J.; Wang H.; Liu S.; Wang L.; Chen L.; Chen M.; Zhang N.; Liu H.; Nie C. Just-in-Time Learning-Integrated Partial Least-Squares Strategy for Accurately Predicting 71 Chemical Constituents in Chinese Tobacco by Near-Infrared Spectroscopy. ACS Omega 2022, 7, 38650–38659. 10.1021/acsomega.2c04139. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref13] Kocevska S.; Maggioni G. M.; Crouse S. H.; Prasad R.; Rousseau R. W.; Grover M. A. Effect of Ion Interactions on the Raman Spectrum of NO3–: Toward Monitoring of Low-Activity Nuclear Waste at Hanford. Chem. Eng. Res. Des. 2022, 181, 173–194. 10.1016/j.cherd.2022.03.002. [DOI] [Google Scholar]

[ref14] Prasad R.; Crouse S. H.; Rousseau R. W.; Grover M. A. Quantifying Dense Multicomponent Slurries with In-Line ATR-FTIR and Raman Spectroscopies: A Hanford Case Study. Ind. Eng. Chem. Res. 2023, 62, 15962–15973. 10.1021/acs.iecr.3c01249. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref15] Continuous Manufacturing of Drug Substances and Drug Products (Q13). ICH 2021..

[ref16] Health and Human Services (HHS); FDA; CDER; CVM; ORA . Guidance for Industry PAT — A Framework for Innovative Pharmaceutical Development, Manufacturing, and Quality Assurance, 2004.

[ref17] Acevedo D.; Yang X.; Mohammad A.; Pavurala N.; Wu W. L.; O’Connor T. F.; Nagy Z. K.; Cruz C. N. Raman Spectroscopy for Monitoring the Continuous Crystallization of Carbamazepine. Org. Process Res. Dev. 2018, 22, 156–165. 10.1021/acs.oprd.7b00322. [DOI] [Google Scholar]

[ref18] Cervera-Padrell A. E.; Nielsen J. P.; Pedersen M. P.; Christensen K. M.; Mortensen A. R.; Skovby T.; Dam-Johansen K.; Kiil S.; Gernaey K. V. Monitoring and Control of a Continuous Grignard Reaction for the Synthesis of an Active Pharmaceutical Ingredient Intermediate Using Inline NIR Spectroscopy. Org. Process Res. Dev. 2012, 16, 901–914. 10.1021/op2002563. [DOI] [Google Scholar]

[ref19] Glace M.; Kraus H.; Wu W.; Acevedo D.; Liu D.; Roper T. D.; Mohammad A. Impurity Profiling for a Scalable Continuous Synthesis and Crystallization of Carbamazepine Drug Substance. Org. Process Res. Dev. 2024, 28, 2013–2027. 10.1021/acs.oprd.4c00081. [DOI] [Google Scholar]

[ref20] Singh R.; Sahay A.; Karry K. M.; Muzzio F.; Ierapetritou M.; Ramachandran R. Implementation of an Advanced Hybrid MPC-PID Control System Using PAT Tools into a Direct Compaction Continuous Pharmaceutical Tablet Manufacturing Pilot Plant. Int. J. Pharm. 2014, 473, 38–54. 10.1016/j.ijpharm.2014.06.045. [DOI] [PubMed] [Google Scholar]

[ref21] Sagmeister P.; Lebl R.; Castillo I.; Rehrl J.; Kruisz J.; Sipek M.; Horn M.; Sacher S.; Cantillo D.; Williams J. D.; Kappe C. O. Advanced Real-Time Process Analytics for Multistep Synthesis in Continuous Flow*. Angew. Chem., Int. Ed. 2021, 60, 8139–8148. 10.1002/anie.202016007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref22] Kim E. J.; Kim J. H.; Kim M. S.; Jeong S. H.; Choi D. H. Process Analytical Technology Tools for Monitoring Pharmaceutical Unit Operations: A Control Strategy for Continuous Process Verification. Pharmaceutics 2021, 13, 919 10.3390/pharmaceutics13060919. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref23] Eisen K.; Eifert T.; Herwig C.; Maiwald M. Current and Future Requirements to Industrial Analytical Infrastructure—Part 1: Process Analytical Laboratories. Anal. Bioanal. Chem. 2020, 412, 2027–2035. 10.1007/s00216-020-02420-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref24] Eifert T.; Eisen K.; Maiwald M.; Herwig C. Current and Future Requirements to Industrial Analytical Infrastructure—Part 2: Smart Sensors. Anal. Bioanal. Chem. 2020, 412, 2037–2045. 10.1007/s00216-020-02421-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref25] Andersson M. A Comparison of Nine PLS1 Algorithms. J. Chemom. 2009, 23, 518–529. 10.1002/cem.1248. [DOI] [Google Scholar]

[ref26] Ma X.; Sun X.; Wang H.; Wang Y.; Chen D.; Li Q. Raman Spectroscopy for Pharmaceutical Quantitative Analysis by Low-Rank Estimation. Front. Chem. 2018, 6, 400 10.3389/fchem.2018.00400. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref27] Gemperline P. J.; Long J. R.; Gregoriou V. G. Nonlinear Multivariate Calibration Using Principal Components Regression and Artificial Neural Networks. Anal. Chem. 1991, 63, 1149–1153. 10.1021/ac00020a022. [DOI] [Google Scholar]

[ref28] Bocklitz T.; Walter A.; Hartmann K.; Rösch P.; Popp J. How to Pre-Process Raman Spectra for Reliable and Stable Models?. Anal. Chim. Acta 2011, 704, 47–56. 10.1016/j.aca.2011.06.043. [DOI] [PubMed] [Google Scholar]

[ref29] Smith J. P.; Holahan E. C.; Smith F. C.; Marrero V.; Booksh K. S. A Novel Multivariate Curve Resolution-Alternating Least Squares (MCR-ALS) Methodology for Application in Hyperspectral Raman Imaging Analysis. Analyst 2019, 144, 5425–5438. 10.1039/C9AN00787C. [DOI] [PubMed] [Google Scholar]

[ref30] Jirasek A.; Schulze G.; Yu M. M. L.; Blades M. W.; Turner R. F. B. Accuracy and Precision of Manual Baseline Determination. Appl. Spectrosc. 2004, 58, 1488–1499. 10.1366/0003702042641236. [DOI] [PubMed] [Google Scholar]

[ref31] Chi M.; Han X.; Xu Y.; Wang Y.; Shu F.; Zhou W.; Wu Y. An Improved Background-Correction Algorithm for Raman Spectroscopy Based on the Wavelet Transform. Appl. Spectrosc. 2019, 73, 78–87. 10.1177/0003702818805116. [DOI] [PubMed] [Google Scholar]

[ref32] Yu H. G.; Park D. J.; Chang D. E.; Nam H. An Effective Baseline Correction Algorithm Using Broad Gaussian Vectors for Chemical Agent Detection with Known Raman Signature Spectra. Sensors 2021, 21, 8260 10.3390/s21248260. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref33] Guo S.; Bocklitz T.; Popp J. Optimization of Raman-Spectrum Baseline Correction in Biological Application. Analyst 2016, 141, 2396–2404. 10.1039/C6AN00041J. [DOI] [PubMed] [Google Scholar]

[ref34] Hu H.; Bai J.; Xia G.; Zhang W.; Ma Y. Improved Baseline Correction Method Based on Polynomial Fitting for Raman Spectroscopy. Photonic Sens. 2018, 8, 332–340. 10.1007/s13320-018-0512-y. [DOI] [Google Scholar]

[ref35] Cai Y.; Yang C.; Xu D.; Gui W. Baseline Correction for Raman Spectra Using Penalized Spline Smoothing Based on Vector Transformation. Anal. Methods 2018, 10, 3525–3533. 10.1039/C8AY00914G. [DOI] [Google Scholar]

[ref36] Liu X.; Zhang Z.; Sousa P. F. M.; Chen C.; Ouyang M.; Wei Y.; Liang Y.; Chen Y.; Zhang C. Selective Iteratively Reweighted Quantile Regression for Baseline Correction. Anal. Bioanal. Chem. 2014, 406, 1985–1998. 10.1007/s00216-013-7610-x. [DOI] [PubMed] [Google Scholar]

[ref37] Chen L.; Wu Y.; Li T.; Chen Z. Collaborative Penalized Least Squares for Background Correction of Multiple Raman Spectra. J. Anal. Methods Chem. 2018, 2018, 9031356 10.1155/2018/9031356. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref38] Nagy B.; Galata D. L.; Farkas A.; Nagy Z. K. Application of Artificial Neural Networks in the Process Analytical Technology of Pharmaceutical Manufacturing—a Review. AAPS J. 2022, 24, 74 10.1208/s12248-022-00706-0. [DOI] [PubMed] [Google Scholar]

[ref39] Barton B.; Thomson J.; Diz E. L.; Portela R. Chemometrics for Raman Spectroscopy Harmonization. Appl. Spectrosc. 2022, 76, 1021–1041. 10.1177/00037028221094070. [DOI] [PubMed] [Google Scholar]

[ref40] Guo S.; Popp J.; Bocklitz T. Chemometric Analysis in Raman Spectroscopy from Experimental Design to Machine Learning–Based Modeling. Nat. Protoc. 2021, 16, 5426–5459. 10.1038/s41596-021-00620-3. [DOI] [PubMed] [Google Scholar]

[ref41] Buitinck L.; Louppe G.; Blondel M.; Pedregosa F.; Mueller A.; Grisel O.; Niculae V.; Prettenhofer P.; Gramfort A.; Grobler J.; Layton R.; Vanderplas J.; Joly A.; Holt B.; Varoquaux G.. API Design for Machine Learning Software: Experiences from the Scikit-Learn Project, 2013. arXiv:1309.0238. arXiv.org e-Print archive. https://arxiv.org/abs/1309.0238.

[ref42] Chen T.; Guestrin C.. et al. In XGBoost: A Scalable Tree Boosting System, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery; ACM, 2016; pp 785–794.

[ref43] Vinet L.; Di Marco L.; Kairouz V.; Charette A. B. Process Intensive Synthesis of Propofol Enabled by Continuous Flow Chemistry. Org. Process. Res. Dev. 2022, 26, 2330–2336. 10.1021/acs.oprd.1c00416. [DOI] [Google Scholar]

[ref44] Pierna J. A. F.; Dardenne P. Soil Parameter Quantification by NIRS as a Chemometric Challenge at “Chimiométrie 2006.. Chemom. Intell. Lab. Syst. 2008, 91, 94–98. 10.1016/j.chemolab.2007.06.007. [DOI] [Google Scholar]

[ref45] Olivieri A. C. Practical Guidelines for Reporting Results in Single- and Multi-Component Analytical Calibration: A Tutorial. Anal. Chim. Acta 2015, 868, 10–22. 10.1016/j.aca.2015.01.017. [DOI] [PubMed] [Google Scholar]

[ref46] Mazivila S. J.; Lombardi J. M.; Páscoa R. N. M. J.; Bortolato S. A.; Leitão J. M. M.; da Silva J. C. G. E. Three-Way Calibration Using PARAFAC and MCR-ALS with Previous Synchronization of Second-Order Chromatographic Data Through a New Functional Alignment of Pure Vectors for the Quantification in the Presence of Retention Time Shifts in Peak Position and Shape. Anal. Chim. Acta 2021, 1146, 98–108. 10.1016/j.aca.2020.12.033. [DOI] [PubMed] [Google Scholar]

PERMALINK

Iterative Regression of Corrective Baselines (IRCB): A New Model for Quantitative Spectroscopy

Matthew Glace

Roudabeh S Moazeni-Pourasil

Daniel W Cook

Thomas D Roper

Abstract

Experimental Section

Iterative Regression of Corrective Baselines (IRCB)

Figure 1.

Figure 2.

Materials

Sample Preparation—Case Study 1

High-Performance Liquid Chromatography

Fourier Transform Infrared Spectroscopy

Results and Discussion

Table 1. Outline of the Case Studies.

Case Study 1: FTIR for Three-Component Mixture

Figure 3.

Table 2. Best Baselines for Case Study 1.

Figure 4.

Table 3. Statistical Results for (a) Case Study 1 and (b) Case Study 2a.

Figure 5.

Case Study 2: “Chimiométrie 2006” Soil Quantification with NIR Spectroscopy

Figure 6.

Figure 7.

Case Study 3: Raman Spectroscopy for Dense Slurry Solid Quantification

Figure 8.

Table 4. Comparison of IRCB-XGB and SVG-PLS-R.

IRCB Compared to Other Preprocessing Methods

Conclusions

Acknowledgments

Data Availability Statement

Supporting Information Available

Author Contributions

Supplementary Material

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Table 3. Statistical Results for (a) Case Study 1 and (b) Case Study 2^a.