Abstract
Rapid, non-invasive monitoring technologies represent an advancement beyond labor-intensive assays for assessing biomass and bioactives in microalgal production. This study established an integrated Raman spectroscopy and machine learning framework for quantification of Phaeodactylum tricornutum biomass and fucoxanthin yield. Workflow combined spectral processing with multiple feature selection strategies and machine learning algorithms. Deep networks demonstrated superior biomass prediction accuracy on full-spectrum (coefficient of determination R2 = 0.968; root mean squared error RMSE = 0.045 g L−1; relative prediction deviation RPD = 5.598), while kernel methods achieved robust fucoxanthin quantification through sparse, regularized features (R2 = 0.949, RMSE = 1.112 mg L−1, RPD = 4.408). Direct application of laboratory-trained models to pilot-scale photobioreactors was inaccurate due to domain shifts. Transfer learning effectively mitigated these shifts, restoring accuracy with minimal calibration efforts (10–20%). This work presents an industrially viable analytical tool for quality control and process optimization of microalgae-derived bioproducts.
Keywords: Microalgae, Biomass, Fucoxanthin, Raman spectroscopy, Machine learning, Model, Transfer learning
Graphical abstract
Highlights
-
•
Raman spectroscopy monitoring of P. tricornutum biomass and fucoxanthin.
-
•
Structured analytical workflow generated robust models and reproducibility.
-
•
Biomass and fucoxanthin models exhibited distinct feature sensitivity and complexity.
-
•
Transfer learning restored pilot-scale performance with minimal calibration size.
1. Introduction
Diatoms, among the most productive photosynthetic eukaryotic microalgae on Earth, contribute approximately 40% to marine primary production and account for an estimated 20% of global carbon fixation (Field et al., 1998). Within this group, Phaeodactylum tricornutum has emerged as a well-established model species owing to its advanced genetic toolkit and its capacity to function as a versatile photosynthetic cell factory for the biosynthesis of chrysolaminarin, long-chain polyunsaturated fatty acids, and carotenoids. Of particular industrial interest is fucoxanthin, a diatom-dominant carotenoid with documented bioactivities, including regulation of body weight, antioxidant capacity, attenuation of non-alcoholic fatty liver disease (NAFLD) progression, and anti-atherosclerotic effects. These attributes have attracted the exploration of fucoxanthin as both a functional food ingredient and a nutraceutical component (Lourenço-Lopes et al., 2021). Despite this potential, the transition from laboratory-scale cultivation to industrial production remains constrained by challenges in process scalability, especially the ability to maintain consistent product quality and operational efficiency. A major bottleneck is the lack of monitoring technologies capable of rapidly and accurately assessing biomass and bioproducts.
Biomass and intracellular analyte assessments traditionally require laborious, multi-step workflows. Chromatography delivers high sensitivity and chemical specificity for pigments, monosaccharides and lipids, yet requires cell harvesting, lyophilization, solvent extraction and instrumental analysis. The quantification of microalgal biomass is equally cumbersome, which limits their suitability for high-frequency process monitoring. This methodological gap has driven the growing interest in optical approaches for rapid and non-destructive profiling. For example, reflectance spectroscopy has enabled fast quantification of Haematococcus pluvialis biofilm traits (Morgado et al., 2024), and two-dimensional fluorescence spectroscopy has established robust models for P. tricornutum biomass and fucoxanthin dynamics (Reynolds-Brandao et al., 2025). Raman spectroscopy (RS) detects molecular vibrations through inelastic scattering of monochromatic light, generating chemical fingerprints that reflect composition and conformational state, and has been widely applied in microbial fermentation process analytical technology (PAT) (Dzurendova et al., 2023). In microalgal research, RS has been employed for metabolite identification (chlorophylls, carotenoids, and fucoxanthin-chlorophyll protein complexes), contamination monitoring, taxonomic, and physiological-states discrimination (Adejimi et al., 2022; Kaczor & Baranska, 2011; Lieutaud et al., 2019; Premvardhan et al., 2010). Furthermore, chemometrics-based RS approaches have enabled quantitative estimation of lipids and carbohydrates via partial least squares regression (PLSR) or principal component analysis (PCA), often demonstrated at the single-cell level (Li et al., 2019; Wang et al., 2023). However, RS performance degrades in heterogeneous populations due to spectral variability, instrument drift, strong fluorescence backgrounds, and nonlinear optical effects, which conventional chemometric methods cannot fully compensate for (Moudrikova et al., 2017).
Machine learning (ML) and deep learning (DL) frameworks map high-dimensional inputs to quantitative outputs and have been widely applied to predict microalgal productivity from images, environmental variables, or cultivation parameters. Representative algorithms include random forest (RF) (Meenatchi Sundaram et al., 2024), eXtreme Gradient Boosting (XGBoost) (Li et al., 2025), long short-term memory networks (LSTM) (Yeh et al., 2023), and convolutional neural networks (CNN) (Peng et al., 2024). However, the reliance on domain-specific inputs limits their applicability for multiplexed analyte detection. In this context, RS with ML/DL offers a complementary analytical approach that combines chemically informative spectra with data-driven modeling. Nevertheless, practical deployment in industrial photobioreactors (PBRs) faces two key fundamental challenges: the scarcity of large, well-labeled spectral datasets at industrial scale, and pronounced domain shifts between controlled laboratory conditions and industrial environments arising from abiotic fluctuations, matrix effects, and contamination. Transfer learning (TL) enables the transfer of knowledge from data-rich source domains to data-limited target domains, enhancing model performance and reducing reliance on large labeled datasets, which is critical for the industrial-scale application of spectroscopic technology (Li et al., 2021). TL has proven effective for mitigating domain shift across instruments and scenarios in areas such as food quality control, soil analysis, and pollutant tracing (Hossen et al., 2025). However, the applicability of TL to RS or other spectroscopy-based monitoring in microalgal biotechnology has not yet been validated.
This study presents an integrated RS-ML/DL-TL framework for rapid and label-free quantification of biomass and fucoxanthin yield in P. tricornutum. The workflow integrated spectral preprocessing and feature selection and included systematic benchmarking of ML and DL approaches. The framework explicitly adopted target-dependent modeling strategies to accommodate chemical and biological differences among analytes and incorporated transfer learning to enable model calibration across domains arising from scale transitions between laboratory and pilot systems. These frameworks provide a structured methodological basis for scalable RS-based monitoring in microalgal systems.
2. Materials and methods
2.1. Standard preparation
A fucoxanthin standard solution was prepared by dissolving fucoxanthin (≥95%, HPLC grade, Sigma-Aldrich) in 75% ethanol to prepare a stock solution. The solution was ultrasonicated at room temperature for 10 min to ensure complete dissolution. To maintain the sample temperature near room temperature during sonication, intermittent sonication (1 min on/1 min off) was used and sample temperature was monitored in real time. The solution was then serially diluted with ethanol to produce a concentration series ranging from 0 to 200 ppm. All standards were stored in amber vials at 4 °C to prevent photodegradation prior to spectral measurements.
2.2. Microalgae and culture conditions
The wild-type diatom P. tricornutum UTEX640 (Culture Collection of Algae, University of Texas at Austin, USA) was cultivated in a modified f/2 medium with urea as the nitrogen source under reduced salinity. Dataset 1, used for model training and validation, was collected from laboratory-scale column photobioreactors (PBRs; 5 cm inner diameter × 90 cm height, 1.5 L working volume) under controlled conditions: temperature maintained at 20 °C, continuous aeration with CO₂-enriched air (1.5%, v/v), and multispectral LED illumination (white 400–700 nm; red 630–660 nm) at three irradiance levels (9 W, 15 W, 30 W), corresponding to 15%, 25%, and 50% of maximum power output, respectively. Dataset 2, which served as an external test set, comprised atmospheric and room-temperature plasma (ARTP)-mutagenized P. tricornutum mutant libraries, cultured under identical conditions in column PBRs.
Dataset 3 was obtained from in-house designed pilot-scale PBRs (200 L working volume) configured for industrial P. tricornutum cultivation. These systems featured a submersible LED lighting array and a closed-loop water-cooling system maintaining temperature at 20 ± 1 °C. Cultures were aerated with compressed air containing 1% CO₂.
Samples were collected daily from each cultivation batch in both laboratory and pilot-scale PBRs to establish the datasets required for spectral analysis and model development.
2.3. Reference growth and fucoxanthin quantification
Growth determination: microalgal biomass was measured as cell dry weight (dwt, g L−1). Culture aliquots (10 to 20 mL) of culture (V) were filtered through pre-weighed 0.45 μm GF/C membrane filters (Whatman, UK, dwt0). Filters were washed twice with 20 mL 0.50 M NH4HCO3 solution, dried at 105 °C overnight, and then re-weighed (dwt1). dwt was calculated as Eq. (1):
| (1) |
Fucoxanthin yield (fuco, mg L−1) determination: Approximately 10 mg of freeze-dried algae powder was placed into a microtube with cover. Then 300 mg of glass beads (Sigma, G8772, 0.4–0.6 mm) and 1 mL of HPLC-grade methanol were added. Cells were disrupted using a tissue homogenizer at 6 ms−1 for 80 s, followed by centrifugation to collect the supernatant into a 10 mL amber volumetric flask. The residual cell debris was resuspended in 1 mL of HPLC-grade methanol and centrifuged to obtain additional supernatant. The extraction cycle was repeated until the pellet became milky white. The combined extracts were adjusted to 10 mL, mixed thoroughly, and filtered through a 0.22 μm organic-phase membrane filter into sample vials. Fucoxanthin content was determined using a Waters 2695 HPLC system (Waters, Milford, MA, USA) equipped with a photodiode array (PDA) detector and a C18 reverse phase column. Fuco was calculated as the product of dwt and fucoxanthin content.
2.4. Spectra acquisition and pre-processing
All spectra were collected offline using a Subphotonics Raman Analyzer (Shenzhen, China) equipped with an 830 nm near-infrared excitation laser. The laser was operated at 200 mW to minimize fluorescence interference while maintaining an optimal signal-to-noise ratio (SNR). For each sample, ten consecutive 2-s integrations were averaged into a single spectrum containing 1024 variables for full length. Aliquots (3–5 mL) of P. tricornutum cultures from PBRs were transferred to quartz cuvettes and measured in a light-isolated chamber. An immersion probe was inserted vertically to ensure a consistent focal depth across measurements. The system was calibrated with ethanol standard prior to analysis.
A structured preprocessing workflow was applied to eliminate non-informative components while retaining essential biochemical features. Noise reduction involved (1) median filtering with a 7-point sliding window to suppress high-frequency noise, followed by (2) Savitzky-Golay smoothing (15-point window, second-order polynomial) to enhance SNR without distorting peak morphology. Baseline correction was performed using asymmetric least squares (ALS) optimization (Eq. 2):
| (2) |
where λ = 105 determined baseline smoothness, p = 0.05 defined the asymmetric weighting to favor peak preservation over background fitting, and 20 iterations ensured convergence. The second-derivative regularization matrix D enforced baseline continuity. Following baseline correction, spectra were truncated to the 388.87–1747.22 cm−1 Raman shift region, yielding 537 variables for subsequent analysis.
2.5. Post-processing and feature selection algorithms
Post-processing transforms were applied to generate alternative spectral representations that address residual intensity-scaling and heteroscedastic variance remaining after preprocessing, enable fair comparisons across feature-selection strategies and model classes, and support testing of model robustness. The transforms applied were first-derivative conversion, standard normal variate (SNV) and vector normalization (VN). First-derivative conversion emphasizes local peak morphology and aids resolution of overlapping bands; SNV mitigates multiplicative scattering and stabilizes variance across wavelengths; VN scales spectra to a common intensity to reduce total-intensity differences.
Feature selection was implemented as a critical step that may improve machine learning performance by reducing overfitting, enhancing predictive accuracy, and lowering computational cost. Different selection methods impose distinct inductive biases and tend to favor distributed spectral patterns, linear contributions, compact nonredundant clusters, or sparse stable predictors. Four complementary strategies were evaluated to determine which types of spectral representation best suit each target. This ensemble approach also facilitates interpretation and practical deployment by identifying compact, physically meaningful variable subsets when appropriate while retaining full spectrum baselines for high-capacity learners. As follows:
-
(i)
Full-spectrum, in which all spectral variables were retained to serve as a baseline for performance comparison;
-
(ii)
Variable importance in projection (VIP), calculated from PLSR, with variables retained only if VIP ≥ 1.0 to ensure selection of the most influential features;
-
(iii)
Genetic algorithm (GA), an evolutionary optimization approach that iteratively selects compact, non-redundant clusters of variables through simulating natural selection processes of mutation, crossover, and survival. The number of selected variables was limited to 200;
-
(iv)
Least absolute shrinkage and selection operator with elastic net regularization (LASSO–EN), a hybrid L1/L2 penalty method that enforces sparsity while mitigating multicollinearity, enabling stable selection of correlated yet relevant features.
2.6. Modeling algorithms
A suite of machine learning algorithms, including PLSR, support vector regression machine (SVM), and PCA-SVM, together with neural network-based deep learning architectures of CNN and multilayer perceptron (MLP) were selected. These were combined with the post-processing and feature selection strategies to encompass a range of complexity levels and modeling characteristics, which enables a systematic evaluation of performance on targets (dwt and fuco) that differ in biological origin and statistical behavior. Low-complexity linear methods address largely linear, collinear spectral signals. Medium-capacity learners handle moderate nonlinearity. Sparse, regularized models produce compact predictor sets that reduce overfitting and simplify deployment. High-capacity architectures provide greater representational flexibility for modeling complex spectral–response relationships.
PLSR is a linear method that projects predictor and response variables into a shared latent space by maximizing covariance, thereby reducing multicollinearity and yielding stable predictions (Wold et al., 2001). SVM employed a radial basis function (RBF) kernel to map spectral inputs into a high-dimensional feature space, using an epsilon-insensitive loss function to capture nonlinear relationships while constraining model complexity (Yu et al., 2025). PCA-SVM firstly performed PCA to reduce data dimensionality and remove inter-variable correlations, followed by SVM training to improve computational efficiency and reduce overfitting risk.
The CNN architecture consisted of stacked one-dimensional convolutional layers, rectified linear unit (ReLU) activation, and max pooling, interleaved with dropout regularization and concluded with fully connected layers, enabling hierarchical extraction of localized spectral motifs relevant to regression (Ouyang et al., 2025).
The MLP regressor consisted of a cascade of fully connected layers, each followed by batch normalization and ReLU activation, with dropout applied for regularization and a final linear readout producing a scalar prediction. This architecture facilitates learning of global relationships across spectral channels and transforms raw spectra into a nonlinear representation suitable for continuous-value prediction (Xiao et al., 2025).
All models were trained and validated on Dataset 1 using five-fold cross-validation (5-fold CV) (Fig. S1a): samples were randomly split into five approximately equal-sized folds, each fold was held out once for validation so that after five iterations every sample had served as a validation case while the model was trained on the remaining four. Dataset 2, an independent external test set, was withheld during model development and used exclusively for final evaluation to provide an unbiased assessment of robustness and transferability. Hyperparameters were optimized individually for each algorithm via grid search within the CV framework (search spaces detailed in Table S1). For deep-learning models (CNN and MLP), a randomized hyperparameter search with 200 iterations was used to identify optimal configurations.
2.7. Transfer learning
Labeling constraints in pilot-scale photobioreactor spectral modeling, together with domain shifts arising from scale-dependent biophysical heterogeneities, motivated the design of two transfer-learning (TL) strategies that leverage a laboratory source domain with sufficient labeled samples to enable efficient industrial-scale prediction.
Strategy 1 (fine-tuning, FT) (Fig. S1b): fine-tuning was used to preserve generalizable low-level spectral representations learned from the well-labeled laboratory source domain while adapting higher-order mappings to the target domain (Yosinski et al., 2014). Specifically, convolutional filters in the lower layers of the pretrained CNN were frozen (excluded from gradient updates) to retain stable representations of fundamental Raman vibrational patterns and peak morphology. Training was restricted to the higher-level parameters of the network, namely the fully connected layers and the final output layer, allowing these weights to recalibrate the mapping from extracted spectral features to continuous target values and accommodate systematic shifts in peak position and relative intensity between source and target spectra. This selective-update scheme reduces the risk of overfitting when target labels are scarce and facilitates efficient domain adaptation by combining preserved spectral priors with targeted adjustment of the predictive mapping.
Strategy 2 (instance transfer, IT) (Fig. S1c): instance-transfer prioritized source-domain spectra that were most similar to the target-domain migration samples. Similarity was quantified with a k-nearest-neighbors search (k = 5) in the Euclidean metric: for each source sample, distances to its five nearest target samples were computed and averaged to yield a single similarity score. Source samples were ranked by this mean distance, and the top 50% most similar source instances (SIMILARITY_RATIO = 0.5) were retained. The selected source samples were combined with the target-domain migration samples to build the training set; models were trained using the hyperparameters inherited from the source-domain optimization. This instance-selection procedure limits negative transfer by excluding dissimilar source cases and focuses model training on source examples that are most relevant to the target conditions (Liu et al., 2025).
For evaluation, target-domain samples were partitioned via 5-fold CV, with 80% allocated to the migration pool for adaptation and the remaining 20% reserved as an unseen test set. Within the migration pool, incremental subsets corresponding to fractions of 0.1, 0.2, …, 1.0 were sampled and used from the migration pool for both FT and IT. In parallel, two baseline scenarios were implemented: a zero-transfer baseline, in which source-domain models were applied directly to target-domain data without adaptation, and a single-domain baseline, in which models were trained exclusively on target-domain samples.
2.8. Criteria for evaluation and model interpretability
Predictive performance was evaluated using four metrics to assess regression accuracy. The coefficient of determination (R2) quantifies the proportion of variance in dwt and fuco explained by the model. Root mean squared error (RMSE) expresses absolute prediction deviations in physiologically relevant units. Mean squared error (MSE) served as the loss function during training; it penalizes large deviations quadratically to promote stable convergence. Relative prediction deviation (RPD), defined as the ratio of the observed standard deviation to the RMSE, considers the prediction error and the variability of the observed values, offering a more objective and readily comparable model validity indicator across validations. A well-performing model is expected to exhibit high R2 and RPD values, while minimizing RMSE.
Four reliability categories were defined to assess the quality of the regression models (Shi et al., 2024). Models with RPD < 2.0, 2.0 ≤ RPD < 3.0, 3.0 ≤ RPD < 4.0, and RPD ≥ 4.0 were classified as non-predictive (NP), marginally predictive (MP), predictive (P), and excellent predictive (EP), respectively. In bioreactor monitoring for microalgal production, an RPD ≥ 3 indicates that the model's prediction error is sufficiently low relative to the natural variability of targets, facilitating reliable process control and basic decision-making. An RPD ≥ 4 signifies a higher level of predictive excellence, where the reduced error supports more precise optimization and quality assurance in production operations, as commonly required for minimizing variability in bioprocess outcomes.
R2, RMSE, MSE, and RPD were calculated as Eqs. (3), (4), (5), (6), respectively:
| (3) |
| (4) |
| (5) |
| (6) |
where yi, ŷi, and ȳi denote measured values, predicted values, and the sample mean, respectively, and σobserved represents the standard deviation of reference measurements.
Based on cooperative game theory, SHAP (SHapley Additive exPlanations) values quantify the contribution of each feature to individual predictions, thereby enhancing the model interpretability. In this study, mean absolute SHAP values were computed across samples using the Python SHAP library to identify the most influential features. More details are available in Zhong et al. (2024).
2.9. Technical implementation
The computational framework was implemented in Python 3.11.8 with PyTorch (ML/DL/TL architectures), Scikit-learn (data processing), and Matplotlib (visualization). PyCharm was used to manage the pipelines, including version control, debugging, and visualization. All computations ran on a workstation with a 12th Gen Intel Core i7–12,700 CPU, NVIDIA GeForce RTX 4090 GPU (CUDA 12.1 drivers).
3. Results and discussion
3.1. Spectra analysis
The Raman spectra of fucoxanthin standards dissolved in 75% ethanol displayed two prominent bands at 1531 cm−1 and 1160 cm−1, which intensities increasing linearly as a function of fucoxanthin concentration (Figs. 1a-b). A weaker band in the 1000–1023 cm−1 region became detectable only at higher concentrations. These spectral features correspond to the primary vibrational modes of fucoxanthin: C C stretching (ν1), C—C stretching (ν2), and C-CH₃ deformation (ν3) (Nekvapil et al., 2022), confirming their suitability as spectroscopic markers.
Fig. 1.
Raman characterization of fucoxanthin/carotenoids in Phaeodactylum tricornutum. (a) Raman spectra of fucoxanthin standards in 75% ethanol showing characteristic peaks (ν1, ν2, and ν3). (b) Intensity response of the principal fucoxanthin features (ν1 and ν2) in ethanol across the tested standard series. (c) Spectral preprocessing workflow. (d) Bright-field and chlorophyll autofluorescence micrographs and macroscopic culture images illustrating samples from low to high cell densities (scale bar = 10 μm, Olympus BX53).
Spectra acquired from P. tricornutum cultures under 830 nm excitation exhibited pronounced fluorescence backgrounds (Fig. 1c), primarily attributable to intracellular chlorophyll autofluorescence (as red emission in the micrographs, Fig. 1d). To improve spectral clarity and SNR, a two-step preprocessing workflow was applied consisting of smoothing and baseline fitting and subtraction (Fig. 1c). Importantly, the preprocessed spectra were used for all subsequent quantitative modeling, whereas the original spectra were retained for exploratory multivariate analyses (e.g., PCA) to reveal the unprocessed global variance structure. The principal bands observed in cell spectra (ν1 ∼ 1524 cm−1, ν2 ∼ 1160 cm−1, ν3 ∼ 1015 cm−1) reflect composite signals from coexisting carotenoids, including fucoxanthin, diadinoxanthin and diatoxanthin. Because these pigments share similar polyene chains and terminal rings, their vibrational modes overlap substantially in vivo and cannot be resolved individually, producing a consolidated triplet-band profile (Pinto et al., 2021). Of note, the quantitative models infer fuco from multivariate spectral correlations calibrated to HPLC reference values, rather than by isolating fucoxanthin-specific Raman bands.
Compared with the ethanol standards, the ν1 band in cellular spectra was shifted by approximately 7 cm−1 toward lower wavenumbers. This shift is consistent with a change in effective conjugation length upon incorporation of fucoxanthin into fucoxanthin-chlorophyll a/c protein complexes, as well as with pigment-protein interactions such as hydrogen bonding and local dielectric effects that modulate carotenoid vibrational energies (Premvardhan et al., 2010). Additional bands assigned to other cellular components are summarized in Table 1: ∼1440 cm−1 (CH2 and CH3 deformations from saturated fatty acids), ∼1340 cm−1 (CH2 deformation from carbohydrates and C–N/C–H deformations from chlorophyll), ∼1264 cm−1 (=CH deformation from unsaturated fatty acids), ∼1038 cm−1 (C–O–H deformation and C–O–C stretching from polysaccharides), and ∼ 904 cm−1 (phosphodiester and deoxyribose vibrations). Bands at lower wavenumbers could not be resolved owing to a broad interference peak from the sapphire probe lens and the spectral resolution limits of the detector.
Table 1.
Identification of band assignments of Raman spectra for P. tricornutum culture (Grace et al., 2022; Wood et al., 2005).
| Raman shift (cm−1) | Band assignments |
|---|---|
| ∼1524 | C=C stretches from carotenoids |
| ∼1440 | CH2 and CH3 deformation from saturated fatty acids |
| ∼1340 | CH2 deformation from carbohydrates C—N stretches and C—H deformation from chlorophyll |
| ∼1264 | =CH deformation from unsaturated fatty acids |
| ∼1160 | C-C stretches from carotenoids C C, C—O ring asymmetric stretching from carbohydrates |
| ∼1038 | C-O-H deformation, C-O-C stretching of polysaccharides from carbohydrates |
| ∼1017 | C-CH3 deformations from carotenoids |
| ∼904 | phosphodiester group deoxyribose C-O-C skeletal mode from monosaccharide and disaccharide |
3.2. Datasets
An overview of all datasets is provided in Table 2 and Fig. 2. Dataset 1, the largest and most diverse dataset in terms of target variable coverage, was used for model training and validation. This dataset consists of 251 samples collected from P. tricornutum cultures grown in laboratory-scale column PBRs under six distinct light regimes, specifically 15%, 25%, and 50% of the total input power from either white or red LEDs. Biomass accumulation (dwt) ranged from 0.145 to 3.100 g L−1, while fuco ranged from 1.919 to 30.999 mg L−1 (Table 2). The spectral, growth and fucoxanthin profiles across light regimes are shown in Figs. 2a-c.
Table 2.
Summary of datasets used for model training/validation, testing, and transfer learning, with statistical information on dwt (g L−1) and fuco (mg L−1) measurements.
| Dataset | Stage | Analytes | Size | Minimum | Maximum | Mean | Standard deviation |
|---|---|---|---|---|---|---|---|
| dataset 1 | Training/Validation | dwt | 251 | 0.145 | 3.100 | 1.248 | 0.827 |
| fuco | 1.919 | 30.999 | 11.811 | 7.731 | |||
| dataset 2 | Testing | dwt | 81 | 0.500 | 1.383 | 0.897 | 0.253 |
| fuco | 5.030 | 23.803 | 12.753 | 4.901 | |||
| dataset 3 | Transfer learning | dwt | 82 | 0.050 | 0.630 | 0.271 | 0.152 |
| fuco | 0.618 | 6.455 | 3.087 | 1.739 |
Fig. 2.
Dataset visualizations. Boxplots summarizing measured biomass (dwt, a) and fucoxanthin production (fuco, b) across the light regimes for dataset1. (c) Raman spectra from dataset1 for P. tricornutum cultures grown under white (W) or red (R) LEDs at 15%, 25% and 50% of total input power (labeled W15, W25, W50, R15, R25, R50). Mean traces are shown with shaded areas indicating within-group variability. (d) Principal component analysis (PCA) of day-10 spectra across multiple batches, with score scatter indicating group separation and corresponding loading plot highlighting spectral features that discriminate light regimes. (e) Heatmap of Pearson correlation coefficients r (Pearson's r) between each spectral wavenumber and the two target variables. The calibration relationship between the intensity at the wavenumber exhibiting the maximal Pearson's r (from panel e) and dwt (f) or fuco (g); 95% CI: 95% Confidence Intervals. (h) Original and preprocessed spectra from dataset 1–3 (mean traces with shaded variability); Spectra in panels (c), (d), and (h) are plotted with arbitrary vertical offsets applied solely for visualization clarity to separate overlapping traces. Numeric y-axis tick marks are omitted to avoid misinterpretation of absolute intensities. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Mean dwt indicated the fastest growth under 50% white light and the slowest under 15% red light; however, analysis of variance (ANOVA) revealed no statistically significant differences among regimes (Fig. 2a). By contrast, fuco was significantly higher under 15% red light and lowest under 50% white light, suggesting that higher photon flux favored biomass accumulation, whereas lower red-LED flux promoted fucoxanthin biosynthesis (Fig. 2b). Spectral differences across samples were most evident within 950–1600 cm−1 (Fig. 2c). As expected, carotenoid-associated bands (ν1, ν2, ν3) were strongest under 15% red illumination and weakest under 50% white illumination, consistent with measured fuco. Conversely, spectral features attributable to biomass-related macromolecules (lipids, carbohydrates, proteins; Table 1) remained visually less distinct complicating direct inference of dwt from single spectra. Multivariate exploration by PCA on day-10 raw spectra revealed separation by light regime (Fig. 2d). PC1 (99.0% of the variance) primarily reflected differences in fluorescence baseline, whereas PC2 (1.0% of the variance) captured chemically relevant differences associated with carotenoid-related Raman bands and was primarily responsible for discrimination among illumination conditions. As a result, chemically meaningful variation was largely confined to PC2 rather than PC1.
The Pearson correlation coefficient (Pearson's r) was used to quantify the strength and direction of linear association between variables (Gao et al., 2022). Pearson's r ranges from −1 to 1, where values of 1 and − 1 denote perfect positive and negative linear correlation, respectively, and r = 0 denotes no linear association. Correlation coefficients were computed for each Raman shift and plotted in Fig. 2e. The strongest single wavenumber association with dwt occurred at 1254.30 cm−1 (Pearson's r = 0.778, p < 0.001), for which a univariate linear regression yielded R2 = 0.608 (Fig. 2f), indicating a moderate relationship. Fuco exhibited substantially stronger, extended correlations, peaking at 1537.70 cm−1 (ν1 carotenoid band; Pearson's r = 0.969, p < 0.001; univariate R2 = 0.934, Fig. 2g). Despite these strong local correlations, especially for fuco, univariate model calibration performed poorly on the independent test set (Dataset 2, Fig. S2). This loss of accuracy likely reflects subtle shifts between datasets (baseline offsets, instrument variability, and matrix effects), which undermine univariate calibration transferability. Although the use of external standards or internal references can partially correct such shifts, their use compromises the speed and label-free nature of the method. Therefore, operationally robust quantification requires appropriate spectral transformations, informed variable selection, and ML/DL models capable of capturing multivariate and nonlinear relationships rather than reliance on univariate regressions. Moreover, these results indicate that, compared with dwt, developing quantitative models for fuco is likely more straightforward due to its stronger and more localized spectral signatures.
Dataset 2 comprises 81 ARTP-mutagenized P. tricornutum samples grown under 15% red LEDs (dwt: 0.500–1.383 g L−1; fuco: 5.030–23.803 mg L−1) and served as an independent external test set for model robustness assessment. Dataset 3 includes 82 samples from a 200-L pilot-scale PBR, where both dwt (0.050–0.630 g L−1) and fuco (0.618–6.455 mg L−1) were substantially lower than in laboratory cultures. The raw and preprocessed spectra for all three datasets are presented in Fig. 2h. Collectively, these datasets cover diverse illumination regimes, genetic backgrounds and cultivation scales, thereby providing a rigorous basis for training, validating, and externally testing ML/DL regression models under conditions relevant to real-world bioprocessing.
3.3. ML and DL models for dwt and fuco in laboratory
3.3.1. Pipeline
To rigorously evaluate the robustness and generalizability of models predicting dwt and fuco in P. tricornutum cultures, a comprehensive analytical pipeline was established. The framework integrated spectral processing methods, feature selection algorithms, and optimized model types through hyperparameter tuning. Model performance was assessed using 5-fold CV and independent external testing, with R2, RMSE, and RPD as primary evaluation metrics.
3.3.2. Processed spectra and selected features
Spectral preprocessing results and feature-selection outcomes are reported in the Supplementary materials and summarized in Table 3. First-derivative transformation enhanced subtle local variations but also amplified high-frequency noise (Fig. S3b). VN and SNV corrected multiplicative and additive effects, improving comparability among samples while preserving peak morphology (Figs. S3c–d). Among selection methods, VIP (threshold ≥1) returned the largest subsets (178–256 variables for dwt; 240–361 for fuco), producing densely clustered features across the spectral range. GA enforced moderate sparsity (177–193 features for dwt; 163–180 for fuco), yielding feature sets that were evenly distributed and formed contiguous segments. In contrast, LASSO–EN imposed the strongest sparsity, selecting at most 67 features for dwt and 37 for fuco.
Table 3.
Number of selected variables for dwt and fuco models after different processing and feature selection methods. “Raw” denotes spectra subjected only to the common preprocessing pipeline (smoothing and baseline correction) with no further post-processing; VN: Vector normalization; SNV: Standard normal variate; VIP: Variable importance in projection; GA: Genetic algorithm; LASSO-EN: Least absolute shrinkage and selection operator with elastic net.
| Target variable | Post-processing | Number of selected variables |
|||
|---|---|---|---|---|---|
| Full spectrum | VIP | GA | LASSO-EN | ||
| dwt | Raw | 537 | 237 | 189 | 14 |
| 1st derivative | 536 | 256 | 177 | 67 | |
| VN | 537 | 231 | 188 | 26 | |
| SNV | 537 | 178 | 193 | 29 | |
| fuco | Raw | 537 | 295 | 180 | 37 |
| 1st derivative | 536 | 240 | 163 | 19 | |
| VN | 537 | 331 | 169 | 20 | |
| SNV | 537 | 361 | 176 | 22 | |
3.3.3. dwt models
A total of eighty distinct dwt regression pipelines were evaluated, each assessed using 5-fold CV, producing 400 sub-models. Fig. 3a presents the aggregated CV performance alongside the external test set results, and Table 4 summarizes the detailed distribution of pipeline outcomes on the test set. Across all pipelines, internal validation performance was consistently strong, with R2 values ranging from 0.909 to 0.991, RMSE from 0.088 to 0.249 g L−1, and RPD from 3.323 to 10.563. Notably, even linear PLSR models achieved high validation accuracy, indicating that laboratory-scale datasets supported effective in-domain calibration. However, external test performance revealed substantial variability, with more than half of the pipelines yielding RPD values below 3.0 (Table 4). Deep learning architectures (CNN and MLP) delivered superior external performance. More than half of their pipelines attained RPD ≥ 3.0 and thus met the predictive (P) criterion, and 17 (CNN) and 9 (MLP) pipelines reached the excellent predictive (EP) threshold. These results highlight deep models' capacity to capture complex spectral-dwt relationships that linear methods (PLSR) often fail to generalize across subtle dataset shifts. In addition, kernel-based SVM and PCA-SVM pipelines also produced several models that met the P-class or higher predictive standard.
Fig. 3.
Comprehensive evaluation of model robustness and generalizability across diverse analytical pipelines for predicting dwt of P. tricornutum culture. (a) Heatmap of validation (Val) and external test-set performance (Test 1–5, Mean) for all pipeline combinations. RPD (Residual Prediction Deviation) thresholds: one red dot (·) represents 3.0 ≤ RPD < 4.0; two red dots (··) represent RPD ≥ 4.0. (b) Validation set predictions of the top dwt pipeline, with the regression line (solid grey) showing the fitted trend between predicted and measured values, and the ideal line (dashed blue) indicating perfect prediction (slope = 1). (c) External test set performance of the optimal dwt model. Absolute residual distributions (top subpanels) with kernel density estimation (KDE) curves and a reference line at residual zero (red). Performance metrics (R2, RMSE, RPD, n) are annotated for each model. (d) SHAP-based interpretation of the best dwt model across all samples, ranking the top 20 influential features. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Table 4.
Distribution of dwt pipelines employing different modeling algorithms, spectral preprocessing methods, and feature-selection approaches across predictive performance classes (n). Each cell reports the number of pipelines (n) assigned to the specified predictive class (NP/MP/P/EP). Criteria for class assignment are described in the Methods.
| dwt pipelines | Categories (n) |
||||
|---|---|---|---|---|---|
| NP | MP | P | EP | ||
| Total | 67 | 183 | 122 | 28 | |
| Modeling class | PLSR | 40 | 40 | 0 | 0 |
| SVM | 1 | 57 | 21 | 1 | |
| PCA-SVM | 3 | 43 | 33 | 1 | |
| CNN | 21 | 16 | 26 | 17 | |
| MLP | 2 | 27 | 42 | 9 | |
| Feature selection | Full spectrum | 17 | 45 | 26 | 12 |
| VIP | 16 | 43 | 39 | 2 | |
| GA | 17 | 49 | 22 | 12 | |
| LASSO-EN | 17 | 46 | 35 | 2 | |
| Spectra post-processing | Raw | 10 | 51 | 37 | 2 |
| 1st derivative | 14 | 69 | 16 | 1 | |
| VN | 21 | 35 | 32 | 12 | |
| SNV | 22 | 38 | 37 | 1 | |
Spectral processing emerged as a critical determinant of model performance. Pipelines utilizing VN or SNV accounted for the majority of top-performing predictions (Table 4), although these transformations were also associated with a higher incidence of non-predictive (NP) outcomes, as PLSR occasionally failed, likely due to sensitivity to transformed variance structure (Fig. 3a). The effect of feature-selection algorithms on model performance was inconsistent across pipelines. In some instances, specific selection methods enhanced external robustness for kernel-based models. Models on the full spectrum, however, produced better predictions than many pipelines using filtered variables, suggesting that dimensionality reduction can inadvertently discard variables required to capture higher-order dependencies among spectral features and dwt estimation. The two highest-ranking pipelines were CNN-SNV-full-spectrum (fold 2) and CNN-VN-full-spectrum (fold 5). These results agreed with previous reports on intrinsic variable-selection and redundancy-suppression capabilities of CNN (Zhong et al., 2024). GA based selection, when yielding contiguous and broadly distributed feature subsets, also produced EP-class performance in 12 pipelines, comparable to full-spectrum models.
Fig. 3b and c present the combined 5-fold CV performance and best single-fold test results for the top-performing pipelines. The regression line represents the relationship between predicted and observed values, with the ideal line denoting perfect prediction. As the proximity of points to the ideal line increases, the model's predictive accuracy improves. For this model, combined 5-fold CV yielded R2 = 0.982, RMSE = 0.111 g L−1, and RPD = 7.446, while external test performance reached R2 = 0.968, RMSE = 0.045 g L−1, and RPD = 5.598. Predicted values closely followed the ideal regression line, and residuals were symmetrically distributed around zero, with an approximately normal distribution. The absence of systematic bias across the measurement range supports the reliability and stability of the model.
Model explainability using SHAP identified clustered variables near ∼1340 cm−1 (including 1346.22, 1348.61, 1343.84, 1339.06, and 1341.45 cm−1) and around ∼1157 cm−1 band (1184.83, 1179.81, 1174.79, 1159.67, and 1169.77 cm−1) among the top ten features by mean absolute SHAP value, with intensity increases at these bands contributed positively to dwt prediction (Fig. 3d). Additional contributions were also observed at ∼1532 cm−1. These features are consistent with carotenoid- and carbohydrate-associated vibrations, indicating that pigment- and carbohydrate-related signals underpin robust regression performance. SHAP analysis suggests that the model does not directly capture mass-specific signals of bulk constituents (proteins, lipids and carbohydrates). Raman sensitivity differs across molecular classes, and carotenoids, which contain extended conjugated systems and may exhibit resonance enhancement, generate relatively strong Raman scattering and therefore contribute disproportionately to spectral variance (Jehlička et al., 2014). In microbial growth systems, concentrations of multiple cellular components frequently covary during biomass accumulation, creating collinearity among spectral features (Müller et al., 2023). Consequently, regression models tend to capture multivariate compositional and physiological signatures that track biomass through informative spectral proxies rather than by direct mass-specific measurements. That is, SHAP quantifies feature contributions to model outputs and frames high importance values as indicators of model reliance rather than evidence of causal, mass-specific measurement.
3.3.4. fuco models
Model outcomes for fuco differed from those observed for dwt in both model complexity and algorithmic preference. Across identical analytical pipelines, validation metrics for fuco ranged from R2 = 0.903 to 0.979, RMSE = 1.118 to 2.411 mg L−1, and RPD = 3.209 to 6.921 (Fig. 4a), indicating consistently strong in-domain performance. On the external test set, pipelines were distributed across predictive categories as follows: 160 pipelines were classified as MP, 206 as P, and 34 as EP. (Table 5). Notably, several PLSR pipelines also reached P-class performance on the external test set (Fig. 4a). Among EP-class models, PCA-SVM was most frequent (20 pipelines), followed by SVM (9), CNN (3), and MLP (2).
Fig. 4.
Comprehensive evaluation of model robustness and generalizability across diverse analytical pipelines for predicting fuco of P. tricornutum culture. (a) Heatmap of validation and external test-set performance for all pipeline combinations. RPD thresholds: one red dot (·) represents 3.0 ≤ RPD < 4.0; two red dots (··) represent RPD ≥ 4.0. (b) Validation set performance of the top fuco pipeline. (c) External test set performance of the optimal fuco model. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Table 5.
Distribution of fuco pipelines employing different modeling algorithms, spectral preprocessing methods, and feature-selection approaches across predictive performance classes (n). Each cell reports the number of pipelines (n) assigned to the specified predictive class (NP/MP/P/EP). Criteria for class assignment are described in the Methods.
| fuco pipelines | Categories (n) |
||||
|---|---|---|---|---|---|
| NP | MP | P | EP | ||
| Total | 0 | 160 | 206 | 34 | |
| Modeling class | PLSR | 0 | 67 | 13 | 0 |
| SVM | 0 | 4 | 67 | 9 | |
| PCA-SVM | 0 | 25 | 35 | 20 | |
| CNN | 0 | 22 | 55 | 3 | |
| ResNet | 0 | 42 | 36 | 2 | |
| Feature selection | Full spectrum | 0 | 48 | 48 | 4 |
| VIP | 0 | 45 | 50 | 5 | |
| GA | 0 | 37 | 55 | 8 | |
| LASSO-EN | 0 | 30 | 53 | 17 | |
| Spectra post-processing | Raw | 0 | 41 | 59 | 0 |
| 1st derivative | 0 | 60 | 40 | 0 | |
| VN | 0 | 34 | 55 | 11 | |
| SNV | 0 | 25 | 52 | 23 | |
As with dwt, spectral processing remained essential for robust fuco prediction, with SNV transformation proving particularly effective across multiple pipelines. Among feature selection approaches, LASSO-EN contributed the largest amount of EP pipelines (LASSO–EN: 17; GA: 8; VIP: 5; full-spectrum: 4). The best performing pipeline (PCA-SVM-SNV-LASSO-EN) yielded validation metrics of R2 = 0.972, RMSE = 1.303 mg L−1, RPD = 5.939, with corresponding external test performance of R2 = 0.949, RMSE = 1.112 mg L−1, RPD = 4.408 (Figs. 4b-c). The results supported that compact and salient spectral representations improved robustness and transferability for kernel-based ML models of fuco. PCA-based dimensionality reduction further reduced model complexity and, in many pipelines, improved external performance. In contrast to dwt modeling, deep networks were not consistently superior for fuco predictions and occasionally prone to overfitting given the available samples and features.
Model interpretability also differed by algorithms. PCA-based models project the feature space into orthogonal components, which precludes direct attribution to the original wavenumbers. By enforcing sparsity and stabilizing correlated predictors through combined L₁ (sparsity-inducing) and L₂ (correlation-stabilizing) regularization, LASSO-EN generated a compact set of plausible variables (22 variables, Table 3) that helped explain the behavior and robustness of the fuco models.
Synthesis. The empirical results support a practical, target-driven modeling strategy. Accurate prediction of dwt required models capable of capturing high-dimensional, distributed spectral information and complex dependencies; accordingly, deep-learning architectures trained on full-spectrum inputs provided the strongest external performance. By contrast, fuco, exhibited strong, localized spectral signatures, resulting in lower intrinsic modeling complexity, and therefore kernel-based methods applied to sparse, regularized feature sets attained high accuracy with markedly lower computational cost. Consistent with our findings, this distinction parallels previous observations that pigment-dominated targets produce clearer, lower-complexity spectral signatures, whereas cell concentration targets reflect broader compositional variability (Reynolds-Brandao et al., 2025). These findings emphasize the importance of systematically evaluating combinations of spectral processing, feature selection, and algorithmic choices, and of adopting data-partitioning strategies that explicitly assess transferability.
3.4. Boosting scalability: transfer learning from laboratory to pilot-scale PBRs
Average spectral profiles acquired from laboratory column PBRs and pilot-scale PBRs were broadly similar in overall shape, but the two datasets displayed pronounced systematic differences (Fig. 2h). PCA analysis confirmed clear separation between the two datasets, with PC1 and PC2 explaining 97.7% and 1.4% of the total variance, respectively (Fig. 5a). These spectral differences in biomolecular bands reflect distinct physiological states. Laboratory cultures spanned a wide yield distribution, whereas pilot-scale cultures exhibited restricted dwt and lower fuco. Such shifts are consistent with scale-dependent effects, including increased light attenuation, intensified self-shading, and reduced mixing and mass-transfer efficiency in larger PBRs. A regression model trained on dataset 1 (laboratory source domain) was directly applied to scaled-PBR spectra (target domain) without adaptation (Fig. 5b-c). This zero-transfer approach produced marked performance loss and a systematic positive prediction bias relative to reference measurements, highlighting the necessity of domain adaptation and illustrating the sample efficiency trade-offs that must be addressed for industrial deployment.
Fig. 5.
(a) PCA analysis of spectra from column and pilot-scale PBRs, showing distinct separation between the datasets. (b) dwt prediction using the source domain model applied directly to the target domain, with the inset showing results after transfer learning using 10% of the target domain. (c) fuco prediction using the source domain model applied directly to the target domain, with the inset showing results after transfer learning using 10% of the target domain. (d) and (e) performance of dwt prediction with varying migration sample ratios in both single-domain and TL settings. (f) and (g) performance of fuco prediction with varying migration sample ratios in both single-domain and TL settings.
TL was therefore employed to explicitly mitigate distributional shifts between domains. The target-domain dataset was partitioned into a migration pool (80%) and test set (20%), and 5-fold CV was used for robust evaluation. An initial experiment employed only 10% of the migration pool for adaptation (Fig. 5b-c). Fine-tuning of a CNN model for dwt and instance-transfer reweighting with PCA-SVM for fuco produced substantial performance recovery on pilot-scale spectra. The aggregated post-transfer metrics were as follows. For dwt, R2 = 0.894, RMSE = 0.049 g L−1, RPD = 3.067. For fuco: R2 = 0.904, RMSE = 0.538 mg L−1, RPD = 3.236.
A single-domain baseline was evaluated in which models were trained exclusively on target-domain samples, without access to source-domain data. The effect of migration fraction on TL performance was further examined by varying the proportion of the migration pool from 20% to 100%. Under single-domain training, dwt predictions were poor when only 10% of target samples were available (RPD = 1.487), improved to acceptable levels at 20% (RPD = 3.165), and thereafter plateaued with occasional fluctuations at higher fractions (Fig. 5d). In contrast, TL fine-tuning produced accurate dwt predictions even at low migration fractions and yielded steady gains after inclusion of 20% migration samples, with RPD increasing from 3.760 at 20% to 4.315 at 100% (Fig. 5e). For fuco, single-domain models produced RPD < 3 when fewer than 40% of migration samples were available (Fig. 5f). Whereas instance-transfer methods achieved stable and substantially improved performance with only a 20% migration fraction, producing RPD values up to approximately 4.50 and attaining excellent predictive performance at higher fractions (Fig. 5g).
To evaluate practical applicability, models were adapted using three pilot-scale batches (20% of the pilot-scale samples) and subsequently applied to predict twelve independent, previously unseen batches collected outside Dataset 3 (80%). The resulting dwt and fuco time series predictions are presented in Fig. S4. Despite minor deviations, predictions across all batches remained within ranges deemed acceptable for process monitoring. Taken together, these results provide evidence that sample-efficient transfer learning, including fine tuning of pretrained CNNs and instance reweighting for classical learners, restores predictive performance for monitoring dwt and fuco in P. tricornutum during the laboratory-to-pilot transition for the tested scenarios.
3.5. Applications and perspective
The RS-ML-TL framework developed in this study is designed for PAT applications that require rapid, non-destructive monitoring. The practical choice between periodic offline sampling and continuous in-line monitoring depends on monitoring objectives, acceptable latency, and resource constraints. Offline sampling offers cost-effective, high-specificity measurements suitable for lower-frequency monitoring, whereas integrated in-line sensors enable higher-frequency control at greater capital and maintenance cost. In photoautotrophic microalgal cultivation, growth and pigment composition are constrained by photon flux and therefore typically evolve on a timeframe of 1–2 days under steady-state operation. Consequently, periodic offline as commonly employed in the majority of prior studies (Table 6) represents a pragmatic monitoring approach for routine process control in many settings.
Table 6.
Spectroscopic quantitative methods for monitoring biomass and biochemical compounds of microalgae in bioreactors.
| Spectroscopic Technique | Microalga | Method/Algorithm | Analyte | Performance Metrics | On-line/Off-line | Bioreactor | Reference |
|---|---|---|---|---|---|---|---|
| Raman spectroscopy | P. tricornutm | ML/DL-TL | Biomass | R2 = 0.968, RMSE = 0.045 g L−1, RPD = 5.598 | Off-line | Column PBRs and pilot-scale PBRs (200L) | This study |
| Fucoxanthin yield | R2 = 0.949, RMSE = 1.112 g L−1, RPD = 4.408 | ||||||
| Reflectance spectroscopy | Haematococcus pluvialis | Linear regression | Biomass density | R2 = 0.94, RMSE = 0.21 ln g m−2, RPD = 4.1 | On-line | Rotating biofilms | Morgado et al. (2024) |
| Astaxanthin density | R2 = 0.92, RMSE = 0.45 ln g m−2, RPD = 3.7 | ||||||
| Chlorophyll density | R2 = 0.88, RMSE = 0.29 ln g m−2, RPD = 2.4 | ||||||
| Absorbance and 2D-fluorescence spectroscopy | P. tricornutm | ML/DL | Cell number | R2 = 0.98, RMSE = 2.41 × 106 cells/mL | Off-line | 2 L Schott flasks | Reynolds-Brandao et al. (2025) |
| Fucoxanthin yield | R2 = 0.91, RMSE = 2.05 g L−1 | ||||||
| Fourier transform Raman spectroscopy | Schizochytrium sp. | PLSR | Biomass | R2 = 0.94, RMSE = 0.67 g L−1 | On-line | Fermenters (2.0 L) | Dzurendova et al. 2(023) |
| Lipids | R2 = 0.85, RMSE = 1.07 g L−1 | ||||||
| Carotenoids | R2 = 0.91, RMSE = 0.65 ppm | ||||||
| Raman micro-spectroscopy | Scenedesmus obliquus | PLSR | Lipid | R2 = 0.925, RMSE = 0.019%, RPD = 3.553 | Off-line | N/A | Li et al. (2019) |
| Single-cell Raman spectroscopy | Cyclotella cryptica | PLSR | Polysaccharide | R2 = 0.949 | Off-line | N/A | Wang et al. (2023) |
| Total lipid | R2 = 0.904 | ||||||
| Protein | R2 = 0.801 | ||||||
| Chlorophyll a | R2 = 0.917 | ||||||
| Fourier transform infrared spectroscopy | Nannochloropsis oceanica | PLSR | Biomass | R2 = 0.900, RMSE = 0.253 g L−1 | Off-line | Column PBRs (1 L) | Zhang et al. (2022) |
| Toal lipid | R2 = 0.947, RMSE = 0.120 g L−1 | ||||||
| Palmitic acid C16:0 | R2 = 0.960, RMSE = 0.050 g L−1 | ||||||
| Palmitoleic acid C16:1 | R2 = 0.948, RMSE = 0.032 g L−1 | ||||||
| Eicosatetraenoic acid C20:4 | R2 = 0.880, RMSE = 0.004 g L−1 | ||||||
| Eicosapentaenoic acid C20:5 | R2 = 0.903, RMSE = 0.009 g L−1 | ||||||
| Fluorescence spectroscopy |
Tisochrysis lutea P. tricornutm |
PLSR | Biomass | R2 = 0.93, RMSE = 0.052 log10 g L−1 | Off-line | Panel PBRs (38 L) | Gao et al. (2021) |
| Fucoxanthin content | R2 = 0.77, RMSE = 0.071 mg g−1 | ||||||
| Near-infrared spectroscopy (transmission) | Spirulina sp. | PLSR | Carotenoid yield | R2 = 0.96, RMSE = 0.24 mg L−1, RPD = 3.44 | Off-line | Erlenmeyer flasks (1 L) | Shao et al. (2015) |
ML, machine learning algorithms; DL, deep learning algorithms; TL, transfer learning; R2, coefficient of determination; RMSE, root mean squared error; RPD, relative prediction deviation; PBRs, photobioreactors; PLSR: partial least squares regression. N/A: Not applicable.
Practical deployment must also address operational and data-centric limitations. Strong background fluorescence from contaminants, dense cultures or dissolved organics can obscure Raman marker bands and reduce signal-to-noise. Although longer excitation wavelengths, time-gated detection and fluorescence-suppression algorithms can mitigate these effects, their implementation requires careful balancing between sensitivity and system complexity (Park et al., 2023). Beyond these operational challenges, earlier approaches such as those reliant on chemometrics or single-cell analyses are deficient in scalability and robust management of domain shifts. The RS-ML-TL framework effectively addresses these limitations by leveraging cross domain knowledge transfer. Nonetheless, deployments should incorporate ongoing domain-adaptation measures, including periodic migration pools, targeted resampling and model recalibration, and should implement routine quality-control procedures to detect model drift and trigger recalibration (Guo et al., 2021).
Previous spectroscopic studies have revealed considerable diversity in microalgal species, optical techniques, and target analytes, often achieving commendable accuracy in controlled settings. Future work should therefore extend RS-ML-TL across diverse microalgal species, strains and geometries, and investigate multimodal sensor fusion that integrates RS with complementary measurements, including fluorescence spectroscopy, VIS-NIR reflectance, imaging or hyperspectral data, and process telemetry (irradiance, temperature, dissolved oxygen). Such approaches have the potential to further enhance robustness, resilience to domain shifts, and transferability in real-world bioprocesses (Lines et al., 2020; Reynolds-Brandao et al., 2025).
4. Conclusion
Two principal contributions are reported. First, systematic benchmarking of ML and DL pipelines revealed a target-dependent modeling principle: full-spectrum convolutional networks achieved the highest accuracy for dwt, whereas kernel-based methods applied to compact, regularized feature sets yielded more robust fuco predictions. Second, TL strategies recovered pilot-scale predictive performance with modest calibration effort: fine-tuning pretrained deep models and instance-transfer reweighting restored accuracy using on the order of 10–20% labeled pilot samples. Additionally, sample-efficient domain adaptation holds the potential to lower the calibration burden associated with deploying Raman-based process analytical technologies in comparable production settings.
CRediT authorship contribution statement
Jijian Long: Writing – review & editing, Writing – original draft, Methodology, Investigation, Formal analysis, Data curation, Conceptualization. Song Zou: Resources. Wen Liu: Investigation. Mingyang Ma: Resources, Project administration. Qin Zhang: Investigation. Yini Chen: Investigation. Yadong Chu: Resources, Project administration. Zhiqiang Guo: Methodology, Formal analysis. Yinlan Ruan: Conceptualization. Danxiang Han: Writing – review & editing, Resources, Methodology, Funding acquisition, Conceptualization. Qiang Hu: Writing – review & editing, Writing – original draft, Supervision, Project administration, Funding acquisition, Conceptualization.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This research was financially supported by the National Key Research and Development Programs of China (Grant number 2024YFA0919700, and grant number 2021YFA0909700).
Footnotes
Supplementary data to this article can be found online at https://doi.org/10.1016/j.fochx.2026.103682.
Contributor Information
Danxiang Han, Email: danxiang.han@ihb.ac.cn.
Qiang Hu, Email: huqiang@suat-sz.edu.cn.
Appendix A. Supplementary data
Supplementary material
Data availability
Data will be made available on request.
References
- Adejimi O.E., Ignat T., Sadhasivam G., Zakin V., Schmilovitch Z.E., Shapiro O.H. Low-resolution Raman spectroscopy for the detection of contaminant species in algal bioreactors. Science of the Total Environment. 2022;809 doi: 10.1016/j.scitotenv.2021.151138. [DOI] [PubMed] [Google Scholar]
- Dzurendova S., Olsen P.M., Byrtusová D., Tafintseva V., Shapaval V., Horn S.J.…Zimmermann B. Raman spectroscopy online monitoring of biomass production, intracellular metabolites and carbon substrates during submerged fermentation of oleaginous and carotenogenic microorganisms. Microbial Cell Factories. 2023;22(1) doi: 10.1186/s12934-023-02268-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Field C.B., Behrenfeld M.J., Randerson J.T., Falkowski P. Primary production of the biosphere: Integrating terrestrial and oceanic components. Science. 1998;281(5374):237–240. doi: 10.1126/science.281.5374.237. [DOI] [PubMed] [Google Scholar]
- Gao F., Sá M., Teles I., Wijffels R.H., Barbosa M.J. Production and monitoring of biomass and fucoxanthin with brown microalgae under outdoor conditions. Biotechnology and Bioengineering. 2021;118(3):1355–1365. doi: 10.1002/bit.27657. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gao W., Zhou L., Liu S., Guan Y., Gao H., Hu J. Machine learning algorithms for rapid estimation of holocellulose content of poplar clones based on Raman spectroscopy. Carbohydrate Polymers. 2022;292 doi: 10.1016/j.carbpol.2022.119635. [DOI] [PubMed] [Google Scholar]
- Grace C.E.E., Mary M.B., Vaidyanathan S., Srisudha S. Response to nutrient variation on lipid productivity in green microalgae captured using second derivative FTIR and Raman spectroscopy. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy. 2022;270 doi: 10.1016/j.saa.2021.120830. [DOI] [PubMed] [Google Scholar]
- Guo S., Popp J., Bocklitz T. Chemometric analysis in Raman spectroscopy from experimental design to machine learning–based modeling. Nature Protocols. 2021;16(12):5426–5459. doi: 10.1038/s41596-021-00620-3. [DOI] [PubMed] [Google Scholar]
- Hossen M.I., Awrangjeb M., Pan S., Mamun A.A. Transfer learning in agriculture: A review. Artificial Intelligence Review. 2025;58(4):97. [Google Scholar]
- Jehlička J., Edwards H.G., Oren A. Raman spectroscopy of microbial pigments. Applied and Environmental Microbiology. 2014;80(11):3286–3295. doi: 10.1128/AEM.00699-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kaczor A., Baranska M. Structural changes of carotenoid astaxanthin in a single algal cell monitored in situ by Raman spectroscopy. Analytical Chemistry. 2011;83(20):7763–7770. doi: 10.1021/ac201302f. [DOI] [PubMed] [Google Scholar]
- Li H., Chen L., Zhang F., Cai Z. Graph-learning-based machine learning improves prediction and cultivation of commercial-grade marine microalgae Porphyridium. Bioresource Technology. 2025;416 doi: 10.1016/j.biortech.2024.131728. [DOI] [PubMed] [Google Scholar]
- Li X., Li Z., Yang X., He Y. Boosting the generalization ability of Vis-NIR-spectroscopy-based regression models through dimension reduction and transfer learning. Computers and Electronics in Agriculture. 2021;186 [Google Scholar]
- Li X., Sha J., Chu B., Wei Y., Huang W., Zhou H.…He Y. Quantitative visualization of intracellular lipids concentration in a microalgae cell based on Raman micro-spectroscopy coupled with chemometrics. Sensors and Actuators B: Chemical. 2019;292:7–15. [Google Scholar]
- Lieutaud C., Assaf A., Gonçalves O., Wielgosz-Collin G., Thouand G. Fast non-invasive monitoring of microalgal physiological stage in photobioreactors through Raman spectroscopy. Algal Research. 2019;42 [Google Scholar]
- Lines A.M., Hall G.B., Asmussen S., Allred J., Sinkov S., Heller F.…Bryan S.A. Sensor fusion: Comprehensive real-time, on-line monitoring for process control via visible, near-infrared, and Raman spectroscopy. ACS Sensors. 2020;5(8):2467–2475. doi: 10.1021/acssensors.0c00659. [DOI] [PubMed] [Google Scholar]
- Liu H., Chen F., Zhang L., Meng D., Sun H. Improvement method for tea leaf moisture content prediction using VIS-NIR spectrum based on transfer learning. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy. 2025;343:126571. doi: 10.1016/j.saa.2025.126571. [DOI] [PubMed] [Google Scholar]
- Lourenço-Lopes C., Fraga-Corral M., Jimenez-Lopez C., Carpena M., Pereira A., Garcia-Oliveira P.…Simal-Gandara J. Biological action mechanisms of fucoxanthin extracted from algae for application in food and cosmetic industries. Trends in Food Science & Technology. 2021;117:163–181. [Google Scholar]
- Meenatchi Sundaram K., Sravan Kumar S., Deshpande A., Chinnadurai S., Rajendran K. Machine learning assisted image analysis for microalgae prediction. ACS ES&T Engineering. 2024;5(2):541–550. [Google Scholar]
- Morgado D., Fanesi A., Martin T., Tebbani S., Bernard O., Lopes F. Non-destructive monitoring of microalgae biofilms. Bioresource Technology. 2024;398 doi: 10.1016/j.biortech.2024.130520. [DOI] [PubMed] [Google Scholar]
- Moudrikova S., Sadowsky A., Metzger S., Nedbal L., Mettler-Altmann T., Mojzes P. Quantification of polyphosphate in microalgae by Raman microscopy and by a reference enzymatic assay. Analytical Chemistry. 2017;89(22):12006–12013. doi: 10.1021/acs.analchem.7b02393. [DOI] [PubMed] [Google Scholar]
- Müller D.H., Flake C., Brands T., Koß H.J. Bioprocess in-line monitoring using Raman spectroscopy and indirect hard Modeling (IHM): A simple calibration yields a robust model. Biotechnology and Bioengineering. 2023;120(7):1857–1868. doi: 10.1002/bit.28424. [DOI] [PubMed] [Google Scholar]
- Nekvapil F., Brezestean I., Lazar G., Firta C., Pinzaru S.C. Resonance Raman and SERRS of fucoxanthin: Prospects for carotenoid quantification in live diatom cells. Journal of Molecular Structure. 2022;1250 [Google Scholar]
- Ouyang Q., Fan Z., Chang H., Shoaib M., Chen Q. Analyzing TVB-N in snakehead by Bayesian-optimized 1D-CNN using molecular vibrational spectroscopic techniques: Near-infrared and Raman spectroscopy. Food Chemistry. 2025;464 doi: 10.1016/j.foodchem.2024.141701. [DOI] [PubMed] [Google Scholar]
- Park M., Somborn A., Schlehuber D., Keuter V., Deerberg G. Raman spectroscopy in crop quality assessment: Focusing on sensing secondary metabolites: A review. Horticulture Research. 2023;10(5):uhad074. doi: 10.1093/hr/uhad074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peng Y., Yao S., Li A., Xiong F., Sun G., Li Z.…Peng F. Investigating quantitative approach for microalgal biomass using deep convolutional neural networks and image recognition. Bioresource Technology. 2024;403 doi: 10.1016/j.biortech.2024.130889. [DOI] [PubMed] [Google Scholar]
- Pinto R., Vilarinho R., Carvalho A.P., Moreira J.A., Guimarães L., Oliva-Teles L. Raman spectroscopy applied to diatoms (microalgae, Bacillariophyta): Prospective use in the environmental diagnosis of freshwater ecosystems. Water Research. 2021;198 doi: 10.1016/j.watres.2021.117102. [DOI] [PubMed] [Google Scholar]
- Premvardhan L., Robert B., Beer A., Büchel C. Pigment organization in fucoxanthin chlorophyll a/c2 proteins (FCP) based on resonance Raman spectroscopy and sequence analysis. Biochimica et Biophysica Acta (BBA)-Bioenergetics. 2010;1797(9):1647–1656. doi: 10.1016/j.bbabio.2010.05.002. [DOI] [PubMed] [Google Scholar]
- Reynolds-Brandao P., Quintas-Nunes F., Bertrand C.D., Martins R.M., Crespo M.T., Galinha C.F., Nascimento F.X. Integration of spectroscopic techniques and machine learning for optimizing Phaeodactylum tricornutum cell and fucoxanthin productivity. Bioresource Technology. 2025;418 doi: 10.1016/j.biortech.2024.131988. [DOI] [PubMed] [Google Scholar]
- Shao Y., Pan J., Zhang C., Jiang L., He Y. Detection in situ of carotenoid in microalgae by transmission spectroscopy. Computers and Electronics in Agriculture. 2015;112:121–127. [Google Scholar]
- Shi T.-F., Pan T.-T., Lu P. Rapid and simultaneous determination of mixed pesticide residues in apple using SERS coupled with multivariate analysis. Food Chemistry: X. 2024;24 doi: 10.1016/j.fochx.2024.101954. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang X., He Y., Zhou Y., Zhu B., Xu J., Pan K., Li Y. An attempt to simultaneously quantify the polysaccharide, total lipid, protein and pigment in single Cyclotella cryptica cell by Raman spectroscopy. Biotechnology for Biofuels and Bbioproducts. 2023;16(1):63. doi: 10.1186/s13068-023-02314-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wold S., Sjöström M., Eriksson L. PLS-regression: A basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems. 2001;58(2):109–130. [Google Scholar]
- Wood B.R., Heraud P., Stojkovic S., Morrison D., Beardall J., McNaughton D. A portable Raman acoustic levitation spectroscopic system for the identification and environmental monitoring of algal cells. Analytical Chemistry. 2005;77(15):4955–4961. doi: 10.1021/ac050281z. [DOI] [PubMed] [Google Scholar]
- Xiao T., Xie C., Yang L., He X., Wang L., Zhang D.…Dong J. A general deep learning model for predicting and classifying pea protein content via visible and near-infrared spectroscopy. Food Chemistry. 2025;478 doi: 10.1016/j.foodchem.2025.143617. [DOI] [PubMed] [Google Scholar]
- Yeh Y.-C., Syed T., Brinitzer G., Frick K., Schmid-Staiger U., Haasdonk B.…Urbas L. Improving microalgae growth modeling of outdoor cultivation with light history data using machine learning models: A comparative study. Bioresource Technology. 2023;390 doi: 10.1016/j.biortech.2023.129882. [DOI] [PubMed] [Google Scholar]
- Yosinski J., Clune J., Bengio Y., Lipson H. How transferable are features in deep neural networks? Advances in Neural Information Processing Systems. 2014;27 [Google Scholar]
- Yu Y., Chai Y., Yan Y., Li Z., Huang Y., Chen L., Dong H. Near-infrared spectroscopy combined with support vector machine for the identification of Tartary buckwheat (Fagopyrum tataricum (L.) Gaertn) adulteration using wavelength selection algorithms. Food Chemistry. 2025;463 doi: 10.1016/j.foodchem.2024.141548. [DOI] [PubMed] [Google Scholar]
- Zhang D., Li Q., Yan C., Cong W. Determination of intracellular lipid and main fatty acids of Nannochloropsis oceanica by ATR-FTIR spectroscopy. Journal of Applied Phycology. 2022;34(1):343–352. [Google Scholar]
- Zhong L., Guo X., Ding M., Ye Y., Jiang Y., Zhu Q., Li J. SHAP values accurately explain the difference in modeling accuracy of convolution neural network between soil full-spectrum and feature-spectrum. Computers and Electronics in Agriculture. 2024;217 [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary material
Data Availability Statement
Data will be made available on request.






