Predicting the low-level and extremely low-threshold compounds in Baijiu: uniform manifold approximation and projection

Yintao Jia; Yue Qiu; Qi Deng; Ying Han; Baoguo Sun; Rong Liu; Pan Zhen; Wenxian Li; Wei Dong; Xiaotao Sun; Xiao Yang; Fan Cui

doi:10.1016/j.fochx.2025.102645

. 2025 Jun 10;29:102645. doi: 10.1016/j.fochx.2025.102645

Predicting the low-level and extremely low-threshold compounds in Baijiu: uniform manifold approximation and projection

Yintao Jia ^a,^b,^c, Yue Qiu ^a,^b,^c, Qi Deng ^a,^b,^c, Ying Han ^d, Baoguo Sun ^a,^b,^c, Rong Liu ^d, Pan Zhen ^d, Wenxian Li ^a,^b,^c, Wei Dong ^a,^b,^c,^⁎, Xiaotao Sun ^a,^b,^c,^⁎, Xiao Yang ^a,^b,^c, Fan Cui ^d

PMCID: PMC12192529 PMID: 40567578

Abstract

In the flavor analysis of Baijiu, the identification of compounds exhibiting high flavor dilution factors but displaying minimal or undetectable response signals in analytical measurements, remains a significant challenge in current research. To clearly elucidate the correlation between the structure of odorants and their odors in Baijiu, a “compound-aroma” database consisting of 646 compounds and their known associated odors (70 aroma descriptors) has been compiled. Each individual compound was coded with 1024-bit molecular fingerprints and analyzed by K-means and uniform manifold approximation and projection (UMAP). Moreover, by calculating the presence or the absence of odor molecular substructures, associations for both odor notes and chemical groups of the aroma compounds in Baijiu were revealed, and an indicative model diagram was constructed. Finally, a strong roasted odor of Baijiu, which presented at the retention index of 1324–1334 with no distinct response signal, has been successfully identified as 2,5-dimethylpyrazine or 2,6-dimethylpyrazine by using the aforementioned model.

Keywords: Baijiu, UMAP, Structure-odor correlations, Molecular fingerprints, Compounds prediction

Highlights

•
“Aroma-compound” database containing 646 compounds and 70 descriptors was compiled.
•
“Aroma-structure” model for Baijiu was constructed by UMAP and K-means.
•
Strong roasted odor of Baijiu, without distinct signal, was identified as 2,5- or 2,6-dimethylpyrazine.

1. Introduction

Baijiu, one of the most representative fermented foods in China, is a distilled alcoholic beverage composed mainly of ethanol, water (accounting for over 98 %), and a complex mixture of aroma-active compounds, including esters, aldehydes, acids, and other volatiles (Wang, Liu, et al., 2024). Unique solid-state fermentation and distillation processes are fundamental in determining its organoleptic characteristics (Wu et al., 2024). Owing to the intricate flavor attributes, more than 12 types of Baijiu have been established, and over 2700 compounds detected (Wang, Tang, et al., 2024). Meanwhile, the Molecular Sensory Science has been systematically developed for the analysis of the key aroma compounds in Baijiu, which includes, isolation of volatile compounds, qualitative analysis, quantitative analysis, odor activity value (OAV) calculations, aroma recombination and omission experiments (Dong et al., 2024). In light of this, 386 aroma-active compounds have been identified and characterized (Wang, Tang, et al., 2024). However, accurately identifying compounds present at low concentrations with extremely low aroma thresholds remains difficult. Particularly, in gas chromatography-olfactometry or gas chromatography-olfactometry-mass spectrometry (GC-O-MS) analyses, regions where the compounds were detectable only by olfaction but lacked corresponding chromatographic or mass spectrometric signals, researchers were compelled to rely on empirical inferences to hypothesize their potential chemical composition. For example, Duan et al. reported the identification of six key sulfur compounds in sauce-aroma Baijiu, starting with an empirical “Guess” based on their long-term experimental experience (Duan et al., 2024). Sha et al. identified sulfur-containing compounds in roasted sesame-flavored Baijiu that exhibited high flavor dilution factors but did not show prominent peaks in MS. The identification process were initially “Guessed” through data comparison and further validated using pulsed flame photometric detector (Sha et al., 2017). Despite these empirical “Guess” into odorant recognition, how aroma profiles interact with molecular configurations remains elusive. Therefore, the development of a model based on the correlation between aroma descriptors and molecular structures, which enables the scientific prediction of the chemical composition of unknown flavor regions rather than relying on empirical inference, has long represented a significant challenge in the advancement of Baijiu flavor science.

In recent years, machine learning has effectively been used in the analysis and prediction of food flavor (Zeng et al., 2023). Prediction of flavor is carried out using predictive models based on the molecular structures of compounds with available flavor profiles, which establish a correlation between the aroma and structure of the compounds. Prantar Dutta et al. predicted the taste of 3468 compounds, reporting that umami molecules are rich in nitrogen-containing functional groups, while bitter and sweet molecules have similar structures, demonstrating significant correlations with functional groups such as aldehydes, ketones, and carboxylic acids (Dutta et al., 2023). Additionally, the study developed a classification model that achieved favorable classification performance. The predictive performance of the model depends on the dataset containing a large number of effective molecules. However, the large volume of high-dimensional data generated during the process of model construction complicates the data analysis. Dimensionality reduction, as an unsupervised learning technique, provides a useful method to overcome this problem. It embeds high-dimensional data into a meaningful low-dimensional feature space while retaining important information and facilitating visualization. Additionally, this method can be used to analyze and discover potential structures from large datasets, which align closely with our research objectives. Existing dimensionality-reduction algorithms can be divided into two main categories. The first category is based on linear dimensionality reduction techniques, which focus on preserving the distance structure in high-dimensional space, including principal component analysis (PCA)(Ehiro, 2023), multidimensional scaling (MDS)(Herrera-Rocha et al., 2024), and non-negative matrix factorization (Zeng et al., 2024). The second category is based on non-linear dimensionality reduction techniques, having the ability to learn the local or global structure of non-linear manifolds with well-known algorithms, such as t-distributed stochastic neighbor embedding (t-SNE)(Marx, 2024), locally linear embedding (Roweis & Saul, 2000) and self-organizing maps (Wen et al., 2024). The new algorithm called uniform manifold approximation and projection (UMAP) published in recent years claims to preserve as much of the local and more of the global data structures than other techniques that are currently available (Becht et al., 2019). For instance, an effective correlations between the aroma profiles and chemical structures of compounds in food matrix have been established by Marylène et al., using UMAP combined with K-means and AHC, which indicated that the “woody” and “spicy” odors were linked to allylic and bicyclic structures (Rugard et al., 2021). However, to the best of our knowledge, the application of UMAP in Baijiu flavor research to predict the correlation between aroma and structure has not yet been reported.

In the present study, we propose a novel approach to investigate the correlation between the molecular structure and aroma profile in Baijiu, consisting of the following steps: (i) Compilation of aroma compounds and their corresponding descriptors from the publicly accessible aroma databases “Flavornet and Human Odor Space,” “Web of Science,” and “China National Knowledge Infrastructure (CNKI)”; (ii) Calculation and encoding of molecular structures utilizing KNIME software; (iii) Application of UMAP in conjunction with K-means clustering techniques to analyze the correlation between aroma, molecular structure, and compounds in Baijiu, followed by the construction of a predictive model; (iv) Identification of compounds detectable only by olfaction but lack any other corresponding chromatographic or mass spectrometric signals in actual Baijiu samples using the predictive model so constructed.

2. Materials and methods

2.1. Data collection

The odor-compound datasets were compiled in this study, which includes “aroma compounds from Flavornet and Human Odor Space” (Flavornet Home Page), as well as “aroma compounds with OAV higher than 1 in Baijiu, collected from CNKI and Web of Science”. Due to matrix effects and interactions in food that may alter odors, the odor profiles of compounds with an OAV > 1 in Baijiu were used to refine and supplement the descriptors collected from websites, aiming to establish a more accurate relationship between “odor” and “structure”. Additionally, some odor descriptors collected were vague and inconsistent, thus, the manual screening process was employed. Specifically, several descriptors that indicate the degree or are used for modification (e.g., “pleasant,” “warm,” etc.) were excluded. Descriptors with identical meanings were standardized to ensure consistency. For example, “apple-like” was simplified to “apple” and “grassy” was transformed into “grass.” Ultimately, only those aroma descriptors occurring more than five times were extracted and compiled (Tromelin et al., 2018). It is noteworthy that this study differs from previous research on the elucidation of odor-structure correlations, as the aroma descriptions of compounds were revised based on the aromas of substances with OAV greater than 1 in Baijiu. The detailed process of database construction is illustrated in Fig. 1.

Fig. 1 — Construction process of the “aroma-compound” database.

2.2. Computation and encoding of molecular structures by KNIME

Molecular structure calculations and encoding were performed using KNIME software (v 5.1.1.), an open-source workflow platform that supports a wide range of functionalities and is supported by an active cheminformatics community, along with a variety of available plugins (Beisken et al., 2013). The Simplified Molecular-Input Line-Entry System (SMILES) strings of the compounds were used as inputs in the workflow, and various open-source plugins were employed to calculate the chemical features of the molecules (Sharma et al., 2021). Specifically, the molecular structures of the compounds were computed using the “RDKit from molecule” plugin, followed by the identification and analysis of functional groups. The “RDKit Fingerprint” plugin was used to encode the compounds as Extended-connectivity fingerprints (ECFP). ECFP effectively represents the presence or absence of substructures within a specified radius of small molecules and encodes them as binary data, where 1 indicates the presence and 0 indicates the absence (Capecchi et al., 2020). To achieve optimal results, the following parameters were configured: radius = 2, allowing the acquisition of Extended-connectivity fingerprints with a diameter of 4 (ECFP4), and bits number = 1024, the suitability of which has been demonstrated by prior studies (Probst & Reymond, 2020; Rugard et al., 2021).

2.3. Dimension reduction from the 1024-bit fingerprints

To facilitate visualization and obtain low-dimensional representations, four dimensionality reduction techniques, namely PCA, MDS, t-SNE, and UMAP, were applied to the 1024-bit encoded molecular structures. While PCA, MDS, and t-SNE methods were carried out using the R (v 4.3.1) software, UMAP algorithm was implemented within the Jupyter Notebook environment using the umap-learn library in Python. Further, PCA was performed using the PCA function included in the FactoMineR package in R. Two principal components were extracted from the analysis to enable the intuitive visualization of the data within a two-dimensional space. The dimensionality reduction of MDS was implemented using the stats package in R software. In contrast to PCA, MDS focuses on preserving the similarity or distance correlations between the original data points, upon which a low-dimensional space representation is constructed. To achieve this, the dist function was first utilized to accurately calculate the Euclidean distances between data points, resulting in the construction of a distance matrix. Subsequently, the cmdscale function was applied to this distance matrix for dimensionality reduction, effectively mapping the high-dimensional data into a lower-dimensional space, while preserving the relative positional information among the data points with reasonable accuracy. Implementation of the t-SNE algorithm was primarily facilitated using the Rtsne package. During the computation process, the Kullback-Leibler (KL) divergence between the original space and the low-dimensional space is minimized, thereby reflecting the similarity between data points. To ensure that the algorithm had sufficient time to converge to a stable low-dimensional representation, the maximum number of iterations was set to 1000. Further, the theta and perplexity parameters were maintained at 0.5 and 216 respectively, thereby influencing the execution of the algorithm and determining the shape of the final cluster. The superior ability of UMAP to preserve both detailed local and global structures can be attributed to the fine-tuning of its hyperparameters, which play a crucial role in determining the quality of the resulting embedding. Critical libraries, including NumPy, Pandas, and UMAP, were imported and utilized to establish a framework for data analysis in the UMAP down-scaling process. The dataset was then loaded, and the UMAP model was set up for initialization. Specifically, the number of neighbors (n_neighbors) = 15 signifies that during the construction of local neighborhoods, each data point considers 15 nearest neighbors. The minimum distance (min_dist) = 0.1 indicates the minimum distance between data points in the low-dimensional space, preventing data points from losing separability owing to overcrowding and making it possible to clearly distinguish different data points even after dimensionality reduction. Moreover, due to the binary or categorical nature of molecular fingerprint data, metric = “jaccard” was chosen as the metric for the distance between points, which accurately reflects the similarity among binary data.

2.4. Clustering and visualization

To facilitate a more comprehensive analysis of dimensionality-reduced data, K-means clustering was used to group structurally similar compounds (Rugard et al., 2021). This method enables the identification of underlying patterns within the data that are not immediately apparent in the raw, high-dimensional feature space. The clustering results were subsequently visualized to highlight the groupings and structures within the data. Clustering analysis was implemented in the R software using the K-means algorithm. The “cluster” and “factoextra” packages were the primary packages used for clustering and visualization, respectively. In this approach, a distance matrix is produced by calculating the Euclidean distances between observations. Prior to performing K-means clustering, the optimal number of clusters was determined using the Kelly-penalty function, which balances the trade-off between the number of clusters and the model fit, and the number of clusters corresponding to the minimum score of the function was selected as the optimal number for the analysis (Li et al., 2023).

2.5. Establishment of the “aroma-structure” model

To elucidate the correlation between the aroma and structure of the compounds, a meticulous analysis was carried out on the primary components within each cluster after the clustering. The analysis involves three pivotal steps: (1) Initial assessment of the aroma distribution patterns within the clusters, (2) Detailed exploration of the structural features inherent in the clusters, (3) Establishment of a correlation between aroma and structure through the application of rigorous statistical analysis methodologies.

The analysis of clustered aroma components primarily focused on the representative aromas within each cluster examined from two distinct perspectives. The distribution frequency of a specific aroma within the various clusters was calculated based on its total occurrence in the entire database (A%), and for individual clusters, the percentage of a particular aroma in a cluster was calculated based on the total number of elements (aromatic compounds) in that cluster (B%). To facilitate this analysis, the following equations were used (Rugard et al., 2021):

A % = \frac{C}{D}

(1)

B % = \frac{E}{F}

(2)

where parameters C represents the frequency of occurrence of aroma within a cluster, the parameter D represents the total frequency of aroma occurrences in the dataset, the parameter E represents the frequency of occurrence of aroma within a cluster, the parameter F represents the total number of elements (compounds) in the cluster.

For example, by employing the UMAP dimensionality-reduction technique in conjunction with the K-means clustering method, the dataset was partitioned into four distinct clusters. Within cluster C1, a total of 256 odor compounds were identified, with the aroma of “fruit” occurring 108 times specifically within C1 and 166 times across the entire dataset. Using Eqs. (1), (2), parameters A and B are calculated to be as follows:

A % = \frac{108}{256} = 42.19 %

(3)

B % = \frac{108}{166} = 65.06 %

(4)

From the above equations, it may be observed that 42.19 % of the fruity aroma molecules were distributed in C1, accounting for 65.06 % of the total compounds in C1. The distribution of the structures within different clusters was calculated in the same manner.

2.6. Application on samples of actual baijiu

2.6.1. Sampling and sample preparation

Light-aroma types of Baijiu (LAB) was used in this study, which was provided by XingHuaCun Fenjiu Group Co., Ltd. (Shanxin, China). Baijiu samples were diluted with ultrapure water to 15 % ethanol (v/v), and NaCl was added for saturation. The diluted sample was then transferred to a separatory funnel and extracted thrice with CH₂Cl₂. The organic phase was combined and anhydrous Na₂SO₄ was added for overnight drying. The solution was concentrated to 500 μL under a gentle stream of nitrogen and subsequently evaluated by GC-O-MS.

2.6.2. GC-O-MS analysis

The instrument parameters for GC-O-MS were set based on previous reports and subsequently modified (Dong et al., 2018). GC-O-MS analyses were performed using the Agilent 7890B gas chromatograph equipped with an Agilent 5977 A mass selective detector (MSD) and a sniff port (ODP3, Gerstel, Germany). The samples were analyzed on a DB-FFAP column (60 m × 0.25 mm i.d., 0.25 μm film thickness; J&W Scientific, USA). The temperature of olfactory port was set to 250 °C, the ion source temperature to 230 °C, and the transmission line temperature to 250 °C. Mass spectra were acquired in electron ionization (EI) mode at 70 eV. The injection volume was 1 μL, and helium was used as the carrier gas. Aroma compound identification was carried out in full scan mode, with a mass range of 35 to 400 amu. The oven temperature for the DB-FFAP columns was initially increased from 40 °C to 50 °C (held for 6 min) at a rate of 10 °C/min, followed by an increase from 50 °C to 80 °C (held for 6 min) at 3 °C/min, and then further increased at a rate of 5 °C/min until reaching 235 °C (held for 10 min).

Olfactometric analysis was performed by three trained panelists (two females and one male) who were part of the Beijing Key Laboratory of Flavor Chemistry at Beijing Technology and Business University. During GC analysis, each panelist independently evaluated the same sample using a sniff port. The aroma attributes, retention time, and intensity of the odor types present in the sample extract were recorded. To ensure accuracy and minimize the risk of overlooking or misidentifying individual odor-active compounds, only odorants detected by a minimum of two assessors were documented. Each panelist sniffed each extract in triplicate.

3. Result and discussion

3.1. Data sets

The construction of the “odor-compound” database is crucial for the subsequent analysis of the “aroma-structure” correlation. In this study, a total of 646 aroma compounds and 70 aroma descriptors, with 530 compounds in the open access aroma database and 149 compounds in Baijiu with OAV > 1 have been compiled. Table 1 presents the compounds with the OAV greater than 1 in Baijiu, along with their descriptions. From these results, it can be observed that the number of compound types significantly exceeded the aroma descriptors, suggesting that the same aroma can be generated by different compounds. A single compound may also exhibit multiple aroma descriptors. Fig. S1 A illustrates that most compounds have 1–3 descriptive terms. Notably, four compounds, “butyric acid,” “2-undecanone,” “2-Methyl-1-propanol,” and “Guaiacol” generate the highest number of descriptive terms of up to nine. In Fig. S1 B, the horizontal axis represents the frequency of the specific descriptive terms, and the vertical axis indicates the number of compounds that generate a specific number of these terms. As can be observed from the Table S1, the frequency of aroma descriptive terms ranges from 5 to 166, and the three aroma descriptors “fruit,” “flower,” and “sweet” have the highest frequencies of occurrence. These are also important aroma attributes in the flavor profile of Baijiu, especially in light aroma-type Baijiu, where floral and fruity notes are the main flavor characteristics (Li et al., 2023; Sun et al., 2022).

Table 1.

Significant aromatic compounds in Baijiu with OAV > 1.

No.	CAS	Aroma compounds	Descriptor	OAV	SMILES
1	109-60-4	propyl acetate	fruit	2	CCCOC( O)C
2	25415-67-2	ethyl 4-methylvalerate	fruit	59–872	CCOC( O)CCC(C)C
3	105-79-3	isobutyl hexanoate	fruit	1–2	CCCCCC( O)OCC(C)C
4	687-47-8	ethyl L (-)-lactate	fruit	6.2	CCOC( O)C(C)O
5	3842-03-3	Isovaleraldehyde diethyl acetal, (1,1-diethoxy-3-methylbutane)	fruit	2	CCOC(CC(C)C)OCC
6	51115-64-1	2-methylbutyl butyrate	fruit, flower	0.7–43	CCCC( O)OCC(C)CC
7	106-27-4	isoamyl butyrate	fruit, flower	1–120	CCCC( O)OCCC(C)C
8	142-92-7	hexyl acetate	fruit, flower	1–4	CCCCCCOC( O)C
9	6290-37-5	2-phenylethyl hexanoate	fruit, flower	2–19	CCCCCC( O)OCCC1 CC CC C1
10	23726-91-2	damascone	apple^✳, flower, fruit	4–257.1	CC CC( O)C1 C(CCCC1(C)C)C
11	103-82-2	phenylacetic acid	flower^✳, honey, fruit^✳	3	C1 CC C(C C1)CC( O)O
12	111-87-5	1-octanol	fruit, flower, grass	0–2.2	CCCCCCCCO
13	106-30-9	ethyl heptanoate	fruit, pineapple, flower	1–43.8	CCCCCCC( O)OCC
14	78-70-6	linalool	flower, fruit^✳, lavender^✳, acid	13.4	CC( CCCC(C)(C C)O)C
15	122-97-4	3-phenyl-1-propanol	anise^✳, cinnamon^✳, fruit, flower	4.1	C1 CC C(C C1)CCCO
16	23696-85-7	β-damascenone	fruit, flower, sweet, honey	164–829	CC CC( O)C1 C(C CCC1(C)C)C
17	123-92-2	isoamyl acetate	banana, fruit, flower, sweet	7–1000	CC(C)CCOC( O)C
18	101-97-3	ethyl phenylacetate	sweet, fruit, flower, rose, honey	1–4.1	CCOC( O)CC1 CC CC C1
19	105-54-4	ethyl butyrate	fruit, apple, pineapple, flower, sweet	7–1000	CCCC( O)OCC
20	111-27-3	1-hexanol	resin, flower, green, fruit, grass, nut	1–10	CCCCCCO
21	103-45-7	phenethyl acetate	honey, rose, tobacco^✳, fruit^✳, flower^✳, sweet^✳	1–8.6	CC( O)OCCC1 CC CC C1
22	141-78-6	ethyl acetate	pineapple, fruit, apple, flower, sweet, faint scent	2–99.3	CCOC( O)C
23	123-66-0	ethyl hexanoate	apple, fruit, banana, flower, sweet, alcohol, cellar	10–197.4	CCCCCC( O)OCC
24	123-51-3	3-methyl-1-butanol	burnt^✳, malt, whiskey^✳, fruit, flower, sweet, nail polish, mildew, empyreumatique, bitter	1–11	CC(C)CCO
25	106-33-2	ethyl laurate	leaf^✳, fruit, flower, sweet, faint scent, acid, walnut	1–8	CCCCCCCCCCCC( O)OCC
26	100-52-7	benzaldehyde	almond, caramel, fruit, flower, nut, cherry	0–5	C1 CC C(C C1)C O
27	110-38-3	ethyl caprate	grape, fruit, pineapple, pine, wax, flower, sweet	1–6.3	CCCCCCCCCC( O)OCC
28	2021-28-5	ethyl 3-phenylpropionate	flower, fruit, pineapple, rose, sweet, honey, alcohol	1–108.8	CCOC( O)CCC1 CC CC C1
29	71-23-8	1-propanol	fruit, flower, yeast, alcohol, grass, green, solvent, bitter	2–14.2	CCCO
30	106-32-1	ethyl caprylate	fat^✳, fruit, pear, litchi, sesame, almond, oil, balsamic, flower, alcohol	4–1000	CCCCCCCC( O)OCC
31	66840-71-9	DMST	fruit, sweet	7285.2	CC1 CC C(C C1)NS( O)( O)N(C)C
32	97-62-1	ethyl isobutyrate	rubber, fruit, sweet, yeast, pungent	1–65.7	CCOC( O)C(C)C
33	539-82-2	ethyl valerate	fruit, apple, strawberry, sweet, yeast, mold culture	0.1–1000	CCCCC( O)OCC
34	124-07-2	octanoic acid	sweet^✳, cheese, sweat, fruit, oil, balsamic, acid, rancid	1–1.2	CCCCCCCC( O)O
35	108-64-5	ethyl isovalerate	fruit, apple, pineapple, sweet, banana, alcohol, water smell	10.9–1000	CCOC( O)CC(C)C
36	112-12-9	2-undecanone	fresh^✳, green^✳, orange^✳, fruit, sweet, oil, balsamic, cream, citrus	2.9	CCCCCCCCCC( O)C
37	78-83-1	2-methyl-1-propanol	bitter, solvent, wine, fruit, sweet, alcohol, pine, malt, mildew, rubber	1–17.6	CC(C)CO
38	539-88-8	ethyl levulinate	fruit, apple	2–22	CCOC( O)CCC( O)C
39	6378-65-0	hexyl hexanoate	apple^✳, peach^✳, fruit	3–34	CCCCCCOC( O)CCCCC
40	110-19-0	isobutyl acetate	apple^✳, banana^✳, fruit, rum	1.5–2.3	CC(C)COC( O)C
41	7452-79-1	ethyl 2-methylbutyrate	apple^✳, fruit, pineapple, sweet	5–1000	CCC(C)C( O)OCC
42	513-85-9	2,3-butanediol	fruit, onion^✳	4–6	CC(C(C)O)O
43	111-35-3	3-ethoxy-1-propanol	fruit^✳, alcohol	20.6	CCOCCCO
44	543-49-7	2-heptanol	mushroom^✳, fruit	4	CCCCCC(C)O
45	7789-92-6	1,1,3-triethoxypropane	fruit, earth, green, vegetable, mushroom	0–2.6	CCOCCC(OCC)OCC
46	107-87-9	2-pentanone	ether^✳, fruit	2	CCCC( O)C
47	75-07-0	acetaldehyde	ether^✳, malt^✳, pungent^✳, fruit, grass, cream, aldehyde, bran	9.5–84	CC O
48	97-64-3	ethyl lactate	fruit, grass	1–10	CCOC( O)C(C)O
49	105-57-7	acetal	cream^✳, fruit, grass, vegetable	24–1000	CCOC(C)OCC
50	590-86-3	3-methylbutanal	fruit, grass, green, malt, cocoa	2.3–10,000	CC(C)CC O
51	3777-69-3	2-pentylfuran	butter^✳, green, bean^✳, fruit, grass, cream, earth, vegetable	3	CCCCCC1 CC CO1
52	51755-83-0	3-mercapto-1-hexanol	sulfur, fruit	1443–10,058	CCCC(CCO)S
53	78-92-2	2-butanol	wine^✳, fruit, alcohol, nut, yeast, malt, solvent	1.9–1000	CCC(C)O
54	628-63-7	amyl acetate	fruit, banana	1.2	CCCCCOC( O)C
55	105-37-3	ethyl propionate	fruit, banana, nail polish	1–34.1	CCC( O)OCC
56	71-36-3	1-butanol	fruit, medicine, banana, alcohol, oil, balsamic, pungent, solvent	1–10	CCCCO
57	628-97-7	palmitic acid ethyl ester	fruit, cream	2–53	CCCCCCCCCCCCCCCC( O)OCC
58	103-36-6	ethyl cinnamate	cinnamon^✳, honey^✳, fruit	20.3–28.0	CCOC( O)C CC1 CC CC C1
59	103-52-6	phenethyl butyrate	flower, rose	314	CCCC( O)OCCC1 CC CC C1
60	689-67-8	6,10-dimethyl-5,9-undecadien-2-one	flower, sweet	2	CC( CCCC( CCCC( O)C)C)C
61	111-13-7	2-octanone	flower, soap	1–10	CCCCCCC( O)C
62	821-55-6	2-nonanone	flower, cream	1–12.3	CCCCCCCC( O)C
63	23726-93-4	beta-damascenone	apple. Rose^✳, honey, flower	93	CC CC( O)C1 C(C CCC1(C)C)C
64	121-33-5	vanillin	flower, sweet, vanilla	0–10	COC1 C(C CC( C1)C O)O
65	79-77-6	β-lonone	seaweed^✳, flower, raspberry^✳, violet^✳	1–3.5	CC1 C(C(CCC1)(C)C)C CC( O)C
66	143-08-8	1-nonanol	fat^✳, green^✳, flower, oil, grass, citrus	3.8–10	CCCCCCCCCO
67	104-61-0	gamma-nonanolactone	coconut, peach^✳, flower, sweet, cream, nut	0–27.5	CCCCCC1CCC( O)O1
68	122-78-1	phenylacetaldehyde	sweet, hawthorne^✳, honey, flower, rose, sweet	1–19	C1 CC C(C C1)CC O
69	60-12-8	phenethyl alcohol	honey, lilac^✳, spice^✳, flower, rose, sweet, Chinese rose	1–2.7	C1 CC C(C C1)CCO
70	96-48-0	gamma butyrolactone	sweet, caramel	1–18	C1CC( O)OC1
71	4466-24-4	2-butylfuran	empyreumatique, sweet	464–1167	CCCCC1 CC CO1
72	13529-27-6	2-furaldehyde diethyl acetal	sweet, pine, spice	7	CCOC(C1 CC CO1)OCC
73	513-86-0	acetoin	butter, cream, sweet, acid	1–195	CC(C( O)C)O
74	620-02-0	5-methyl furfural	almond^✳, caramel, caramel^✳, sweet	2.2	CC1 CC C(O1)C O
75	98-00-0	furfuryl alcohol	burnt^✳, sweet, caramel, empyreumatique, alcohol	1–44	C1 COC( C1)CO
76	109-52-4	valeric acid	sweet, cream, acid, rancid, sweat, cellar, earth	1–49.9	CCCCC( O)O
77	2785-89-9	4-ethyl-2-methoxyphenol	sweet, spice, clove, medicine, smoke, pungent	1–10.8	CCC1 CC( C(C C1)O)OC
78	79-31-2	isobutyric acid	sweet^✳, butter^✳, cheese, rancid, acid, sweat, pungent	0–10	CC(C)C( O)O
79	98-01-1	furfural	sweet, almond, bread, pine, nut, almond, burnt odor, potato, caramel, bran, roast, aging aroma	1–11.2	C1 COC( C1)C O
80	2463-53-8	2-nonenal	paper^✳, green	255	CCCCCCC CC O
81	2198-61-0	isoamyl hexanoate	grass, green	1–71	CCCCCC( O)OCCC(C)C
82	557-48-2	(E, Z)-2,6-nonadienal	cucumber, green, wax^✳	10–66	CCC CCCC CC O
83	66-25-1	hexanal	pine, grass, green	0–133	CCCCCC O
84	78-84-2	isobutyraldehyde	green^✳, pungent^✳, grass, malt	1.2	CC(C)C O
85	18829-56-6	trans-nonenal	cucumber^✳, fat^✳, green^✳, oil, balsamic	3–22	CCCCCCC CC O
86	124-13-0	Octanal	fat, green, lemon, soap, honey, orange	130	CCCCCCCC O
87	124-19-6	1-nonannal	fat^✳, oil, balsamic, grass, soap, wax	0–354	CCCCCCCCC O
88	111-71-7	heptaldehyde	citrus^✳, cocoa^✳, fat^✳, medicine^✳, rancid^✳, grass	66	CCCCCCC O
89	107-92-6	butyric acid	cheese, fat, rancid, sweat, oil, balsamic, acid, cellar, earth, pungent	1–56.0	CCCC( O)O
90	97-53-0	eugenol	spice, smoke	21	COC1 C(C CC( C1)CC C)O
91	28664-35-9	4,5-dimethyl-3-hydroxy-2,5-dihydrofuran-2-one	cotton candy^✳, maple^✳, spice, caramel, herb	5	CC1C( C(C( O)O1)O)C
92	90-05-1	guaiacol	oil, balsamic, pine, spice, clove, medicine, grain, smoke, pungent	1–10	COC1 CC CC C1O
93	76-49-3	bornyl acetate	pine, herb	230	CC( O)OC1CC2CCC1(C2(C)C)C
94	3913-81-3	3-heptylacrolein	oil, balsamic	34	CCCCCCCC CC O
95	2548-87-0	(E)-2-octenal	oil, balsamic	15–11,515	CCCCCC CC O
96	107-03-9	1-propanethiol	boiled egg, garlic	25–26	CCCS
97	111-14-8	heptanoic acid	oil, balsamic, sweat	1–2	CCCCCCC( O)O
98	25152-84-5	trans, trans-2,4-decadien-1-al	oil, balsamic, chicken, cucumber	3391	CCCCCC CC CC O
99	64-19-7	acetic acid	acid, oil, balsamic, rancid, pungent	1–7.5	CC( O)O
100	112-31-2	decyl aldehyde	orange, soap^✳, tallow^✳, oil, balsamic	3.6–10	CCCCCCCCCC O
101	503-74-2	Isovaleric acid	acid, rancid, sweat, oil, balsamic, cream, pungent, sauce	1–10	CC(C)CC( O)O
102	123-07-9	4-ethylphenol	smoke, animal, pungent	1–4	CCC1 CC C(C C1)O
103	3658-80-8	dimethyl trisulfide	cabbage, fish^✳, sulfur, pungent, onion, ether, pickles, gas odor, cabbage	1–138.7	CSSSC
104	1124-11-4	Tetramethylpyrazine	nut	2	CC1 C(N C(C( N1)C)C)C
105	16630-66-3	methyl(methylthio)acetate	nut, potato	15–23	COC( O)CSC
106	1534-08-3	(s)-methylthioacetate	nut, potato	9	CC( O)SC
107	13925-03-6	2-ethyl-6-methylpyrazine	roast, nut, potato	23–1923	CCC1 NC( CN C1)C
108	108-50-9	2,6-dimethylpyrazine	roast, beef^✳, nut	2–15	CC1 CN CC( N1)C
109	124-06-1	ethyl myristate	ether^✳, yeast, coconut, sauce	0.1–9.7	CCCCCCCCCCCCCC( O)OCC
110	14667-55-1	trimethyl-pyrazine	roast, nut, peanuts, peppers, coffee	2–225	CC1 CN C(C( N1)C)C
111	100-53-8	benzyl mercaptan	roast	487–12,538	C1 CC C(C C1)CS
112	13678-68-7	furfuryl thioacetate	roast, sulfur	3–22	CC( O)SCC1 CC CO1
113	5405-41-4	ethyl 3-hydroxybutyrate	marshmallow^✳, alcohol, roast, grass, solvent	6.6	CCOC( O)CC(C)O
114	98-02-2	2-furylmethanethiol	roast, sesame	135	C1 COC( C1)CS
115	136954-20-6	3-mercaptohexyl acetate	sulfur	327–377	CCCC(CCOC( O)C)S
116	74-93-1	methyl mercaptan	gasoline^✳, garlic^✳, sulfur, cabbage, rubber	273	CS
117	624-92-0	dimethyl disulfide	rancid^✳, cabbage^✳, onion, aging aroma, sulfur	4–5	CSSC
118	106-36-5	propyl propionate	pineapple^✳, solvent	2	CCCOC( O)CC
119	16423-19-1	(+/−)-geosmin	earth	10	CC1CCCC2(C1(CCCC2)O)C
120	19700-21-1	geosmin	beet^✳, earth	6–64	CC1CCCC2(C1(CCCC2)O)C
121	3391-86-4	1-octen-3-ol	mushroom, grass, earth, grain	1–1000	CCCCCC(C C)O
122	96-76-4	2,4-di-tert-butylphenol	smoke, lemon	2–6	CC(C)(C)C1 CC( C(C C1)O)C(C)(C)C
123	108-95-2	phenol	phenolic resin, smoke	2	C1 CC C(C C1)O
124	7786-61-0	4-hydroxy-3-methoxystyrene	curry^✳, clove^✳, smoke	3.6	COC1 C(C CC( C1)C C)O
125	93-51-6	2-methoxy-4-methylphenol	pine, smoke, phenol, medicine, sauce	1–9.3	CC1 CC( C(C C1)O)OC
126	106-44-5	p-cresol	medicine^✳, phenol^✳, smoke, medicine, animal, cellar, earth, stable	1–1.9	CC1 CC C(C C1)O
127	28588-74-1	2-methyl-3-furanthiol	medicine, empyreumatique	33–135	CC1 C(C CO1)S
128	79-09-4	propionic acid	acid, rancid, sweat	1–2.5	CCC( O)O
129	646-07-1	4-methylvaleric acid	acid, sweat, rancid	4.9	CC(C)CCC( O)O
130	142-62-1	hexanoic acid	cream, acid, rancid, sweat, animal	1–6	CCCCCC( O)O
131	57-55-6	1,2-propane diol	alcohol	10–100	CC(CO)O
132	10348-47-7	ethyl 2-hydroxy-4-methylvalerate	fresh	1–16	CCOC( O)C(CC(C)C)O
133	124-76-5	isoborneol	camphor	3	CC1(C2CCC1(C(C2)O)C)C
134	83-34-1	3-methylindole	camphor^✳, fecal, cellar	8	CC1 CNC2 CC CC C12
135	75-18-3	dimethyl sulfide	cabbage, onion	4–14	CSC
136	489-41-8	(−)-globulol	pine	2	CC1CCC2C1C3C(C3(C)C)CCC2(C)O
137	75-08-1	ethanethiol	onion, rubber	94	CCS
138	30899-19-5	3-methylbutanol	nail polish, malt, rancid, mildew	1–3.7	CCCCCO
139	13184-86-6	4-[ethoxymethyl]-2-methoxyphenol	cocoa, vanilla	5	CCOCC1 CC( C(C C1)O)OC
140	3268-49-3	3-(methylthio)propionaldehyde	potato	10	CSCCC O
141	431-03-8	2,3-butanedione	butter	4–80	CC( O)C( O)C
142	96-17-3	2-methylbutyraldehyde	mildew	1–512	CCC(C)C O
143	505-10-2	3-methylthiopropanol	salty	1–35.8	CSCCCO
144	2639-63-6	hexyl butyrate	fruit	1–38	CCCCCCOC( O)CCC
145	138-86-3	limonene	fruit	1–6.4	CC1 CCC(CC1)C( C)C
146	91-20-3	naphthalene	green, pungent, empyreumatique	1.0	C1 CC C2C CC CC2 C1
147	2437-95-8	α-pinene	turpentine	61	CC1 CCC2CC1C2(C)C
148	123-35-3	β-myrcene	pine, grass	13	CC( CCCC( C)C C)C
149	515-13-9	β-elemene	fennel	6	CC( C)C1CCC(C(C1)C( C)C)(C)C C

Open in a new tab

The compounds listed in the table represent significant aroma compounds in Baijiu with OAV greater than 1, collected from various literature sources. The asterisk (*) denotes unique terms found exclusively in the “Flavornet Home Page” aroma database.

3.2. Dimension reduction, clustering and visualization of the data

The chemical structures of all 646 compounds were represented by 1024-bit molecular fingerprints, which were encoded using the KNIME software. To facilitate data visualization, the K-means clustering algorithm was combined with four-dimensionality reduction techniques, enabling the representation of data in a two-dimensional coordinate system. The Kelly penalty function is introduced as an accurate tool for calculating the optimal number of clusters. In the Fig. S2 presented, the “Kelly penalty score” trends for the four dimensionality reduction methods all exhibit a pattern of initial decline followed by a subsequent rise, with the lowest point on the horizontal axis corresponding to the optimal number of clusters (Kelley et al., 1996). The findings indicated that the ideal cluster count for t-SNE, PCA, and MDS was five, whereas for UMAP, the optimal number of clusters was four. Fig. 2 displays the results of dimensionality reduction clustering for several methods.

Fig. 2 — The clustering outcomes of two-dimensional data generated from various dimensionality reduction methods applied to compounds. Specifically, A, B, C, and D represent the results of combining PCA, MDS, t-SNE, and UMAP dimensionality reduction techniques with K-means clustering, respectively.

Based on the colour distribution within each cluster, it was evident that the resulting visualizations generated by PCA, MDS, and t-SNE exhibited closely interconnected neighboring clusters. Compared to other dimensionality reduction techniques, UMAP produces denser, more compact clusters and allocates a greater blank space between different clusters (Fig. 2D), indicating its superiority in data separation. This observation is consistent with the reports from other studies (Kobak & Linderman, 2021; Probst & Reymond, 2020; Rabasovic et al., 2023). This is due to the special algorithm and adjustable parameters of UMAP, whereby high-dimensional data points that are already close together become closer in two dimensions following dimensionality reduction, allowing space to distinguish between distinct groups.

The most significant outcome of this study is the ability to group similar structures/aromas into the same cluster using effective dimensionality reduction and clustering methods. Consequently, to evaluate the efficacy of the odor-recognition techniques employed, we calculated the aroma distribution within the clusters generated by various dimensionality reduction and clustering approaches. If a dimensionality reduction technique can group more than 50 % of the compounds of a specific aroma into the same cluster, it is considered to have good classification performance (Rugard et al., 2021). Therefore, we calculated the number of aromas for which each dimensionality reduction clustering technique grouped more than 50 % of the compounds into the same cluster, a higher number indicating a better classification performance of the dimensionality reduction technique. From Fig. S3, it can be observed that the number of aromas where more than 50 % of the compounds were grouped into the same cluster was 34 for both PCA and MDS dimensionality reduction techniques, 37 for t-SNE, and 49 for UMAP, which was significantly higher than the other three techniques, indicating a clear advantage of UMAP in classification performance. Table S1 presents the aroma distribution within different clusters generated by the combination of UMAP dimensionality reduction and K-means clustering techniques. Subsequent studies shall be carried out on results generated using the UMAP dimensionality reduction technique.

3.3. Analysis of the cluster constituents: Structure-odor correlations

3.3.1. Aroma distribution of compounds

A comprehensive analysis was conducted on the clustering results of the two-dimensional data generated by UMAP dimensionality reduction. As evidenced in Table S1, the individual clusters demonstrated well-defined aroma distribution characteristics. Upon analyzing the overall distribution, it becomes evident that the “fruit” aroma is the predominant aroma in cluster C1, comprising 42.19 % of the total elements in C1, significantly exceeding the proportions of other aroma profiles. In cluster C2, three aroma profiles stand out, “flower,” “sweet,” and “roast”, which account for 21.64 %, 22.81 %, and 18.13 % of the total molecular count in C2, respectively. Meanwhile, although “fruit,” “flower,” “sweet,” “spice,” and “herb” are all present in cluster C3, their proportional representation does not yield a clear dominance. Additionally, “green” and “fat” exhibit pronounced aroma representativeness within cluster C4. A more detailed analysis of the distribution of aroma profiles across various clusters revealed that cluster C1 exhibited the greatest diversity of aroma types. Majority of the fruit-related aromas are prominently represented in C1, with significant proportions such as 65.06 % of “fruit” aroma, 90.91 % of “apple” aroma, 90.91 % of “pineapple” aroma, 100 % of “banana” aroma, and 71.43 % of “lemon” aroma being well-distributed within this cluster. A substantial proportion of vegetable odors are also present, including “onion,” “cabbage,” and “mushroom.” 56 % of “sulfur” odor is also assigned to C1, which aligns with the well-known fact that “onion” and “cabbage” are typical sources of sulfur compounds (Sun et al., 2022). Furthermore, “alcohol” (78.57 %), “solvent” (88.89 %), “ether” (88.89 %), “wine” (85.71 %), and “yeast” (100 %) aromas are all distributed within C1. A significant portion of “acid,” “sweat,” and “rancid” aromas, which are generally associated with acidic compounds, are also present in this cluster (Dong et al., 2024). In cluster C2, over 70 % of the “roast”, “smoke”, “medicine”, “phenol”, and “clove” aromas are concentrated, with an additional 68.75 % of the “honey” aroma also being distributed. The proportion analysis in Table S1 revealed that cluster C3 harbored a relatively limited range of aroma types, with nearly half of the total aromas exhibiting zero representation within this cluster. Notably, a segment of refreshing aromas is present in C3, specifically 73.68 % of “mint” aroma and 69.23 % of “camphor” aroma. Additionally, a substantial proportion of “peach” and “turpentine” aromas are also distributed within C3. As for cluster C4, prior research has identified “green” and “fat” as the predominant aroma types, accounting for 41.33 % and 24.00 % respectively of the total aroma compounds in C4. Furthermore, C4 encompasses 88.89 % of “cucumber” aroma and 62.5 % of “leaf” aroma among its diverse composition. This demonstrates that green plant odor is the main type of aroma in C4. The distribution of “fat” odor in C4 is 46.15 % of the total number of odors. From the aforementioned results, it can be observed that although most of the same aromas can be well divided into the same cluster, several of the aroma distributions are more scattered and there is no certain regularity. Since the same aroma can be produced by different substances, combined with the principle of cluster analysis, it is possible that different structural substances carrying the same smell resulted in this occurrence.

3.3.2. Calculation and statistics of compound structure

Using KNIME software to determine the structures of 646 molecules, 26 distinct functional groups were identified. Twelve of these functional groups, which were present in a minimum of 5 % of the molecules in one of the four clusters, were selected for examination to increase the accuracy of the results (Rugard et al., 2021). The distribution of the functional groups of the compounds inside each cluster is illustrated in Fig. 3. Most of the functional groups exhibited significant distribution characteristics. Further investigation revealed that the C1 cluster contained over 50 % of the ester functional groups, alcohol hydroxyl groups, and carboxyl groups, which together made up 73.73 % of the total number of compounds in C1. The vast majority of sulfur-containing heterocycles, nitrogen-containing heterocycles, oxygen-containing heterocycles, and nitrogen-containing sulfur-containing heterocycles were distributed in C2, and several ether bonds and phenolic hydroxyls were also distributed in C2. Among them, cyclic substances dominated in C2, occupying a total of 87.73 % of the total number of compounds in C2. In C3, compared with the other clusters, there was no significant representative distribution of functional groups. Aldehydes make up the majority of C4, and 67.07 % of functional groups based on aldehydes are found there. It is evident from the aforementioned study that the majority of functional groups belonging to the same class may be assigned to the same clusters and are well represented within them. These findings highlight the effectiveness of the dimensionality reduction and clustering techniques employed.

Fig. 3 — Distribution of selected functional groups in the clusters: red denoting cluster C1, black representing cluster C2, blue for cluster C3, and yellow indicating cluster C4. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

3.3.3. Construction of an “aroma-structure” predictive model

Although the distribution and proportion of each individual aroma and functional group within the various clusters are clearly understood, this did not allow us to directly correlate each aroma with its corresponding functional group, and further detailed analyses are required. We focus on the cluster in which each aroma is primarily distributed and calculate the probability of each aroma being produced by different functional groups. For the C1 cluster, which exhibits the most abundant distribution of aroma compounds, statistical analysis reveals that the database comprises 166 “fruit” aroma molecules, with 108 of them residing in C1. As evident from the Table S2-S5, more than one functional group can contribute to the “fruit” aroma. Nevertheless, ester functional groups are predominantly responsible for 68.52 % of the “fruit” aroma, indicating a strong correlation between “fruit” aroma and ester groups. Furthermore, our investigation into other fruit-like aromas demonstrates that all “apple” and “pineapple” aromas present in C1 are exclusively derived from ester groups, and 88.89 % of “banana” aroma is also attributed to esters. These findings further underscore the crucial contribution of ester functional groups to the pleasant fruity aroma of Baijiu. This result is consistent with earlier reports (Fan & Qian, 2005; Jin et al., 2017; Song et al., 2020; Song et al., 2021). The aroma of “flower” which stands as one of the most ubiquitous fragrances in Baijiu, exhibits a commendable distribution profile. Specifically, “flower” aroma is predominantly concentrated in clusters C1 and C2. In C1, 54.35 % of “flower” aroma is generated by ester functional groups, while 41.30 % is attributed to alcohol hydroxyl groups. In contrast, C2 harbors 37 “flower” aroma molecules, among which 15 compounds possess ether functional groups, 11 compounds carry ester functional groups, and 9 compounds feature phenolic hydroxyl groups. Consequently, the “flower” aroma exhibits a relationship with ester groups, hydroxyl groups, and ether bonds. Additionally, “rose” aroma also demonstrates a strong association with ester groups and alcohol hydroxyl groups. “Sweet,” another prevalent and delightful aroma in Baijiu, exhibited a notable presence in cluster C1. Specifically, 34 sweet-scented compounds were distributed within C1, out of which 61.76 % carry ester groups. Additionally, the presence of certain alcohol hydroxyl groups, aldehyde groups, ether bonds, and O-heterocycles was also implicated in the production of “sweet” aromas, indicating a multifaceted contribution to this desirable fragrance. For the creamy scents categorized as “cream” and “butter,” their generation is largely attributed to the presence of ketonic functional groups. Among the 11 “alcohol” aroma compounds distributed in C1, 8 carry hydroxyl groups, indicating a significant correlation between hydroxyl groups and “alcohol.” This can be attributed to the fact that when acidic compounds in Baijiu reach certain concentrations, they can impart undesirable odors, as reported in prior studies. As discernible from the Table S2, the presence of carboxyl groups is intimately tied to the production of “acid” “sweat” and “rancid” aromas, with 71.43 % of “acid” aroma, 83.33 % of “sweat” and 83.33 % of “rancid” odors in cluster C1 emanating from carboxyl groups. Table S6 revealed that cluster C2 was dominated by a series of cyclic compounds. When correlated with the distribution of aroma profiles, it becomes evident that cluster C2 is abundant in roasted/nutty aromas such as “roast,” “nut,” and “almond.” A detailed analysis indicates that 83.78 % of the “roast” aroma is localized within C2, with over half of these aroma compounds bearing nitrogen-containing heterocyclic rings. Furthermore, 84.62 % of the “nut” aroma in Cluster C1 is attributed to nitrogen-containing heterocycles. Consequently, the generation of roasted or nutty aromas is primarily associated with the presence of nitrogen-containing heterocycles. Additionally, 85.71 % of the “potato” aroma is also produced by nitrogen-containing heterocycles, as observed from the table. Notably, these compounds exhibit not only “potato” aroma but also “roast” aroma, suggesting that the presence of nitrogen-containing heterocycles contributes to the characteristic aroma of roasted potatoes, rather than a purely “potato” scent in all likeliness. Nitrogenous compounds present in Baijiu at low concentrations have a low threshold, making a significant contribution to its flavor. Zhu et al. performed GC-O analysis on Maotai Baijiu, and the results indicated that pyrazine compounds contain “roasted and nutty” aromas, constituting a crucial component of Maotai Baijiu's flavor profile and this aligns with our findings (Zhu et al., 2020). Concurrently, we have observed that the generation of “clove” aroma is intimately related to the presence of ether bonds and phenolic hydroxyl groups. For the “smoke” and “medicine” aromas, the existence of phenolic hydroxyl groups serves as the primary factor in their production, while the presence of certain ether bonds also contributes to their release. The production of “phenol” aroma is solely attributed to phenolic hydroxyl groups. In cluster C2, there are 11 aroma compounds associated with “honey,” of which 6 are contributed by ester functional groups. As a subset of “sweet” aroma, “honey” exhibits similar research findings. Furthermore, 76.47 % of the “spice” aroma in cluster C2 is generated by ether bonds. Meanwhile, “earth” and “caramel” aromas display certain patterns in clusters C2 and C3. Specifically, the “earth” aroma is largely associated with ether bonds, nitrogen-containing heterocycles, and alcoholic hydroxyl groups. In contrast, the “caramel” aroma is primarily associated with oxoheterocycles, alcoholic hydroxyl groups, ketone groups and ether bonds. Earthy flavor is considered a common off-odor in Baijiu and is more pronounced in the strong-aroma types of Baijiu. Dong et al. elucidated for the first time that 3-methylindole is the key compound for the “mud odor” by excavating the mud odor substances in strong-flavored Baijiu (Dong et al., 2018). Indoles are an important class of N-containing heterocyclic compounds. This is consistent with our experimental findings, indicating the significance of this present study work. Cluster C3 lacks a prominent representative aroma, yet 66.67 % of the “camphor” aroma exhibits a certain degree of correlation with alcoholic hydroxyl groups. The “peach” aroma is predominantly distributed in cluster C3 and displays a strong association with ester functional groups, which is consistent with similar conclusions drawn from the fruity aromas in cluster C1. Cluster C4 exhibited distinct functional group characteristics despite its relatively small proportion of aroma distribution. Aldehydes were the primary functional groups in C4, with aroma profiles dominated by green vegetal and fatty odors. A detailed analysis reveals that 37.35 % of the “green” aroma is concentrated in this cluster, comprising 31 aroma compounds, of which 27 carry aldehyde functional groups. Additionally, 87.50 % of the “cucumber” aroma and “grass” aroma within this cluster display similar properties. These findings collectively indicate a significant correlation between the aroma of green plants and aldehyde groups. Sun et al. used comprehensive molecular sensory science techniques to demonstrate that acetaldehyde (grass) and 3-methylbutanal (grass and malt) are important aroma-active compounds in fresh Xiaoqu Baijiu (Sun et al., 2022). Furthermore, fatty odors such as “fat,” “oil,” and “balsamic” also exhibit a strong correlation with aldehyde groups. Based on the aforementioned studies, a “structure-odor” network correlation model was established, with the results presented in Fig. 4.

Fig. 4 — Aroma-structure prediction model. Thicker lines indicate a stronger correlation between aroma and structure, while thinner lines represent a weaker correlation.

3.4. Research on the practical applications of real baijiu

3.4.1. Discovery of the unknown substance with strong roasted aroma

GC-O-MS results indicate that most of the aroma regions corresponded to the peaks observed in the chromatogram and could accurately be identified. However, there are certain regions where aromas can be detected but the specific substances cannot be determined. Notably, at the retention index of 1324–1334, a strong “roasted aroma” is presented, signifying a significant contribution to the aroma profile of the sample. However, as observed from the chromatogram illustrated in Fig. 5, there were nearly no peaks in this aroma region, suggesting that the concentrations of the substances were extremely low, rendering them undetectable. To clarify the compounds contributing to the prominent “roasted aroma” in the Baijiu sample, the unknown substance within this time range was investigated through the “aroma-structure” model.

Fig. 5 — TIC chromatogram of the light-aroma types of Baijiu. Panel A shows the full chromatogram, while Panel B presents a partial chromatogram. The red-marked region in Panel B corresponds to the intense “roasted aroma” detected between retention index of 1324 and 1334. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

3.4.2. Identification of compound with strong roasted aroma in baijiu

Previous studies have elucidated the correlation between the aroma and structure. Fig. 4 reveals a significant correlation between “roasted aroma” and “nitrogen heterocycles.” Therefore, by utilizing the “aroma-compound” database, we can prioritize compounds that contain both “roasted aroma” and “nitrogen heterocycles” to narrow the search scope (Fig. 6 illustrates the complete identification process). Additionally, based on the generation and end times of the aroma region, the retention index range for the unknown substances can be calculated, while the retention index for other selected substances can be obtained through the website (CAS Number Search, nist.gov). The retention indices of the same substances on the same chromatographic column were found to be similar. Therefore, the retention index can serve as a screening criterion, and known substances with retention indices close to those of unknown substances can be selected. This significantly reduces the number of substances that are required to be analyzed. Through calculations, the retention index of the aroma region in the Baijiu sample is 1324–1334. Generally, under the same chromatographic conditions, the deviation of the measured retention index from the standard value should be within 10–25 i.u (Zhu et al., 2020). By comparison and screening, three possible compounds contributing to the “roasted aroma” of the Baijiu were identified: 2-acetyl-1-pyrroline, 2,5-dimethylpyrazine, and 2,6-dimethylpyrazine.

Fig. 6 — Flowchart for the qualitative analysis of unknown compounds responsible for the “roasted aroma” in Baijiu samples.

It is important to note that various aromas can be classified as “roasted aromas,” such as “roasted bread,” “roasted nuts,” and so on. To identify the unknown compounds responsible for the “roasted aroma” in Baijiu samples, standard solutions of the three aforementioned compounds were evaluated on GC-O-MS using the same procedure. The results indicated that 2-acetyl-1-pyrroline emitted a distinct popcorn-like aroma, which, while categorized as a “roasted aroma,” did not align with the odor observed in the Baijiu samples. In contrast, the other two compounds exhibited aromas more similar to those detected in the Baijiu samples. Further analysis revealed that 2-acetyl-1-pyrroline is a key aromatic compound in rice, with an exceptionally low odor threshold of 0.02–0.04 ng/L, making it highly detectable (Cai et al., 2024; Huang et al., 2024). However, the results of the sniffing experiment clearly excluded it, as its aroma did not correspond to the “roasted aroma” found in the Baijiu samples. Consequently, the unknown substances that produce the “roasted scent” can only be 2,5-dimethylpyrazine or 2,6-dimethylpyrazine, both of which are typical isomers. Further analysis revealed that the mass spectra of the two compounds were extremely similar, with almost identical ion fragments. In conjunction with Fig. S4, further analysis of the peak elution of the two compounds was conducted. Calculations show that the retention index of 2,5-dimethylpyrazine is 1330, and the retention index of 2,6-dimethylpyrazine is 1336, indicating that both compounds' retention indices fall within the normal fluctuation range. However, with the current analytical method, we are unable to distinguish between the two. A more scientific approach is required for their differentiation. Therefore, in our study, we conducted a rigorous screening process and ultimately identified that the aroma compounds responsible for the “roasted aroma” within the retention index of 1324–1334 are 2,5-dimethylpyrazine and 2,6-dimethylpyrazine. These two compounds are relatively rare in reports on the light flavor of Baijiu. Furthermore, future studies could confirm whether they play a significant role as aroma compounds contributing to the “roasted aroma” of LAB using recombinant missing techniques.

The main source of the “roasted aroma” in Baijiu is related to the Maillard reaction products formed during the brewing process, which is most prominently observed in sauce-flavored Baijiu. Typical thermal processes in the production of sauce-flavored Baijiu include high-temperature fermentation with Qu (65–70 °C), grain stacking, and multiple cycles of grain fermentation (42–45 °C) and distillation (100–105 °C). These distinctive brewing techniques are responsible for the characteristic baked and roasted aromas of sauce-flavored Baijiu, which are considered crucial criteria in its sensory evaluation and quality assessment (Sha et al., 2017). In contrast, strong-aroma Baijiu is produced with medium-temperature Qu preparation (55–60 °C), followed by fermentation at 32–35 °C and distillation at 95–102 °C(Dong et al., 2019). As for light-aroma Baijiu, it is produced with low-temperature Qu preparation (typically not exceeding 60 °C), combined with low-temperature fermentation and a shorter fermentation duration. Currently, 2,5-dimethylpyrazine and 2,6-dimethylpyrazine have been widely identified in both sauce-aroma and strong-aroma Baijiu and have been confirmed as important aroma compounds in sauce-aroma Baijiu. Although LAB is produced at lower temperatures during fermentation, a study by Van-Diep Le et al. on the volatile compounds at various stages of fen-daqu production revealed that during the high-temperature stage of Qu making, the temperature can reach 60 °C, promoting the occurrence of the Maillard reaction, which identified six pyrazine compounds, including 3-dimethylpyrazine, 2-ethyl-6-methylpyrazine, trimethylpyrazine, and tetramethylpyrazine (Van-Diep et al., 2012). These compounds are transferred into the spirit during the subsequent fermentation and brewing processes, contributing to the final flavor profile. Therefore, from the perspective of the mechanisms of compound formation, we further validated the possibility that 2,5-dimethylpyrazine and 2,6-dimethylpyrazine were present in the light-flavored Baijiu samples used in this study, while also confirming the effectiveness of the model. One possible reason for the absence of detection in our samples might be that the concentrations were below the threshold levels for ordinary detectors. However, their significant contribution to flavor highlights the need for further research.

4. Conclusions

In summary, we have developed an “aroma-compound” database for Baijiu, which includes 646 compounds and 70 aroma descriptors. The compounds were encoded using 1024-bit molecular fingerprints and analyzed using UMAP dimensionality reduction and K-means clustering techniques to establish correlations between aroma and chemical structure. This approach is particularly valuable for low-concentration, low-threshold compounds in Baijiu, as they may produce strong aromas but lack distinct mass spectrometry or chromatographic signals, making accurate identification difficult. Furthermore, by applying real Baijiu samples, we successfully identified the unknown compounds responsible for the intense “roasted aroma” as 2,5-dimethylpyrazine and 2,6-dimethylpyrazine, thereby validating the effectiveness of the constructed “aroma-structure” model. In future studies, more advanced machine learning algorithms will be employed to process large datasets, offering deeper insights into the “aroma-structure” relationship.

CRediT authorship contribution statement

Yintao Jia: Writing – original draft, Methodology, Conceptualization. Yue Qiu: Writing – review & editing, Visualization. Qi Deng: Visualization, Formal analysis, Data curation. Ying Han: Supervision, Conceptualization. Baoguo Sun: Supervision. Rong Liu: Resources. Pan Zhen: Resources. Wenxian Li: Writing – review & editing. Wei Dong: Writing – review & editing, Visualization, Supervision. Xiaotao Sun: Writing – review & editing, Visualization, Supervision. Xiao Yang: Visualization. Fan Cui: Resources.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Ethical approval

All procedures for sensory evaluation were carried out in accordance with relevant laws and institutional guidelines and were approved by the Scientific Research Academic Committee of Beijing Technology and Business University.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported by the National Key Research and Development Program of China (2022YFD2101205), National Natural Science Foundation of China (32102122), and National Engineering Research Center of Solid-State Brewing of Luzhou Laojiao Distillery Co., Ltd.

Footnotes

^{Appendix A}

Supplementary data to this article can be found online at https://doi.org/10.1016/j.fochx.2025.102645.

Contributor Information

Wei Dong, Email: 20200812@btbu.edu.cn.

Xiaotao Sun, Email: sxt_btbu66@163.com.

Appendix A. Supplementary data

Supplementary material

The distribution of aroma generated by UMAP and K-means across different clusters (Table S1); relationship of “aroma-structure” (Table S2-S5); proportion of functional groups in different clusters (Table S6). The relationship between odor descriptors and compounds (Fig. S1); statistics of the number of aroma descriptors with A values greater than 50 % (Fig. S2); results of the Kelly penalty function (Fig. S3), chromatogram and mass spectrometry of the Baijiu sample and two compounds (Fig. S4).

mmc1.docx^{(1.5MB, docx)}

Data availability

Data will be made available on request.

References

Becht E., Mcinnes L., Healy J., Dutertre C.A., Kwok I.W.H., Ng L.G.…Newell E.W. Dimensionality reduction for visualizing single-cell data using umap. Nature Biotechnology. 2019;37(1):38. doi: 10.1038/nbt.4314. [DOI] [PubMed] [Google Scholar]
Beisken S., Meinl T., Wiswedel B., De Figueiredo L.F., Berthold M., Steinbeck C. Knime-cdk: Workflow-driven cheminformatics. BMC Bioinformatics. 2013;14 doi: 10.1186/1471-2105-14-257. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cai Y., Pan X., Zhang D., Yuan L., Lao F., Wu J. The kinetic study of 2-acetyl-1-pyrroline accumulation in the model system: An insight into enhancing rice flavor through the maillard reaction. Food Research International. 2024;191 doi: 10.1016/j.foodres.2024.114591. [DOI] [PubMed] [Google Scholar]
Capecchi A., Probst D., Reymond J.L. One molecular fingerprint to rule them all: Drugs, biomolecules, and the metabolome. Journal of. Cheminformatics. 2020;12(1) doi: 10.1186/s13321-020-00445-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dong W., Dai X., Jia Y., Ye S., Shen C., Liu M., Lin F., Sun X., Xiong Y., Deng B. Association between baijiu chemistry and taste change: Constituents, sensory properties, and analytical approaches. Food Chemistry. 2024;437 doi: 10.1016/j.foodchem.2023.137826. [DOI] [PubMed] [Google Scholar]
Dong W., Guo R., Liu M., Shen C., Sun X., Zhao M., Sun J., Li H., Zheng F., Huang M., Wu J. Characterization of key odorants causing the roasted and mud-like aromas in strong-aroma types of base baijiu. Food Research International. 2019;125 doi: 10.1016/j.foodres.2019.108546. [DOI] [PubMed] [Google Scholar]
Dong W., Shi K., Liu M., Shen C., Li A., Sun X., Zhao M., Sun J., Li H., Zheng F., Huang M. Characterization of 3-methylindole as a source of a “mud”-like off-odor in strong-aroma types of base baijiu. Journal of Agricultural and Food Chemistry. 2018;66(48):12765–12772. doi: 10.1021/acs.jafc.8b04734. [DOI] [PubMed] [Google Scholar]
Duan J., Cheng W., Lv S., Deng W., Hu X., Li H., Sun J., Zheng F., Sun B. Characterization of key aroma compounds in soy sauce flavor baijiu by molecular sensory science combined with aroma active compounds reverse verification method. Food Chemistry. 2024;443 doi: 10.1016/j.foodchem.2024.138487. [DOI] [PubMed] [Google Scholar]
Dutta P., Jain D., Gupta R., Rai B. Classification of tastants: A deep learning based approach. Molecular Informatics. 2023;42(12) doi: 10.1002/minf.202300146. [DOI] [PubMed] [Google Scholar]
Ehiro T. Feature importance-based interpretation of umap-visualized polymer space. Molecular Informatics. 2023;42(8–9) doi: 10.1002/minf.202300061. [DOI] [PubMed] [Google Scholar]
Fan W.L., Qian M.C. Headspace solid phase microextraction and gas chromatography-olfactometry dilution analysis of young and aged chinese “yanghe daqu” liquors. Journal of Agricultural and Food Chemistry. 2005;53(20):7931–7938. doi: 10.1021/jf051011k. [DOI] [PubMed] [Google Scholar]
Herrera-Rocha F., Fernandez-Nino M., Duitama J., Cala M.P., Chica M.J., Wessjohann L.A.…Barrios A.F.G. Flavorminer: A machine learning platform for extracting molecular flavor profiles from structural data. Journal of. Cheminformatics. 2024;16(1) doi: 10.1186/s13321-024-00935-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huang Y., Huang L., Cheng M., Li C., Zhou X., Ullah A., Sarfraz S., Khatab A., Xie G. Progresses in biosynthesis pathway, regulation mechanism and potential application of 2-acetyl-1-pyrroline in fragrant rice. Plant Physiology and Biochemistry. 2024;215 doi: 10.1016/j.plaphy.2024.109047. [DOI] [PubMed] [Google Scholar]
Jin G., Zhu Y., Xu Y. Mystery behind chinese liquor fermentation. Trends in Food Science & Technology. 2017;63:18–28. doi: 10.1016/j.tifs.2017.02.016. [DOI] [Google Scholar]
Kelley L.A., Gardner S.P., Sutcliffe M.J. An automated approach for clustering an ensemble of nmr-derived protein structures into conformationally related subfamilies. Protein Engineering. 1996;9(11):1063–1065. doi: 10.1093/protein/9.11.1063. [DOI] [PubMed] [Google Scholar]
Kobak D., Linderman G.C. Initialization is critical for preserving global data structure in both t-sne and umap. Nature Biotechnology. 2021;39(2) doi: 10.1038/s41587-020-00809-z. [DOI] [PubMed] [Google Scholar]
Li H., Zhang X., Gao X., Shi X., Chen S., Xu Y., Tang K. Comparison of the aroma-active compounds and sensory characteristics of different grades of light-flavor baijiu. Foods. 2023;12(6) doi: 10.3390/foods12061238. [DOI] [PMC free article] [PubMed] [Google Scholar]
Marx V. Seeing data as t-sne and umap do. Nature Methods. 2024;21(6):930–933. doi: 10.1038/s41592-024-02301-x. [DOI] [PubMed] [Google Scholar]
Probst D., Reymond J.-L. Visualization of very large high-dimensional data sets as minimum spanning trees. Journal of. Cheminformatics. 2020;12(1) doi: 10.1186/s13321-020-0416-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rabasovic M.S., Pavlovic D.M., Sevic D. Analysis of laser ablation spectral data using dimensionality reduction techniques: Pca, t-sne and umap. Contributions of the Astronomical Observatory Skalnate Pleso. 2023;53(3):51–57. doi: 10.31577/caosp.2023.53.3.51. [DOI] [Google Scholar]
Roweis S.T., Saul L.K. Nonlinear dimensionality reduction by locally linear embedding. Science. 2000;290(5500):2323–+. doi: 10.1126/science.290.5500.2323. [DOI] [PubMed] [Google Scholar]
Rugard M., Jaylet T., Taboureau O., Tromelin A., Audouze K. Smell compounds classification using umap to increase knowledge of odors and molecular structures linkages. PLoS ONE. 2021;16(5) doi: 10.1371/journal.pone.0252486. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sha S., Chen S., Qian M., Wang C., Xu Y. Characterization of the typical potent odorants in chinese roasted sesame-like flavor type liquor by headspace solid phase microextraction-aroma extract dilution analysis, with special emphasis on sulfur-containing odorants. Journal of Agricultural and Food Chemistry. 2017;65(1):123–131. doi: 10.1021/acs.jafc.6b04242. [DOI] [PubMed] [Google Scholar]
Sharma A., Kumar R., Ranjta S., Varadwaj P.K. Smiles to smell: Decoding the structure-odor relationship of chemical compounds using the deep neural network approach. Journal of Chemical Information and Modeling. 2021;61(2):676–688. doi: 10.1021/acs.jcim.0c01288. [DOI] [PubMed] [Google Scholar]
Song X., Jing S., Zhu L., Ma C., Song T., Wu J., Zhao Q., Zheng F., Zhao M., Chen F. Untargeted and targeted metabolomics strategy for the classification of strong aroma-type baijiu (liquor) according to geographical origin using comprehensive two-dimensional gas chromatography-time-of-flight mass spectrometry. Food Chemistry. 2020;314 doi: 10.1016/j.foodchem.2019.126098. [DOI] [PubMed] [Google Scholar]
Song X., Zhu L., Geng X., Li Q., Zheng F., Zhao Q., Ji J., Sun J., Li H., Wu J., Zhao M., Sun B. Analysis, occurrence, and potential sensory significance of tropical fruit aroma thiols, 3-mercaptohexanol and 4-methyl-4-mercapto-2-pentanone, in Chinese Baijiu. Food Chemistry. 2021;363 doi: 10.1016/j.foodchem.2021.130232. [DOI] [PubMed] [Google Scholar]
Sun X., Qian Q., Xiong Y., Xie Q., Yue X., Liu J., Wei S., Yang Q. Characterization of the key aroma compounds in aged chinese xiaoqu baijiu by means of the sensomics approach. Food Chemistry. 2022;384 doi: 10.1016/j.foodchem.2022.132452. [DOI] [PubMed] [Google Scholar]
Tromelin A., Chabanet C., Audouze K., Koensgen F., Guichard E. Multivariate statistical analysis of a large odorants database aimed at revealing similarities and links between odorants and odors. Flavour and Fragrance Journal. 2018;33(1):106–126. doi: 10.1002/ffj.3430. [DOI] [Google Scholar]
Van-Diep L., Zheng X.-W., Chen J.-Y., Han B.-Z. Characterization of volatile compounds in fen-daqu-a traditional chinese liquor fermentation starter. Journal of the Institute of Brewing. 2012;118(1):107–113. doi: 10.1002/jib.8. [DOI] [Google Scholar]
Wang G., Liu F., Pan F., Li H., Zheng F., Ye X., Sun B., Cheng H. Study on the interaction between polyol glycerol and flavor compounds of baijiu: A new perspective of influencing factors of baijiu flavor. Journal of Agricultural and Food Chemistry. 2024;72(48):26832–26845. doi: 10.1021/acs.jafc.4c05935. [DOI] [PubMed] [Google Scholar]
Wang L., Tang P., Zhang P., Lu J., Chen Y., Xiao D., Guo X. Unraveling the aroma profiling of baijiu: Sensory characteristics of aroma compounds, analytical approaches, key odor-active compounds in different baijiu, and their synthesis mechanisms. Trends in Food Science & Technology. 2024;146. Article 104376 doi: 10.1016/j.tifs.2024.104376. [DOI] [Google Scholar]
Wen H., Nan S., Zhang J., Lei Z., Shen W. Chemical space deconstruction-based dynamic model ensemble architecture for molecular property prediction. Chemical Engineering Science. 2024;295 doi: 10.1016/j.ces.2024.120118. [DOI] [Google Scholar]
Wu M., Fan Y., Zhang J., Chen H., Wang S., Shen C., Fu H., She Y. A novel organic acids-targeted colorimetric sensor array for the rapid discrimination of origins of baijiu with three main aroma types. Food Chemistry. 2024;447 doi: 10.1016/j.foodchem.2024.138968. [DOI] [PubMed] [Google Scholar]
Zeng S., Duan X., Bai J., Tao W., Hu K., Tang Y. Soft multiprototype clustering algorithm via two-layer semi-nmf. IEEE Transactions on Fuzzy Systems. 2024;32(4):1615–1629. doi: 10.1109/tfuzz.2023.3329108. [DOI] [Google Scholar]
Zeng X., Cao R., Xi Y., Li X., Yu M., Zhao J., Cheng J., Li J. Food flavor analysis 4.0: A cross-domain application of machine learning. Trends in Food Science & Technology. 2023;138:116–125. doi: 10.1016/j.tifs.2023.06.011. [DOI] [Google Scholar]
Zhu J., Niu Y., Xiao Z. Characterization of important sulfur and nitrogen compounds in lang baijiu by application of gas chromatography-olfactometry, flame photometric detection, nitrogen phosphorus detector and odor activity value. Food Research International. 2020;131 doi: 10.1016/j.foodres.2020.109001. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

mmc1.docx^{(1.5MB, docx)}

Data Availability Statement

Data will be made available on request.

[bb0005] Becht E., Mcinnes L., Healy J., Dutertre C.A., Kwok I.W.H., Ng L.G.…Newell E.W. Dimensionality reduction for visualizing single-cell data using umap. Nature Biotechnology. 2019;37(1):38. doi: 10.1038/nbt.4314. [DOI] [PubMed] [Google Scholar]

[bb0010] Beisken S., Meinl T., Wiswedel B., De Figueiredo L.F., Berthold M., Steinbeck C. Knime-cdk: Workflow-driven cheminformatics. BMC Bioinformatics. 2013;14 doi: 10.1186/1471-2105-14-257. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0015] Cai Y., Pan X., Zhang D., Yuan L., Lao F., Wu J. The kinetic study of 2-acetyl-1-pyrroline accumulation in the model system: An insight into enhancing rice flavor through the maillard reaction. Food Research International. 2024;191 doi: 10.1016/j.foodres.2024.114591. [DOI] [PubMed] [Google Scholar]

[bb0020] Capecchi A., Probst D., Reymond J.L. One molecular fingerprint to rule them all: Drugs, biomolecules, and the metabolome. Journal of. Cheminformatics. 2020;12(1) doi: 10.1186/s13321-020-00445-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0025] Dong W., Dai X., Jia Y., Ye S., Shen C., Liu M., Lin F., Sun X., Xiong Y., Deng B. Association between baijiu chemistry and taste change: Constituents, sensory properties, and analytical approaches. Food Chemistry. 2024;437 doi: 10.1016/j.foodchem.2023.137826. [DOI] [PubMed] [Google Scholar]

[bb0030] Dong W., Guo R., Liu M., Shen C., Sun X., Zhao M., Sun J., Li H., Zheng F., Huang M., Wu J. Characterization of key odorants causing the roasted and mud-like aromas in strong-aroma types of base baijiu. Food Research International. 2019;125 doi: 10.1016/j.foodres.2019.108546. [DOI] [PubMed] [Google Scholar]

[bb0035] Dong W., Shi K., Liu M., Shen C., Li A., Sun X., Zhao M., Sun J., Li H., Zheng F., Huang M. Characterization of 3-methylindole as a source of a “mud”-like off-odor in strong-aroma types of base baijiu. Journal of Agricultural and Food Chemistry. 2018;66(48):12765–12772. doi: 10.1021/acs.jafc.8b04734. [DOI] [PubMed] [Google Scholar]

[bb0040] Duan J., Cheng W., Lv S., Deng W., Hu X., Li H., Sun J., Zheng F., Sun B. Characterization of key aroma compounds in soy sauce flavor baijiu by molecular sensory science combined with aroma active compounds reverse verification method. Food Chemistry. 2024;443 doi: 10.1016/j.foodchem.2024.138487. [DOI] [PubMed] [Google Scholar]

[bb0045] Dutta P., Jain D., Gupta R., Rai B. Classification of tastants: A deep learning based approach. Molecular Informatics. 2023;42(12) doi: 10.1002/minf.202300146. [DOI] [PubMed] [Google Scholar]

[bb0050] Ehiro T. Feature importance-based interpretation of umap-visualized polymer space. Molecular Informatics. 2023;42(8–9) doi: 10.1002/minf.202300061. [DOI] [PubMed] [Google Scholar]

[bb0055] Fan W.L., Qian M.C. Headspace solid phase microextraction and gas chromatography-olfactometry dilution analysis of young and aged chinese “yanghe daqu” liquors. Journal of Agricultural and Food Chemistry. 2005;53(20):7931–7938. doi: 10.1021/jf051011k. [DOI] [PubMed] [Google Scholar]

[bb0060] Herrera-Rocha F., Fernandez-Nino M., Duitama J., Cala M.P., Chica M.J., Wessjohann L.A.…Barrios A.F.G. Flavorminer: A machine learning platform for extracting molecular flavor profiles from structural data. Journal of. Cheminformatics. 2024;16(1) doi: 10.1186/s13321-024-00935-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0065] Huang Y., Huang L., Cheng M., Li C., Zhou X., Ullah A., Sarfraz S., Khatab A., Xie G. Progresses in biosynthesis pathway, regulation mechanism and potential application of 2-acetyl-1-pyrroline in fragrant rice. Plant Physiology and Biochemistry. 2024;215 doi: 10.1016/j.plaphy.2024.109047. [DOI] [PubMed] [Google Scholar]

[bb0070] Jin G., Zhu Y., Xu Y. Mystery behind chinese liquor fermentation. Trends in Food Science & Technology. 2017;63:18–28. doi: 10.1016/j.tifs.2017.02.016. [DOI] [Google Scholar]

[bb0075] Kelley L.A., Gardner S.P., Sutcliffe M.J. An automated approach for clustering an ensemble of nmr-derived protein structures into conformationally related subfamilies. Protein Engineering. 1996;9(11):1063–1065. doi: 10.1093/protein/9.11.1063. [DOI] [PubMed] [Google Scholar]

[bb0080] Kobak D., Linderman G.C. Initialization is critical for preserving global data structure in both t-sne and umap. Nature Biotechnology. 2021;39(2) doi: 10.1038/s41587-020-00809-z. [DOI] [PubMed] [Google Scholar]

[bb0085] Li H., Zhang X., Gao X., Shi X., Chen S., Xu Y., Tang K. Comparison of the aroma-active compounds and sensory characteristics of different grades of light-flavor baijiu. Foods. 2023;12(6) doi: 10.3390/foods12061238. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0090] Marx V. Seeing data as t-sne and umap do. Nature Methods. 2024;21(6):930–933. doi: 10.1038/s41592-024-02301-x. [DOI] [PubMed] [Google Scholar]

[bb0095] Probst D., Reymond J.-L. Visualization of very large high-dimensional data sets as minimum spanning trees. Journal of. Cheminformatics. 2020;12(1) doi: 10.1186/s13321-020-0416-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0100] Rabasovic M.S., Pavlovic D.M., Sevic D. Analysis of laser ablation spectral data using dimensionality reduction techniques: Pca, t-sne and umap. Contributions of the Astronomical Observatory Skalnate Pleso. 2023;53(3):51–57. doi: 10.31577/caosp.2023.53.3.51. [DOI] [Google Scholar]

[bb0105] Roweis S.T., Saul L.K. Nonlinear dimensionality reduction by locally linear embedding. Science. 2000;290(5500):2323–+. doi: 10.1126/science.290.5500.2323. [DOI] [PubMed] [Google Scholar]

[bb0110] Rugard M., Jaylet T., Taboureau O., Tromelin A., Audouze K. Smell compounds classification using umap to increase knowledge of odors and molecular structures linkages. PLoS ONE. 2021;16(5) doi: 10.1371/journal.pone.0252486. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0115] Sha S., Chen S., Qian M., Wang C., Xu Y. Characterization of the typical potent odorants in chinese roasted sesame-like flavor type liquor by headspace solid phase microextraction-aroma extract dilution analysis, with special emphasis on sulfur-containing odorants. Journal of Agricultural and Food Chemistry. 2017;65(1):123–131. doi: 10.1021/acs.jafc.6b04242. [DOI] [PubMed] [Google Scholar]

[bb0120] Sharma A., Kumar R., Ranjta S., Varadwaj P.K. Smiles to smell: Decoding the structure-odor relationship of chemical compounds using the deep neural network approach. Journal of Chemical Information and Modeling. 2021;61(2):676–688. doi: 10.1021/acs.jcim.0c01288. [DOI] [PubMed] [Google Scholar]

[bb0125] Song X., Jing S., Zhu L., Ma C., Song T., Wu J., Zhao Q., Zheng F., Zhao M., Chen F. Untargeted and targeted metabolomics strategy for the classification of strong aroma-type baijiu (liquor) according to geographical origin using comprehensive two-dimensional gas chromatography-time-of-flight mass spectrometry. Food Chemistry. 2020;314 doi: 10.1016/j.foodchem.2019.126098. [DOI] [PubMed] [Google Scholar]

[bb0130] Song X., Zhu L., Geng X., Li Q., Zheng F., Zhao Q., Ji J., Sun J., Li H., Wu J., Zhao M., Sun B. Analysis, occurrence, and potential sensory significance of tropical fruit aroma thiols, 3-mercaptohexanol and 4-methyl-4-mercapto-2-pentanone, in Chinese Baijiu. Food Chemistry. 2021;363 doi: 10.1016/j.foodchem.2021.130232. [DOI] [PubMed] [Google Scholar]

[bb0135] Sun X., Qian Q., Xiong Y., Xie Q., Yue X., Liu J., Wei S., Yang Q. Characterization of the key aroma compounds in aged chinese xiaoqu baijiu by means of the sensomics approach. Food Chemistry. 2022;384 doi: 10.1016/j.foodchem.2022.132452. [DOI] [PubMed] [Google Scholar]

[bb0140] Tromelin A., Chabanet C., Audouze K., Koensgen F., Guichard E. Multivariate statistical analysis of a large odorants database aimed at revealing similarities and links between odorants and odors. Flavour and Fragrance Journal. 2018;33(1):106–126. doi: 10.1002/ffj.3430. [DOI] [Google Scholar]

[bb0145] Van-Diep L., Zheng X.-W., Chen J.-Y., Han B.-Z. Characterization of volatile compounds in fen-daqu-a traditional chinese liquor fermentation starter. Journal of the Institute of Brewing. 2012;118(1):107–113. doi: 10.1002/jib.8. [DOI] [Google Scholar]

[bb0150] Wang G., Liu F., Pan F., Li H., Zheng F., Ye X., Sun B., Cheng H. Study on the interaction between polyol glycerol and flavor compounds of baijiu: A new perspective of influencing factors of baijiu flavor. Journal of Agricultural and Food Chemistry. 2024;72(48):26832–26845. doi: 10.1021/acs.jafc.4c05935. [DOI] [PubMed] [Google Scholar]

[bb0155] Wang L., Tang P., Zhang P., Lu J., Chen Y., Xiao D., Guo X. Unraveling the aroma profiling of baijiu: Sensory characteristics of aroma compounds, analytical approaches, key odor-active compounds in different baijiu, and their synthesis mechanisms. Trends in Food Science & Technology. 2024;146. Article 104376 doi: 10.1016/j.tifs.2024.104376. [DOI] [Google Scholar]

[bb0160] Wen H., Nan S., Zhang J., Lei Z., Shen W. Chemical space deconstruction-based dynamic model ensemble architecture for molecular property prediction. Chemical Engineering Science. 2024;295 doi: 10.1016/j.ces.2024.120118. [DOI] [Google Scholar]

[bb0165] Wu M., Fan Y., Zhang J., Chen H., Wang S., Shen C., Fu H., She Y. A novel organic acids-targeted colorimetric sensor array for the rapid discrimination of origins of baijiu with three main aroma types. Food Chemistry. 2024;447 doi: 10.1016/j.foodchem.2024.138968. [DOI] [PubMed] [Google Scholar]

[bb0170] Zeng S., Duan X., Bai J., Tao W., Hu K., Tang Y. Soft multiprototype clustering algorithm via two-layer semi-nmf. IEEE Transactions on Fuzzy Systems. 2024;32(4):1615–1629. doi: 10.1109/tfuzz.2023.3329108. [DOI] [Google Scholar]

[bb0175] Zeng X., Cao R., Xi Y., Li X., Yu M., Zhao J., Cheng J., Li J. Food flavor analysis 4.0: A cross-domain application of machine learning. Trends in Food Science & Technology. 2023;138:116–125. doi: 10.1016/j.tifs.2023.06.011. [DOI] [Google Scholar]

[bb0180] Zhu J., Niu Y., Xiao Z. Characterization of important sulfur and nitrogen compounds in lang baijiu by application of gas chromatography-olfactometry, flame photometric detection, nitrogen phosphorus detector and odor activity value. Food Research International. 2020;131 doi: 10.1016/j.foodres.2020.109001. [DOI] [PubMed] [Google Scholar]

PERMALINK

Predicting the low-level and extremely low-threshold compounds in Baijiu: uniform manifold approximation and projection

Yintao Jia

Yue Qiu

Qi Deng

Ying Han

Baoguo Sun

Rong Liu

Pan Zhen

Wenxian Li

Wei Dong

Xiaotao Sun

Xiao Yang

Fan Cui

Abstract

Highlights

1. Introduction

2. Materials and methods

2.1. Data collection

Fig. 1.

2.2. Computation and encoding of molecular structures by KNIME

2.3. Dimension reduction from the 1024-bit fingerprints

2.4. Clustering and visualization

2.5. Establishment of the “aroma-structure” model

2.6. Application on samples of actual baijiu

2.6.1. Sampling and sample preparation

2.6.2. GC-O-MS analysis

3. Result and discussion

3.1. Data sets

Table 1.

3.2. Dimension reduction, clustering and visualization of the data

Fig. 2.

3.3. Analysis of the cluster constituents: Structure-odor correlations

3.3.1. Aroma distribution of compounds

3.3.2. Calculation and statistics of compound structure

Fig. 3.

3.3.3. Construction of an “aroma-structure” predictive model

Fig. 4.

3.4. Research on the practical applications of real baijiu

3.4.1. Discovery of the unknown substance with strong roasted aroma

Fig. 5.

3.4.2. Identification of compound with strong roasted aroma in baijiu

Fig. 6.

4. Conclusions

CRediT authorship contribution statement

Informed consent

Ethical approval

Declaration of competing interest

Acknowledgement

Footnotes

Contributor Information

Appendix A. Supplementary data

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases