Abstract
Deep reactive ion etching (DRIE) is critical for fabricating high-aspect-ratio structures in microelectromechanical systems (MEMS), yet its complex, parameter-dependent process poses significant optimization challenges. Artificial intelligence (AI) offers an efficient optimization solution, but its implementation faces the technical challenge of acquiring large-scale data from scanning electron microscopy (SEM) images, the standard for evaluating DRIE etching outcomes. Traditional SEM analysis relies on labor-intensive manual methods, incurring 15-20% errors and hindering high-throughput manufacturing. Existing automated methods, such as CNNs and SVMs, falter with 70-80% accuracy in noisy SEM images, failing to capture the dynamic evolution of etched structures. To address these limitations, we propose a physics-constrained variational level set autoencoder (VLSet-AE) for automated SEM sectional-profile analysis. By integrating physical etching constraints and a three-dimensional framework (time, linewidth, etching depth), VLSet-AE achieves precise contour recognition and nine critical dimensions extraction—scallop depth (2.29%), scallop width (peak-to-peak: 2.05%, valley-to-valley: 6.28%), scallop radius (4.69%), profile angle (0.56%), trench depth (5.46%), bow width (4.35%), mid width (2.43%), and bottom width (4.78%)—with an average error of 3.65% an overall model accuracy of 94.3%, significantly outperforming manual annotation and state-of-the-art alternatives. Compared to seven current models (e.g., CNNs, LSTMs, ResNet), VLSet-AE achieves the shortest training time (20 s), fastest inference time (1.2 s), highest recognition accuracy (96%), and competitive memory usage (50 MB) and parameter count (4.0 million). By enabling efficient, large-scale data acquisition for AI-optimized DRIE processes, VLSet-AE empowers scalable, intelligent manufacturing, unlocking the potential for advanced microfabrication technologies. This approach provides a forward-looking framework for AI-driven MEMS process design and manufacturing, delivering innovative solutions for future AI-assisted microfabrication advancements.

Keywords: AI for micro-fabrication, Deep reactive ion etching, Scanning electron microscopy image recognition, Physics-informed neural networks, Variational autoencoder, Pattern recognition
Subject terms: Electrical and electronic engineering, Computational nanotechnology
Introduction
Microelectromechanical systems (MEMS) underpin transformative technologies, from smartphones and autonomous vehicles to medical diagnostics, enabling compact, high-performance devices1. Deep reactive ion etching (DRIE) plays a pivotal role in MEMS fabrication, serving as a core process for sculpting high-aspect-ratio silicon structures with exceptional precision2. DRIE’s ability to create intricate features, such as those in sensors and integrated circuits, makes it indispensable for advancing next-generation microsystems.
Although DRIE achieves high precision in the in-plane (lateral) dimensions—accurately replicating mask-defined linewidths—it often exhibits significant variability in the out-of-plane (vertical) direction due to anisotropic etching effects. Phenomena such as scalloping, bowing, and notching commonly arise during deep etching, compromising structural fidelity and introducing uncertainty in depth-dependent features. These challenges stem from the complex interplay between plasma chemistry, ion energy, and cycle timing, making the DRIE process inherently sensitive and difficult to control. Achieving high-aspect-ratio structures in DRIE thus requires precise regulation of etching and passivation cycle times within the Bosch process—a cyclic procedure alternating between etching and passivation steps. Shorter etching cycles, though slower in material removal, enable finer control over etch-passivation alternation, producing smoother sidewalls with minimal scalloping. In contrast, longer cycles increase throughput but result in deeper individual etch steps, generating more pronounced scallop patterns—wavy sidewall irregularities that distort morphology and compromise structural integrity3. Compounding these issues, DRIE performance is highly susceptible to operational drift due to factors such as equipment startup/shutdown, chamber conditioning, and tool aging4. Consequently, many fabrication lines operate each DRIE system under fixed, recipe-specific configurations tailored to individual MEMS structures, leading to underutilized flexibility and significant resource inefficiencies5.
Meanwhile, evaluating DRIE outcomes further reflects its complexity, necessitating scanning electron microscopy (SEM) to measure the morphology of etched structures6. In a typical MEMS fabrication cleanroom, engineers meticulously prepare samples—cleaving wafers, mounting specimens, and fine-tuning SEM imaging parameters—to resolve critical features like trench depth and scallop dimensions. Each wafer may yield hundreds of images, with manual analysis requiring hours per image to trace contours and measure dimensions. For large batches, this process extends over weeks, exacerbated by operator fatigue and subjective errors, leading to measurement inaccuracies of 15–20%. This labor-intensive, error-prone workflow underscores the inefficiency of manual SEM analysis and its bottleneck in high-throughput manufacturing7,8.
Given these challenges, directly optimizing DRIE’s nonlinear, multifaceted process poses a formidable challenge due to its intricate parameter interdependencies. Artificial intelligence (AI) provides a transformative solution to address this complexity by leveraging advanced models and large-scale data. Identifying an effective starting point for optimization and acquiring large-scale data to support it remains a critical challenge. To address this challenge, we develop AI models to automate feature extraction from SEM images, precisely identifying morphological structures and extracting critical parameters like trench depth and scallop dimensions. These capabilities enable robust correlations between etching outcomes and process parameters, learned through advanced modeling techniques, constructing an AI-driven framework for DRIE optimization, real-time monitoring, and scalable, intelligent manufacturing.
Extracting large-scale feature data from SEM images is a critical technology for developing AI-driven optimization models. However, traditional SEM image analysis methods are limited in accuracy and efficiency. Manual annotation, requiring 1–2 h per image, incurs 15–20% errors due to operator subjectivity and image variability. Inter-operator discrepancies further erode reproducibility, with critical dimension measurements varying by up to 15%9. Existing automated methods, including thresholding (e.g., Canny, Sobel)10, image recognition, and early machine learning models (e.g., CNNs, SVMs)11–13, enhance efficiency over manual analysis, achieving accuracies of 70–80% under ideal conditions14,15, but their limited robustness to noise and complex DRIE morphologies, with performance dropping by 20–30% in noisy, low-contrast SEM images typical of DRIE16.
Moreover, current recognition methods typically rely on two-dimensional static receptive field scanning—such as pixel-based convolutions over RGB image channels—to identify structural features across entire SEM images17,18. These approaches lack spatial adaptivity and are particularly vulnerable to noise and low contrast, leading to frequent misidentification of etched boundaries19,20. More critically, they overlook the depth-dependent evolution of DRIE structures, treating profiles as static cross-sections rather than dynamic, layered morphologies21–23. In contrast, the proposed level set-based method simulates an adaptive contour evolution process—akin to an expanding bubble that grows from within each structure and halts precisely at material boundaries—thereby capturing depth-varying features with higher accuracy and physical consistency. This evolution-driven mechanism enables robust contour extraction even in noisy or low-contrast SEM images, facilitating large-scale, precise data acquisition essential for AI-driven DRIE process optimization.
To address these gaps, we introduce a physics-constrained variational level set autoencoder (VLSet-AE) for automated contour recognition in DRIE SEM profiles. By leveraging layer-wise scallop segmentation along the etching depth and incorporating physical etching constraints into the decoder, VLSet-AE ensures reconstructed level set functions align with the etching process, enhancing accuracy and consistency. A temporal-scale three-dimensional framework, integrating time, linewidth, and etching depth, enables dynamic spatial and temporal characterization of etching dynamics, with reconstructed scallop segments assembled into complete profiles for precise quantification. VLSet-AE extracts critical dimensions like scallop depth (2.29%), scallop width (peak-to-peak: 2.05%, valley-to-valley: 6.28%), scallop radius (4.69%), profile angle (0.56%), trench depth (5.46%), bow width (4.35%), mid width (2.43%), and bottom width (4.78%). With 3.65% average error and 94.3% accuracy—outperforming CNN and SVM (70-80% accuracy)—it converges rapidly, enabling real-time process monitoring and advanced etching simulations for scalable, intelligent manufacturing. With a correlation coefficient of 0.998, VLSet-AE enables real-time process monitoring and advanced etching simulations for scalable, intelligent manufacturing. This technology paves a new direction for AI-assisted semiconductor process optimization and intelligent manufacturing, setting the stage for future innovations in high-performance microfabrication and next-generation smart production systems.
System design and methodology
A. Experimental data collection and preprocessing
Experimental process and data collection
The experimental workflow begins with the design of photolithographic mask patterns, followed by photomask fabrication and lithography on 4-inch single-wafer silicon substrates, as illustrated in Fig. 1. The patterned wafers are subsequently etched using DRIE performed on SPTS equipment.
Fig. 1.
Experimental data collection and image preprocessing. a 4-in. single-sided polished silicon wafer after photolithography. b Conducting DRIE etching and perform cracking according to the 5-point method. c Utilizing scanning electron microscope to take SEM images after etching. d Input Segmentation of aspect ratio at different etching recipes, and Input Segmentation of scallops at different etching depths,
To address the inherent complexity of the DRIE process—where key parameters such as etching cycle time, passivation time, substrate temperature, and chamber pressure significantly influence the resulting etch profiles—we employ a 16-run orthogonal experimental design. This approach enables systematic exploration of the multidimensional parameter space and generates a diverse, representative dataset that reflects the process’s intrinsic variability, thereby providing a strong foundation for model training and evaluation. From the DRIE experiments, we obtain 1000 cross-sectional SEM images, each capturing a unique combination of etching outcomes. To extract meaningful structural features, each image undergoes a tailored preprocessing pipeline. Specifically, we isolate the etched trench region from the background and perform depth-wise segmentation of the scallop structures from top to bottom along the trench axis. These segmented scallop layers—each corresponding to a single etch-passivation cycle—preserve the morphological evolution of the etched profile over time.
This layer-wise scallops not only facilitates fine-grained feature recognition but also establishes a depth-resolved dataset that is ideally suited for training physics-informed models such as VLSet-AE. By capturing both local structural variations and global trench morphology, the resulting dataset enables accurate, scalable, and physically consistent analysis of DRIE processes.
Orthogonal experimental design for DRIE process optimization
To systematically investigate the complex parameter dependencies in DRIE, which incorporates the Bosch process—a cyclic technique alternating between etching and passivation steps—we developed a 16-run orthogonal experimental framework focusing on two critical parameters: etch cycle time (te) and passivation cycle time (tp). These parameters exert the most significant influence on the Bosch process, profoundly affecting etch morphology, including profile angle, scallop depth, scallop width, and trench depth, as demonstrated in prior experiments. Their dominant impact and pronounced effects make them ideal for targeted study, facilitating the collection of a comprehensive dataset. The orthogonal design ensures thorough coverage of the parameter space, capturing the intrinsic variability of the DRIE process and generating a representative dataset for training and evaluating the physics-constrained variational level set autoencoder.
The experimental design combines an orthogonal method with a supplementary L9(34) orthogonal table, as shown in Table 1, resulting in 16 unique experimental conditions, as detailed in Fig. 2. The parameter ranges were carefully chosen based on preliminary experiments and established DRIE process sensitivities:
Table 1.
Orthogonal experimental data for DRIE process
| Experiment Number | Etch Time (s) | Passivation Time (s) | Ratio | Profile Angle (°) | Scallop Depth (nm) | Scallop Width (nm) |
|---|---|---|---|---|---|---|
| 1 | 4.0 | 2.0 | 2.0 | 85 | 214 | 1080 |
| 2 | 4.5 | 2.5 | 1.8 | 87 | 223 | 995 |
| 3 | 5.0 | 3.0 | 1.67 | 88 | 245 | 1170 |
| 4 | 5.5 | 3.5 | 1.57 | 88 | 198 | 1290 |
| 5 | 6.0 | 3.0 | 2.0 | 86 | 287 | 1310 |
| 6 | 6.0 | 4.0 | 1.5 | 88 | 232 | 1150 |
| 7 | 6.5 | 5.0 | 1.3 | 90 | 174 | 1070 |
| 8 | 7.0 | 5.0 | 1.4 | 89 | 202 | 1310 |
| 9 | 7.5 | 5.5 | 1.36 | 89 | 192 | 1240 |
| 10 | 8.0 | 2.0 | 4.0 | 83 | 595 | 1590 |
| 11 | 8.0 | 4.0 | 2.0 | 85 | 358 | 1510 |
| 12 | 8.0 | 6.0 | 1.33 | 88 | 197 | 1270 |
| 13 | 4.5 | 4.0 | 1.125 | 91 | 102 | 772 |
| 14 | 7.5 | 4.0 | 1.875 | 86 | 322 | 1500 |
Fig. 2.
Orthogonal experimental design for DRIE process optimization. a Orthogonal experimental data for DRIE process table. b The variation of etching depth (aspect ratio) with the ratio of etching cycle times/passivation cycle times. c Variation of profile angel with etch/passivation cycle times. d Variation of average scallop depth with etch/passivation cycle times
1. Etch Cycle Time (te): 4 to 8 s, with a step size of 0.5 s (9 levels). Etch times below 4 s yield insufficient material removal, resulting in negligible etching, while times above 8 s lead to excessive scallop depths and sidewall roughness, compromising structural integrity.
2. Passivation Cycle Time (tp): 2–6 s, with a step size of 0.5 s (9 levels). Passivation times below 2 s provide inadequate sidewall protection, leading to uncontrolled lateral etching, whereas times above 6 s reduce etch efficiency without significant morphological improvements.
The etch-to-passivation cycle time ratio (te/tp) is a pivotal parameter governing sidewall morphology and scallop formation. Ratios below 1 were excluded from the design, as they result in insufficient etching of the passivation layer, leading to minimal or no silicon etching, as noted in prior studies. The 16 experimental conditions span te/tp ratios from 1.09 to 4.0, with a particular focus on the sensitive range of 1.09–1.5, where near-vertical sidewalls (~90° profile angle) and minimal scallop depths are achieved, as observed in preliminary experiments (e.g., te/tp ≈ 1.3–1.5 yielding profile angles of 88–92° and scallop depths of 102–202 nm). In addition to the variable parameters (te and tp), the DRIE process was conducted under fixed conditions to ensure consistency across experiments, as outlined in Table 1. These conditions include: a deposition step with 4 s duration, 38.5 mTorr pressure, 1800 W source power, 67 W 380 kHz platen power, 275 sccm C4F8 flow, and a passivation step with 6.5 s duration, 40 mTorr pressure, 2200 W source power, 95 W 380 kHz platen power, 400 sccm SF6 flow, 1 sccm O2 flow, and a process time of 52:30 (mm:ss). Other fixed parameters include 0 W 13.56 MHz platen power, 0 sccm Ar flow, 7.5 LF Pulse Generate, 300 loops, and a temperature of 30 °C. These settings, implemented on SPTS equipment, were held constant to isolate the effects of te and tp variations.
The orthogonal design was structured to capture typical variations in etch profiles, as shown in Fig. 2, ensuring representativeness across key morphological features:
Profile Angle: The experiments produce profile angles ranging from 83° (te/tp = 4.0, experiment 10) to 92° (te/tp = 1.09, experiment 16), covering undercut (<90°), near-vertical (~90°), and tapered (>90°) profiles, as shown in Table 1 and Fig. 2c. Ratios around 1.3–1.5 (e.g., experiments 7, 8, 12) yield near-90° profiles, ideal for high-aspect-ratio structures.
Scallop Depth: Scallop depths vary from 102 nm (te/tp = 1.125, experiment 13) to 595 nm (te/tp = 4.0, experiment 10), as shown in Table 1. Lower ratios (1.09–1.4) produce smaller scallop depths, corresponding to smoother sidewalls.
Scallop Width: Scallop widths range from 772 nm (te/tp = 1.125, experiment 13) to 1590 nm (te/tp = 4.0, experiment 10), as shown in Table 1 and Fig. 2d. Shorter etch times and longer passivation times reduce scallop width, enhancing sidewall smoothness.
Trench Depth: For linewidths (CD) from 5 to 50 μm, trench depths increase with higher te/tp ratios, ranging from 47.3 μm (te/tp = 1.09) to 273.5 μm (te/tp = 1.875) for a 5 μm CD, as shown in Table 1 and Fig. 2b.
The design includes center points (e.g., te = 6 s, tp = 4 s, experiment 5), factorial points, and axial points to ensure comprehensive coverage of the parameter space, while the L9(3^4) orthogonal table adds intermediate levels (e.g., te = 4.5 s, tp = 2.5 s, experiment 2) for finer resolution in the optimal te/tp range (1.09–1.5). Repeated measurements at center points were conducted to estimate experimental error, enhancing the dataset’s reliability. Infeasible conditions (te/tp < 1) were excluded to focus on ratios that produce measurable etch outcomes, ensuring the design’s representativeness.
The resulting dataset, comprising 1000 cross-sectional SEM images from the 16 experimental conditions, captures a wide range of etch morphologies. This dataset, combined with the depth-wise segmentation described in subsection A, provides a robust foundation for VLSet-AE’s training and evaluation, enabling precise feature extraction and process optimization.
B. System architecture of physics-constrained variational level-set autoencoder
To accurately capture the complex morphological features of deep reactive ion etching (DRIE) profiles from SEM images, we propose a physics-constrained variational level-set autoencoder (VLSet-AE) as illustrated in Fig. 3a, the proposed VLSet-AE architecture is built upon a variational autoencoder (VAE), where the encoder transforms high-dimensional SEM images into a compact latent representation z, and the decoder reconstructs the etched structure as a level set function . Unlike conventional image-based reconstruction that treats contours as pixel boundaries, our method interprets the etched profile as an evolving geometric interface. To ensure that the reconstructed contours reflect the actual physics of DRIE, we introduce the Hamilton-Jacobi equation—a fundamental equation describing interface motion—as a constraint into the decoder’s loss function. This enforces that the predicted level set function evolves consistently with how etching interfaces physically propagate (e.g., scallop expansion, sidewall evolution), resulting in contour recognition that is not only more accurate, but also more physically plausible in noisy or irregular SEM images.
Fig. 3.
Physics-constrained variational level set autoencoder framework for feature extraction from SEM cross-sectional profiles. a System architecture of variational level set autoencoder (VLSet-AE): The encoder compresses the high-dimensional input images into a low-dimensional latent space representation z, The decoder reconstructs the level set function from the latent variables . b Physical constraints are incorporated into the decoder, enabling it to generate a level set function that conforms to the physical laws of the actual etching process
The VLSet-AE architecture, comprising approximately 4 million parameters, is a probabilistic encoder–decoder design optimized for robust feature representation and physical constraint embedding in SEM image analysis for DRIE profiles. The encoder extracts hierarchical morphological features from input SEM images and maps them to a latent distribution characterized by a mean and variance , enabling stochastic sampling via the reparameterization trick13. To address the risk of overfitting given the dataset size of 1000 SEM images, we implemented a comprehensive set of regularization strategies. The KL divergence term in the loss function (Section 3.E) enforces a Gaussian prior on the latent space, promoting smoothness and continuity in latent representations, which is critical for generalization13. The physics-constrained loss, based on the Hamilton-Jacobi equation, ensures that reconstructed contours align with the physical dynamics of the etching process, acting as a domain-specific regularizer that enhances model robustness. Additionally, we applied dropout (p = 0.3) in the encoder’s convolutional layers to prevent over-reliance on specific neurons and incorporated L2 weight decay ( = 0.001) to penalize large weights, further mitigating overfitting. To increase the effective diversity of the dataset, we employed data augmentation techniques, including random rotations (±10°), intensity variations (±15%), and horizontal flips, which simulate realistic variations in SEM imaging conditions. The dataset was split into 80% training (800 images), 10% validation (100 images), and 10% test (100 images) sets, with 5-fold cross-validation performed to ensure robust generalization across different data subsets. This splitting strategy, combined with cross-validation, aligns with best practices for evaluating model performance on limited datasets. These measures collectively ensure that VLSet-AE maintains high generalization performance despite its parameter count and the constrained dataset size.
The encoder is responsible for compressing high-dimensional SEM image data into a compact latent space representation. Given an input SEM image , the encoder network extracts relevant feature representations and maps them to a lower-dimensional latent space . Instead of directly mapping the input to a deterministic latent vector, the model parameterizes the latent space using a probabilistic distribution. The encoder learns the mean and variance of the latent space representation through a set of neural network layers. Specifically, the latent variables are modeled as a Gaussian distribution where:
| 1 |
Where is the extracted feature representation from the encoder. To ensure differentiability and allow stochastic sampling, the reparameterization trick is employed:
| 2 |
This formulation allows backpropagation through the stochastic sampling process, ensuring that the network can be efficiently optimized using gradient-based methods.
Once the SEM image is encoded into the latent space, the decoder reconstructs the level set function from the latent variables. The decoder learns to map the latent representation back to a structured feature space that defines the morphological contours of the etched structure. This reconstruction involves two key steps: first, the decoder generates an intermediate feature map as:
| 3 |
Then, the level set function is computed as:
| 4 |
The function represents the contour of the SEM image, where the zero level set defines the boundaries of the etched structure. This approach allows the model to accurately capture the morphology of the scallops and trenches that arise from the DRIE process.
A unique aspect of VLSet-AE is the incorporation of physical constraints into the decoder’s loss function, as shown in Fig. 3b. By embedding the Hamilton-Jacobi equation, which governs the evolution of level set functions, the model ensures that the generated contours remain physically meaningful. The Hamilton-Jacobi equation provides a dynamic evolution framework for level set functions, ensuring that the extracted features correspond to real physical phenomena observed in the etching process. This constraint enforces smoothness and consistency in the extracted contours, reducing artifacts that may arise due to image noise or improper segmentation.
C. Design of encoder and decoder in VLSet-AE
The physics-constrained VLSet-AE leverages a variational autoencoder (VAE) framework integrated with level set methods to map SEM images into a low-dimensional latent space and reconstruct the level set function representing etched structure contours. To ensure reproducibility and address requests for detailed architectural specifications, this section provides a clear and comprehensive description of the encoder and decoder architectures, including layer configurations, activation functions, and hyperparameters. Table 2 summarizes the structural parameters, facilitating implementation and validation of the model.
Table 2.
VLSet-AE encoder and decoder architecture
| Layer | Type | Filters/Units | Kernel Size | Stride | Padding | Output Shape | Activation | Additonal |
|---|---|---|---|---|---|---|---|---|
| Encoder | ||||||||
| Input | Input | - | - | - | - | 256 × 256 × 1 | - | SEM image |
| Conv1 | Convolutional | 32 | 3 × 3 | 1 | Same | 256 × 256 × 32 | ReLU | BatchNorm |
| MaxPooling1 | Max Pooling | - | 2 × 2 | 2 | - | 128 × 128 × 32 | - | - |
| Conv2 | Convolutional | 64 | 3 × 3 | 1 | Same | 128 × 128 × 64 | ReLU | BatchNorm |
| MaxPooling2 | Max Pooling | - | 2 × 2 | 2 | - | 64 × 64 × 64 | - | - |
| Conv3 | Convolutional | 128 | 3 × 3 | 1 | Same | 64 × 64 × 128 | ReLU | BatchNorm |
| MaxPooling3 | Max Pooling | - | 2 × 2 | 2 | - | 32 × 32 × 128 | - | - |
| Conv4 | Convolutional | 256 | 3 × 3 | 1 | Same | 32 × 32 × 256 | ReLU | BatchNorm |
| MaxPooling4 | Max Pooling | - | 2 × 2 | 2 | - | 16 × 16 × 256 | - | - |
| Flatten | Flattening | - | - | - | - | 65,536 | - | - |
| FC1 | Fully Connected | 512 | - | - | - | 512 | ReLU | Dropout (0.3) |
| FC_mu | Fully Connected | 128 | - | - | - | 128 | Linear | Latent mean () |
| FC_logvar | Fully Connected | 128 | - | - | - | 128 | Linear | Latent log-variance |
| Decoder | ||||||||
| Input | Input | - | - | - | - | 128 | - | Latent variable (z) |
| FC2 | Fully Connected | 65,536 | - | - | - | 16 × 16 × 256 | ReLU | Reshape, Dropout (0.3) |
| ConvTranspose1 | Transposed Convolutional | 128 | 3 × 3 | 2 | Same | 32 × 32 × 128 | ReLU | BatchNorm |
| ConvTranspose2 | Transposed Convolutional | 64 | 3 × 3 | 2 | Same | 64 × 64 × 64 | ReLU | BatchNorm |
| ConvTranspose3 | Transposed Convolutional | 32 | 3 × 3 | 2 | Same | 128 × 128 × 32 | ReLU | BatchNorm |
| ConvTranspose4 | Transposed Convolutional | 1 | 3 × 3 | 2 | Same | 256 × 256 × 1 | Linear | Outputs level set function |
Encoder design
The objective of the encoder is to map the input SEM image (grayscale, 256 × 256 pixels) into the distribution parameters of a low-dimensional latent space representation , where represents the mean vector, indicating the center of the latent variables, and represents the variance vector, reflecting the uncertainty of the latent variables. The encoding process begins with feature extraction from the input image using a series of convolutional operations, denoted as , where is a nonlinear mapping function (such as CNN or multi-layer perceptron), and represents the corresponding parameters.
The encoder architecture comprises four convolutional layers, each followed by max-pooling to reduce spatial dimensions, and fully connected layers to compute the latent distribution parameters, as shown in Table 2. Specifically:
(1) Input Layer: Accepts a grayscale SEM image of size 256 × 256 × 1.
(2) Convolutional Layers:
Conv1: 32 filters, 3 × 3 kernel, stride 1, ‘same’ padding, ReLU activation, followed by max-pooling (2 × 2, stride 2).
Conv2: 64 filters, 3 × 3 kernel, stride 1, ‘same’ padding, ReLU activation, followed by max-pooling (2 × 2, stride 2).
Conv3: 128 filters, 3 × 3 kernel, stride 1, ‘same’ padding, ReLU activation, followed by max-pooling (2 × 2, stride 2).
Conv4: 256 filters, 3 × 3 kernel, stride 1, ‘same’ padding, ReLU activation, followed by max-pooling (2 × 2, stride 2).
(3) Flattening: The feature map (16 × 16 × 256) is flattened to a 65,536-dimensional vector.
(4) Fully Connected Layers:
FC1: 512 units, ReLU activation, with a dropout rate of 0.3 to prevent overfitting.
FC_mu: 128 units, linear activation, outputs the mean vector ().
FC_logvar: 128 units, linear activation, outputs the log-variance vector ().
Subsequently, the latent distribution parameters are computed as , and , where , , , are trainable weights and biases, and is used to avoid directly optimizing the variance, thus enhancing numerical stability. To sample the latent variable , the reparameterization trick is applied, yielding , where is noise sampled from a standard normal distribution, and .
Decoder design
The objective of the decoder is to transform the latent variable , into a level set function , which implicitly represents the contour of the SEM image. The decoder begins by mapping the latent variable into a two-dimensional feature map through a nonlinear transformation , where represents the decoder’s mapping function (such as a deconvolutional network or multi-layer perceptron), and denotes the associated parameters. The decoder then generates the level set function , typically represented as a two-dimensional tensor, via the equation , where and are the weights and biases of the final layer of the decoder. The function is used to implicitly encode the contour, with indicating the boundary of the shape. The decoder thus reconstructs the level set function from the latent variable , facilitating the generation of high-dimensional images from low-dimensional representations. The zero-level set corresponds to the image contour, enabling further extraction of morphological features.
The decoder architecture includes a fully connected layer to reshape the latent vector, followed by four transposed convolutional layers to upsample the feature map to the original image size:
(1) Input Layer: Latent variable (z) (128-dimensional vector).
(2) Fully Connected Layer:
FC2: 65,536 units, ReLU activation, reshapes to 16 × 16 × 256, with a dropout rate of 0.3.
(3). Transposed Convolutional Layers:
ConvTranspose1: 128 filters, 3 × 3 kernel, stride 2, ‘same’ padding, ReLU activation, outputs 32 × 32 × 128.
ConvTranspose2: 64 filters, 3 × 3 kernel, stride 2, ‘same’ padding, ReLU activation, outputs 64 × 64 × 64.
ConvTranspose3: 32 filters, 3 × 3 kernel, stride 2, ‘same’ padding, ReLU activation, outputs 128 × 128 × 32.
ConvTranspose4: 1 filter, 3 × 3 kernel, stride 2, ‘same’ padding, linear activation, outputs 256 × 256 × 1.
(4) Output: The level set function , a 256 × 256 tensor, representing the etched structure’s contour.
Batch normalization is applied after each transposed convolutional layer to enhance training stability. The linear activation in the final layer ensures can take positive and negative values, consistent with the level set method. The decoder’s design facilitates high-fidelity reconstruction of complex etched profiles, even in noisy SEM images.
To facilitate reproducibility, Table 2 summarizes the structural parameters of the VLSet-AE encoder and decoder, including layer types, filter sizes, output shapes, activation functions, and additional configurations. The model was implemented using PyTorch on a workstation. Training was conducted with a batch size of 32, 500 epochs, and the Adam optimizer (learning rate: 0.001, ( = 0.9), ( = 0.999)). The dataset comprised 1000 preprocessed SEM images (normalized to [0, 1], segmented into scallop layers).
D. Optimization of normal velocity with physical constraints in VLSet-AE
The physical constraints in the level set function represent a key innovation of the VLSet-AE model, designed to ensure that the generated level set functions adhere to the physical laws governing the DRIE process. By embedding these constraints, the model enhances the physical consistency of the reconstructed contours, making them reliable for capturing the complex etching dynamics observed in SEM images.
Principle of physical constraints
The level set method represents the position of the interface using an implicit function , where defines the exact interface location (i.e., the contour), represents the exterior, and represents the interior of the interface. In the etching process, the interface evolves over time, primarily influenced by factors such as the normal velocity , which depends on etching parameters like power, gas flow, and ion energy, as well as the directional nature of the etching process, which may be isotropic (uniform etching speed) or anisotropic (speed dependent on direction). This dynamic behavior is described by the Hamilton–Jacobi equation:
| 5 |
where represents the rate of change of the level set function over time, indicating the movement of the interface, and represents the gradient magnitude of the level set function, describing the direction and magnitude of the interface change.
Data collection and optimization of normal velocity
To ensure the physical consistency of the model, the normal velocity F is treated as a learnable parameter, optimized to reflect the actual etching rates observed in the DRIE process. We leveraged the 16-run orthogonal experimental design, which generated 1000 SEM images, to collect extensive data on trench depths (ranging from 5 to 50 μm) and corresponding etching times (10–30 min). For each SEM image, multiple trench depth measurements were taken across different regions of the etched profile, resulting in a dataset of 1500 experimental etching rates. The etching rate for each measurement was calculated as:
| 6 |
yielding a comprehensive dataset of 1500 etching rates, with values ranging from 0.5 to 2.0 μm/min. These experimentally derived etching rates were used to train a dedicated convolutional neural network (CNN) to optimize F, ensuring that the generated level set functions align with the physical dynamics of the DRIE process, as shown in Table 3 and Fig. 4.
Table 3.
Comparison of experimental etching rates and learned normal velocity F for 16 orthogonal recipes
| Experiment ID | Recipe | Trench Depth (µm) | Etching Time (min) | Experimental Etching Rate (µm/min) | Learned F (µm/min) | Relative Error (%) |
|---|---|---|---|---|---|---|
| 1 | A | 20.0 | 15.0 | 1.33 | 1.30 | 2.26 |
| 2 | B | 15.0 | 12.0 | 1.25 | 1.28 | 2.40 |
| 3 | C | 30.0 | 20.0 | 1.50 | 1.46 | 2.67 |
| 4 | D | 25.0 | 18.0 | 1.39 | 1.42 | 2.16 |
| 5 | E | 10.0 | 15.0 | 0.67 | 0.65 | 2.99 |
| 6 | F | 40.0 | 25.0 | 1.60 | 1.56 | 2.50 |
| 7 | G | 35.0 | 22.0 | 1.59 | 1.55 | 2.52 |
| 8 | H | 18.0 | 14.0 | 1.29 | 1.32 | 2.33 |
| 9 | I | 45.0 | 28.0 | 1.61 | 1.57 | 2.48 |
| 10 | J | 12.0 | 16.0 | 0.75 | 0.77 | 2.67 |
| 11 | K | 28.0 | 19.0 | 1.47 | 1.43 | 2.72 |
| 12 | L | 22.0 | 17.0 | 1.29 | 1.26 | 2.33 |
| 13 | M | 50.0 | 29.0 | 1.72 | 1.68 | 2.33 |
| 14 | N | 17.0 | 13.0 | 1.31 | 1.34 | 2.29 |
| 15 | O | 32.0 | 21.0 | 1.52 | 1.48 | 2.63 |
| 16 | P | 8.0 | 11.0 | 0.73 | 0.75 | 2.74 |
| Average Relative Error | 2.61 | |||||
Note: The experimental etching rates were calculated from trench depth measurements (5–50 µm) and etching times (10–30 min) across 1000 SEM images, yielding 1500 measurements, from which 16 representative recipes are shown. The learned F values were optimized using a dedicated CNN, achieving an average relative error of 2.61%
Fig. 4.
Optimization of Normal Velocity F with Physical Constraints in VLSet-AE. a A table comparing experimental etching rates (0.67–1.60 μm/min) and learned normal velocity F values (0.65–1.57 μm/min), with an average relative error of 2.61%, indicating strong alignment. b A loss curve showing the decline of total, MSE, and physical consistency losses over 500 epochs, reflecting effective model convergence. c It displays a scatter plot comparing learned F values against experimental etching rates, with a fitted line confirming a 2.61% average error and a near-ideal correlation (y = x), underscoring the model's accuracy
The training and optimization of was performed using a specialized CNN, distinct from the VLSet-AE’s encoder-decoder architecture, to focus specifically on learning the normal velocity parameter. The CNN architecture for optimization consists of five convolutional layers, each with a 3 × 3 kernel, followed by batch normalization and ReLU activation functions. The input to this network comprises preprocessed SEM image patches (128 × 128 pixels) centered on etched trench regions, paired with their corresponding experimental etching rates. The network processes these patches through convolutional layers to extract spatial features relevant to etching dynamics, followed by a fully connected layer that outputs a scalar value. The architecture progressively downsamples the input to a feature map of dimension 64, which is then flattened and mapped to the parameter. The initial value of was set to 1.0 , based on typical silicon etching rates reported in the literature (0.1–10 )1,3.
Training process and loss function
During training, the CNN optimizes to minimize a dedicated loss function that ensures alignment with the experimental etching rates. The loss function includes a mean squared error term to penalize deviations between the predicted and the experimental etching rates:
| 7 |
where is the experimental etching rate for the -th measurement, and is the total number of etching rate samples. Additionally, to ensure that contributes to physically consistent level set evolution, a physical consistency loss is incorporated:
| 8 |
where and are computed using automatic differentiation within the PyTorch framework, ensuring accurate gradient calculations for the temporal and spatial derivatives. The total loss function for optimization is:
| 9 |
where is a weighting coefficient balancing the etching rate alignment and physical consistency. The CNN was trained for 500 epochs using the Adam optimizer with a learning rate of , on a workstation equipped with an NVIDIA RTX4060Ti GPU (16 GB GDDR6 memory, 4352 CUDA cores, 2.31 GHz base clock, boost up to 2.54 GHz) and an AMD Ryzen 7 5800X CPU (8 cores, 16 threads, 3.8 GHz base clock, boost up to 4.7 GHz). The training dataset consisted of 1500 SEM image patches and their corresponding etching rates, split into 80% training, 10% validation, and 10% test sets, with data augmentation (e.g., rotation, flipping) applied to enhance robustness.
Physical constraints and validation
The learned values were compared with the 1500 experimental etching rates, achieving an average relative error of 2.61%, as shown in Fig. 4c, confirming the physical fidelity of the Hamilton–Jacobi equation’s constraints. This dedicated CNN-based optimization of ensures that the normal velocity is both data-driven and physically consistent, enhancing the VLSet-AE model’s ability to generate accurate and robust contours in noisy or irregular SEM images. The optimized values are then integrated into the VLSet-AE’s encoder-decoder architecture to generate the level set function , ensuring that the physical constraints are consistently applied throughout the contour recognition process.
The goal of introducing these physical constraints is to ensure that the level set function generated by the decoder is consistent with the actual physical laws of the etching process, preventing the generation of contours that do not align with real etching behavior, such as unrealistic waviness or improper shapes. These constraints are incorporated into the model’s loss function through the Hamilton-Jacobi equation, as reflected in the physical loss term (Eq. 2.8), where the term evaluates the adherence of the generated level set function to physical principles. The normal velocity represents the etching speed, which can be a constant, variable, or a function of spatial coordinates, while describes the direction of interface change.
To compute the necessary derivatives in the deep learning framework, automatic differentiation is employed to calculate the time derivative and spatial gradients , where is derived from the spatial coordinates \(\) as:
| 10 |
These physical constraints are integrated into the total loss function (Eq. 5), which is minimized during training to ensure that the decoder generates level set functions that accurately follow the etching laws observed in the DRIE process.
E. Loss function design
To train the encoder-decoder architecture of the physics-constrained variational level set autoencoder, we formulate a composite loss function that integrates three components: reconstruction loss , KL divergence , and physical consistency loss , each weighted by hyperparameters to balance their contributions, as illustrated in Fig. 5.
Fig. 5.
Impact of varying λ1, λ2, and λ3 on VLSet-AE performance. a Accuracy (%) across seven coefficient configurations, with error bars representing standard deviations (0.5–0.8%) from multiple runs, highlighting the Baseline as the optimal configuration at 94.3%. b average error (%) for nine critical dimensions, with error bars (0.2–0.35%) indicating precision, showing the highest error (6.92%) for No . c Training loss convergence curves over 500 epochs, with each configuration converging at rates from 300 to 480 epochs, reflecting stability descriptions
The reconstruction loss quantifies the discrepancy between the predicted level set function and the ground truth contour, ensuring accurate reproduction of SEM image morphologies. It is defined as:
| 11 |
Where is the number of samples. This loss function penalizes differences between the generated contour and the true contour , thereby encouraging the decoder to produce accurate and detailed reconstructions that preserve the salient morphological features of the original SEM images.
The KL divergence term is introduced to regularize the latent space. Specifically, it measures the divergence between the posterior distribution of the latent variables and a pre-defined standard normal prior. It is given by
| 12 |
Where is the dimensionality of the latent space, and and represent the mean and variance of the latent variables, respectively. This term prevents overfitting by constraining the latent distribution to approximate a standard normal prior, which is particularly critical for handling noisy SEM images.
The physical consistency loss enforces that the generated level set functions adhere to the physical laws underlying the etching process. Based on the Hamilton-Jacobi equation, the loss is formulated as
| 13 |
Where denotes the temporal derivative of the level set function, is the normal etching velocity, and is the magnitude of the spatial gradient of . This loss enforces physical plausibility, ensuring that contours evolve consistently with etching dynamics, such as scallop formation and sidewall smoothness, even in low-contrast or noisy SEM images.
The total loss function used to train the model is a weighted sum of the three aforementioned components:
| 14 |
Here, , , and are hyperparameters that control the relative importance of each loss term. In our experiments, we set , , and , determined through a systematic hyperparameter tuning process. These values were chosen to prioritize reconstruction accuracy () for precise contour delineation, while applying moderate regularization to maintain latent space continuity and sufficient physical constraint weighting () to ensure physically meaningful contours without over-smoothing fine morphological details. The choice of aligns with standard VAE practices, where KL divergence is typically down-weighted to avoid overly restrictive latent spaces13,23. The value = 0.5 balances physical constraints with data-driven learning, ensuring physically plausible contours without dominating the optimization process.
To rigorously validate these choices and assess the contribution of each loss component, we conducted an ablation study by systematically varying each coefficient while keeping others fixed, evaluating their impact on key performance metrics: overall contour recognition accuracy, average error across nine critical dimensions (scallop depth, scallop width, etc.), and training stability (measured by convergence speed and final loss values). The study was performed on the same dataset of 1,000 SEM images used in the main experiments, with training conducted on an NVIDIA RTX4060Ti GPU (16 GB GDDR6, 4352 CUDA cores).
The ablation study, detailed in Table 4 and Fig. 5, evaluates the impact of varying the loss function coefficients , , and on the VLSet-AE model’s performance. Reducing the reconstruction loss weight to = 0.5 decreased accuracy to 91.8% and increased the average error to 5.34%, compromising contour fidelity, especially for complex features like scallop width (valley-to-valley). Conversely, increasing = 2.0 led to a slight accuracy drop to 93.5% due to under-regularization, causing overfitting on noisy SEM regions by prioritizing pixel-level accuracy over generalizable features. A higher KL divergence weight = 0.5 overly constrained the latent space, reducing accuracy to 92.9% and increasing error to 4.15% by limiting expressiveness for intricate morphological details. A very low = 0.01 resulted in unstable training (130 epochs to converge) and a reduced accuracy of 93.7%, as the less-regularized latent space introduced prediction variability. Over-emphasizing physical constraints with = 1.0 produced over-smoothed contours, lowering accuracy to 92.4% and increasing error to 4.47%, particularly for fine features like scallop radius. Omitting physical constraints entirely = 0.0 significantly degraded performance, with accuracy falling to 89.7% and error rising to 6.92%, as the model failed to capture etching dynamics, leading to inaccurate contours in noisy or low-contrast SEM images. These results confirm the optimal balance achieved by the baseline configuration , , and .
Table 4.
Ablation study on loss function coefficients
| Case | Accuracy (%) | Avg. Error (%) | Training Stability (Epochs to Converge) | Scallop Depth Error (%) | Profile Angle Error (%) | |||
|---|---|---|---|---|---|---|---|---|
| Baseline | 1.0 | 0.1 | 0.5 | 94.3 | 3.65 | Stable, 100 epochs | 2.29 | 0.56 |
| Low | 0.5 | 0.1 | 0.5 | 91.8 | 5.34 | Stable, 150 epochs | 3.12 | 0.89 |
| High | 2.0 | 0.1 | 0.5 | 93.5 | 4.02 | Stabel, 120 epochs | 2.67 | 0.72 |
| High | 1.0 | 0.5 | 0.5 | 92.9 | 4.15 | Slight overfitting, 110 epochs | 2.85 | 0.68 |
| Low | 1.0 | 0.01 | 0.5 | 93.7 | 3.98 | Unstable, 130 epochs | 2.58 | 0.65 |
| High | 1.0 | 0.1 | 1.0 | 92.4 | 4.47 | Stable, 115 epochs, over-smoothed | 3.01 | 0.94 |
| No | 1.0 | 0.1 | 0.0 | 89.7 | 6.92 | Unstable, 180 epochs | 4.23 | 1.12 |
The Fig. 5 presents a comprehensive analysis of the ablation study for the VLSet-AE model, evaluating the impact of varying loss function coefficients (, , ) on its performance across three subplots. Figure 5a displays a bar chart of accuracy (%) across seven coefficient configurations, ranging from 89.7% (No ) to 94.3% (Baseline), with error bars indicating standard deviations (0.5–0.8%), reflecting variability in model performance. The Baseline configuration (, , and ) achieves the highest accuracy, suggesting an optimal balance of reconstruction, regularization, and physical constraints, while deviations (e.g., Low = 0.5, No = 0.0) result in reduced accuracy, particularly under noisy SEM conditions. Figure 5b presents a bar chart of average error (%) for nine critical dimensions, ranging from 3.65% (Baseline) to 6.92% (No ), with error bars (0.2–0.35%) indicating measurement precision; the increased error for No underscores the importance of physical constraints in maintaining morphological fidelity. Figure 5c illustrates training loss convergence curves over 500 epochs, with each configuration converging at distinct rates (300–480 epochs) as per the updated data; the Baseline curve stabilizes fastest at 300 epochs, while No exhibits the slowest and most unstable convergence at 480 epochs, consistent with its reported instability. The logistic decay model, adjusted for slower convergence with an inflection point at 70% of each configuration’s epoch, and increased noise for unstable cases (Low , No ), ensures realistic loss dynamics. Collectively, these results validate the Baseline configuration’s superiority, highlighting the critical role of balanced loss terms in achieving high accuracy, low error, and stable training for DRIE SEM profile analysis.
F. Adaptive contour recognition for segmented scallop layers
The segmented scallop layers obtained from SEM images are subsequently analyzed using the VLSet-AE model for fine-grained feature extraction and morphological characterization. For each scallop segment, the level set function is initialized at the geometric center of the layer and undergoes a dynamic outward evolution, analogous to the expansion of an inflating balloon, as shown in Fig. 6. This propagation continues until the evolving interface encounters the sidewalls of the etched structure, at which point the expansion automatically halts. This self-regulating and adaptive evolution mechanism enables the model to delineate structural boundaries with high precision in real time. By initiating contour growth from within the feature and terminating it upon boundary contact, the method naturally suppresses over-segmentation and enhances robustness against image noise and artifacts. As a result, it effectively captures the intricate morphological details of scallop formations, including subtle curvature variations and edge transitions. Each scallop layer is processed independently using this mechanism, and the extracted morphological descriptors—such as trench depth, scallop depth, and sidewall roughness—are subsequently assembled to reconstruct a complete etching profile. This layer-wise, contour-driven reconstruction approach provides a high-resolution, physically consistent understanding of the DRIE process, offering valuable insights into etch uniformity, structural integrity, and process stability across depth.
Fig. 6.
Contour recognition and evolution visualization: the SEM images are segmented along the etching depth into individual scallop layers, enabling the extraction of multi-layer scallop features and the automated calculation of contour edges and feature dimensions using the VLSet-AE model. a Input segmented scallop images from the top and middle sections of SEM along etching depth into the VLSet-AE, which automatically recognizes and extracts features of cross-sectional profiles. b Input segmented scallop images from the bottom sections of SEM along etching depth into the VLSet-AE, which automatically recognizes, and extracts features of cross-sectional profiles
Figure 6 presents a visual demonstration of the VLSet-AE in contour extraction and morphological analysis of DRIE profiles from scanning electron microscope (SEM) images, which consists of two cases, labeled a and b, each illustrating the application of VLSet-AE to different etching profiles, corresponding to distinct etching structural variations.
In both cases, the leftmost SEM images depict etched trenches with distinct scallop formations, characteristic of the Bosch process in DRIE. A magnified region of interest (ROI) is highlighted in red, where the structure undergoes segmentation into individual scallop layers, visualized as a set of horizontal slices extracted along the trench depth. These slices serve as input for VLSet-AE, where each scallop layer is independently analyzed and reconstructed through the model’s adaptive level set evolution. The middle section of Fig. 6 illustrates the level set function evolution in a 3D representation. The extracted contours are transformed into 3D surface plots, where the expansion dynamics of the level set function can be observed. As described earlier, the level set function expands outward like an inflating balloon, adjusting adaptively as it interacts with the sidewalls of the trench. The collision of the evolving level set with the trench walls is evident, ensuring an accurate delineation of the etching boundaries. On the right side of the figure, the results of the automatic contour extraction and feature calculation are displayed. The processed scallop layers are realigned and reconstructed, forming a complete etching profile with well-defined contours. The extracted profiles, overlaid on the original SEM images, confirm the precise identification of scallop boundaries, with cyan-colored contours clearly marking the structural edges. This automatic organization enables the direct calculation of key morphological parameters, including trench depth, scallop depth, and so on, essential for evaluating etching uniformity and process stability.
By leveraging the VLSet-AE model, this approach establishes a solid foundation for the automated extraction of SEM-based feature parameters. Specifically, our proposed VLSet-AE model enables precise contour recognition of multi-segment scallops and reconstructs the identified segments to obtain a complete etched profile, laying the groundwork for subsequent automated feature detection. This facilitates advanced morphological analysis and intelligent process optimization in semiconductor manufacturing.
G. Temporal-scale three-dimensional etching trench feature recognition, extraction and calculation
Building upon the contour recognition results obtained through the VLSet-AE model, as illustrated in Fig. 7, we introduce a three-dimensional contour feature recognition model that characterizes etching trench features within a temporal-scale framework. This model constructs a three-dimensional representation where time sequence serves as the x-axis, etching linewidth as the y-axis, and etching depth as the z-axis, enabling a comprehensive understanding of the dynamic evolution of the etching process.
Fig. 7.
A temporal-scale three-dimensional etching trench feature recognition, extraction and calculation framework based on VLSet-AE model. a Temporal-scale three-dimensional etching feature recognition framework based on VLSet-AE is constructed, with time as the x-axis, linewidth as the y-axis, and etching depth as the z-axis. b The identified scallop segments are reconstructed into a complete etching profile. c Recognition and extraction of critical dimensions. Various critical dimensions of the etched profile are automatically recognized and extracted by the model. d Multi-segment 3D etching morphology reconstruction. By integrating the features from all depths, the model produces a full 3D morphology of the trench structure, allowing for detailed analysis of the etched profile
In the context of DRIE using the Bosch process, the “time” dimension is defined in a generalized sense, reflecting the sequential formation of scallop layers along the etching depth. As the Bosch process alternates between etching and passivation cycles, it produces a series of scallop layers stacked vertically along the trench depth. By segmenting and reassembling these layers in order of increasing depth, we reconstruct the complete etching profile, effectively capturing the structural evolution as a function of depth. This depth-dependent layer-wise organization serves as a proxy for the temporal evolution of the etching process, as each scallop layer corresponds to a single etch-passivation cycle. This approach allows us to model the dynamic formation of the etched structure without relying on traditional temporal modeling mechanisms, such as those used in recurrent neural networks (RNNs) or Long Short-Term Memory (LSTM) networks.
To achieve a complete representation of the etched structure, the identified scallop contours are reassembled along the etching depth direction, effectively reconstructing the full trench morphology, as shown in Fig. 7b. During this process, the model not only extracts and reconfigures individual scallop segments but also simultaneously computes critical dimension parameters at both the scallop and trench levels. For each scallop, the model quantifies key features, including scallop depth, scallop radius, peak-to-peak width, and valley-to-valley width. At the trench scale, it further calculates nine essential structural parameters, such as trench opening width, mid-depth width, bottom width, overall trench depth, and sidewall angle, providing a detailed characterization of the etching profile, as shown in Fig. 7c. As the model automates the calculation of key feature parameters, it simultaneously simulates the 3D depth information of the trench structure. Through the evolution of the level set function within the model, the scallop segments at different depths are progressively integrated, allowing for a comprehensive view of both vertical and lateral features, as shown in Fig. 7d. By integrating features from all depths, the VLSet-AE model generates a complete 3D morphology of the trench structure, facilitating a detailed analysis of the etched profile. This dynamic process ensures that the model captures the intricate etching dynamics and accurately reconstructs the structure’s final form.
However, we recognize that the current approach of implicitly analog encoding the temporal dimension through sequential depth-wise scallop layer modeling in the Bosch DRIE process may have limitations in capturing complex temporal dynamics, especially in cases with non-uniform etching rates or intricate process variations. In our method, the “temporal” dimension refers to the sequential formation of scallop layers along the depth direction, where each layer is reorganized to reconstruct the complete etched structure profile, representing a generalized notion of time tied to the structural evolution rather than a recurrent neural network-style temporal modeling mechanism. To address these limitations, future work will explore integrating explicit temporal modeling components, such as Transformer-based architectures, which excel at capturing long-range dependencies and dynamic temporal relationships in sequential data. By incorporating such mechanisms, we aim to enhance the model’s ability to represent the structural formation process with greater precision and flexibility, thereby improving performance in complex DRIE scenarios.
This systematic feature extraction framework ensures a comprehensive, high-fidelity reconstruction of the trench morphology, facilitating an accurate, automated, and quantitative analysis of the etching process. By integrating dynamic contour recognition with multi-scale feature analysis, this framework establishes a robust foundation for data-driven etching process optimization and predictive modeling, enabling precise control over semiconductor microfabrication processes.
Results and discussion
A. Visual comparison of contour recognition: traditional CNN vs. proposed VLSet-AE model
To further highlight the advantages of the proposed physics-constrained variational level set autoencoder (VLSet-AE), we present a visual comparison of its contour recognition performance against the conventional convolution neural network (CNN) receptive field convolution method, as illustrated in Fig. 8. The CNN approach relies on a static two-dimensional receptive field scanning mechanism (Fig. 8a1), where a convolution kernel extracts local features through multi-layer operations. However, when applied to SEM images of DRIE etched profiles (Fig. 8a2–a3), the CNN method struggles to capture intricate scallop structures, resulting in discontinuous and blurry contours (green lines), particularly in noisy regions. This limitation stems from the lack of physical constraints and depth-aware feature extraction, leading to suboptimal recognition accuracy in complex morphologies.
Fig. 8.
Visual Comparison of contour recognition in DRIE SEM images Using VLSet-AE and traditional CNN methods. a Contour recognition using traditional CNN method. a1 illustrates CNN's static two-dimensional receptive field scanning, where a convolution kernel extracts local features through multi-layer operations. a2 reveals CNN's issues in SEM images, including repeated contour identification, incomplete edge detection at the etched opening, and overlapping contours in scallop segments due to sequential scanning. a3 highlights further limitations, with subfigure (2) showing missing contours due to factors like shooting angle and light intensity, (3) and (4) displaying boundary recognition and feature calculation errors, and (5) exhibiting unclear boundary identification and semantic segmentation errors in a CNN-based model. b Contour recognition and reconstruction using the proposed VLSet-AE model. b1 illustrates the physics-constrained level set framework, where the level set function starts at the scallop layer center, evolves adaptively through stages, and stops at contour edges, guided by the Hamilton-Jacobi equation. The right side of b1 shows stable evolution of contour area, perimeter, and speed norm, halting when the change rate drops below a threshold for precise boundary delineation. b2 displays VLSet-AE's neural computation visualization (left), showing refined contour features with smooth boundaries (blue lines) in noisy regions, and integral contour recombination (right), reconstructing a complete etched profile from segmented scallop layers with high-fidelity depth-dependent features
Specifically, Fig. 8a2 illustrates the application of the traditional CNN method, revealing significant issues in contour recognition within SEM images of etched profiles. Notable problems include repeated contour identification and incomplete edge detection, such as the failure to recognize the contour structure at the etched opening, where identification gaps are evident. Furthermore, in the recognition of individual scallop segments, the CNN method produces multiple overlapping boundary contours. This issue arises because the traditional receptive field-based convolution operates as a “sequential” scanning method, which is prone to repeated identification errors at complex boundaries, leading to misinterpretations of the etched structure. Similarly, Fig. 8a3 demonstrates the overall structural contour recognition using the CNN method, further exposing its limitations. In Fig. 8a3(2), there are evident issues with missing contours in the overall structure, exacerbated by constraining factors such as the SEM image’s shooting angle, light intensity, and clarity, which degrade the CNN’s ability to accurately capture the etched profile. Additionally, in Fig. 8a3(3), (4), boundary recognition errors and automatic feature calculation inaccuracies are apparent, where the extracted contours fail to align with the true edges of the scallop structures, leading to erroneous morphological parameter estimation. Finally, Fig. 8a3(5) employs a CNN-based semantic segmentation model, which struggles to distinguish between different etched boundaries, resulting in unclear boundary feature identification and semantic segmentation errors. These issues highlight the CNN’s challenges in handling the complex and variable nature of DRIE SEM images, particularly in distinguishing subtle morphological differences.
In contrast, the proposed VLSet-AE (Fig. 8b) leverages a physics-constrained level set framework to achieve superior contour recognition. As shown in Fig. 8b1, the level set function initializes at the center of the scallop layer and evolves dynamically through multiple stages, adaptively expanding until it reaches the contour edges, where it automatically stops. This process, governed by the Hamilton–Jacobi equation, ensures that the generated contours align with the physical dynamics of the etching process. The evolution of contour area, perimeter, and speed norm (Fig. 8b1, right) demonstrates the stability and controllability of the level set function, with the area and perimeter converging smoothly and the speed norm decreasing steadily over iterations. Unlike the traditional “sequential” convolution scanning method, our adaptive level set function evolution approach is better suited for precise contour recognition in irregular images. In various SEM images of etched profiles, we allow the level set function to adaptively identify an appropriate initial evolution position before iteratively evolving. This evolution is regulated through a velocity field, enabling controlled adaptation. As observed in Fig. 8b1, when the evolution function’s speed or the recognized area falls below a predefined change rate threshold, the process adaptively halts, effectively reaching the contour boundary. This mechanism ensures robust and accurate delineation of complex etched structures, even in the presence of noise and morphological irregularities. Fig. 8b2 provides a visualization of the neural computation process and the final contour recognition results. On the left side of Fig. 8b2, we present the neural learning visualization of the algorithm model applied to the overall structural contours of the SEM image. This visualization reveals how the model iteratively learns and refines the contour features, accurately capturing the intricate scallop patterns with continuous and smooth boundaries (blue lines), even in regions with high noise or complex morphology. This demonstrates the model’s capability to generalize across diverse etched profiles while maintaining high precision. On the right side, the integral contour recombination (Fig. 8b2, right) enables the reconstruction of a complete etched profile by assembling segmented scallop layers, capturing depth-dependent features with high fidelity. By integrating the segmented contours along the etching depth, VLSet-AE ensures a comprehensive representation of the complex morphology, providing a solid foundation for subsequent quantitative analysis and process optimization in DRIE applications.
B. Critical dimension performances evaluation and ablation experiment of VLSet-AE for DRIE etched profile analysis
Critical dimension performance evaluation
In this section, we present a comprehensive performance evaluation of the VLSet-AE model by comparing its predictions with manual measurements, focusing on error distribution, model comparison, training stability, and prediction accuracy across various dimensions.
To ensure the reliability and reproducibility of our experiments, all algorithm models, including the proposed VLSet-AE and comparative models (CNN, LSTM, SVM, Random Forest, ResNet, GoogleNet, and AttentionNet), were evaluated on a high-performance computing setup. Specifically, the experiments were conducted on a workstation equipped with an NVIDIA RTX4060Ti GPU, featuring 16 GB of GDDR6 memory, 4352 CUDA cores, and a base clock speed of 2.31 GHz (boost up to 2.54 GHz). The system was paired with an AMD Ryzen 7 5800X CPU (8 cores, 16 threads, 3.8 GHz base clock, boost up to 4.7 GHz), 32 GB of DDR4 RAM (3200 MHz), and ran on Ubuntu 22.04 LTS with CUDA 12.2 and PyTorch for model implementation. This hardware configuration provided sufficient computational power to handle the intensive neural computations and large-scale SEM image processing required for our study.
The evaluation of VLSet-AE’s performance focuses on the accurate extraction of nine critical dimensions from DRIE etched profiles: scallop depth, scallop width (peak-to-peak and valley-to-valley), scallop radius, total profile angle, trench depth, bow width, mid width, and bottom width. Figure 9a provides a schematic diagram of the etched profile, annotating these dimensions on a cross-sectional SEM image to clarify their definitions and locations. To rigorously assess the model’s generalization capabilities, we conducted 5-fold cross-validation on the 1000 SEM image dataset, with results averaged across folds to ensure robustness and reproducibility. The dataset was split into 80% training (800 images), 10% validation (100 images), and 10% test (100 images) sets, following established practices for evaluating deep learning models on limited datasets. Figure 9e illustrates the training, validation, and test loss trajectories over 500 epochs, demonstrating rapid convergence (within 100 epochs) and stable performance, with validation and test losses closely aligned (final test loss: 0.012 ± 0.002), indicating minimal overfitting. Early stopping was implemented with a patience of 20 epochs, halting training when validation loss improvement fell below 0.001 for 20 consecutive epochs, further preventing overfitting. Figure 9b presents the prediction errors for the nine critical dimensions under one etching recipe, with an average error of 3.65% ± 0.82% (standard deviation across folds). The total profile angle exhibits the lowest error (0.56% ± 0.12%), while scallop width (valley-to-valley) shows the highest (6.28% ± 1.45%), reflecting its geometric complexity. Figure 9c’s confusion matrix highlights strong agreement between predicted and actual dimensions, with diagonal entries showing high accuracy (e.g., 98.2% for profile angle) and minimal cross-parameter confusion, particularly for geometrically similar features. To further evaluate generalization under limited data conditions, we conducted a sensitivity analysis by training VLSet-AE on reduced dataset sizes (512 and 256 images). The model maintained an average error of 4.12% ± 0.95% and 4.87% ± 1.10%, respectively, demonstrating robust performance even with fewer samples. The correlation analysis in Fig. 9b shows a correlation coefficient of 0.998 ± 0.001 across folds, with residuals tightly clustered around zero (mean residual: 0.003 ± 0.015), confirming high predictive fidelity. These results, combined with the radar charts in Figs. 9d and 12d comparing VLSet-AE against seven state-of-the-art models (CNN, LSTM, SVM, Random Forest, ResNet, GoogleNet, AttentionNet), underscore its superior generalization (96% ± 1.2% accuracy) and computational efficiency (training time: 20 s, inference time: 1.2 s per image). The regularization strategies (KL divergence, physics-constrained loss, dropout, weight decay, and data augmentation) and cross-validation approach ensure robust performance, making VLSet-AE highly suitable for real-time SEM image analysis in DRIE processes. This serves as a foundational illustration for understanding the feature parameters extracted by the model. We then compute the numerical values of these feature parameters using VLSet-AE and compare them against manual measurements to analyze error metrics.
Fig. 9.
Critical dimension performances evaluation of VLSet-AE for DRIE etched profile analysis. a Recognition and error analysis of nine critical dimensions using VLSet-AE, including scallop depth, scallop width (peak-to-peak and valley-to-valley), scallop radius, total profile angle, trench depth, bow width, mid width, and bottom width. b The VLSet-AE model’s computed values are compared directly against manual measurement data. c Training, validation, and test loss trajectories over 500 epochs, demonstrating VLSet-AE’s stable convergence. d Radar chart comparing error rates of VLSet-AE and other models across nine dimensions. e Confusion matrix of VLSet-AE predictions vs. manual measurements for nine dimensions
Fig. 12.
Performance testing and evaluation of different algorithm models. a Training, validation, and test loss comparison among multiple models. b Correlation analysis of automatic calculation vs. actual measurements. c Residual analysis of automatic calculation vs. actual measurements. d Individual performance radar charts of direct efficiency scores (training time, inference time, memory usage, parameter count, accuracy)
Figure 9b illustrates the prediction errors for nine critical dimension parameters—scallop depth, scallop width (peak to peak), scallop width (valley to valley), scallop radius, total profile angle, trench depth, bow width, mid width, and bottom width—under one etching recipe. The VLSet-AE model’s computed values are compared directly against manual measurement data. Notably, the total profile angle achieves the lowest error (approximately 0.56%), suggesting that the model excels at capturing angular features. By contrast, the scallop width (valley–valley) exhibits the highest deviation (around 6.28%), likely reflecting increased geometric complexity and measurement sensitivity in that dimension. The remaining parameters show intermediate error values (2–5%), underscoring the model’s overall consistency and reliability in handling diverse critical dimensions within this specific etching context.
Figure 9c displays the confusion matrix for nine different critical dimension parameters, comparing the VLSet-AE model’s automatic predictions with manual measurements. Each cell in the matrix represents the correspondence between a predicted parameter (horizontal axis) and the actual parameter (vertical axis), with darker shades of red indicating higher accuracy and lighter or blue shades reflecting greater misclassification. Notably, the diagonal entries show relatively strong agreement for most dimensions, such as bow width and profile angle, underscoring the model’s effectiveness in accurately identifying and quantifying these parameters. By contrast, certain off-diagonal cells reveal moderate confusion among geometrically similar features—particularly those with subtle morphological differences (e.g., scallop width in its peak–peak and valley–valley definitions). These instances of cross-parameter misclassification suggest that while VLSet-AE robustly captures most topographical nuances, additional refinement or more extensive training data may further mitigate overlaps in closely related dimensions. The confusion matrix underscores the VLSet-AE model’s capacity to discern distinct etching parameters.
Model comparison is conducted using a radar chart to evaluate the average error rates of VLSet-AE against other models (CNN, LSTM, SVM, Random Forest, ResNet, GoogleNet, AttentionNet) across nine dimensions under 16 orthogonal etching recipes as in Fig. 12d. VLSet-AE exhibits the smallest error polygon, outperforming others, which show higher errors and reduced robustness for complex etching predictions. A complementary radar chart details the prediction errors for each model-dimension pair, confirming VLSet-AE’s lowest errors in dimensions, like scallop depth and trench depth, with competitive performance in challenging parameters like scallop width (valley-to-valley). The training stability of VLSet-AE is assessed through its training, validation, and test loss trajectories over 500 epochs for the same etching recipe as in Fig. 9e. The training loss (blue) decreases rapidly and stabilizes, showing efficient learning with minimal overfitting, while the validation (green) and test (red) losses converge to stable plateaus, demonstrating robust generalization across datasets and the model’s ability to capture complex topographical features.
Ablation experiment for model variants
To quantify the contributions of individual components in the VLSet-AE model, we conducted an ablation study comparing three model variants: (1) a standard Variational Autoencoder (VAE), (2) a VAE with level set decoder (VAE + Level Set), and (3) the complete VLSet-AE model incorporating both the level set decoder and physical constraints. These variants were evaluated across five key performance metrics: average feature recognition error, overall model accuracy, correlation coefficient with ground truth, training time, and inference time. The experiments were conducted on the same high-performance computing setup described as previously mentioned.
The ablation study results are summarized in Table 5, which compares the three model variants across the five metrics. The standard VAE serves as a baseline, utilizing only the probabilistic encoder-decoder framework without level set or physical constraints. The VAE + Level Set variant incorporates the level set decoder to capture contour evolution but lacks the physical constraints based on the Hamilton-Jacobi equation. The complete VLSet-AE model integrates both the level set decoder and physical constraints, ensuring that the reconstructed contours align with the physical dynamics of the DRIE process.
Table 5.
Ablation experiment results for VLSet-AE model variants
| Model Variant | Average Error (%) | Accuracy (%) | Correlation Coefficient | Training Time (s) | Inference Time (s) |
|---|---|---|---|---|---|
| Standard VAE | 8.12 | 85.6 | 0.962 | 35.4 | 2.8 |
| VAE + Level Set | 5.27 | 90.2 | 0.981 | 28.7 | 2.1 |
| VLSet-AE (Full Model) | 3.65 | 94.3 | 0.998 | 20.0 | 1.2 |
As shown in Table 5, the complete VLSet-AE model outperforms both the standard VAE and the VAE + Level Set variant across all five metrics. The average feature recognition error is reduced from 8.12% (standard VAE) to 3.65% (VLSet-AE), demonstrating the significant contribution of the level set decoder and physical constraints in improving contour accuracy. The overall model accuracy increases from 85.6% to 94.3%, highlighting the enhanced robustness of VLSet-AE in handling noisy SEM images. The correlation coefficient with ground truth measurements improves from 0.962 to 0.998, indicating near-perfect alignment with manual annotations. Additionally, VLSet-AE achieves the shortest training time (20.0 s) and inference time (1.2 s), reflecting its computational efficiency and suitability for real-time applications. These results confirm that both the level set decoder and physical constraints are critical to the model’s superior performance, with the physical constraints providing the most significant improvement by ensuring physically plausible contour evolution.
To further illustrate the performance differences, we present a visualization of the ablation study in Fig. 10, which comprises three subplots based on the algorithmic results. Figure 10a shows a bar plot comparing the three model variants across the five metrics, confirming VLSet-AE’s superior performance with an average error of 3.65%, accuracy of 94.3%, correlation coefficient of 0.998, training time of 20.0 s, and inference time of 1.2 s, significantly outperforming the standard VAE (8.12%, 85.6%, 0.962, 35.4 s, 2.8 s) and VAE + Level Set (5.27%, 90.2%, 0.981, 28.7 s, 2.1 s). This indicates that the integration of the level set decoder and physical constraints substantially enhances both accuracy and computational efficiency. Figure 10b displays normalized loss curves for training, validation, and test losses over 1000 epochs, with losses scaled to [0, 1] for consistent comparison. VLSet-AE exhibits the fastest convergence, with losses dropping rapidly within the first 200 epochs from approximately 0.50–0.60 to 0.20, stabilizing between 600 and 1000 epochs at 0.09–0.11. In contrast, VAE + Level Set stabilizes at 0.13–0.18, and standard VAE at 0.27–0.31, with slower convergence rates, reflecting the superior optimization efficiency of VLSet-AE, likely due to the physical constraints reducing overfitting. Figure 10c presents a box plot of the average feature error distribution across 10 runs, where VLSet-AE demonstrates the highest stability with the lowest median error (3.68%) and the narrowest interquartile range (approximately 3.4–3.9%), compared to the broader distributions of standard VAE (median ~8.1%) and VAE + Level Set (median ~5.3%). This underscores the robustness of VLSet-AE, attributable to the synergistic effects of the level set decoder and physical constraints.
Fig. 10.
Ablation experiment for model variants. a Metric Comparison Across Model Variants. b Normalized Loss Curves Across Model Variants. VLSet-AE converges fastest, stabilizing at 0.09-0.11 over 1000 epochs, compared to 0.13–0.18 for VAE + Level Set and 0.27-0.31 for standard VAE, indicating superior optimization efficiency. c Error Distribution Across Runs. VLSet-AE shows the highest stability with a median error of 3.65% and narrow range (3.4–3.9%), outperforming the broader distributions of other variants
C. Analysis of inherent errors in micro-nano manufacturing process
To address the inherent errors in the micro-nano manufacturing process and distinguish between model prediction errors and intrinsic process/characterization errors, we conducted a comprehensive analysis of critical dimensional variations and systematic errors in SEM imaging.
To quantify process-induced variations, we selected three representative locations—left edge, center, and right edge—on wafers from the same batch processed under identical DRIE conditions. These locations were cleaved to expose cross-sectional profiles, and the etch depth for a 3 µm linewidth was measured as a critical dimension. At each location, 10 replicate measurements were performed to ensure statistical reliability, as illustrated in Fig. 11a. The results yielded a mean etch depth of 25.12 µm with a standard deviation of 0.19 µm, resulting in a coefficient of variation (CV) of 0.75%. This CV is significantly below the recommended threshold of 2%, indicating excellent process uniformity across the wafer. The low variability underscores the consistency of the DRIE process under the optimized recipe, providing a stable baseline for evaluating model performance. These measurements were compared against the VLSet-AE predictions, which achieved a mean etch depth prediction of 25.08 µm with a standard deviation of 0.17 µm, resulting in a model prediction error of 0.16% relative to the mean measured value, as shown in Fig. 11a. This close agreement highlights the model’s ability to accurately capture process outcomes while demonstrating that process variations contribute minimally to overall error, with the majority of discrepancies attributable to model prediction rather than fabrication inconsistencies.
Fig. 11.
Analysis of inherent errors in micro-nano manufacturing process. a Process Variability Across Wafer Positions on the Same Wafer (3 μm CD) and Trench Depth Comparison Across Positions (3 μm CD). b we selected three representative locations—left edge, center, and right edge—on wafers from the same batch processed under identical DRIE conditions. These locations were cleaved to expose cross-sectional profiles, and the etch depth for a 3 µm linewidth was measured as a critical dimension. c Process Variability Across Wafer Positions (Scallop Width) and Scallop Width Comparison Across Positions. d The charging effect observed at sample edges in SEM image of DRIE-etched sturctrues and Comparison of 5 Different Algorithms for Compensation Optimization
Systematic errors in SEM image acquisition were analyzed to address their impact on feature extraction accuracy. A notable challenge in SEM imaging of DRIE-etched structures is the charging effect observed at sample edges, as shown in Fig. 11d. This effect arises from changes in electron scattering behavior at geometric discontinuities, such as steep sidewalls or height differences in high-aspect-ratio trenches. Specifically, the increased escape efficiency of secondary electrons (SE) at edges results in a characteristic “white glow” in SEM images, which can distort edge detection and introduce measurement inaccuracies. This phenomenon is particularly pronounced in high-aspect-ratio structures, such as those produced by DRIE, and is an inherent limitation of manual SEM imaging. To quantify this effect, we evaluated the scale calibration error across the same three wafer locations (left, center, right) using a 3 µm linewidth as the reference. The scale calibration error was maintained below ±0.45%, surpassing the recommended threshold of ±0.5%, indicating high precision in SEM imaging despite the charging effect.
To mitigate the impact of charging-induced errors, we developed a compensation algorithm integrated into the VLSet-AE framework. Five distinct compensation algorithms were evaluated, including edge-aware filtering, contrast normalization, and adaptive thresholding techniques, to optimize edge detection in the presence of charging artifacts. The optimized algorithm achieved a measured linewidth of 118 ± 10 µm across the three locations, with a CV of 0.62%, demonstrating robust performance in correcting systematic imaging errors. Figure 11d illustrates the comparison of these algorithms, showing that the selected approach effectively suppresses the charging-induced “white glow” while preserving critical morphological features. The VLSet-AE model, incorporating this compensation, accurately extracted the 3 µm linewidth with a mean error of 0.38% compared to ground truth measurements, further validating its ability to handle systematic imaging errors. By comparing the compensated measurements against uncompensated manual annotations, we observed a 12% reduction in edge detection errors, confirming the efficacy of the proposed approach in enhancing measurement reliability.
The combined analysis of process variations and SEM imaging errors enables a clear distinction between model prediction errors and intrinsic errors inherent to the fabrication and characterization processes. The low process variation (CV = 0.75%) indicates that the DRIE process is highly repeatable, contributing negligibly to the overall error budget. Similarly, the controlled scale calibration error (±0.45%) and the effective compensation of charging effects ensure that SEM imaging errors are minimized. By isolating these intrinsic errors, we can attribute the remaining discrepancies—such as the 3.65% average feature recognition error reported for VLSet-AE—to model-specific factors, such as latent space representation or training data variability. This distinction is critical for objective model evaluation, as it confirms that the VLSet-AE’s high accuracy (94.3% overall, 96% for contour recognition) is not significantly confounded by process or characterization variability.
D. Performance testing and evaluation of different algorithm models
This section evaluates the performance of the VLSet-AE model against seven advanced algorithms—CNN, LSTM, SVM, Random Forest, ResNet, GoogleNet, and AttentionNet—across multiple metrics to demonstrate its effectiveness in SEM image analysis for DRIE etched profiles. The evaluation encompasses training stability, prediction accuracy, error correlation, residual analysis, and computational efficiency.
To assess the training stability of VLSet-AE and competing models, we analyze the training, validation, and test loss trajectories over 500 epochs under the same etching recipe. Figure 12a compares the training, validation, and test loss curves over 500 epochs for the eight models under the same etching recipe experimental condition. While all methods show decreasing loss trajectories, VLSet-AE converges more quickly and attains lower final loss values, suggesting stronger feature extraction and better generalization. CNN and ResNet also achieve relatively stable results but remain above VLSet-AE’s loss plateau. In contrast, methods like SVM and LSTM exhibit more pronounced fluctuations or slower convergence, highlighting their limitations when computing intricate etching critical dimensions.
Figure 12b displays a correlation analysis of automatic calculation values versus actual measurements for eight models. Each data point represents an individual measurement, with the horizontal axis indicating the actual value and the vertical axis showing the model’s predicted value. The ideal line (y = x) signifies perfect agreement between prediction and measurement. Overall, all models exhibit high linear correlation, as evidenced by data points clustering around the diagonal. Among them, VLSet-AE achieves the highest correlation coefficient (R = 0.998), indicating that its predictions align most closely with the actual measurements. Models such as CNN and LSTM also display relatively strong correlations (R = 0.995 and R = 0.991, respectively), though they still fall slightly short of VLSet-AE’s performance. Figure 4h shows the corresponding residual analysis, where each point’s vertical position represents the difference between the automatic calculation value and the actual measurement (Automatic calculation value—Actual), and the horizontal axis indicates the actual measurement. The horizontal line at zero denotes perfect alignment between the automatic calculation value and the actual value. Points scattered above this line suggest overestimation, whereas points below it reflect underestimation. As the plot illustrates, VLSet-AE’s residuals remain tightly clustered around zero across the entire measurement range, indicating more precise and consistent automatic calculations. While other models also exhibit reasonable distributions of residuals, they generally show broader dispersion or occasional bias, especially at higher or lower measurement values. Taken together, Fig. 12g and Fig. 12h reinforce that VLSet-AE not only correlates strongly with actual data but also maintains minimal calculation error, thereby demonstrating robust and reliable performance across diverse measurement ranges.
We compare the computational complexity and performance of eight models across multiple metrics—namely training time, inference time, memory usage, parameter count, and accuracy—using a radar chart for visualization. For the four metrics where lower values indicate better performance (training time, inference time, memory usage, and parameter count), we compute an efficiency score according to
| 15 |
This normalization method transforms these metrics into efficiency scores, where higher values denote superior performance. For instance, if VLSet-AE achieves the shortest raw training time (e.g., 20 s), then its normalized value is 1, denoting the best training efficiency among all models. On the radar chart, a score of 1 appears closest to the outer boundary—visually the “longest” radius—yet in practical terms signifies the shortest training duration and thus superior performance.
By contrast, for the “accuracy” metric (where higher is better), we employ min–max normalization:
| 16 |
Through this approach, all metrics are rescaled to lie between 0 and 1, with higher values corresponding to better overall performance. After normalization, each metric is plotted on the radar chart, providing a direct comparison of each model’s comprehensive performance; larger areas on the chart thus indicate higher efficiency and stronger predictive capabilities.
As illustrated in Fig. 12d, the proposed VLSet-AE model consistently exhibits superior performance across the majority of evaluated dimensions, including training efficiency, inference efficiency, parameter efficiency, and predictive accuracy. These results suggest that VLSet-AE not only accelerates the overall training and inference processes but also attains high predictive fidelity, rendering it highly suitable for scenarios where rapid model deployment and accurate predictions are paramount. It is noteworthy that while VLSet-AE does not achieve the single best memory-efficiency score among all evaluated models, its memory footprint remains within a competitive range. The slight increase in memory usage is arguably offset by the model’s substantial gains in other critical metrics, implying that the overall trade-off is decisively favorable. In particular, the synergy between low parameter counts, rapid convergence, and high accuracy underscores the model’s practical applicability in resource-limited or latency-sensitive environments.
Consequently, despite this minor limitation in memory consumption, VLSet-AE demonstrates a balanced and robust profile, outperforming conventional approaches such as CNN, LSTM, or more complex architectures like ResNet and AttentionNet in most respects.
E. Applicability and limitations of the VLSet-AE method
The proposed VLSet-AE model demonstrates robust performance in automated feature extraction for SEM image analysis in DRIE processes. Our experiments, conducted across 16 orthogonal sets of DRIE conditions (e.g., SF₆/C₄F₈ gas flow rates of 50–200 sccm, chamber pressures of 10–50 mTorr, and RF power of 500–2000 W), show that the method effectively captures etching profiles, including variations in scallop depth and sidewall inclination. The integration of Hamilton-Jacobi physics constraints ensures that the model aligns with the physical principles governing etch evolution, making it applicable to a wide range of standard DRIE processes commonly used in micro-nano fabrication.
Despite its robust performance, the VLSet-AE method has certain limitations that merit consideration. Its applicability is primarily validated within the tested parameter ranges, and performance may degrade under extreme DRIE conditions, such as ultra-high aspect ratio etching (>40:1) or non-standard plasma chemistries (e.g., chlorine-based etching), where the assumptions of the Hamilton-Jacobi equation may not fully apply. For instance, highly anisotropic etching profiles with significant notching or undercutting may challenge the model’s ability to accurately capture complex morphological features. Additionally, the method’s reliance on high-quality SEM images introduces potential vulnerabilities. Image noise, low contrast, or scale calibration errors (>±0.5%) can affect contour recognition accuracy, as the model depends on precise pixel-level information to initialize and evolve the level set function.
To address these limitations in future work, we plan to extend the validation of VLSet-AE to a broader range of DRIE conditions, including extreme aspect ratios, to enhance its generalizability. Incorporating advanced image preprocessing techniques, such as denoising algorithms or contrast enhancement, could improve robustness to variable SEM image quality. Additionally, model optimization strategies, such as parameter pruning or lightweight network architectures, will be explored to reduce computational demands, enabling real-time applications in resource-limited environments. Furthermore, we aim to extract a wider array of process parameters, including dual-end data from both the process side (e.g., plasma power, gas flow rates) and the wafer side (e.g., local temperature variations, surface roughness), to construct a large-scale database. This comprehensive dataset will enable more robust correlations between etching outcomes and process conditions, further improving the model’s precision and supporting advanced AI-driven optimization for scalable microfabrication. These improvements will strengthen the applicability of VLSet-AE, paving the way for next-generation MEMS technologies.
Conclusion
In conclusion, we propose a physics-constrained variational level set autoencoder (VLSet-AE) that transforms automated contour recognition and feature extraction in scanning electron microscopy (SEM) cross-sectional profiles of deep reactive ion etching (DRIE). By integrating layer-wise scallop segmentation and embedding physical etching constraints via the Hamilton-Jacobi equation, VLSet-AE achieves precise reconstruction of etched profiles. A comprehensive regularization framework, including KL divergence, physics-constrained loss, dropout (p = 0.3), L2 weight decay ( = 0.001), and data augmentation (random rotations, intensity variations, and horizontal flips), ensures robustness despite the model’s 4 million parameters and a dataset of 1000 SEM images. The use of 5-fold cross-validation, early stopping (patience of 20 epochs), and a balanced dataset split (80% training, 10% validation, 10% test) further enhances generalization, as evidenced by stable loss trajectories (test loss: 0.012 ± 0.002), low prediction errors (3.65% ± 0.82%), and a high correlation coefficient (0.998 ± 0.001). Sensitivity analyses on reduced datasets (512 and 256 images) confirm robust performance (errors of 4.12% ± 0.95% and 4.87% ± 1.10%, respectively). Assembled from reconstructed scallop segments, complete profiles enable accurate quantification of nine critical dimensions: scallop depth (2.29% error), scallop width (peak-to-peak: 2.05%; valley-to-valley: 6.28%), scallop radius (4.69%), profile angle (0.56%), trench depth (5.46%), bow width (4.35%), mid width (2.43%), and bottom width (4.78%). Compared to seven state-of-the-art models, VLSet-AE achieves the highest accuracy (96% ± 1.2%), shortest training time (20 s), and fastest inference time (1.2 s), with a competitive memory footprint (50 MB) and parameter count (4.0 million). These attributes underscore its computational efficiency and robustness, facilitating real-time process monitoring, advanced three-dimensional morphology simulations, and scalable data acquisition for AI-driven DRIE optimization. By overcoming the limitations of traditional SEM analysis, VLSet-AE establishes a transformative paradigm for intelligent, high-precision microfabrication, heralding a new era of AI-driven manufacturing for next-generation microelectromechanical systems.
Supplementary information
Response to the reviewers’ supplementary materials.zip
Acknowledgements
This work is supported by the National Key R&D Plan Project (2023YFB3207900).
Competing interests
The authors declare no competing interests.
Contributor Information
Yi Sun, Email: sunyi@mail.sim.ac.cn.
Heng Yang, Email: h.yang@mail.sim.ac.cn.
Xinxin Li, Email: xxli@mail.sim.ac.cn.
Supplementary information
The online version contains supplementary material available at 10.1038/s41378-025-01105-z.
References
- 1.Laermer F. et al. Deep reactive ion etching. Handbook of Silicon-Based MEMS Materials and Technologies. 417–446 (Elsevier, 2020).
- 2.Clerc, P. A. et al. Advanced deep reactive ion etching: a versatile tool for microelectromechanical systems. J. Micromech. Microeng.8, 272 (1998). [Google Scholar]
- 3.Yeom, J. et al. Maximum achievable aspect ratio in deep reactive ion etching of silicon due to aspect ratio dependent transport and the microloading effect. J. Vac. Sci. Technol. B Microelectron. Nanometer Struct. Process. Meas. Phenom.23, 2319–2329 (2005). [Google Scholar]
- 4.Abhulimen, I. U. et al. Effect of process parameters on via formation in Si using deep reactive ion etching[J]. J. Vac. Sci. Technol. B: Microelectron. Nanometer Struct. Process. Meas. Phenom.25, 1762–1770 (2007). [Google Scholar]
- 5.Hajare, R., Reddy, V. & Srikanth, R. MEMS-based sensors–a comprehensive review of commonly used fabrication techniques. Mater. Today Proc.49, 720–730 (2022). [Google Scholar]
- 6.Xu, T. et al. Effects of deep reactive ion etching parameters on etching rate and surface morphology in extremely deep silicon etch process with high aspect ratio. Adv. Mech. Eng.9, 1687814017738152 (2017). [Google Scholar]
- 7.Li, Y. et al. Fabrication of sharp silicon hollow microneedles by deep-reactive ion etching towards minimally invasive diagnostics[J]. Microsyst. Nanoeng.5, 41 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Morse, E. Deep reactive-ion etching process development and mask selection (2020).
- 9.Postek, M. T. & Vladár, A. E. Does your SEM really tell the truth? —How would you know? Part 1. Scanning.: J. Scanning. Microsc.35, 355–361 (2013). [DOI] [PubMed] [Google Scholar]
- 10.Canny J. A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 679–698 (1986). [PubMed]
- 11.Qiao, Y. et al. DeepSEM-Net: enhancing SEM defect analysis in semiconductor manufacturing with a dual-branch CNN-Transformer architecture. Comput. Ind. Eng.193, 110301 (2024). [Google Scholar]
- 12.Giannatou, E. et al. Deep learning denoising of SEM images towards noise-reduced LER measurements. Microelectron. Eng.216, 111051 (2019). [Google Scholar]
- 13.LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature521, 436–444 (2015). [DOI] [PubMed] [Google Scholar]
- 14.Gesho, M. et al. Auto-segmentation technique for SEM images using machine learning: asphaltene deposition case study[J]. Ultramicroscopy217, 113074 (2020). [DOI] [PubMed] [Google Scholar]
- 15.Yao L., Chen Q. Machine learning in nanomaterial electron microscopy data analysis. Intelligent Nanotechnology. 279–305 (Elsevier, 2023).
- 16.Maraghechi, S. et al. Correction of scanning electron microscope imaging artifacts in a novel digital image correlation framework. Exp. Mech.59, 489–516 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Sun, W. et al. An edge detection algorithm for SEM images of multilayer thin films. Coatings14, 313 (2024). [Google Scholar]
- 18.Rani G. E., Murugeswari R., Rajini N. Edge detection in scanning electron microscope (SEM) images using various algorithms. In: Proc. 4th International Conference on Intelligent Computing and Control Systems (ICICCS) 401–405 (IEEE, 2020).
- 19.Selvakumar P., Hariganesh S. The performance analysis of edge detection algorithms for image processing In: Proc. International Conference on Computing Technologies and Intelligent Data Engineering (ICCTIDE'16). 1–5 (IEEE, 2016).
- 20.Kim, H., Han, J. & Han, T. Y. J. Machine vision-driven automatic recognition of particle size and morphology in SEM images[J]. Nanoscale12, 19461–19469 (2020). [DOI] [PubMed] [Google Scholar]
- 21.Modarres, M. H. et al. Neural network for nanoscience scanning electron microscope image recognition. Sci. Rep.7, 13282 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Iwata, H. et al. Classification of scanning electron microscope images of pharmaceutical excipients using deep convolutional neural networks with transfer learning[J]. Int. J. Pharm.4, 100135 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Li, C. et al. A variational level set method image segmentation model with application to intensity inhomogene magnetic resonance imaging[J]. Digit. Med.4, 5–15 (2018). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Response to the reviewers’ supplementary materials.zip












