Abstract
This study introduces a deep learning-based automatic evaluation method for analyzing the microstructure of steel with scanning electron microscopy (SEM), aiming to address the limitations of manual marking and subjective assessments by researchers. By leveraging advanced computer vision algorithms, specifically a suitable model for long-term dendritic solidifications named Tang Rui Detect (TRD), the method achieves efficient and accurate detection and quantification of microstructure features. This approach not only enhances the training process but also simplifies loss function design, ultimately leading to a proper evaluation of surface modifications in steel materials. The results demonstrate the method’s potential in automating and improving the reliability of microstructural analysis in materials science.
Subject terms: Materials science, Computer science
Introduction
The microstructural characterization of the alloy revealed a complex mixture of phases and defects, which could be used to evaluate performances of material in kinds of research tasks 1. The extraction of quantitative and meaningful statistical data from microstructural analysis is a critical step of finding quantitative descriptors representing the microstructural features of a given material, which is an ongoing research area in the field. 2. An accurate quantitation analysis demands a precise and reproducible method to detect and mark the target objects of interest. One of the most commonly used and efficient tools to characterize surface modification of steel material is image analyzing with scanning electron microscopy (SEM). SEM analysis is an easier and faster technique for the extraction of microstructural information from steel materials than the other analyses like ESBD 3. SEM fine electron beam imaging can obtain the surface morphology and composition information of samples 4. For instance, the microstructures this paper tried to detect and quantify are able to evaluate the significant performance of Y2O3 coating on stainless-steel. However, the microstructure itself needs to be marked manually by researchers, and can only be given subjective assessment of its quantification information, which highly relies on the judgment of experienced researchers before 5.
Deep-learning based computer vision methods has been widely used in real world image processing recently, some detecting algorithms have even reached a higher accuracy then human beings these years 6. Therefore, characterization images with developed computer vision algorithms have attracted attention in the field of material characterization, such as image denoising and enhancement to preprocess the representation image, segmentation and classification to detect and separated each microstructure form the characterization image. 4. Kazumasa Tsutsui et al. used machine learning algorism to recognize low-carbon steel microstructures, the accuracies for some steel microstructures are pleasant 3. Furthermore, there have been many experiments made by material researchers in recent years with all kinds of deep learning-based methods 2, most of which have shown the superiority of deep-learning based techniques over classical computer vision methods. For example, Muñoz-Rodenas, Jorge et al. explored to solve an image-classification problem of low-carbon steel microstructures with classic and deep learning models, which than provided that the utility of classic machine learning in a microstructure image context remain quested since the deep learning models GoogleLeNet and ResNet50 gained great results with 99% accuracy in their experiments. 7. The combination of deep learning-based computer vision and SEM image analysis have made progress in classification task 8, and we tried to make a standardized and automated method with it in this paper, to achieve the extraction of quantitative and meaningful statistical data from microstructural analysis.
The quantitative descriptor in our work was the target area occupation rate, thus the size of bounding box area of a detecting model has major influence of conclusion drawing. Therefore, we chose to apply a rotated object detection model to process our work. Rotated frame-based target marking method has upgraded several generations. The first idea to create an expression of a rotated target is simply adding an angle θ to the original expression, which algebraic expression is (xc, yc, w, h, θ), where (xc, yc) are the coordinate of center point, (w, h) are the width and height of bounding box, and θ is the angle of long edge and horizontal line. However, if we exchanged value of w and h, and added 2kπ + π/2 to θ in the same time, the new expression represents the same bounding box of the original one. This phenomenon causes difficulties in loss function design and degression calculating for there are countless expressions to represent one bounding box, which leads to a vast waste of time in training process, that’s one significant fault all scientists want to avoid 9–14. Another commonly used method to express a rotated bounding box is by using the coordinates of its four points. A public dataset called DOTA made by Xia, G.-S. et al. 15 used this kind of expression to represent their targets. The issue arises from the fact that the bounding boxes can be represented by at least four coordinate formats with different starting points. This issue can lead to abnormal loss values, which could have a negative impact on the model training process. We adopted a brand-new approach for bounding box representation called Tang Rui Detect model (TRD) 16, which achieved not only faster training progress but also easier loss function design.
Materials and methods
The 14CrSiMnV steel powder (From GL PTA Inc., Wuhan, China), which contains 14.5 wt.% Cr, 1.2 wt.% Si, 0.5 wt.% Mn, 0.5 wt.% V, 0.3 wt.% C, 0.05 wt.% O, and Fe (bal.), was dried by heating at 80℃ for 2 h in a drying chamber to make sure absorbed water removed. Y2O3 nanoparticles was synthesized by reverse precipitation from Y(NO3)3 and NH4HCO3. The diameter of Y2O3 nanoparticles ranged from 30 to 50 nm. Mixed 14CrSiMnV steel powder and Y2O3 nanoparticles with a planetary grinding machine at low-revolution speed condition (≦200 rpm) for 40 min, to manufacture the coating specimens for Plasma-transferred arc alloy coating with 5 different amounts of Y2O3 nanoparticles: 0 wt.%, 0.2 wt.%, 0.4 wt.%, 0.6 wt.%, and 0.8 wt.%, then marked them as Y0, Y02, Y04, Y06, Y08 respectively. A 200 × 100 × 10 mm3 Q235 steel plate was adopted as substrate, which composition is ≤ 0.17 wt.% C, ≤ 0.35 wt.% Si, ≤ 0.14 wt.% Mn, ≤ 0.035 wt.% P, ≤ 0.035 wt.% S, and Fe (bal.).
The cross-sectional morphology of the coating sample was characterized with an optical microscope (OM, Zeiss Imager A1m). The microstructure and composition of the Y2O3 coated 14CrSiMnV alloy steel were characterized with a field emission scanning electron microscope (FE-SEM, Zeiss Sigma500) equipped with an energy dispersive spectrometer (EDS). Figure 1 shows the crystallographic morphology along the cross-section of the coating specimens. The long-range dendritic solidification microstructure growing along certain angles can be observed in the unmodified coating, and we marked and labeled them manually.
Fig. 1.
Marked microstructure target on SEM image. a and c are ground truth, manually marked SEM; b and d are automatic marked SEM with our detecting model; b is one of positive result, d is one of the fault results, it shows that our model has limitation on low magnification ratio image. Since there are very little (about 1 of 20) low magnification ratio images, the influence of fault detection is limitation.
Results and discussion
Neural network architecture
Sizes of metallographic phase are very from image to image with different enlargement ratio. To achieve better detection of both large and small targets, a Feature Pyramid Network (FPN) 17 was combined with the detecting backbone structure, which in our case, the ResNet50 18.
The FPN Applied in our work is comprised of three parts. The bottom-up part is a kind of forward process, where the input image comes into the backbone convolutional network and starts to extract features. We tend to use feature map at the bottom of every stage, for the features here should be the strongest. Specifically, for the backbone convolutional network ResNet in our case, we take the output of last three layers of each stage to build our feature pyramid.
The top-down part is mostly an upsample process, during which the information of top layer feature map could propagate to down layer feature map, enriching its semantics. Technically, we used nearest neighbor upsampling to enlarge the feature map. It’s one of the simplest interpolation methods, filled new generated pixel blocks with their latest pixel value without computation. Though this method will bring serration to new generated feature map, that kind of shortcoming won’t infect the detection of microstructure at all.
The third part of the FPN is a lateral connection. Firstly, reduce output feature map channel dimensions of every selected stage with a 1*1 convolutional layer to meet the spatial size of which from the bottom-up way. Then combined these feature map of corresponding upsampled layer. Finally, we generated the final feature map with a 3*3 convolution appended on each combined map, to reduce the aliasing effect of upsampling. As our output features of every stage have the same quantity of classifiers, this 3*3 convolution block has the same channels with output features, which is 256 in our network.
As mentioned before, the backbone of our neural network called ResNet50. We set up several experiments with versions of ResNet structures from ResNet18 to ResNet152, which than been provided that ResNet50 has the suitable outputs among those versions. It is well known that the deeper a CNN is, the more information will the net get. But a deeper network has lower output accurate of training and test data. The reason of which is that a deeper network could cause gradient explosion or gradient disappear issue. Here comes out ResNet to deal with these issues. With the special structure of ResNet we could get more information of deeper network with lower accurate lose. The structure of ResNet50 and its advantages are illustrated with Fig. 2.
Fig. 2.
The backbone structure applied in our work. The boxes with white background are tensors with different channels and height marked in Parentheses, the boxes with colored background are sets of convolutions or ReLU structures.
There are five progressive layers called “stage” between input and output of ResNet50. They can be described as three progressively combined parts. The first part is a convolution core with parameters marked its shape, where n*n represents shape of convolution core, C represents quantity of convolution core, S represents step of convolution core. Together with some basic functions like batch normalization and ReLU activation, marked as CONV (n*n, C, S), called CONV block. The second part is called bottleneck, which mainly composed with CONV blocks. There are two kinds of bottleneck in our network, marked as BTNK1 (C, W, C1, S) and BTNK2 (C, W). The difference between them is that BTNK1 has different channel number of input and output, while BTNK2’s is the same. Thus, BTNK1 has four changeable parameters, marked as C, W, C1 and S, while BTNK2 only needs two parameters, C and W. The meaning of each corresponding parameter is the same. C and W represent input tensor shape, C1 represents output shape that the next stage wanted. S represents “step”, and with this parameter, we could decide whether to perform a downsample action to input data. The third part is called Stage, a combination of BTNK1, BTNK2 and CONV. The specific structure is described in Fig. 2. above.
The expression of our rotated target marking method is (xc, yc, |u|, |v|, s, ρ), shown in Fig. 3, where (xc, yc) is the center point of bounding box, the coordinate of C. (u, v) represents vector
, |u| and |v| are the absolute value of u and v, s is a brought in value who represents whether u and v are the same sign.
Fig. 3.

Bounding box design. The black solid line is rotated bounding box, X is liner direction axis, Y is column direction axis, C is the center point of the box, D and E are random vertexes of the box, P is the projection point of
on
.
It is obvious that
or
in this expression represents the same bounding box, so it is whether u and v are the same sign that would make a difference. Let s = 1 represents u and v are the same sign and s = 0 represents the opposite situation. The position of P on vector
is represented by ρ, ρ =
. Hence, with the brought in value s, our expression represents a specific bounding box.
Loss function design
Predicting bounding box with feature map.
The losses of xc, yc, |u|, |v|, ρ are calculated with regression method Smooth L1 when predicting bounding box with feature map, measuring the difference between values of the loss directly. As for the loss of s, it could be decided by a classification method. Let the model calculate two values of s loss with Cross Entropy, which represent their possibility whether the vectors are the same sign, then the higher one decided the value of s. When the bounding box at the position like Fig. 4, the solid line box and the one with dashed line are representing the same target with different s value. Hence, the loss of s couldn’t represent the real difference of bounding boxes on this circumstance, and the situation get heaver when the angle between
and X axis get smaller. We applied a balance weight to deal with the mismatch of position and value. First, we measured the angle between
and X axis with |u| and |v| with the function below:
![]() |
1 |
Fig. 4.

The special situation when two similar bounding box with different value of s.
The range of diff is [0,1], its value equals 0 when |u| and |v| are the same, equals 1 when one of |u| and |v| is 0. Then get the weight value w by letting
equals diff. In this case, the more difference between |u| and |v|, the value of diff gets closer to 1, with the sigmoid function, the weight of s loss gets smaller. Finally, when calculating the difference between predicting box with the real box, the loss of s multiply by w could represent a more realistic difference.
Regression parameter from predicted anchor box to target box by feature map.
In this case, the value of s can be assigned according to the anchor box, thus the value of s is determined. Then the regression parameters of an anchor box to a target box is defined by formulas below.
![]() |
![]() |
![]() |
![]() |
![]() |
2 |
where
,
,
,
,
are the values of target bounding box,
,
,
,
,
are values of anchor box,
,
,
,
,
are regression parameters which transform an anchor box to corresponding target bounding box.
Data set preparation
There are 228 Y2O3 coating SEM images with 2580 *1944 pixel in our experiment, showed in Fig. 5. They will be segmented automatically into 1024*1024 pixel with the overlap of 200 pixel in data prepare module. The original SEM image data set were divided randomly into training, validation and test sets with a ratio of 3:1:1 in quantity, as well as label files. So, there were 1642 images in the detecting model training, 547 images in the validating process, and 548 images for the model test.
Fig. 5.
SEM data set exhibition. Part of our raw SEM images.
Experiment setup
The baseline of our model ResNet50 is implemented in Linux system, written in Python and capable of running on top of TensorFlow GPU, which in our device is Nvidia RTX4090. The Pytorch and CUDA version in our experiment are 11.8. Our network trained with the parameters setup as listed: learning rate is 0.0001, epoch number is 12, batch size is 2.
The compare group were pre-trained with DOTA dataset in the same experiment equipment. The labels of SEM data set are transformed to DOTA-v1.0 style to meet the requirements of compare group experiment. We applied 9 different deep learning based rotated target detect network as compare group, they have the same backbone but different detectors and expressions of bounding box. The successful experiments with these networks could show the utility of deep learning-based detecting method in material microstructure detect task.
Discussion
There are two parts of our evaluation progress. Firstly, the performance of the detector we proposed is compared with compare group. The results of the experiments with DOTA dataset and SEM dataset are showed in the Table1. We primarily evaluate detection mean Average Precision (mAP), because this is the actual metric for object detection 19. Then, we proposed the quantitative descriptor to describe the performance of Y2O3 coating, which is verified by comparing with the conclusion drew from expert experience.
Table 1.
Applied models and detected result in DOTA and SEM dataset.
| Models | Time/4750 pics | mAP (DOTA) | mAP (SEM) |
|---|---|---|---|
| Rotated_faster_rcnn20 | 940 s | 83.55 | 51.93 |
| Rotated_retinanet_hbb21 | 1005 s | 78.16 | 33.87 |
| Rotated_retinanet_obb21 | 975 s | 77.50 | 20.23 |
| Gwd22 | 959 s | 79.18 | 35.50 |
| Kfiou23 | 989 s | 85.62 | 62.14 |
| Gliding_vertex24 | 984 s | 83.08 | 32.27 |
| Rotated_atss25 | 964 s | 80.53 | 32.42 |
| S2anet26 | 990 s | 83.32 | 44.18 |
| Rotated_fcos27 | 922 s | 82.52 | 00.03 |
| Our work | 930 s | 87.52 | 78.91 |
In our study, we aimed to evaluate the influence of Y2O3 nanoparticles on the microstructure of 14CrSiMnV coating fabricated by plasma transfer arc cladding process automatically with deep learning-based target detecting method. According to results of detect model, the main analysis data of dendritic grain zone in SEM image can be calculated as what was shown in Fig. 4. There are two mean occupation rate (MOR) that could represent performance of Y2O3 coating, the formulas were expound below:
MOR of dendritic grain zone:
![]() |
3 |
MOR of dendritic grain zone on single image:
![]() |
4 |
where
represents the occupation rate of dendritic grain zone in different amounts of Y2O3 coating, X could be 0, 02, 04, 06, 08.
means the number of SEM images in different amounts of Y2O3.
represents the occupation rate of a bounding box on single image, and
means the number of the bounding boxes.
It was shown that the dendritic grain zone has the lowest MOR when Y2O3 nanoparticles is 0.4 wt.%, and the average size of which on single image is the smallest at the same time (Fig. 6). That means long-term dendritic solidifications are reduced and equiaxed grains are enlarged in the specimen modified at this amount of Y2O3 coating. Which is perfectly corresponded with judgement of human expert 5.
Fig. 6.
Data analyzing. Occupation rate was calculated with detected targets data.
As discussed above, with deep learning method applied on SEM image, the performances of a new material could be evaluated extremally fast and more objective with a proper degree of confidence.
Conclusion
Our work demonstrates the feasibility of an effective steel microstructural target detection by a deep learning-based computer vision method. We performed the Tang Rui rotated bounding box to mark critical microstructure in SEM images, which is able to achieve a highest accuracy result with top 3 fastest evaluation speed among latest rotated target detecting models. We created a quantitative descriptor “MOR” to represent the microstructural features of Y2O3 steel coating, and proved its effectiveness to evaluate the surface performance of the coating with SEM image.
With the rapid and accurate rotated target detecting model, one could give an evaluation of surface performance with SEM image within 1 s. For a generating model, a rapid evaluation method could accelerate its feedback network. Furthermore, the combination of generation, detection and evaluation model provides a way to make the high throughput material design become possible.
Supplementary Information
Acknowledgements
This work was supported by the Jilin Provincial Science and Technology Innovation Center of Network Database Application Software (Grant No. YDZJ202302CXJD027), and Jilin Provincial Science and Technology Development Planned Project (Grant No.YDZJ202401621ZYTS). The authors are grateful to Baotou Research Institute of Rare Earths for the help with the raw SEM images in this paper.
Author contributions
Jianhong Zhao wrote the first draft of the manuscript and conducted the experiments. Yi Sui contributed to concepts, ideas as well as design and analysis of the experiments. All experiments and researching activities are under Huamin Yang’s guidance. All authors reviewed the manuscript – including edits and rewrites of passages.
Data availability
The SEM image datasets are available from the authors upon request.
Code availability
The relevant code in our analysis process is available at https://github.com/cancerhusky/TRD.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-024-85061-0.
References
- 1.Kamble, A., He, S., Howse, J. R., Ward, C. & Hamerton, I. Exploiting the use of deep learning techniques to identify phase separation in self-assembled microstructures with localized graphene domains in epoxy blends. Comput. Mater. Sci.229, 112374 (2023). [Google Scholar]
- 2.Alrfou, K., Zhao, T. & Kordijazi, A. Deep Learning Methods for Microstructural Image Analysis: The State-of-the-Art and Future Perspectives. Integr. Mater. Manuf. Innov.13, 703–731 (2024). [Google Scholar]
- 3.Tsutsui, K. et al. A methodology of steel microstructure recognition using SEM images by machine learning based on textural analysis. Mater. Today Commun.25, 101514 (2020). [Google Scholar]
- 4.Cheng, D. et al. Computer Vision Analysis on Material Characterization Images. Adv. Intell. Syst.4, 2100158 (2022). [Google Scholar]
- 5.Yue, J. et al. Influence of Y Nano-Oxide and Its Secondary Phase on Microstructure, Mechanical Properties, and Wear Behavior of the Stainless Steel Coatings Fabricated by Plasma Transfer Arc. Metals12, 942 (2022). [Google Scholar]
- 6.Ahmed, S. F. et al. Deep learning modelling techniques: current progress, applications, advantages, and challenges. Artif. Intell. Rev.56, 13521–13617 (2023). [Google Scholar]
- 7.Muñoz-Rodenas, J., García-Sevilla, F., Coello-Sobrino, J., Martínez-Martínez, A. & Miguel-Eguía, V. Effectiveness of Machine-Learning and Deep-Learning Strategies for the Classification of Heat Treatments Applied to Low-Carbon Steels Based on Microstructural Analysis. Appl. Sci.13, 3479 (2023). [Google Scholar]
- 8.Ge, M., Su, F., Zhao, Z. & Su, D. Deep learning analysis on microscopic imaging in materials science. Materials Today Nano11, 100087 (2020). [Google Scholar]
- 9.Li, C., Xu, C., Cui, Z., Wang, D., Jie, Z., Zhang, T., & Yang, J. Learning Object-Wise Semantic Representation for Detection in Remote Sensing Imagery. CVPR Workshops. (2019).
- 10.Zhu, H. et al. Orientation robust object detection in aerial images using deep convolutional neural network. In 2015 IEEE International Conference on Image Processing (ICIP) 3735–3739 (2015).
- 11.Zhou, X. et al. EAST: An Efficient and Accurate Scene Text Detector. Preprint at 10.48550/arXiv.1704.03155 (2017).
- 12.Jiang, Y. et al. R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection. Preprint at 10.48550/arXiv.1706.09579 (2017).
- 13.Tang, R., Song, W., Guan, X., Ge, H. & Kong, D. Dam Burst: A region-merging-based image segmentation method. Preprint at 10.48550/arXiv.2003.04797 (2020).
- 14.Yang, X. et al. SCRDet: Towards More Robust Detection for Small, Cluttered and Rotated Objects. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) 8231–8240 (IEEE, Seoul, Korea (South), 2019).
- 15.Xia, G.-S. et al. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 3974–3983 (2018).
- 16.Wenlong, S., Rui, T., Kun, Y. & Hongjie, L. An annotation method for tilted rectangular range boxes and its application analysis in remote sensing object detection. J. China Institute Water Resourc. Hydropower Res.19, 165–172 (2021). [Google Scholar]
- 17.Lin, T.-Y. et al. Feature Pyramid Networks for Object Detection. Preprint at 10.48550/arXiv.1612.03144 (2017).
- 18.Shafiq, M. & Gu, Z. Deep Residual Learning for Image Recognition: A Survey. Appl. Sci.-Basel 12, 8972 (2022).
- 19.Padilla, R., Netto, S. L. & da Silva, E. A. B. A Survey on Performance Metrics for Object-Detection Algorithms. In: 2020 International Conference on Systems, Signals and Image Processing (IWSSIP) 237–242 (2020).
- 20.Ren, S., He, K., Girshick, R. & Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell.39, 1137–1149 (2017). [DOI] [PubMed] [Google Scholar]
- 21.Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollar, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell.42, 318–327 (2020). [DOI] [PubMed] [Google Scholar]
- 22.Yang, X. et al. Rethinking Rotated Object Detection with Gaussian Wasserstein Distance Loss. Preprint at 10.48550/arXiv.2101.11952 (2022).
- 23.Yang, X. et al. The KFIoU Loss for Rotated Object Detection. Preprint at 10.48550/arXiv.2201.12558 (2023).
- 24.Xu, Y. et al. Gliding Vertex on the Horizontal Bounding Box for Multi-Oriented Object Detection. IEEE Trans. Pattern Anal. Mach. Intell.43, 1452–1459 (2021). [DOI] [PubMed] [Google Scholar]
- 25.Zhang, S., Chi, C., Yao, Y., Lei, Z. & Li, S. Z. Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection. Preprint at 10.48550/arXiv.1912.02424 (2020).
- 26.Han, J., Ding, J., Li, J. & Xia, G.-S. Align Deep Features for Oriented Object Detection. IEEE Trans. Geosci. Remote Sens.60, 5602511 (2022). [Google Scholar]
- 27.Z. Tian, C. Shen, H. Chen and T. He, FCOS: Fully Convolutional One-Stage Object Detection, 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South) (2019)
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The SEM image datasets are available from the authors upon request.
The relevant code in our analysis process is available at https://github.com/cancerhusky/TRD.












