Abstract
Donor-based qubits in silicon, manufactured using scanning tunneling microscope (STM) lithography, provide a promising route to realizing full-scale quantum computing architectures. This is due to the precision of donor placement, long coherence times, and scalability of the silicon material platform. The properties of multiatom quantum dot qubits, however, depend on the exact number and location of the donor atoms within the quantum dots. In this work, we develop machine learning techniques that allow accurate and real-time prediction of the donor number at the qubit site during STM patterning. Machine learning image recognition is used to determine the probability distribution of donor numbers at the qubit site directly from STM images during device manufacturing. Models in excess of 90% accuracy are found to be consistently achieved by mitigating overfitting through reduced model complexity, image preprocessing, data augmentation, and examination of the intermediate layers of the convolutional neural networks. The results presented in this paper constitute an important milestone in automating the manufacture of atom-based qubits for computation and sensing applications.
Keywords: machine learning, silicon, phosphorus, STM lithography, quantum dots
Introduction
Single-spin qubits based on phosphorus donors in silicon are a promising platform for building a large-scale quantum processor with high-fidelity single-qubit gates1 and fast electron two-qubit gates.2 There are currently two approaches for fabricating nanoelectronic devices using phosphorus-doped silicon (Si:P). One is ion-implantation3 and the other is scanning tunneling microscopy (STM) lithography based on a single atomic layer of hydrogen resist.4,5 The ion approach uses low-energy phosphorus implantation followed by complementary metal-oxide-semiconductor (CMOS) fabrication for readout and control structures near the silicon-oxide surface.6 The STM method, on the other hand, achieves atomic precision manufacturing of the donors by use of an STM tip to selectively remove atomic hydrogen from an atomic scale mask formed by a hydrogen-terminated silicon surface.7 The entire silicon surface is then dosed with phosphine (PH3) gas, which only adsorbs to the bare silicon surface where the hydrogen layer has been removed. The phosphorus donors are subsequently incorporated into the silicon crystal via a thermal anneal and overgrown with epitaxial silicon.8
With the demonstration of high-fidelity single1 and two-qubit gates2 in atom qubits in silicon, the basic building blocks for a universal quantum computer have been demonstrated. Future work is now focused on scalability and improved manufacturing performance.5,10−12 Several quantum computer architectures based on phosphorus atoms in silicon have been proposed for both ion-implanted11 and precision-placed STM donors.10 Both of these approaches require large homogeneous two-dimensional (2D) arrays of phosphorus donors. The homogeneity requirements of these proposed architectures necessitate strict control (≤1 nm) over the donor placement to allow controllable interqubit couplings.2 The STM approach, however, can be used to create multidonor quantum dots (that is, more than a single P donor) to confine a single electron. This allows better control over the coupling strength,13 enhanced addressability,14 and access to densely packed nuclear spin qubits,9 thus significantly relaxing manufacturing tolerances. A multidonor system does bring some complications. The incorporation pathway of a single donor is fairly straightforward.15 In a multidonor site, however, there are several reaction pathways for the phosphorus atoms to incorporate from the phosphine precursor gas, meaning the number of donors embedded at any given site can be hard to predict.16 To design and fabricate a quantum computer with hundreds or thousands of multidonor qubits of well-defined size, we therefore require an accessible method to identify the number of donors for a given lithographic area. One such method, which we investigate here, is the use of machine learning techniques for image recognition.17−20
Deep learning is a type of machine learning characterized by the successive application of layers of artificial neurons that perform information processing analogous to that of biological entities.21,22 Analogous to biological neurons, artificial neurons take a set of weighted inputs along with a bias term and perform a nonlinear mapping to a single output. The nonlinear function, termed the activation function, is responsible for mimicking the switching behavior found in biological systems. Successively connected layers of neurons can then be leveraged as a universal approximator23 that can be efficiently trained using conventional optimization methods such as gradient descent. Training, therefore, consists of minimizing the difference between the expected output and the current approximated output by modifiying the weights and biases in each layer that parametrize the model. The difference measure is termed the loss function, the choice of which is generally problem dependent. This provides an algorithmic approach to solving many data science problems while bypassing the domain expertise usually required to handcraft feature extractors. The addition of more exotic structures, such as convolution and recurrent layers, has expanded the domain of such techniques with applications in natural language24 and signal processing.25
Convolutional neural networks (CNNs) are a ubiquitous tool in computer vision.26 They provide an algorithmic solution to classification problems that are often, but not always, trivial for humans. Such a task is characterized as a mapping between an input image and a predefined set of classes that describe some aspect or feature contained within the image. The efficacy of these techniques has led to a proliferation of applications, of which breast cancer identification,27 autonomous vehicle navigation,28 and detection of extreme weather events29 are just a few. CNN-based models have also successfully been applied to semiconductor-based qubits with applications in optimizing device fabrication,30 automatic device tuning,31−33 Hamiltonian estimation,34 and single-shot state classification.35
CNNs are a feed-forward network that extracts image features by learning convolutional image kernels, similar to kernels applied by human experts such as edge detection or sharpening. These convolution layers are stacked to form a deep network that facilitates mapping between the image and what is called a “feature-space”. Following this, standard densely connected layers are generally attached to the output feature layer to perform the final classification step. Before the advent of deep learning techniques, image classification consisted of handcrafted filters and feature extraction methods. The distinct advantage of CNNs over these filter and feature extraction techniques is the ability to efficiently learn intermediate feature space representation and classification mapping simultaneously. This is done by adjusting the filters applied at each convolutional layer as well as the weights and biases of the final classification stage.22 As with standard feed-forward networks, CNNs are trained via gradient descent, where the loss is measured between the expected and predicted classes. An additional advantage is opting to learn the important features directly from the data, which reduces the amount of required a-priori knowledge. This is particularly useful when the methods to extract particular features are complex, such as with general image classifiers, or even unknown, as in the case of the present work.
In this paper, we propose a machine learning (ML) method for predicting donor numbers in multidonor quantum dot qubits based on image recognition from the lithographic STM images taken during device fabrication.
Previous work has examined the use of CNNs for the identifying donor position after incorporation and overgrowth with a thin silicon layer.20 In contrast, we use our ML model to predict the number of donors before incorporation. This is an important distinction as it allows for real-time correction of the lithographic site despite not predicting the exact intrasite configuration of the donors. Further knowledge of the precise donor configuration can be determined via subsequent measurements during characterization of the quantum processor.36
To highlight the benefits of our ML-based approach, we first describe the standard state-of-the-art technique for estimating the number of donors in a given lithographic patch. Figure 1a shows an STM image of the silicon surface passivated with a monolayer of hydrogen that acts as a lithographic mask. The dark regions correspond to the hydrogen mask, while the bright areas indicate the dangling bonds exposed when an STM tip has been pulsed to create openings. It is these openings that allow subsequent incorporation of phosphorus donors using phosphine (PH3) as a precursor. The device presented in Figure 1a consists of four control gates (left, middle, right, and bottom); a single-electron transistor (SET) charge sensor that is tunnel coupled to source and drain; and two multidonor quantum dots that define the two qubits. Figure 1b shows a close-up STM image of the central part of the device with the two-qubit sites. The upper panels in Figure 1b show the exact size of the lithographic openings at left and right qubit sites, where the white lines indicate the dimers on the reconstructed silicon surface, and the bare silicon dimers are highlighted in blue. The size of the lithographic openings at the qubit sites can be used to estimate the number of donors that can be incorporated into the sites upon dosing with PH3 and annealing. For example, it has been shown that three consecutive dimers need to be desorbed in order to facilitate incorporation of a single donor.7,37 For the device shown in Figure 1b, we estimate that ≤2 (≤1) donors can be incorporated into the lithographic opening at the left (right) qubit site with seven (five) fully desorbed dimers (see Supporting Information, S1 for a detailed discussion on the incorporation chemistry). While this method provides a quick estimation for the likely donor number, it suffers from a high error rate due to the multiple chemical pathways that can occur for the phosphine gas to dissociate before P incorporates into the surface during the incorporation anneal.
Figure 1.

Comparing traditional STM identification of donor number with our machine learning approach (a) STM image of a two-qubit silicon device in a hydrogen resist, taken before dosing with phosphine gas. The standard method for classifying the donor number is illustrated in panels (b) and (c). The process of estimating the likely donor number in each patch based on knowing the size of the lithographic patch before dosing. The lower panel of (b) shows the central part of the device [dashed box in panel (a)], and the upper panel shows the close-up images of the two-qubit sites. The diagonal white lines running from top-right to bottom-left indicate the Si dimer rows, and the top-left to bottom-right lines correspond to the periodicity of individual Si dimers. The exposed Si surface is highlighted in blue, with ellipses corresponding to Si dimers and circles to single Si atoms (single dangling bonds). The donor number in a given lithographic patch can be then estimated based on the number of exposed dimers, as detailed in the main text. For the device shown, the predosed images suggest ≤2 P donors to be incorporated in the left dot and ≤1 in the right dot due to the known requirement of 6 bare silicon dangling bonds to incorporate a single donor.9 After dosing with phosphine gas, the height of the absorbed features on a subsequent STM scan is then used to determine the presence of PHx=1,2 species which are then used to estimate the donor numbers as shown in panel (c). An alternative approach, illustrated in (d), uses a convolutional neural network (CNN) to classify the qubit sites. Convolutional layers learn filters that map the input images to a latent feature space, which is reduced in dimensionality by the intermediate pooling layers. The feature space is then classified using the proceeding dense layers that are separated by a dropout layer to mitigate overfitting. The loss function calculated between the model predictions, Ŷ, and the true labels, Y, provides feedback to adjust the trainable variables of the model.
Additional information about the exact donor numbers can be gained by taking further STM images after dosing the device with PH3. In Figure 1c, we show qubit sites imaged after dosing, where the changed appearance from predosing to postdosing indicates that a chemical reaction has occurred between the exposed silicon surface and the phosphine gas.38 By performing feature-height analysis,36 we can identify the PH2 and PH species adsorbed to the qubit sites as well as the dangling bonds (DB). The lower panel in Figure 1c shows the height profiles measured for DB (orange), PH2 (gray), and PH (purple), with the corresponding features indicated with the same colors in the STM images in Figure 1c. For the right qubit, we observe one PH2 and one PH, which will most likely result in a single P atom (from the PH species) being incorporated into the silicon crystal and a PH3 molecule desorbing from the surface. The left qubit site contains one extra PH2 compared to that of the right qubit. We estimate that either 1 or 2 P atoms can be incorporated within the left qubit patch. This analysis can be used to set an upper bound on the number of phosphorus atoms that will be incorporated into the lithographic region.
In Figure 1d we present an alternative ML-based approach for determining the donor numbers. In this approach, a CNN is trained to map STM images of the qubits to the known donor number determined via electrical characterization (see Supporting Information, S2). A given CNN may be trained on either the before or the after dosed images to reduce the complexity of the task. Once the models are trained, they can be used to predict future device donor numbers during device manufacture.
We used deep learning techniques to analyze over 40 quantum dots to determine the number of incorporated phosphorus donors. Specifically, we trained a CNN to extract and classify the features of the STM images of the lithographically defined donor sites to identify the relative likelihood of containing N-incorporated phosphorus donors. Our CNN models achieved classification accuracies in excess of 90%. Generally, such a small data set would not be considered when invoking neural networks or other ML approaches as problems with overfitting and model generalization will overshadow any likely gains. The size of the data set normally required depends on the complexity of the task with conventional general-purpose image classifiers using millions of examples.39 In this work, however, we leveraged the existing knowledge of the incorporation chemistry3−7 to simplify the problem, thus greatly reducing the data required. These results show that ML can be used to identify the multidonor quantum dots for future large-scale quantum processors based on phosphorus donors in isotopically purified silicon-28, using methods that will continue to learn and improve with the manufacture of further devices.
A CNN for generalized image recognition will typically contain a large number of convolutional and dense layers. For example, certain instances of the popular VGG19 model39 contain up to 16 convolutional layers alone. Incorporation of a comprehensively deep model such as this into the present work is not a tenable solution as the limited data set that we have would likely suffer from overfitting and convergence failures. Additionally, this naive approach neglects simplifications such as the known location and size of the quantum dots, which can instead be exploited to reduce the complexity of the problem. To this end, we implemented a reduced CNN structure, shown in Figure 1d, consisting of only two convolutional layers. The convolutional layers learn filters that process features within the STM image, similar to handcrafted image kernels. The output of these stacked layers is the feature map, which is then fed into two subsequent densely connected layers. The densely connected layers are responsible for utilizing the feature space for classification. The final layer maps to the four classes corresponding to the predicted number of phosphorus donors (0–3). It is standard practice in ML to divide the data into both training and validation sets. The training data is used during the training process to optimize the model parameters. The validation data, in contrast, are never used for training and are instead used to monitor the accuracy of the model during training. Using validation data in this way weeds out cases of overfitting, where the model learns each of the images in the training set but is not capable of generalizing to the unseen validation data. Thus, we mitigate overfitting using early stopping, i.e., halting training when the validation accuracy starts to decrease (see Supporting Information, S3 for more information on the CNN model).
In the current work, we are restricted to a limited data set composed of 40 donor sites, which are sourced from 20 unique devices that may contain up to 3 donor sites. The small data set has been a result of the long measurement process to unequivocally determine the donor number through low-temperature electron spin resonance and charge measurements (see Supporting Information, S2 for the methods used to determine the donor number). Models with large data sets generally learn context directly from the training sets. In the case of a small data set, it is instead better to try and remove obstacles that may be faced by the model based on physical knowledge of the problem. This allows the small data to be used more effectively by removing the need to learn information from a context. In practice, what this means for our work is the application of image preprocessing and image augmentation. Via these two steps, ML became viable even for this small number of images, which would not generally be sufficient for an ML-based approach.
Due to the small data set, unsupervised models and transfer learning were initially attempted as these historically have better performance for small data. In this case, however, we find poor performance of such methods and instead opt to use a CNN structure (see Supporting Information, S4). Use of a CNN is advantageous in this case, as we wish to verify the model output physically by examining the intermediate kernels. Handcrafted feature extraction is not a tenable solution either, as incomplete knowledge of the surface chemistry does not permit a simple way to achieve this task.
The training images consist of STM scans taken before and after dosing the device with phosphine gas (see Figure 1b for an image taken before dosing). Each image was preprocessed to provide a consistent resolution by cropping the scan to a ≈ 30 nm region around the device of interest, containing the single electron transistor and qubit sites. A consistent resolution removes the need for the models to learn distance relationships contextually from the background dimer rows across different length scales. The images were then normalized to a consistent feature height using a plane-level process, where a plane fit is computed for all the STM image points and is subtracted from the data, taking care to consider multiple silicon lattice planes that may be present in the images.
After this initial preprocessing, we performed data augmentation to extend the data set. A common problem among ML models is overfitting to the training data at the expense of generality. This is especially true for neural networks with small data sets as their high dimensionality affords them the ability to essentially “store” all the training data without learning meaningful features. Data augmentation extends the data set by performing transforms that do not alter the information that is to be learned from the data set. In our case, the features that we are trying to learn from the images are invariant with rotations and translations.
It is physically valid for the purposes of classification to treat each qubit site as independent; thus, we cropped each qubit site to a ≈ 6 nm bounding box. By cropping the sites with relatively tight bounds, we removed the need for the models to learn the qubit location contextually. Additionally we were also able to reduce the noise introduced from surface imperfections, such as neighboring dangling bonds and scan artifacts. Translation of the bounding box for augmentation is possible since we are cropping the qubit sites from the larger STM scan. We were therefore able to programmatically move the bounding box around the region of interest to produce augmented images. Adding these images to the training set forced the network to learn a more general relationship because the important features appeared in different locations within the image, while non-useful features, such as dangling bonds, would often leave the region of interest. We augmented our data set by performing random translations of the bounding box up to ±10 pixels in either dimension on the original images. Using this technique, we generated 100 unique augmented images per site. During each training step (corresponding to a gradient descent step), we sampled 20 images from this set to form the training set. Generating the transforms a-priori allows for reduced training time. While it is possible to rotate the images, we avoided this transform as the silicon dimer rows often already appear in different orientations across the training set, and no noticeable improvement was observed with this addition.
After performing the augmentation, each image was scaled to 50 × 50 pixels as this was found to contain enough information while reducing the model size and training time. To reduce noise from the scan, a threshold operation was performed by shifting the before (after) image pixel intensity values by −20 (−30), clipping to an 8-bit integer (0–255), and renormalizing to the full extent of the 8-bit grayscale range. Examples of both the pre- and postdosing images for different donor numbers are presented in Figure 2a. All preprocessing and augmentation operations can be performed in a time less than model inference, providing no barrier to real-time analysis.
Figure 2.
CNN model performance and characteristics. (a) The labeled STM data was divided into training and validation sets that were then augmented with random translation operations. This increased the available data by 100x and helped to mitigate over fitting. The validation set was then divided again to produce a test set which was not used by the model except to evaluate performance after early stopping. The images shown in (a) correspond to 8 unique donor sites from different devices, demonstrating the observed large variation in features across donor numbers and before/after dosing. Independent models were trained on images of devices before and after dosing to produce two separate classification models. The confusion matrices, which indicate the rate and nature of misidentification, are given on the left of (b, c) for each respective model. The relative accuracy of each model is also given for the respective set. As well as the highest-performing model, we compared the accuracy of an ensemble of high-performing models on the test set. As a baseline, we compared the relative performance of the models to that of trained human experts as shown in the histograms of (b, c). Note that while the truth table is identical for the test and validation sets, the input data is in fact distinct. The relative accuracies of each set are shown above the plot.
We separated a validation set from the training set via random assignment that contained one example from each donor number class, amounting to a 10% validation split. Often in ML, it is conventional to also divide a test set from the data that is used for neither training nor validation. The validation and test sets both allow an estimate of the model’s accuracy, but the test set is generally used as the final unbiased estimate after model tuning and selection. In the present work, we are unable to create a test set in the strict sense, as we have a limited number of devices. For example, the current labeled data only has two examples of N = 1-donor number sites, allowing one to be allocated to the training set and one to the validation set. To mitigate this problem, we formed a test set by splitting the augmented images from the validation set in half; i.e., 50 images per qubit site were assigned to the validation set and were used to monitor the validation accuracy at each training step. The next 50 images were devoted to the test set. This test set was used to verify that our early stopping had not biased our model and also aided in selecting performant models and estimating the model error.
During benchmarking trials, we constructed and trained a number of CNN models to observe the repeatability of the training process. We observed that the training process is nontrivial and prone to a particular local minimum corresponding to guessing the donor number 2, as this accounts for approximately 40% of the training images. Using the models that exhibit the highest validation accuracy, we also formed an “expert” ensemble of the four most performant models. These models were trained on restricted subsets of the training data, which is a technique known as bagging. The predictions of the ensemble were then combined to provide an ensemble estimate of the donor numbers. We then benchmarked this ensemble on the test set to determine its efficacy in comparison to those of the best individual models.
Results/Discussion
All training of the models was performed on a NVIDIA Tesla P100 PCIe 12GB GPU. By restricting the augmentation transforms to sampling from the set generated a-priori, the training time of a given model could be performed on the order of 1–2 min. Using the STM images before dosing, we can consistently achieve validation accuracies on the order of 90% with the best-performing model exhibiting a validation and test accuracy of 90.9 and 83.6%, respectively. Inclusion of the ensemble of models increased the test accuracy by 7.7%, to a final value of 91.3%. Figure 2b shows the collated confusion matrices and accuracy histograms for each donor number before dosing. A perfect model should return a diagonal matrix as with the “truth” matrix, which exhibits the validation set ground truth. Errors as a result of the model’s failure to correctly identify a qubit site will appear in off-diagonal positions. This allows the accuracy of the model to be observed in more detail, providing information on potential biases in the training set. The histograms show the relative error for each method in accurately predicting a given donor number. A perfect model should have unity accuracy for each donor number. As expected, the 0- and 1-donor numbers have the highest variance due to there being a limited number of examples from which to infer from. To accurately learn these donor numbers, the model must generalize from the limited examples as well as identify contextually the differences to the higher donor number examples. The lowest confusion is associated with the 2- and 3-donor numbers, which comprise the majority of the data set.
Higher accuracies were achieved for the postdosing model with a validation and test accuracies of 93.9 and 92.9%, respectively. An increase with the inclusion of the ensemble method to 97.4% percent was also observed as with predosing models. As shown in Figure 2c, the models exhibit the highest variance in the 2- and 3-donor numbers. We attribute this to larger donor patches having more features in the postdosing images, requiring that the models learn more possible configurations. Both the pre- and postdosing models were able to achieve ≥97% accuracy on their training data.
Due to the nature of the small data, it is difficult to place uncertainty estimates on the predictions provided by the CNNs. We attempt to quantify the distribution associated with classification via the confusion matrices presented in Figure 2b,c. This is a direct representation of the model uncertainty as a function of donor number on data unseen by the trained models.
In both cases, the CNN models were directly compared to the performance of 6 experts that have years of experience with STM device manufacturing. Here, the ensemble of experts contains a majority of the field’s expert knowledge base for this particular fabrication technique. In each case, the experts were given the same unlabeled images as the models and asked to assign a probability distribution to the unknown qubit similar to the CNN model output. Unlike the models, however, the experts were given higher-resolution images so as not to disadvantage the participants based on visual acuity alone. The experts were given devices from the entire data set to classify without any additional information from low-temperature measurements. As is evident from the histograms of Figure 2b,c, the experts were consistently outperformed by the models across all donor numbers and for 0-donor number in particular. While visual inspection by experts is more error-prone than cryogenic characterization, it is considerably faster as it only takes seconds as compared to 1–2 weeks needed for the latter. As a result, the ability of the ML models to achieve >90% accuracy using solely STM data is a significant achievement, enabling efficient real-time determination of donor numbers. This will allow researchers to systematically obtain desired donor numbers and discard suboptimal devices (e.g., devices containing 0 donors) already at the STM lithography stage, ultimately saving time and resources. In general, an expert will not attempt to classify a device visually; however, we use this comparison to illustrate the difficulty of the task even for domain experts.
Due to the nature of the small data set, it was necessary to perform a number of additional validation steps to determine the learned relationship represented by the model. While the model was designed, devices were exchanged between the training and validation set to ensure that the model was capable of achieving high accuracies irrespective of a particular set of training or validation examples. Additionally, we were able to verify the scaling of model accuracy with the number of training examples by artificially reducing the size of the training set until the model was no longer performant.
A potential method for the model to overfit with small data is to learn features that are correlated across the training data but have no physical link to the task. For example, if the STM operating conditions for the 0-donor numbers introduced an imaging artifact only seen in that example, the model may infer this to be a useful attribute for identifying 0-donor sites. In general, this behavior can be difficult to correct. In the present work, however, we may use physical insights to assess a trained model’s understanding of the problem. The translational augmentation steps provide some amount of correction against these effects, as features are shifted around the image. This means noise and artifacts are sometimes displaced outside the bounding box, forcing the model to learn more general features. Additionally we can use feature visualization to examine the output of the intermediate layers in the model to determine what information is likely being used by the network.40 The output of the intermediate convolutional layers gives insight into the filters that are being used by the model to interpret an input image. In general, this may not always be possible and cannot be considered conclusive proof of a particular learning outcome. It is possible, however, to make some inferences about what kind of filters the model is using if features in the convolutional layers can be correlated to the physical structure of the devices. By examining the intermediate layers (similar to that shown in Figure 3), we find that the networks converge to a sparse and consistent use of intermediate filters. This implies that the augmentation and training techniques have led to a representation that is not a result of overfitting the training data, but instead has learned a small number of useful filters for processing the data, with other intermediate filters ≈0. We also observe that the models are robust to artificially introduced noise and will continue to correctly predict the donor number with a reduced signal-to-noise ratio. True network dissection is not possible for the present work as we cannot provide pixel-level feature labeling due to incomplete knowledge of the associated surface chemistry.
Figure 3.
Breakdown of the neural network structure and intermediate feature layers. The network structure is broken down into successive layers with representative samples from feature maps after each of the pool layers. Each convolutional layer learns 128 kernels of size 3 that serve to filter the image. The expanded regions show a selection of these filtered images as well as an average of all 128 layers to provide a snapshot of information processing carried out by the CNN. In the case of the predosed images, the models tend toward performing an edge detection filter. This spatial information is often synonymous with patch size measurement and is similar to the information used in the traditional analysis of predosing images. In contrast, the model trained on the postdosing images learns a filter that extracts information about the height of the peaks on the surface. The final feature maps of each model are fed to their respective densely connected layers, which perform classification of the feature layers. This architecture facilitates learning of the feature extractor and classifier simultaneously.
Using the intermediate layers, we can estimate how the network may be analyzing the STM images, and from this, we find that the model is interpreting the STM images in a physically valid way similar to human experts. In the top half of Figure 3 where we only consider the lithographic template, we show that the model performs operations that are similar to that of an edge detection filter before the final dense layers. This is consistent with the conventional STM image analysis, where the number of exposed Si dimers is taken into account when estimating the number of donors that can be incorporated. In contrast, the images taken after dosing are interpreted differently by the network that is trained on them. Instead of performing an edge detection to look at the relative size of the patch, the network uses feature heights similar to the approach taken by experts. This supports the idea that the ML approach utilizes the same identification methods used by experts; however, a large uncertainty is associated with the approximate techniques due to the incomplete knowledge of the probabilistic chemical reactions between phosphine and the silicon surface.
The representation learned by the models is consistent, with all high-performing models converging on the same filtering. This is verified by examining the intermediate layers for each high-performing model with minor differences stemming from the random initialization of the network at the beginning of training.
Conclusions
While we have successfully demonstrated that models can be trained with high accuracy, it is important to note the caveats of a small data set. Deep learning or similar “black box” models will always be limited by the generality of the training data. That is to say that these models are able to interpolate between features; however, the problem of extrapolation remains a challenging problem for such models and indeed humans alike. We expect that through repeated use of this model, further edge cases and scenarios that are not expressed within the current training set will be identified. The distinct advantage of the proposed method is that this model can simply be retrained as more STM data become available without requiring a redesign of the entire algorithmic structure. Alternative extraction methods, in contrast, may require redesign, which can be time-consuming and costly. This method will also scale efficiently to large data sets as it is built on efficient deep learning training procedures. A simple way to extend the data set for future use is to collate a number of unclassified images and employ a form of unsupervised learning to create an extended data set. This bypasses the bottleneck of extensive donor number identification and is a subject for future work.
Manually designing models using conventional image processing is limited by a lack of knowledge of how the chemical dissociative pathways interact within a larger dot with many reaction pathways. This limits the design of feature extractors to simple filtering or trial and error. It is known that there exists an energy-reducing process that occurs during the incorporation of PHx species into the silicon crystal lattice that is stochastic.41,42 We attribute this to the difference in accuracy between using the pre- and postdosing models as well as the higher variance exhibited by the models trained on the before dosing images.
Interestingly, our predictive model can be used in combination with conventional donor identification methods to further understand the chemical dissociation pathways of phosphine on hydrogen-terminated silicon surfaces. By comparing the output of the pre- and postdosing models we can infer information about the likelihood of the dissociation pathways of phosphine related to the final incorporated donor number in the device. This is critical information for scaling phosphorus-doped silicon quantum devices. Since the ML method here ascribes a donor number based on the size and shape of the lithographic patch, we can determine the optimal lithographic region for any desired donor configuration.
Conventional deep learning models routinely achieve accuracy approaching 100%, whereas the present work remains in the low 90%’s for the predosing model and ≈97% for the postdosing model. As previously mentioned, we expect that this accuracy will continue to improve with the inclusion of additional fabrication data. In the interim, however, the reported accuracy is more than sufficient for improving the efficiency of the research and fabrication pipelines, removing a crucial bottleneck in donor identification.
In summary, we have demonstrated a scalable and accurate method for predicting the donor number during the fabrication of Si:P devices. We find that accuracies in excess of 90% can be consistently achieved by mitigating overfitting through reduced model complexity, image preprocessing, and data augmentation. This work extends the notion of using small data sets where well-known physical insights can be applied to determine model efficacy. We expect that the use of this technology will help to provide a scalable pathway to manufacturing quantum devices for computational and sensing applications.
Methods/Experimental
STM images of the silicon-based quantum dots are acquired before phsophine dosing, where a silicon surface has been passivated with a monolayer of hydrogen to act as a lithographic mask. A lithographic patch is created by pulsing the STM patch, creating a region in which a phosphorus donor may be incorporated. Images acquired before the phosphine dosing comprise the data set used to train an ensemble of ML models that can process before-dosing images. The device is then dosed with phosphine followed by annealing (further information on this process is detailed in Supporting Information, S1). Subsequent STM images of the dosed devices comprise the after-dosing data set used to train a second ensemble of ML models that can process after-dosing images. To unequivocally determine the donor number of a given quantum dot, crygoenic measurements are performed, including electron spin resonance and charge measurements (further information is detailed in Supporting Information, S2). Both the before- and after-dosing images undergo a series of augmentation and preprocessing steps to prepare the input images for the ML models. A bounding region of ≈6 nm is centered on each quantum dot, which constitutes the region of interest. During augmentation, this bounding box is randomly translated up to ±10 pixels in either dimension to generate 100 unique augmentations per quantum dot site. Following the data augmentation, the images were down-scaled to 50 × 50 pixels. The pixel values were then shifted and clipped to an 8-bit integer to perform a threshold operation, followed by a renormalization to the full 8-bit range. The images are divided into training and validation sets with a 90:10 split. Using the network structure specified in Figure 1d and detailed in Supporting Information, S3, two separate ensembles of ML models are trained on the before- and after-dosing images, respectively. Early stopping and dropout are used to mitigate overfitting by monitoring the validation loss during training. The top four performing models from each ensemble were selected to form an “expert ensemble” for benchmarking.
Acknowledgments
The research was supported by the Australian Research Council Center of Excellence for Quantum Computation and Communication Technology (project number CE170100012), the US Army Research Office under contract number W911NF-17-1-0202 and Silicon Quantum Computing Pty Ltd. M.Y.S. acknowledges an Australian Research Council Laureate Fellowship.
Supporting Information Available
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acsnano.4c00080.
Further information on methods used for surface feature analysis based on the phosphine dosing chemistry can be found in S1; experimental details on the cryogenic determination of donor numbers is presented in S2; details on the machine learning architecture and comparisons to more conventional methods are detailed in S3 and S4 (PDF)
The authors declare the following competing financial interest(s): M.Y.S is a director of Silicon Quantum Computing Pty Ltd.
Supplementary Material
References
- Muhonen J. T.; Laucht A.; Simmons S.; Dehollain J. P.; Kalra R.; Hudson F. E.; Freer S.; Itoh K. M.; Jamieson D. N.; McCallum J. C.; Dzurak A. S.; Morello A. Quantifying the quantum gate fidelity of single-atom spin qubits in silicon by randomized benchmarking. J. Phys.: Condens. Matter 2015, 27 (15), 154205 10.1088/0953-8984/27/15/154205. [DOI] [PubMed] [Google Scholar]
- He Y.; Gorman S. K.; Keith D.; Kranz L.; Keizer J. G.; Simmons M. Y. A two-qubit gate between phosphorus donor electrons in silicon. Nature 2019, 571 (7765), 371–375. 10.1038/s41586-019-1381-2. [DOI] [PubMed] [Google Scholar]
- Morello A.; Pla J. J.; Bertet P.; Jamieson D. N. Donor Spins in Silicon for Quantum Technologies. Adv. Quant. Technol. 2020, 3 (11), 2000005 10.1002/qute.202000005. [DOI] [Google Scholar]
- Ruess F. J.; Oberbeck L.; Simmons M. Y.; Goh K. E. J.; Hamilton A. R.; Hallam T.; Schofield S. R.; Curson N. J.; Clark R. G. Toward Atomic-Scale Device Fabrication in Silicon Using Scanning Probe Microscopy. Nano Lett. 2004, 4 (10), 1969–1973. 10.1021/nl048808v. [DOI] [Google Scholar]
- Wyrick J.; Wang X.; Kashid R. V.; Namboodiri P.; Schmucker S. W.; Hagmann J. A.; Liu K.; Stewart Jr M. D.; Richter C. A.; Bryant G. W.; Silver R. M. Atom-by-Atom Fabrication of Single and Few Dopant Quantum Devices. Adv. Funct. Mater. 2019, 29 (52), 1903475 10.1002/adfm.201903475. [DOI] [Google Scholar]
- Morello A.; Pla J. J.; Zwanenburg F. A.; et al. Single-shot readout of an electron spin in silicon. Nature 2010, 467 (7316), 687–691. 10.1038/nature09392. [DOI] [PubMed] [Google Scholar]
- Fuechsle M.; Miwa J. A.; Mahapatra S.; Ryu H.; Lee S.; Warschkow O.; Hollenberg L. C. L.; Klimeck G.; Simmons M. Y. A single-atom transistor. Nat. Nanotechnol. 2012, 7 (4), 242–246. 10.1038/nnano.2012.21. [DOI] [PubMed] [Google Scholar]
- Fuhrer A.; Füchsle M.; Reusch T. C. G.; Weber B.; Simmons M. Y. Atomic-Scale, All Epitaxial In-Plane Gated Donor Quantum Dot in Silicon. Nano Lett. 2009, 9 (2), 707–710. 10.1021/nl803196f. [DOI] [PubMed] [Google Scholar]
- Mądzik M. T.; Asaad S.; Youssry A.; et al. Precision tomography of a three-qubit donor quantum processor in silicon. Nature 2022, 601 (7893), 348–353. 10.1038/s41586-021-04292-7. [DOI] [PubMed] [Google Scholar]
- Hill C. D.; Peretz E.; Hile S. J.; House M. G.; Fuechsle M.; Rogge S.; Simmons M. Y.; Hollenberg L. C. L. A surface code quantum computer in silicon. Sci. Adv. 2015, 1, e1500707 10.1126/sciadv.1500707. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tosi G.; Mohiyaddin F. A.; Schmitt V.; Tenberg S.; Rahman R.; Klimeck G.; Morello A. Silicon quantum processor with robust long-distance qubit couplings. Nat. Commun. 2017, 8 (1), 450 10.1038/s41467-017-00378-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Krauth F.; Gorman S.; He Y.; Jones M.; Macha P.; Kocsis S.; Chua C.; Voisin B.; Rogge S.; Rahman R.; Chung Y.; Simmons M. Flopping-Mode Electric Dipole Spin Resonance in Phosphorus Donor Qubits in Silicon. Phys. Rev. Appl. 2022, 17, 054006 10.1103/PhysRevApplied.17.054006. [DOI] [Google Scholar]
- Wang Y.; Tankasala A.; Hollenberg L. C. L.; Klimeck G.; Simmons M. Y.; Rahman R. Highly tunable exchange in donor qubits in silicon. npj Quant. Inf. 2016, 2 (1), 16008. 10.1038/npjqi.2016.8. [DOI] [Google Scholar]
- Büch H.; Mahapatra S.; Rahman R.; Morello A.; Simmons M. Y. Spin readout and addressability of phosphorus-donor clusters in silicon. Nat. Commun. 2013, 4 (1), 2017 10.1038/ncomms3017. [DOI] [PubMed] [Google Scholar]
- Wilson H. F.; Warschkow O.; Marks N. A.; Curson N. J.; Schofield S. R.; Reusch T. C. G.; Radny M. W.; Smith P. V.; McKenzie D. R.; Simmons M. Y. Thermal dissociation and desorption of PH3 on Si(001): A reinterpretation of spectroscopic data. Phys. Rev. B 2006, 74, 195310 10.1103/PhysRevB.74.195310. [DOI] [Google Scholar]
- Ivie J. A.; Campbell Q.; Koepke J. C.; Brickson M. I.; Schultz P. A.; Muller R. P.; Mounce A. M.; Ward D. R.; Carroll M. S.; Bussmann E.; Baczewski A. D.; Misra S. Impact of Incorporation Kinetics on Device Fabrication with Atomic Precision. Phys. Rev. Appl. 2021, 16, 054037 10.1103/PhysRevApplied.16.054037. [DOI] [Google Scholar]
- Rashidi M.; Wolkow R. A. Autonomous Scanning Probe Microscopy in Situ Tip Conditioning through Machine Learning. ACS Nano 2018, 12 (6), 5185–5189. 10.1021/acsnano.8b02208. [DOI] [PubMed] [Google Scholar]
- Rashidi M.; Croshaw J.; Mastel K.; Tamura M.; Hosseinzadeh H.; Wolkow R. A. Deep learning-guided surface characterization for autonomous hydrogen lithography. Mach. Learn.: Sci. Technol. 2020, 1 (2), 025001 10.1088/2632-2153/ab6d5e. [DOI] [Google Scholar]
- Gordon O. M.; Moriarty P. J. Machine learning at the (sub)atomic scale: next generation scanning probe microscopy. Mach. Learn.: Sci. Technol. 2020, 1 (2), 023001 10.1088/2632-2153/ab7d2f. [DOI] [Google Scholar]
- West M. T.; Usman M. Framework for Donor-Qubit Spatial Metrology in Silicon with Depths Approaching the Bulk Limit. Phys. Rev. Appl. 2022, 17, 024070 10.1103/PhysRevApplied.17.024070. [DOI] [Google Scholar]
- LeCun Y.; Bengio Y.; Hinton G. Deep learning. Nature 2015, 521 (7553), 436–444. 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]
- Goodfellow I.; Bengio Y.; Courville A.. Deep Learning; MIT Press, 2016. [Google Scholar]
- Scarselli F.; Chung Tsoi A. Universal Approximation Using Feedforward Neural Networks: A Survey of Some Existing Methods, and Some New Results. Neural Networks 1998, 11 (1), 15–37. 10.1016/S0893-6080(97)00097-X. [DOI] [PubMed] [Google Scholar]
- Conneau A.; Schwenk H.; Barrault L.; Lecun Y.. Very Deep Convolutional Networks for Natural Language Processing. 2016, arXiv:1606.01781. arXiv.org e-Print archive. http://arxiv.org/abs/1606.01781.
- Hershey S.; Chaudhuri S.; Ellis D. P.; Gemmeke J. F.; Jansen A.; Moore R. C.; Plakal M.; Platt D.; Saurous R. A.; Seybold B.. et al. In CNN Architectures for Large-Scale Audio Classification, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE, 2017; pp 131–135. [Google Scholar]
- Yamashita R.; Nishio M.; Do R. K. G.; Togashi K. Convolutional neural networks: an overview and application in radiology. Insights Imaging 2018, 9 (4), 611–629. 10.1007/s13244-018-0639-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Krithiga R.; Geetha P. Deep learning based breast cancer detection and classification using fuzzy merging techniques. Mach. Vis. Appl. 2020, 31 (7), 1–18. 10.1007/s00138-020-01122-0. [DOI] [Google Scholar]
- Bojarski M.; Del Testa D.; Dworakowski D.; Firner B.; Flepp B.; Goyal P.; Jackel L. D.; Monfort M.; Muller U.; Zhang J.. et al. End to End Learning for Self-Driving Cars. 2016, arXiv:1604.07316. arXiv.org e-Print archive. http://arxiv.org/abs/1604.07316.
- Liu Y.; Racah E.; Correa J.; Khosrowshahi A.; Lavers D.; Kunkel K.; Wehner M.; Collins W.. et al. Application of Deep Convolutional Neural Networks for Detecting Extreme Weather in Climate Datasets. 2016, arXiv:1605.01156. arXiv.org e-Print archive. http://arxiv.org/abs/1605.01156.
- Mei A. B.; Milosavljevic I.; Simpson A. L.; Smetanka V. A.; Feeney C. P.; Seguin S. M.; Ha S. D.; Ha W.; Reed M. D. Optimization of quantum-dot qubit fabrication via machine learning. Appl. Phys. Lett. 2021, 118, 204001 10.1063/5.0040967. [DOI] [Google Scholar]
- Ziegler J.; McJunkin T.; Joseph E.; Kalantre S. S.; Harpt B.; Savage D.; Lagally M. G.; Eriksson M.; Taylor J. M.; Zwolak J. P. Toward robust autotuning of noisy quantum dot devices. Phys. Rev. Appl. 2022, 17 (2), 024069 10.1103/PhysRevApplied.17.024069. [DOI] [Google Scholar]
- Durrer R.; Kratochwil B.; Koski J. V.; Landig A. J.; Reichl C.; Wegscheider W.; Ihn T.; Greplova E. Automated tuning of double quantum dots into specific charge states using neural networks. Phys. Rev. Appl. 2020, 13 (5), 054019 10.1103/PhysRevApplied.13.054019. [DOI] [Google Scholar]
- Schuff J.; Carballido M. J.; Kotzagiannidis M.; Calvo J. C.; Caselli M.; Rawling J.; Craig D. L.; van Straaten B.; Severin B.; Fedele F.. et al. Fully Autonomous Tuning of a Spin Qubit. 2014, arXiv:2402.03931. arXiv.org e-Print archive. http://arxiv.org/abs/2402.03931.
- Koch R.; Van Driel D.; Bordin A.; Lado J. L.; Greplova E. Adversarial Hamiltonian learning of quantum dots in a minimal Kitaev chain. Phys. Rev. Appl. 2023, 20 (4), 044081 10.1103/PhysRevApplied.20.044081. [DOI] [Google Scholar]
- Navarathna R.; Jones T.; Moghaddam T.; Kulikov A.; Beriwal R.; Jerger M.; Pakkiam P.; Fedorov A. Neural networks for on-the-fly single-shot state classification. Appl. Phys. Lett. 2021, 119, 114003 10.1063/5.0065011. [DOI] [Google Scholar]
- Fricke L.; Hile S. J.; Kranz L.; Chung Y.; He Y.; Pakkiam P.; House M. G.; Keizer J. G.; Simmons M. Y. Coherent control of a donor-molecule electron spin qubit in silicon. Nat. Commun. 2021, 12 (1), 3323 10.1038/s41467-021-23662-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Campbell Q. T.; Koepke J. C.; Ivie J. A.; Mounce A. M.; Ward D. R.; Carroll M. S.; Misra S.; Baczewski A. D.; Bussmann E. Quantifying the Variation in the Number of Donors in Quantum Dots Created Using Atomic Precision Advanced Manufacturing. J. Phys. Chem. C 2023, 127, 6071–6079. 10.1021/acs.jpcc.3c00479. [DOI] [Google Scholar]
- Schofield S. R.; Curson N.; Simmons M.; Rueß F.; Hallam T.; Oberbeck L.; Clark R. Atomically precise placement of single dopants in Si. Phys. Rev. Lett. 2003, 91 (13), 136104 10.1103/PhysRevLett.91.136104. [DOI] [PubMed] [Google Scholar]
- Simonyan K.; Zisserman A.. Very Deep Convolutional Networks for Large-Scale Image Recognition. 2015, arXiv:1409.1556. arXiv.org e-Print archive. http://arxiv.org/abs/1409.1556.
- Molnar C.Interpretable Machine Learning, 2019. https://christophm.github.io/interpretable-ml-book/.
- Schofield S. R.; Curson N. J.; Warschkow O.; Marks N. A.; Wilson H. F.; Simmons M. Y.; Smith P. V.; Radny M. W.; McKenzie D. R.; Clark R. G. Phosphine Dissociation and Diffusion on Si(001) Observed at the Atomic Scale. J. Phys. Chem. B 2006, 110 (7), 3173–3179. 10.1021/jp054646v. [DOI] [PubMed] [Google Scholar]
- Warschkow O.; Curson N. J.; Schofield S. R.; Marks N. A.; Wilson H. F.; Radny M. W.; Smith P. V.; Reusch T. C. G.; McKenzie D. R.; Simmons M. Y. Reaction paths of phosphine dissociation on silicon (001). J. Chem. Phys. 2016, 144 (1), 014705 10.1063/1.4939124. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



