Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2024 Mar 28;8(12):2301573. doi: 10.1002/smtd.202301573

Accelerating the Development of Thin Film Photovoltaic Technologies: An Artificial Intelligence Assisted Methodology Using Spectroscopic and Optoelectronic Techniques

Enric Grau‐Luque 1,2, Ignacio Becerril‐Romero 1, Fabien Atlan 1,2, Daniel Huber 3, Martina Harnisch 3, Andreas Zimmermann 3, Alejandro Pérez‐Rodríguez 1,4, Maxim Guc 1,, Victor Izquierdo‐Roca 1,
PMCID: PMC11672176  PMID: 38546017

Abstract

Thin film photovoltaic (TFPV) materials and devices present a high complexity with multiscale, multilayer, and multielement structures and with complex fabrication procedures. To deal with this complexity, the evaluation of their physicochemical properties is critical for generating a model that proposes strategies for their development and optimization. However, this process is time‐consuming and requires high expertise. In this context, the adoption of combinatorial analysis (CA) and artificial intelligence (AI) strategies represents a powerful asset for accelerating the development of these complex materials and devices. This work introduces a methodology to facilitate the adoption of AI and CA for the development of TFPV technologies. The methodology covers all the necessary steps from the synthesis of samples for CA to data acquisition, AI‐assisted data analysis, and the extraction of relevant information for research acceleration. Each step provides details on the necessary concepts, requirements, and procedures and are illustrated with examples from the literature. Then, the application of the methodology to a complex set of samples from a TFPV production line highlights its ability to rapidly glean significant insights even in intricate scenarios. The proposed methodology can be applied to other types of materials and devices beyond PV and using different characterization techniques.

Keywords: accelerated research, artificial intelligence, combinatorial analysis, methodology, thin film photovoltaics


A methodology facilitates the combination of artificial intelligence and combinatorial analysis for the development and optimization of thin film photovoltaic technologies. Describes all the necessary steps from sample synthesis to data analysis and the extraction of information for research acceleration. Its application to an example case highlights its ability to rapidly provide insights even from highly complex data sets.

graphic file with name SMTD-8-2301573-g008.jpg

1. Introduction

Thin film photovoltaic (TFPV) technologies are based on complex devices created by stacking several multiscale micro‐ and nano‐ layers, some of them formed by multielement compounds (e.g., Cu(In,Ga)Se2, Cu2ZnSn(S,Se)4, CdTe, Sb2Se3, etc.). This is performed through complex multistep and multitechnique fabrication processes (evaporation, sputtering, chemical bath deposition, etc.) that require a fine‐tuning of different process parameters for the correct deposition and/or synthesis of each layer. Figure  1 illustrates this complexity with a cross‐section of a TFPV device showing its internal layer structure, where different possibilities of materials are indicated together with the most critical parameters to be controlled.

Figure 1.

Figure 1

Diagram of a generic TFPV device indicating the general structure, possible compounds for each layer, and parameters of the layers than are usually subjected to monitoring.

To deal with this complexity and obtain a profound understanding of the entire system, the evaluation and quantification of the compositional, morphological, structural, optical, and electrical properties, among others, of TFPV devices is critical for generating a model that allows proposing strategies for their further improvement and optimization, especially in terms of photovoltaic (PV) performance. Combinatorial analysis (CA), which involves the preparation and multi‐analysis of combinatorial samples, has shown to be one of the most powerful and timesaving approaches for such complex systems.[ 1 ] Nowadays, this approach can be greatly improved by combining several advanced characterization techniques that provide a picture as full as possible of the devices. In this context, light‐based characterization techniques have proven to be very versatile for CA as they can provide information about a wide range of different material and device properties, including composition, morphology, crystal structure, and thickness while offering non‐destructive and fast signal acquisition. Examples of such techniques include Raman and photoluminescence (PL) spectroscopies, on which the methodology proposed in the present study is focused. Raman spectroscopy has shown its potential for the advanced characterization of different PV materials, including perovskites,[ 2 ] kesterite,[ 3 ] and chalcopyrite,[ 4 ] revealing details on their structures, defect formation, and general layer performance. Raman spectroscopy has also shown powerful results in other material science fields such as superconductors, polymers, corrosion, and environmental materials.[ 5 , 6 , 7 , 8 ] Likewise, PL has also been employed to characterize different PV materials such as kesterite,[ 9 , 10 ] perovskites,[ 11 ] and silicon[ 12 ] as it allows obtaining valuable information about the defects and the recombination mechanisms present in the materials. As such, it is clear that the combination of Raman and PL spectroscopies offers insightful details of materials becoming a powerful asset for research and technology development in TFPV. Other spectroscopic techniques like X‐ray fluorescence (XRF) provide compositional information that complements that of Raman and PL resulting in further insight into material properties. If all this information about material properties is combined with optoelectronic characterization like current–voltage (IV), which provides information about PV device performance, a comprehensive and holistic picture of TFPV devices can be obtained.[ 1 ]

However, acquiring raw characterization data does not directly provide a picture of TFPV devices. This requires the extraction and quantification of relevant parameters from the characterization data through the use of data analysis techniques. In particular, spectroscopic data analysis requires that quantifiable information such as peak position, intensity, areas or symmetry, to mention a few, are extracted from the raw spectra. This type of analytical data analysis is labor‐intensive and requires highly trained personnel. Additionally, the high amount of information hidden in spectral data is difficult to unravel using solely analytical data analysis techniques. In this regard, artificial intelligence (AI) can help exploring spectroscopic data exploiting their full potential by working with the full information contained in them. Machine learning (ML) is slowly becoming more common in materials research and combined with CA has already shown promising results.[ 13 ] This is exemplified in ref. [14], where ML and CA are used to successfully explore the mechanisms behind defect formation in Cu2ZnGeSe4 kesterite‐based thin film (TF) solar cells. Additional examples can be found on the application of only ML algorithms in comprehending the limitations of PV devices and enhancing their performance, although these are focused on specific technologies and on resolving their specific drawbacks,[ 15 , 16 , 17 , 18 ] which narrows the targeted audience and strongly limits a more general cross‐technological application of the proposed methods. This very focused use of AI algorithms to PV, despite being of high relevance, results in a lack of information that creates several barriers for researchers attempting to implement combined CA and ML approaches to the specific and critical topic of the development and optimization of PV materials and devices. These barriers include the reduced availability of large data sets, the need of expert knowledge, and the availability of clear procedures for their application,[ 19 , 20 ] which slows down widespread implementation and further development of CA and ML approaches. One of the limitations is the proper pre‐processing of measurement data. For instance, in the case of spectroscopic data the pre‐processing allows not only emphasizing the relevant changes in the spectra, but also to combine the data obtained from different techniques correctly for their input into ML algorithms. Also, the use of ML requires substantial amounts of high‐quality data for a precise analysis of the physicochemical parameters of new materials and devices, which necessitates the use of automated systems for massive characterization measurements. All this requires deep theoretical, statistical, analytical, and programming knowledge. Therefore, practical guides and methodologies that help researchers to apply such tools are paramount to accelerate their universal adoption and ultimately shorten the development times of new materials and devices.

In this context, the main objective of this work is to propose a methodology to facilitate the adoption of ML and CA for the development of TFPV technologies. This work explains the necessary concepts and procedures from how to synthesize relevant sets of samples for CA to the preprocessing and analysis of spectroscopic data using AI for the development of TFPV technologies using spectroscopic (mainly Raman and PL) characterization. All the steps of the methodology contain application examples based on recently published results where they were successfully implemented. To complete the work, the full methodology is applied to a set of samples produced by Sunplugged GmbH in their pilot line during the development of their mechanically flexible solar modules based on the low bandgap CuInSe2 absorber. The methodology, however, is not limited to these techniques or materials used as examples in the current work and can be applied to other case studies. We strongly believe that this type of methodologies and explanations are fundamental to boost materials research, and TFPV in particular, towards new limits, aiming for a more robust, faster, and automated research that should produce high quality results with a strong application potential in a more versatile, efficient, and cheaper way.

2. Methodology Overview

The overall structure of the methodology proposed in this work is schematically shown in Figure  2 . The main goal of the procedure is to obtain relevant information from the CA of relevant samples using AI to create a model of TFPV devices that serves as feedback for technology improvement. Overall, the methodology starts with the synthesis and characterization of a combinatorial sample or of a combinatorial set of samples in order to obtain relevant device and/or materials properties. The characterization data obtained is then divided into features and targets and is subjected to conditioning and fusion respecting data traceability and preparing it for their input into AI algorithms. In the next step, the conditioned data are analyzed by applying the AI‐assisted approach. The present work proposes the use of cascaded principal component analysis (PCA) and linear discriminant analysis (LDA) classification algorithms as a powerful tool to process spectroscopic data. After data analysis, the next step is to verify that there are no overfitting problems and validate the results obtained. The proposed methodology leads the classification of the data which allows making decisions about the most relevant and optimum production parameters and generates knowledge about the critical properties of the materials and/or devices. Moreover, the methodology also explains how to select critical samples, techniques, and spectral ranges to generate more solid feedback for further technology improvement.

Figure 2.

Figure 2

General flow of the proposed methodology for accelerated research using CA and ML. See text for more details.

It is worth noticing that the combinatorial sample preparation as well as the characterization of materials and devices is oriented to the specific case of TFPV technologies and may not be directly extrapolatable to other type of technologies. However, the rest of the methodology is more general and can be applied to other cases. This allows to easily extend the proposed methodology to any complex multilayer and multicomponent system. The subsequent sections and subsections of the article provide a more detailed explanation of each step of the proposed methodology and of critical points that should be considered.

3. Sample Preparation for Combinatorial Analysis

The first step of the proposed methodology is the preparation of samples suitable for CA. Combinatorial samples or sets of samples are those in which a property is deliberately varied in a controlled way in‐sample or sample‐to‐sample, respectively. The analysis of combinatorial samples represents an efficient way of obtaining relevant insights that can be used both for extracting information about fundamental material properties and for technological optimization. The focus on combinatorial samples in the present approach is supported by the high potential shown already by this type of samples for research acceleration of various technologies.[ 21 , 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 ] In the case of TFs, different physical and chemical deposition techniques can be employed for the preparation of combinatorial samples and sets of samples which result in discrete or gradient sample libraries, respectively. A discrete library consists of individual samples in which each of them has a discrete variation of a property, normally related to composition. On the other hand, a gradient library is a single sample with a deliberate inhomogeneity consisting in a continuous variation (gradient) of a property across its surface, e.g. a sample with a graded thickness in one of its layers. Diagrams and real examples of each type of sample library are presented in Figure  3 . Both approaches, discrete and gradient sample libraries, can be employed in the methodology proposed in this paper and lead to useful results. However, both of these approaches present advantages and disadvantages that need to be taken into account for the choice of one or the other (see Section S1 in Supporting Information for more details about different types of combinatorial samples). Finally, once a combinatorial sample is produced it is of high importance to discretize it into analysis areas (pixel cells in the case of TFPV devices) that allow the spatial correlation of the characterization data. This discretization can be physical or virtual and will be discussed in detail in Subsection 4.3.

Figure 3.

Figure 3

A) Diagram of discrete sample set with process temperature and time variations, B) diagram of continuous spread sample with 1 graded layer, C) picture of a discrete sample set and D) picture of a continuous spread graded sample.

To sum up, the first step for performing a CA experiment that can help boosting the technological development of TFPV technologies is the fabrication of a suitable sample set in which the material (e.g., composition, thickness…) or processing (e.g., synthesis temperature, deposition speed…) parameters that want to be investigated present significant variations and can lead to insightful results.

4. Data Acquisition and Conditioning

Once that a carefully designed sample library has been produced, the next step is to acquire data through its physicochemical characterization. The following subsections describe different concepts and provide guidelines for data acquisition and conditioning that need to be taken into account for their later use with AI data analysis techniques.

4.1. Characterization Techniques

Considering the high complexity of the compounds and layer architectures used in TFPV devices, the characterization of the sample library must be as complete and comprehensive as possible. This requires the use of multiple techniques on the same discretized area of the sample, defined as analysis area or pixel cell, to maximize the information collected and be able to reveal correlations. The characterization methods employed must be non‐destructive with a spatial resolution comparable or higher to the pixel cell size and, ideally, enable fast acquisition times and automation capabilities.[ 1 ] Nondestructive techniques must be employed due to the necessity of carrying out multiple measurements without altering the properties of the sample while the spatial resolution must be high enough to resolve significant property variations in graded combinatorial samples. The use of fast, automated mapping measurement procedures is highly advisable due to the high number of measurements to be performed on the sample libraries that, otherwise, could result in extremely high data acquisition times for obtaining the high‐statistics necessary for AI. There are significant advances in automation of measurement systems that are self‐controlled, do not require significant sample preparation time, or the permanent control and supervision of an operator. In the case that these systems are not available, standard spectroscopic systems can be automated through the coupling of measuring probe‐heads to programmable motorized XYZ gantry systems or translation stages combined with the use of optical fibers or with the use of detectors that can be integrated within the probe heads. This simple approach will lead to a significant reduction of the labor and time needed for acquiring a statistically relevant amount of data and to increase the size of the data sets. Automation is then a critical component to significantly contribute to the evolution, enhancement, and development of any research area and of TFPV materials and devices in particular. It enables an enhanced efficiency and suitability into the research process with fast and consistent data acquisition and reduced long‐term costs that yield better products.

Another option to speed up data acquisition is the combination of several techniques in parallel, in some type of multi‐sensor tool. However, commercial multi‐sensor systems are extremely rare and, normally, custom solutions have to be employed for this. The sensors integrated in a multi‐sensor tool need to be compatible with each other and multiplexing strategies for signal excitation and collection may be needed. An example of a multi‐sensor tool can be found in Figure  4 , where a radial‐like connection of different probe‐heads is proposed allowing to perform simultaneous and spatially‐correlated acquisition of different spectra (Raman, PL, reflectance, etc.). Moreover, it is critical that the characterization techniques employed are complementary and provide insights into a wide variety of relevant properties of TFPV devices, such as compositional, optical, structural, and optoelectronic. For TFPV technologies based on chalcogenide compounds, this includes methods such as X‐ray fluorescence (XRF),[ 30 , 31 , 32 ] energy‐dispersive X‐ray spectroscopy (EDS),[ 33 , 34 ] X‐ray diffraction (XRD),[ 30 , 32 ] optical spectroscopy,[ 35 ] PL spectroscopy,[ 36 , 37 ] and Raman spectroscopy[ 30 , 34 , 38 ] among others. In the case of the development and optimization of PV devices, all this information needs to be ultimately correlated with their final performance (optoelectronic properties) which is commonly obtained from IV measurements. In this case, it is advisable to discretize the samples into pixel cells with a size that allows converting them directly into isolated solar cells from which spatially resolved information about their IV characteristics can be obtained and correlated with the rest of the data. Additionally, the information provided by the characterization techniques can be complemented with parameters from the fabrication process such as synthesis temperature and pressure, deposition power, solution concentration, and others. A combined analysis of different techniques and fabrication parameters along with optoelectronic data offers valuable insights into how variations in a selected material property or production parameter impacts device performance. These data, after proper analysis, will allow the determination of the optimal range of the material/PV device property varied in the combinatorial sample set and giving insights for the further optimization of the performance of the TFPV technology through subsequent iterations.

Figure 4.

Figure 4

Schematic of a multi‐sensor tool example containing PL, reflectance, and Raman sensors from multiplexed laser and wide emission LED sources.

It is worth noticing that the data obtained from spectroscopic or optoelectronic characterization techniques are in the form of a spectrum or a curve. These data can be used either as a whole (as a vector) or, alternatively, different parameters can be extracted through proper data processing and be used for further analysis as a set of scalars. For instance, from a whole Raman spectrum just scalar values corresponding to the main peaks positions could be used or, conversely, a complete IV curve can be used instead of scalar values of optoelectronic properties. The type of data selected (matrix, vector, scalars…) influence the choice and performance of the methods employed for further data analysis. This is critical aspect for the correct selection of the AI algorithms for data processing and analysis, which will be addressed in the following sections.

In this way, obtaining relevant input data for further analysis requires the spatially correlated characterization of the sample set using multiple nondestructive and fast complementary inspection techniques like XRF, Raman, PL, and IV that constitute a characterization data set that can provide a detailed picture of the TFPV materials and devices after proper data processing and analysis.

4.2. Data Set Size

An adequately sized data set is crucial for the successful application of data analysis techniques and for extracting insightful information. For example, ref. [39] presents a case in which Cu(In,Ga)Se2 (CIGS) TF high efficiency solar cells processed with different RbF post‐deposition temperatures present subtle yet relevant variations in fundamental material properties difficult to detect within the data dispersion that can only be revealed to the large size of the data set (480 data points) and which would have otherwise remained hidden. The importance of the data set size is particularly critical when applying AI‐assisted methodologies. ML methods are normally applied to big data (BD), which is typically considered a data set large enough for a human not to be able to process it with analytical or with traditional IT tools.[ 40 ] Even though BD has no standard size definition,[ 41 ] normally the amount could be placed in the order of tens of thousands of data points and beyond. Conversely, in material science, experimental data sets are typically in the realm of tens to hundreds of data points, rarely in the range of thousands. However, it has been argued that a good starting point for a reasonable ML model is 50 data points,[ 20 ] and there are strategies that significantly increase the effectiveness of some algorithm's application. For instance, crude estimation of property (CEP) has shown good results with data sets of roughly 100 points.[ 42 ] Also, as demonstrated in ref. [14], even less than 200 data points measured for a combinatorial sample can lead to good results using ML algorithms for V OC prediction. For more information about data size and its influence on the obtained conclusions, please refer to the Section S2 in Supporting Information.

In summary, the size of the data set needs to be large enough for not only for being able to successfully apply AI data analysis algorithms, but also to provide reliable and insightful results from data analysis. In addition, for AI‐assisted methodologies this size is completely dependent on the experiment and algorithms selected, but something in the range of hundreds of data points and beyond is advised.

4.3. Data Conditioning, Fusion, and Traceability

Once that the data have been acquired, the next step in the methodology is their conditioning and fusion for their later utilization in ML algorithms considering the necessity of the data to be traceable.

In this process, a critical point is data separation into targets and features, which directly follows data acquisition as shown in Figure 2. Targets are the properties to predict or classify by the ML algorithm, while features, also known as descriptors, are the variables used to make that prediction. Choosing the correct targets and features to be used in AI assisted methodologies is one of the most critical steps in data analysis.[ 13 , 20 , 43 ] The input of irrelevant features or inadequate targets will lead to no or confusing results, but a good selection of these will increase the possibilities for a successful experiment with insightful and interpretable results. In the case of TFPV devices, various targets can be defined such as fabrication parameters, chemical composition of a specific layer, or, more commonly, optoelectronic data of the final solar cell (efficiency, open circuit voltage (V OC), short‐circuit current (J SC), and fill factor (FF)). In most cases, the targets data will be scalars, each associated with a specific sample of a discrete library or a specific area of a graded sample. On the other hand, features can be the results provided by the characterization techniques such as Raman, PL or XRF data as well as the fabrication parameters as described above. The features are heterogeneous data of one‐ (scalars) or high‐ (vectors, images) dimensionality, which adds additional complexity on the data treatment related to heterogeneous data fusion. An important remark is that the data selected for features should not be in the targets in the same workflow.

Data pre‐processing can also be required for specific data types or measurement techniques. The main objective of data pre‐processing is avoiding the introduction of non‐relevant features that are not directly related to the sample itself, but rather to the equipment (e.g., instabilities, characteristics of certain components, design limitations, artifacts…) or to the measuring environment. This is especially critical when using spectroscopic data in which noise, artifacts, spikes or background signals may add non‐relevant information in the spectra. The data arising from each different characterization technique have different pre‐processing requirements. For example, in the case of Raman spectroscopy it is commonly necessary to calibrate the spectral range and correct peak positions with some reference sample, remove spikes and subtract the baseline. Figure  5 shows an illustrative example of spectroscopic data (Raman and PL) before and after pre‐processing (and fusing, which is explained below). Any pre‐processing is preferably to be automatized to make it as systematic and standardized as possible to avoid random human mistakes and inconsistencies. Unfortunately, the automation of data pre‐processing or advanced analysis is still struggling with significant problems, which might be solved for particular cases by custom approaches and are barely available for the general audience. An example of a tool for the automation of data pre‐processing for the case of spectroscopic data is the open‐source toolbox “spectrapepper” developed by IREC, which can be used for fast and automated data processing and analysis.[ 44 ]

Figure 5.

Figure 5

Example of a high dimensional spectrum combining Raman and PL spectra for a single measured point.[ 39 ] A) shows the raw Raman measurement, B) the PL raw measurement and C) the fused vectors after processing.

The data conditioning process should be completed with a data scaling process, which consists in scaling up the data so that they are numerically comparable among them. For instance, if a scalar data has a maximum value of 5 and is to be fused with a vector that has a maximum value of 5000, normalization may be necessary for the ML algorithm to accurately consider the scalar feature inside the fused data vector. This could be done by normalizing each technique to its global maximum, normalizing to the sum under the curve, normalizing to peak ratios, or other normalization method that is adequate to the technique (see details about normalization approaches in Section S3 in Supporting Information). It is worth noticing that data scaling is not a straightforward procedure, and the best option needs to be evaluated case by case to ensure that the original information is not altered or that artifacts do not appear in the process, which could greatly affect the data analysis results.

Once the process of data conditioning (pretreatment, separation and standardization) is performed, the data needs to be fused into single files that contain a vector with all the information and that can be fed into the ML algorithms. In the case of homogeneous data, i.e., when all the data are of the same type, either scalar features or vector features can be joined together in a single vector with higher dimension in a straightforward way. Such a vector then becomes a part of the input file for the ML algorithm, and a specific indicator can be added to each of these vectors to keep the traceability of the data, as will be discussed below. An example of such a data fusion has been already successfully employed in several publications, where Raman spectra measured under different excitation conditions or Raman and PL spectra have been successfully fused into a single vector for further ML analysis.[ 14 , 39 ] Figure 5C shows an example of such a high dimensional spectrum (vector) which is obtained by concatenating Raman and PL spectrum (after their conditioning).[ 39 ]

On the other hand, the fusion of diverse data types (scalar, vector, 2D data, etc.) obtained from different characterization techniques presents additional challenges. The primary goal is to retain the relevance of each type of data, regardless of its dimension. This issue is currently open and requires further research to establish a reliable data fusion procedure. For the present methodology, we propose a general data fusion approach that consists in the correct partition of the data for the spatially defined analysis areas of the sample (these areas can be physically or virtually delimited). In the case of solar cells, these analysis areas would correspond to small area pixel solar cells in which IV measurements can be performed to map the optoelectronic performance of the sample. For such a pixel cell it is important to have reliable data from the spectroscopic techniques that is representative of the whole IV analysis area, which mainly depends on the ratio between the size of analysis area and the measurement light spot. If the ratio is too big (e.g., pixel cell size is a few mm2 and spot size is some microns) a point measurement of a single spectra cannot be not representative for the whole pixel cell, which may result in weak or no correlations between features and targets in the later analysis. In this case, the spot size can be increased using adequate optical components (objectives, lenses, mirrors…) to cover as much of the measuring area as possible. Alternatively, the spot size can be virtually increased by measuring several spectra distributed throughout the area of the pixel cell (mapping) and averaging them, or by continuously moving the measurement spot inside the pixel cell during a single spectrum acquisition (scanning measurement). In this way, the resulting spectrum will contain information related to the whole pixel cell.

Contrarily to the previous situation, it is also possible that the measuring area of some characterization techniques is larger than the pixel cell. In such case it is important to reshape the data obtained from the technique to ensure their correctness and relevance for each pixel cell. The easiest approach here is to assign the same measured data to all the pixel cells that lie inside the measuring area and in the cases of pixel cells that lie in between two (or more) measuring areas an average value (or any other suitable extrapolation method) of both (or all of the) measurements covering the pixel cell can be assigned. An example of an adequate acquisition approach and data reshaping situation is illustrated in Figure  6 , where data from three characterization techniques (optoelectronic, spectroscopic Raman/PL and compositional XRF) are to be fused. As explained above, the spectroscopic Raman/PL measurements can be performed by scanning throughout the area of the pixel cell to provide relevant information of the whole pixel cell (represented by blue S‐shaped arrows). However, the spot size for XRF compositional data measurements (large orange circles) is significantly larger than the pixel cell size (yellow squares), about twice the size. In this case, the fusion can be performed simply by assigning the same compositional data to two adjacent pixel cells.

Figure 6.

Figure 6

Data conditioning and fusion for the case of techniques with different measuring areas in relation to the pixel cell size. Figure on the left shows a scheme of a sample with physically defined solar cells for which the optoelectronic parameters are measured, possible mapping line of spectroscopic techniques, and a XRF spot size. The next two schemes show the direct correlation between the data obtained by different techniques and the proposed way for data reshaping and traceability.

In this situation, an important point is to include some traceability of the data at each step, which will be necessary for further analysis and explanation of the obtained results (see Section 6). This can be performed by numbering each pixel cell and associating the specific combined vector or part of it with this number. For the latter, when a part of the combined vector needs to be identified (e.g., just the PL spectrum or XRF data from the whole vector) a specific identifier should be also included. As a result, just a few identifying numbers can provide full information for tracing back the data. Here, we propose the following identifying nomenclature: first and second number represent the X and Y of coordinates of the pixel cell, the third number represents a specific technique (i.e., 1 for XRF, 2 for Raman spectrum, 3 for VOC, etc.), and the fourth number will indicate if the data was shared or averaged between different cells or corresponds strictly to only one cell. Of course, a more advanced traceability can be included in the data treatment and analysis processes, but it is important to have it at any step.

5. AI‐Assisted Data Analysis

Analyzing data from TFPV materials characterization, especially spectroscopic data like Raman and PL spectra, poses significant challenges due to their high complexity. However, this analysis is crucial to extract valuable insights for further technology improvement. For instance, Raman spectra can offer insights into various aspects such as crystalline quality, structure type, presence of defects and secondary phases, or layer thickness variations. All these layer properties are reflected in the spectra by subtle changes in the characteristics of the peaks: position, full width at half maximum, absolute or relative intensity, symmetry, etc. An example of the different physicochemical properties that can be obtained from Raman spectra can be found in ref. [38] where it is applied for the study of complex quaternary kesterite materials for TF solar cells application. These properties, while crucial for comprehensive characterization and understanding of the measured devices, present difficulties for reliable calculation in an automated and unsupervised manner due to the inherent complexity of the data. Typically, the expertise of specialized human resources is required to treat/characterize this data accurately and consistently, a process that can be time‐consuming, costly, and inefficient, especially for large data sets. In response to these challenges, the proposed methodology involves the application of ML algorithms, capable of handling this type of data with minimal human supervision, while still providing valuable knowledge. This approach leverages dimension reduction algorithms which, rather than isolating specific data features, consider all the available information and simplify it into a more manageable format. This strategy facilitates more efficient and effective data analysis for TFPV materials.

The next subsections cover different aspects of the AI assisted analysis proposed in this methodology from the algorithms employed to their testing and validation.

5.1. PC‐LDA Cascaded Algorithm

For the AI‐assisted analysis of the data obtained from the characterization techniques recommended in the present methodology, we propose a cascaded application of two ML algorithms, principal component analysis (PCA)[ 45 ] and linear discriminant analysis (LDA).[ 46 ] The combined application of these algorithms (PC‐LDA) allows classifying the samples according to the defined targets in an effective way. These algorithms are dimension reduction algorithms that operate based on similar but ultimately different principles. PCA is an unsupervised method that seeks the elements of the feature data that best explain the variations within the data set. It does so by performing linear combinations that maximize the difference between the data points until a predefined dimensionality is achieved. The final dimensions are called principal components. For instance, if the final dimension is 50D, then there will be fifty principal components at the end. Conversely, LDA is a supervised technique that searches for the information that optimally differentiates predefined groups based on the target variable. In a similar way to PCA, it does this by performing linear combinations that maximize the difference between the defined groups until a predefined dimensionality is reached. The final dimensions are called discriminants. For instance, if the final dimension is 2D, as recommended in the present methodology, then there will be two final discriminants. Essentially, these algorithms condense the dimensionality of the data in a manner that conserves critical information within a reduced and more manageable space. These distinctive attributes of PCA and LDA mean that their combination as a cascaded PC‐LDA, where PCA is performed before LDA, can benefit from these two important properties of the data: difference between data points and difference between classification groups. Thus, it is a convenient choice for dealing with the complex and high‐dimensionality data arising from the characterization of TF devices using spectroscopic and optoelectronic techniques. The ability of these methods to utilize variability across the entire data set as well as within each classification group, and the conservation of significant information in reduced dimensions, facilitates feature extraction and feature selection. This becomes particularly valuable when dealing with data sets constituted of large vectors where extraction of meaningful information can pose significant challenges, as is the case of spectroscopic data analysis. Consequently, these characteristics grant the PC‐LDA combination an advantage over the separate use of PCA or LDA. To illustrate the advantage of PC‐LDA, Table  1 shows the training, testing, and cross‐validation scores for the same classification problem described in ref. [39] using other common algorithms including random forest (RF), support vector machines (SVM), and quadratic discriminant analysis (QDA), along with LDA, PCA, and PC‐LDA. Additionally, class‐specific scores (see Subsection 5.4 for more details) are shown for training and testing sets. The scores shown were obtained after 1000 iterations to provide relevant and reliable conclusions. The comparison of the different algorithms applied allows concluding that, in general, RF shows good overall scores, but with considerable overfitting (i.e., when the model learns too well either the training set or a biased portion of it, see Section S4 in Supporting Information for more details) with a difference of 0.14 between training and testing sets. Furthermore, RF is not a dimension reduction algorithm and hence does not enable visual representations in lower dimensions, which limits its further testing and its use for the development of physical models (see Subsection 5.3 and Section 6). For the same data set, SVM shows lower overfitting levels with comparable performance while being a dimension reduction algorithm. However, SVM does not keep the relationship between classes due to its randomization nature. Thus, it is not possible to obtain deeper insights into the correlations, even when using lower‐dimension visualizations. Finally, QDA, despite being similar to PCA and LDA, shows poor testing performance with heavy overfitting. The PC‐LDA model has the best (lowest) overfitting values, due to the difference between training and testing scores and good overall testing accuracy. However, for class‐specific accuracy, the LDA model outperforms the others in two of the three classes and shows the highest cross‐validation score. Nevertheless, it shows considerable overfitting when comparing the training and testing scores. PC‐LDA shows to be a balanced choice as it offers a blend of good generalization, good overall testing accuracy, good cross‐validation score, and the lowest overfitting considering both the general scores and the class‐to‐class scores. With the latter, PC‐LDA appears as a model that performs well across all classes, with good generalization ability and consistency while maintaining the information from higher dimensions and allowing the extraction of correlation insights from the final discriminants. In other words, the selection of PC‐LDA for the methodology goes beyond the pure training and testing scores and allows working with the output results later on for obtaining deeper insights. However, it has to noted that the use of solo PCA, LDA, QDA or any other similar algorithms cannot be totally discarded as they are all suitable to follow the methodology presented in this work, and the final selection of the most appropriate algorithm has to be ultimately evaluated for each specific data set.

Table 1.

RF, SVM, QDA, PCA, LDA, and PC‐LDA algorithms with performance metrics according to the classification problem as presented in ref. [39]. The metrics of the best model after 1000 iterations are presented.

Algorithm General scores Train set specific Test set specific
Train Testing Cross‐validation Class 1 Class 2 Class 3 Class 1 Class 2 Class 3
RF 0.98 0.84 0.72 0.96 1.00 0.98 0.79 0.88 0.78
SVM 0.93 0.84 0.74 0.91 0.93 0.96 0.84 0.83 0.86
QDA 1.00 0.35 0.30 1.00 1.00 1.00 0.38 0.18 0.74
PCA 0.83 0.81 0.71 0.76 0.85 0.83 0.79 0.85 0.76
LDA 0.92 0.79 0.77 0.89 0.91 0.95 0.78 0.79 0.81
PC‐LDA 0.84 0.82 0.76 0.82 0.82 0.90 0.82 0.82 0.83

As a result of the application of this cascaded algorithm, the initial dimensionality (>1000), or vector length, of the input feature data will go through two‐dimension reduction processes: a first one by PCA and a second one by LDA. The input data will be reduced to only two discriminants (D1 and D2), or final dimensions, after the PC‐LDA transformation, producing a number of clusters depending on the number of classification groups defined initially from the target variables, as illustrated in Figure  7 . In principle, this classification groups should be clearly defined to produce the best classification results possible. This may be straightforward in the case of discrete target variables (e.g., processing parameters) in which each classification group can be directly associated to a discrete value. On the contrary, the correct definition of classification groups becomes especially critical in the case of continuous target variables like optoelectronic parameters. In the latter case, discrete classification groups can be defined using ranges. For instance, in ref. [39] two types of classification groups have been defined: i) the temperature of the post‐deposition treatment (discrete target variable); ii) ranges of the VOC of the final solar cells (continuous target variable) defined as: group 1—V OC > 705 mV, group 2—705 mV > V OC ≥ 690 mV and group 3—VOC < 690 mV. Additionally, to achieve a good separation between the classification groups, an optimum interface between the PCA and LDA algorithms has to be implemented. This interface is defined by the dimension reduction provided by the PCA step. We propose performing an optimization of this parameter through several iterations varying the dimension reduction of the PCA algorithms and evaluating the train and test scores for each iteration as shown in Figure S3A (Supporting Information) for the data set used in ref. [39]. In this scenario of a data set obtained from samples with a continuous target variable, both the discriminant plot and model performance will manifest similar behaviors. This is due to the gradual transition between classification groups, resulting in a blurred boundary where groups converge. Minor variations in input features may not precisely mirror the target variable, introducing variability. Consequently, clusters will overlap slightly, impacting performance.

Figure 7.

Figure 7

PC‐LDA application scheme example. Left—The results obtained from two spectroscopic techniques (V1 and V2) are fused with a scalar S to create a single vector that becomes the N‐dimensional feature input. Center—The dimensionality of the vector is reduced by PCA and the resulting vector is used as input for LDA that further reduced the dimensions into the final discriminants. Right—Discriminant graph for the case of reduction to two dimensions.

5.2. Data Set Split

In order to employ AI algorithms for data analysis, these need to be trained, tested and validated. For this, the data set has to be split into a training and a test subset, being a fundamental step in the application of AI for various reasons. Firstly, it allows the evaluation of the performance of the model on unseen data (test data), which is an important measure of its generalization capability, a critical factor in ML. By setting aside a proper portion of the available data as a testing set, we can assess how well the model can predict new outcomes based on its learning from the training data. Secondly, the testing set can be utilized to fine‐tune model parameters and to prevent overfitting. Monitoring the performance of the algorithm with the testing data helps to detect this problem and adjust the parameters of the model (i.e., the trained ML algorithm) to improve its performance. Therefore, the train/test split is a critical step in developing robust and reliable ML models. For more information about this process, please refer to Section S5 in the Supporting Information.

5.3. Test and Validation

The exact way in how to test and validate a model and, furthermore, the criteria for defining success or fail of a model, is determined by the end user,[ 47 ] In this regard, comprehensive testing methods, criteria and needs can be found in the literature.[ 48 , 49 , 50 , 51 , 52 ] In this methodology, we propose to evaluate how well a classification algorithm performs by the comparison between prediction and real classification. In other words, by evaluating how well the algorithm can predict the classification group by comparing the prediction and the real target, what gives a number between 0 (no correct predictions) and 1 (all predictions are correct). This is useful when the prediction of an exact value is not necessary, for example in applications where classifications are of few groups or normally of binary nature. Additionally, checking the robustness of the training data subset by a cross‐validation approach is also critical to ensure that the algorithm performs well. For more information about cross‐validation and other approaches for test and validation of the ML, results can be found in Section S6 in the Supporting Information.

6. System Modeling

Despite the high‐performance scores that a model might yield, it is important to acknowledge that the direct results derived from ML algorithms tend to be inherently challenging for humans to interpret. This is largely due to the complexity of these models, which often entail thousands of operations to make their predictions, making the tracking and understanding of the entire sequence of operations practically impossible. Here, we propose three strategies than can provide deep insights in the interpretation of the results obtained from data analysis using AI: sensitivity analysis, discriminant correlation and finding representative samples.

To gain insights into the workings and decision‐making of the developed algorithm, sensitivity analysis can be performed. Essentially, this means the perturbation of the features and observing how these changes influence the decisions of the model. The results can pinpoint what features or data points in the spectra the algorithm finds the most critical or influential. The general concept behind a sensitivity analysis is evaluating the probability of a sample to be classified in a classification group before and after the perturbation of the feature data. Automation of sensitivity analysis can be performed by the open‐source toolbox “pudu” developed by IREC with a special focus on spectroscopic data problems.[ 53 ] A practical example is illustrated in Figure  8 , referencing the data from ref. [14]. Here, the algorithm learned from a combined data set from Raman measurements performed under 442, 532 and 785 nm excitation wavelengths, focusing on classifying the chemical composition of the absorber layer, which was the target data. For the sensitivity analysis, first the classification probabilities of a sample were calculated as delivered by the probability function of the PC‐LDA algorithm and stored. Then, the value of each feature pixel (feature vector element) was individually increased by 10% and the probability of being classified as the next best‐performing group was reevaluated by the model and compared to the original probability stored. This difference value was then computed for every pixel and defined as the “importance.” Finally, the results were separated by wavelength and averaged out per classification group and normalized to the highest positive value. With the above, positive importance means that a positive change of 10% increases the probability of the feature of being classified in the superior group, and a negative importance means that the probability decreases. The outcomes for the 532 nm are presented alongside the average spectra of the data set (Figure 8). In consequence, the importance of each pixel that defines the classification results or, strictly speaking, the importance of a pixel in shifting the classification of the specific spectrum from one group to other can be defined. In the presented example, the main changes that influence the classification by the composition of the absorber are focused on the spectral range corresponding to the most intense and second intense Raman peaks in the spectra (ranges highlighted with light yellow in the Figure 8).

Figure 8.

Figure 8

Example of sensitivity analysis made for Raman data measured under 532 nm excitations. The top graph represents the average Raman spectrum of all cells from the combinatorial sample in [14] reference.

The performed sensitivity analysis can also be employed to identify not only the critical ranges in the spectra, but also which of the used techniques are more relevant for characterizing the samples as it will show which parts of the input feature data (which contains fused data from all the techniques) have a higher importance. In a more complex case, when a higher variety of different techniques is used, such type of sensitivity analysis will be of a high importance as a first screening allowing to focus the subsequent research only on the most critical techniques that can provide valuable information about the analyzed set of samples. This is critical for accelerated research experiments.

Another complementary approach to interpret the results obtained from ML data analysis is the active search of correlations between the discriminants provided by the algorithm (e.g., D1 and D2 in Figure S5 in the Supporting Information) and the physical parameters of the samples. Inside this approach, it is advisable that once the critical ranges of the spectra are identified through a sensitivity analysis, as shown above, correlations of the peaks in the defined ranges with the discriminants are actively searched, which might provide also a physical meaning to these derivate values of the ML algorithm. For instance, in the example showcased above, a simple peak area ratio of integrated intensity of the peaks at 176 and 205 cm−1 inside the importance ranges (yellow bands in Figure 8) was found to have a clear correlation with D1 (see Figure  9A). The specific physical meaning of this parameter (ratio of integrated intensities) has already been described in the ref. [14] and is directly related to the presence of point and/or cluster structural defects in the absorber layer of the solar cells, which directly depends on the absorber composition. This gives a clear insight that changes in the absorber composition have a direct influence on the Raman spectra, which were used as features for the classification by the PC‐LDA algorithm, through variations in the structural defects. Moreover, a further analysis allows defining that there is a correlation between D2 and device efficiency in the range of the best performing solar cells (>2%) shown in Figure 9B. It is important to note that these discriminant correlations do not necessarily mean the discriminants directly represent the specific physical properties, but rather they suggest that these are strongly related to the target variable. In the example, it is clear that the [Zn]/[Ge] ratio, the original PC‐LDA model target, impacts heavily the presence of structural defects in the absorber layer and has an outmost importance for the high efficiency devices. This connection is significant even though none of the analytical parameters (structural defects concentration or device efficiency) were directly used as input or features in the model but have a direct correlation with the results of the dimension reduction process.

Figure 9.

Figure 9

A) Relative integrated intensity of the Raman peak at 176 cm−1 with 532 nm plotted against discriminant D1. B) Efficiency of the solar cells plotted against discriminant D2.

Finally, another important output that can be obtained from the results of the ML analysis, is the identification of relevant samples or representative samples for each classification group, on which the research needs to focus to progress rapidly. This can be done by analyzing the discriminant graph and selecting the samples or cells closest to the center of mass of each classification group (Figure S5, Supporting Information). These representative cells or samples can be afterwards subjected to a more detailed analysis by techniques that may not be suitable for automated data acquisition and high statistical analysis, but that can provide more information about each classification group and the main differences between them. A similar approach of selecting the representative cells of different classification groups for a deeper characterization was performed in ref. [39], and allowed supporting the model proposed for the explanation of the observed changes in different sets of samples. In this way, the classification results can be supported by a deep knowledge about the physicochemical properties that limit the performance of the devices providing new insights for further technology improvement.

It is worth specifying that the proposed type of analysis of the results of this dimension reduction algorithm provides information about the representative set of samples, about the optimal techniques and about the specific ranges or parameters that the investigation needs to be focused on, when performing the detailed analysis of a combinatorial sample. In this way, the proposed workflow closes the cycle by generating new knowledge and providing direct feedback for technology improvement. This can be considered as a significant step towards accelerated research strategies for which a strong reduction of the efforts of highly experienced personnel is achieved by focusing only on the most critical samples, techniques, and data.

7. Example Application Case: Optimization of Production of CuInSe2‐Based PV Devices

In order to demonstrate the potential of the methodology described above, this section shows its application to a particular case: the industrial development of a CuInSe2 (CISe) TFPV technology by the company SUNPLUGGED GmbH for fabricating flexible and customized solar foil. This is a particularly interesting case of application as the samples were fabricated during the optimization of the manufacturing process resulting in a set of samples with significant inhomogeneities fabricated with different process parameters. As such, the heterogeneous and complex nature of the samples results in a large number (2135) of heterogeneous and complex data that are extremely difficult to analyze using standard scientific methods as this would require not only a strong fundamental knowledge of the materials involved in the PV technology, but also a considerable amount of time of highly qualified personnel.

In this context, this section shows how the application of the methodology allows obtaining high‐value reliable conclusions in a fast way for the optimization of the production technology. First, the samples, characterization techniques and data conditioning approaches employed are described. Then, the results obtained from the application of ML algorithms for classifying the data in terms of device performance (V OC) are presented. Finally, two different approaches are employed to gain insights into the limiting factors of the technology from the ML results (generation of a physical model).

7.1. Sample Description

The samples were obtained from the flexible solar foil production of SUNPLUGGED GmbH in their industrial roll‐to‐roll (RtR) pilot line with the following layer architecture and deposition/synthesis processes from bottom (substrate) to top: i) flexible stainless steel foil substrate, ii) SiO x ‐based insulation and barrier layer with a proprietary formulation deposited by slot‐die coating, iii) Mo back contact deposited by sputtering, iv) CISe absorber synthesized by a hybrid sputtering and evaporation of Cu and In under an evaporated Se atmosphere, v) CdS buffer layer deposited by chemical bath deposition, and vi) i‐ZnO/AZO transparent front contact deposited by sputtering. Seventeen strips with a size of 10×30 cm2 were cut from different rolls produced with different fabrication conditions.

As explained in the methodology, in the case of the development and optimization of PV devices, the discretization of the samples into pixel solar cells allows obtaining spatially resolved information about their IV characteristics (that constitute the main optimization target) that can be correlated with the rest of the data. In this way, the samples were discretized into 3×3 mm2 individual solar cells using a mechanical scriber resulting in a total of 2135 pixel cells. The CISe absorber layers presented inhomogeneities as the result of the non‐optimized synthesis processes resulting in gradients of the optoelectronic parameters along the directions parallel and perpendicular to the movement of the web in the roll‐to‐roll process. However, it should be noted that the small size (≈9 mm2) of the areas discretized in the samples resulted in negligible variations within them.

7.2. Data Acquisition and Data Conditioning

As the main objective of this application case is the optimization of a PV technology, two different types of data were acquired: spectroscopic (sensitive to fundamental material properties) and optoelectronic (which determine the PV performance of the material).

Raman and PL spectroscopies were employed to acquire data from the CISe absorber material in the 2135 discretized pixel solar cells. A special optical probe developed at IREC mounted on a motorized XY gantry allowed measuring the Raman and the PL spectra at the same point quasi‐simultaneously in an automated way. A large area spot (≈70 µm2) was used in order to acquire spectra representative of each discretized area. The probe was coupled to a high resolution FHR640 Horiba monochromator and a CCD detector for Raman spectra acquisition in back scattering configuration, and to a Sol 1.7 spectrometer from B&WTek with an InGaAs detector for acquiring the PL data in near‐back scattering configuration. The simultaneous Raman and PL characterization was carried out under a 532 nm excitation wavelength laser. Laser power densities lower than 150 W cm−2 were selected for the excitation wavelengths to prevent any thermal effect on the samples.

Different data conditioning strategies were performed on the large spectroscopic data set (2135 spectra of each technique) in an automated way using the spectrapepper Python package.[ 44 ] In case of the Raman spectra, the Raman shift was calibrated by imposing the main peak of a monocrystalline Si to 520 cm−1 and the baseline contribution was removed. As for the PL spectra, these were corrected using the spectrum of a calibrated lamp source. Then, the data were cleaned to ensure their accuracy, consistency, and lack of errors, as incorrect or inconsistent data can negatively impact the performance of the ML model. In this way, samples with missing, corrupted, or negative IV properties, along with samples with missing, corrupted, or negative total area of their Raman or PL spectra were removed. This resulted in a final count of 2044 data points (≈96% of the original data). After cleaning the data, these were normalized to be on a similar scale. For this, Raman spectra were normalized to the relative intensity of the peak at 152 cm−1 and the PL spectra were normalized to the global maximum (maximum value of all the PL spectra). For each data point, the conditioned spectra were then fused in a single spectrum of 1536 dimensions (1024 from Raman and 512 from PL). An example of resulting vector can be seen in Figure  10 .

Figure 10.

Figure 10

Example of the fused Raman and PL spectra single vector used as input feature in the ML algorithm.

On the other hand, current‐voltage (IV) curves of each discretized pixel cell were measured under AM1.5 light conditions using a Sun 3000 class AAA solar simulator (Abet Technologies) calibrated with a Si (Newport) reference cell. From each IV curve, the main optoelectronic parameters were extracted as scalar data.

Since the Raman, PL and optoelectronic values were strictly measured and assigned to each pixel cell, data partitioning or averaging was not needed to spatially correlate the data in this application case. This also facilitated the traceability of the data that was performed by simply assigning a number to each analyzed pixel cell.

7.3. AI‐Assisted Data Analysis and Result Validation

The data set was divided into features and targets. In this way, and following the methodology, the spectroscopic data (i.e., the fused vectors containing Raman and PL spectra, see Figure 10) were used as features and the scalar V OC values of the solar cells were used as target. It is worth mentioning that the selection of the V OC as target was made due to i) the optimization of this parameter is fundamental for the overall optimization of the PV technology under study, and ii) this parameter is strongly controlled by the properties of the CISe absorber layer (the main layer from which data were obtained with the inspection techniques and acquisition parameters employed). The target data were then divided into four different classification groups trying to keep a relatively similar amount of data in each of them. The resulting classification groups were: 1) “high”—V OC ≥ 429 (mV); 2) “medium‐high”—402 ≤ V OC < 429 (mV); 3) “medium‐low”—320 ≤ V OC < 402 (mV); and 4) “low”—V OC < 320 (mV). Finally, all the data were split into training (≈70%) and test (≈30%) subsets allowing a robust training and a reliable validation of the results.

A PC‐LDA cascade ML algorithm was trained to classify the solar cells into the different V OC classification groups using the spectroscopic Raman and PL feature data. The cascaded algorithm was employed as follows: i) first the PCA part of the algorithm was used to reduce the dimensions of the feature data from 1536 down to 250, which was found to be the optimal value for this particular case through an iterative process; ii) then the LDA part of the algorithm was used to further reduce the dimensionality of the data down to 2D. The obtained 2D discriminant map for the training and test of the ML algorithm is shown in Figure  11 . From the discriminant graphs, it can be observed that even though the classification groups present a relatively high overlapping, there is a clear separation between the centers of mass of each group. This is reflected in the overall training and test scores of the algorithm: 0.80 and 0.76, respectively. The classification efficacy of the algorithm, both for training and test, can be studied in a clearer way by analyzing the confusion matrixes in Figure  12 . It can be observed that, both for the training and the test scores of the algorithm, the main misclassification occurs between the low and medium‐low as well as between the high and medium‐high classification groups. On the other hand, the misclassification between the medium‐high and medium‐low groups is lower. This indicates that the reshaping (selecting different limits) of the target classification groups using a different criterion may significantly improve the overall classification score.

Figure 11.

Figure 11

Discriminant graph for classification of the CISe solar cells in terms of V OC from spectroscopic data. The full and clear circles indicate the training and test data points, respectively, and stars indicate the center of mass of each classification group. Scores for training and testing sets are included at the top.

Figure 12.

Figure 12

Training (left) and test (right) confusion matrixes for the classification of the CISe solar cells in terms of V OC from spectroscopic data.

7.4. System Modeling

In order to get deeper insights into the ML results presented above and following the methodology, two different complementary strategies were applied to generate a model of the system with physical meaning that can help in the optimization of the PV technology: i) a sensitivity analysis to discern the main regions of the data that influence device performance, and ii) the selection of representative data points from each classification group for further physical analysis. The results of both approaches are presented in Figure  13 .

Figure 13.

Figure 13

Raman (left) and PL (right) spectra of representative cells for each classification group. The color scale indicates the relative importance of the different spectral range as result of the sensitivity analysis.

Regarding the sensitivity analysis, performed with the pudu library,[ 53 ] the data were perturbated increasing the value of each element of the input feature vectors by 10% and the probability of this perturbation to cause the data point to be classified in the next higher V OC target classification group was calculated, also known as the importance value. The importance can be positive or negative (red and blue lines, respectively, in Figure 13) depending on whether the 10% change produces a positive impact in its classification as the next higher class or negative if it has a negative impact. In this way, the calculated importance values allow detecting the spectral ranges that are correlated with a higher V OC of the solar cells. It should be noted that with this approach no sensitivity analysis can be done for the high V OC group. However, as can be seen in Figure 13, for the other groups the sensitivity analysis clearly reveals the spectral ranges that correlate with the V OC and that can be employed to predict this optoelectronic parameter. As such, from the color highlights in Figure 13 it can be then concluded that:

  1. For Raman spectroscopy, there are two main bands around 175 and 235 cm−1 that have a strong influence on the V OC value. However, it should be borne in mind that the band at 152 cm−1 was used to normalize the data before and, thus, the intensity of the other bands should be considered only as relative to this one.

  2. For PL spectroscopy, mainly the overall increase of the PL band intensity seems to be a key for improving the V OC of the PV devices.

To complete the results of the sensitivity analysis and to get further insights into the physical meaning of the important changes in the spectral data, a selection of representative cells from each classification group was performed for further analysis. Representative data points from each classification group were selected as the seven closest to their center of mass (see the selected spectra in Figure 13). In the case of the Raman spectra, it can be seen that, on the one hand, all the classification groups present a clear band at 152 cm−1 and, on the other hand, the main changes in the spectra that differentiate the different classification groups lie in the bands at 175 and at 235 cm−1 that present different relative intensities for each classification group. In the case of the PL spectra, it is clearly seen that the main difference between the different classification groups is the intensity of the PL band (higher for the higher V OC groups and lower for the lower V OC groups). These results are completely aligned with those obtained from the sensitivity analysis.

From these results, it is possible to model the system based on previous knowledge from literature. In the case of PL, the intensity of PL band in an absorber layer is a well‐known indicator of the V OC value, in particular, and of the total performance of a TF solar cell, in general.[ 54 , 55 ] In the case of Raman, the presence of a band at 152 cm−1 is related to OVC phase.[ 56 ] On the other hand, the bands at 175 cm−1 and at 235 cm−1 are assigned to the main Raman peak of the chalcopyrite phase of the CuInSe2 compound and to a mixture of the E and B symmetry modes of the same phase, respectively.[ 57 ] Moreover, some recent investigations revealed that the Raman band at 235 cm−1 includes a peak that can be assigned to a defective chalcopyrite phase.[ 4 ] The ratio between these three phases (OVC, chalcopyrite and defective chalcopyrite) was found to have a direct influence on the V OC of high efficiency chalcopyrite‐based PV devices.[ 4 ] Thus, the differences in this ratio between the different classification groups indicates that this is a strong parameter that influences the V OC of the pixel PV devices. Interestingly, some of data points from the medium‐high group present a very low PL intensity. The analysis of the Raman spectra corresponding to these data points indicates that these pixel cells did not have the Raman peak of the CISe phase at 175 cm−1, which allows directly assigning the observed PL band to the presence of this Raman phase, at least for the groups with high and medium‐high V OC.

With this, it is demonstrated that the application of the methodology to heterogeneous and complex set of sample and large data set, can provide relevant insights. In the case of the optimization of SUNPLUGGED's PV technology towards higher efficiencies, it can be concluded that the V OC of the devices is controlled by ratio between three different phases (OVC, CISe and defective CISe) that correspond to the three Raman bands revealed as relevant from the sensitivity analysis and from the representative spectra of each classification group. This conclusion is reinforced by the correlations found between these phases and the intensity of the PL signal, which was also revealed as the main parameter controlling the V OC of the devices from the sensitivity and representative spectra analyses. These conclusions could be extracted from the data in a fast way without the need of qualified experts and were then validated with scientific results from the literature. Without the proposed methodology, the analysis of these data would have been extremely complex and time‐consuming.

8. Conclusions

This work has presented in detail a robust and clear methodology for accelerating and facilitating research on complex thin film photovoltaic materials and devices through the use of combinatorial analysis and artificial intelligence. The methodology covers all the necessary steps from the synthesis and selection of relevant sets of samples for combinatorial analysis to data acquisition with spectroscopic and optoelectronic inspection techniques, AI‐assisted data analysis using PC‐LDA machine learning algorithms and the extraction of relevant information for research acceleration. Each step provides details on the necessary concepts, requirements, and procedures and are illustrated with examples from the literature. Moreover, a complex and heterogeneous set of industrial thin film photovoltaic samples from a fabrication process under optimization has been used as example case for the application of the methodology demonstrating that relevant insights can be obtained in a fast way and without the need of highly qualified personnel. The proposed methodology can be applied to other types of materials and devices beyond PV and using different characterization techniques, proving especially useful in the cases where the high complexity makes the use of conventional analysis immeasurable and/or excessively complicated and very time consuming. Ultimately, the methodology delivers insightful information about the studied technology in an accelerated fashion, offering an explainable approach to the decisions of the machine learning model.

In this way, the main aim of the present work is to provide to the scientific community with a guided methodology that facilitates the adoption of AI and combinatorial analysis for material and device research and development, with a focus on TFPV technologies which represent a step forward for the massive adoption of PV, key for the energy transition towards renewables. We strongly believe that sharing this type of knowledge and knowhow is fundamental to boost materials research towards new limits, providing new ways for high quality and highly automated research that can accelerate the development and lab‐to‐market transition of new key technologies.

Conflict of Interest

The authors declare no conflict of interest.

Supporting information

Supporting Information

SMTD-8-2301573-s001.pdf (504.6KB, pdf)

Acknowledgements

Co‐funded by the European Union (Grant Agreement No. 101058459 Platform‐ZERO). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or European Health and Digital Executive Agency (HADEA). Neither the European Union nor the granting authority can be held responsible for them. This project has received funding from the European Union's Horizon 2020 research and innovation programme under Marie Skłodowska‐Curie grant agreement No. 801342 (Tecniospring INDUSTRY) and the Government of Catalonia's Agency for Business Competitiveness (ACCIÓ). Authors from IREC belong to the MNT‐Solar Consolidated Research Group of the “Generalitat de Catalunya” (ref. 2021 SGR 01286) and are grateful to European Regional Development Funds (ERDF, FEDER Programa Competitivitat de Catalunya 2007‐2013). M.G. acknowledges the financial support from MCIN/AEI/10.13039/501100011033 and from FSE+ within the Ramón y Cajal (RYC2022‐035588‐I) program.

Grau‐Luque E., Becerril‐Romero I., Atlan F., Huber D., Harnisch M., Zimmermann A., Pérez‐Rodríguez A., Guc M., Izquierdo‐Roca V., Accelerating the Development of Thin Film Photovoltaic Technologies: An Artificial Intelligence Assisted Methodology Using Spectroscopic and Optoelectronic Techniques. Small Methods 2024, 8, 2301573. 10.1002/smtd.202301573

Contributor Information

Maxim Guc, Email: mguc@irec.cat.

Victor Izquierdo‐Roca, Email: vizquierdo@irec.cat.

Data Availability Statement

The data that support the findings of this study are openly available in Zenodo at https://doi.org/10.5281/zenodo.10436746, reference number 10436746.

References

  • 1. Fonoll‐Rubio R., Becerril‐Romero I., Vidal‐Fuentes P., Grau‐Luque E., Atlan F., Perez‐Rodriguez A., Izquierdo‐Roca V., Guc M., Sol. RRL 2022, 6, 2200235. [Google Scholar]
  • 2. Pistor P., Meyns M., Guc M., Wang H.‐C., Marques M. A. L., Alcobé X., Cabot A., Izquierdo‐Roca V., Scr. Mater. 2020, 184, 24. [Google Scholar]
  • 3. Fonoll‐Rubio R., Andrade‐Arvizu J., Blanco‐Portals J., Becerril‐Romero I., Guc M., Saucedo E., Peiró F., Calvo‐Barrio L., Ritzer M., Schnohr C. S., Placidi M., Estradé S., Izquierdo‐Roca V., Pérez‐Rodríguez A., Energy Environ. Sci. 2021, 14, 507. [Google Scholar]
  • 4. Guc M., Bailo E., Fonoll‐Rubio R., Atlan F., Placidi M., Jackson P., Hariskos D., Alcobe X., Pistor P., Becerril‐Romero I., Perez‐Rodriguez A., Ramos F., Izquierdo‐Roca V., Acta Mater. 2022, 223, 117507. [Google Scholar]
  • 5. Das R. S., Agrawal Y. K., Vib. Spectrosc. 2011, 57, 163. [Google Scholar]
  • 6. Liang F., Xu H., Wu X., Wang C., Luo C., Zhang J., Chin. Phys. B 2018, 27, 037802. [Google Scholar]
  • 7. Rostron P., Gaber S., Gaber D., Int. J. Eng. Tech. Res. 2016, 6, 2454. [Google Scholar]
  • 8. Neuville D. R., de Ligny D., Henderson G. S., Rev. Mineral. Geochem. 2014, 78, 509. [Google Scholar]
  • 9. Grossberg M., Krustok J., Raudoja J., Timmo K., Altosaar M., Raadik T., Thin Solid Films 2011, 519, 7403. [Google Scholar]
  • 10. Atlan F., Becerril‐Romero I., Giraldo S., Rotaru V., Sánchez Y., Gurieva G., Schorr S., Arushanov E., Pérez‐Rodríguez A., Izquierdo‐Roca V., Guc M., Sol. Energy Mater. Sol. Cells 2023, 249, 112046. [Google Scholar]
  • 11. Kirchartz T., Márquez J. A., Stolterfoht M., Unold T., Adv. Energy Mater. 2020, 10, 1904134. [Google Scholar]
  • 12. Binetti S., Le Donne A., Sassella A., Sol. Energy Mater. Sol. Cells 2014, 130, 696. [Google Scholar]
  • 13. Chen C., Zuo Y., Ye W., Li X., Deng Z., Ong S. P., Adv. Energy Mater. 2020, 10, 1903242. [Google Scholar]
  • 14. Grau‐Luque E., Anefnaf I., Benhaddou N., Fonoll‐Rubio R., Becerril‐Romero I., Aazou S., Saucedo E., Sekkat Z., Perez‐Rodriguez A., Izquierdo‐Roca V., Guc M., J. Mater. Chem. A 2021, 9, 10466. [Google Scholar]
  • 15. Liu Y., Yan W., Han S., Zhu H., Tu Y., Guan L., Tan X., Sol. RRL 2022, 6, 2101100. [Google Scholar]
  • 16. Hu Y., Hu X., Zhang L., Zheng T., You J., Jia B., Ma Y., Du X., Zhang L., Wang J., Che B., Chen T., Liu S. (Frank), Adv. Energy Mater. 2022, 12, 2201463. [Google Scholar]
  • 17. Bandaru N., Enduri M. K., Reddy Ch. V., Reddy Kakarla R., Sol. Energy 2023, 263, 111941. [Google Scholar]
  • 18. Karade V. C., Sutar S. S., Shin S. W., Suryawanshi M. P., Jang J. S., Gour K. S., Kamat R. K., Yun J. H., Dongale T. D., Kim J. H., Adv. Funct. Mater. 2023, 33, 2303459. [Google Scholar]
  • 19. Gu G. H., Noh J., Kim I., Jung Y., J. Mater. Chem. A 2019, 7, 17096. [Google Scholar]
  • 20. Mahmood A., Wang J.‐L., Energy Environ. Sci. 2021, 14, 90. [Google Scholar]
  • 21. Ecker D. J., Crooke S. T., Nat. Biotechnol. 1995, 13, 351. [Google Scholar]
  • 22. Xiang X.‐D., Sun X., Briceño G., Lou Y., Wang K.‐A., Chang H., Wallace‐Freedman W. G., Chen S.‐W., Schultz P. G., Science 1995, 268, 1738. [DOI] [PubMed] [Google Scholar]
  • 23. Danielson E., Golden J. H., McFarland E. W., Reaves C. M., Weinberg W. H., Di Wu X., Nature 1997, 389, 944. [Google Scholar]
  • 24. Wang J., Yoo Y., Gao C., Takeuchi I., Sun X., Chang H., Xiang X.‐D., Schultz P. G., Science 1998, 279, 1712. [DOI] [PubMed] [Google Scholar]
  • 25. Van Dover R. B., Schneemeyer L. F., Fleming R. M., Nature 1998, 392, 162. [Google Scholar]
  • 26. Takeuchi I., Famodu O. O., Read J. C., Aronova M. A., Chang K.‐S., Craciunescu C., Lofland S. E., Wuttig M., Wellstood F. C., Knauss L., Orozco A., Nat. Mater. 2003, 2, 180. [DOI] [PubMed] [Google Scholar]
  • 27. Takeuchi I., Lauterbach J., Fasolka M. J., Mater. Today 2005, 8, 18. [Google Scholar]
  • 28. Simon C. G., Sheng L. G., Adv. Mater. 2011, 23, 369. [DOI] [PubMed] [Google Scholar]
  • 29. Ding S., Liu Y., Li Y., Liu Z., Sohn S., Walker F. J., Schroers J., Nat. Mater. 2014, 13, 494. [DOI] [PubMed] [Google Scholar]
  • 30. Park J.‐C., Lee J.‐R., Al‐Jassim M., Kim T.‐W., Opt. Mater. Express 2016, 6, 3541. [Google Scholar]
  • 31. Fairbrother A., Dimitrievska M., Sánchez Y., Izquierdo‐Roca V., Pérez‐Rodríguez A., Saucedo E., J. Mater. Chem. A 2015, 3, 9451. [Google Scholar]
  • 32. Eid J., Liang H., Gereige I., Lee S., Van Duren J., Prog. Photovoltaics Res. Appl. 2015, 23, 269. [Google Scholar]
  • 33. Jiang W., Li M., Sha J., Zhou C., Mater. Des. 2020, 192, 108687. [Google Scholar]
  • 34. Davydova A., Rudisch K., Scragg J. J. S., Chem. Mater. 2018, 30, 4624. [Google Scholar]
  • 35. Zakutayev A., Baranowski L. L., Welch A. W., Wolden C. A., Toberer E. S., in Proc. IEEE 40th Photovoltaic Specialists Conf., IEEE, Piscataway, NJ, 2014, pp. 2436–2438. [Google Scholar]
  • 36. Liang H., Liu W., Lee S., van Duren J., Franklin T., Patten M., Nijhawan S., in Proc. IEEE 38thPhotovoltaiic Specialists Conf., IEEE, Piscataway, NJ, 2012, pp. 3102–3107. [Google Scholar]
  • 37. Teeter G., Du H., Leisch J. E., Young M., Yan F., Johnston S. W., Dippo P., Kuciauskas D., Romero M. J., Newhouse P., Asher S. E., Ginley D. S., in Proc. IEEE 35th Photovoltaic Specialists Conf., IEEE, Piscataway, NJ, 2010, pp. 650–655. [Google Scholar]
  • 38. Schorr S., Gurieva G., Guc M., Dimitrievska M., Pérez‐Rodríguez A., Izquierdo‐Roca V., Schnohr C. S., Kim J., Jo W., Merino J. M., J. Phys. Energy 2019, 2, 012002. [Google Scholar]
  • 39. Fonoll‐Rubio R., Paetel S., Grau‐Luque E., Becerril‐Romero I., Mayer R., Pérez‐Rodríguez A., Guc M., Izquierdo‐Roca V., Adv. Energy Mater. 2022, 12, 2103163. [Google Scholar]
  • 40. Chen M., Mao S., Zhang Y., Leung V. C. M., Big Data: Related Technologies, Challenges and Future Prospects, 1st ed., Springer International Publishing, London, UK, 2014. [Google Scholar]
  • 41. Mohanty H., in Big Data: An Introduction, Springer, New Delhi, India, 2015, pp. 1–28. [Google Scholar]
  • 42. Zhang Y., Ling C., npj Comput. Mater. 2018, 4, 25. [Google Scholar]
  • 43. Ghiringhelli L. M., Vybiral J., Levchenko S. V., Draxl C., Scheffler M., Phys. Rev. Lett. 2015, 114, 105503. [DOI] [PubMed] [Google Scholar]
  • 44. Grau‐Luque E., Atlan F., Becerril‐Romero I., Perez‐Rodriguez A., Guc M., Izquierdo‐Roca V., J. Open Source Software 2021, 6, 3781. [Google Scholar]
  • 45. Kong X., Hu C., Duan Z., Principal Component Analysis Networks and Algorithms, 1st ed., Springer, Singapore, 2017. [Google Scholar]
  • 46. Izenman A. J., in Modern Multivariate Statistical Techniques, (Eds: Allen G. I., de Veaux R. D., Nugent R.), 1st ed., Springer, New York, 2008, pp. 237–280. [Google Scholar]
  • 47. Groce A., Kulesza T., Zhang C., Shamasunder S., Burnett M., Wong W.‐K., Stumpf S., Das S., Shinsel A., Bice F., McIntosh K., IEEE Trans. Software Eng. 2014, 40, 307. [Google Scholar]
  • 48. Barr E. T., Harman M., McMinn P., Shahbaz M., Yoo S., IEEE Trans. Software Eng. 2015, 41, 507. [Google Scholar]
  • 49. Riccio V., Jahangirova G., Stocco A., Humbatova N., Weiss M., Tonella P., Empirical Software Eng. 2020, 25, 5193. [Google Scholar]
  • 50. Xie X., Ho J. W. K., Murphy C., Kaiser G., Xu B., Chen T. Y., J. Syst. Software 2010, 84, 544. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Seca D., arXiv:2105.01407, 2021.
  • 52. Breck E., Cai S., Nielsen E., Salib M., Sculley D., in Proc. –2017 IEEE Int. Conf. on Big Data, IEEE, Piscataway, NJ, 2018, pp. 1123–1132. [Google Scholar]
  • 53. Grau‐Luque E., Becerril‐Romero I., Perez‐Rodriguez A., Guc M., Izquierdo‐Roca V., J. Open Source Software 2023, 8, 5873. [Google Scholar]
  • 54. Siebentritt S., Weiss T. P., Sood M., Wolter M. H., Lomuscio A., Ramirez O., J. Phys. Mater. 2021, 4, 042010. [Google Scholar]
  • 55. Unold T., Gütay L., in Advanced Characterization Techniques for Thin Film Solar Cells, Vol. 1 (Eds: Abou‐Ras D., Kirchartz T., Rau U.), Wiley‐VCH, Weinheim, Germany, 2016, pp. 275–297. [Google Scholar]
  • 56. Insignares‐Cuello C., Broussillou C., Bermúdez V., Saucedo E., Pérez‐Rodríguez A., Izquierdo‐Roca V., Appl. Phys. Lett. 2014, 105, 021905. [Google Scholar]
  • 57. Tanino H., Maeda T., Fujikake H., Nakanishi H., Endo S., Irie T., Phys. Rev. B 1992, 45, 13323. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

SMTD-8-2301573-s001.pdf (504.6KB, pdf)

Data Availability Statement

The data that support the findings of this study are openly available in Zenodo at https://doi.org/10.5281/zenodo.10436746, reference number 10436746.


Articles from Small Methods are provided here courtesy of Wiley

RESOURCES