Skip to main content
Journal of Assisted Reproduction and Genetics logoLink to Journal of Assisted Reproduction and Genetics
. 2023 Jan 13;40(2):265–278. doi: 10.1007/s10815-023-02713-2

Assuring quality in assisted reproduction laboratories: assessing the performance of ART Compass — a digital art staff management platform

Carol Lynn Curchoe 1,, Charles Bormann 2, Elizabeth Hammond 3, Scarlett Salter 4, Claire Timlin 1, Lesley Blankenship Williams 1, Daniella Gilboa 5, Daniel Seidman 5, Alison Campbell 4, Dean Morbeck 3
PMCID: PMC9935773  PMID: 36637586

Abstract

Purpose

Staff management is the most cited ART/IVF laboratory inspection deficiency. Small ART/IVF clinics may be challenged to perform these activities by low staff volume; similarly, large ART/IVF networks may be challenged by high staff volume and large datasets. Here, we sought to investigate the performance of an automated, digital platform solution to manage this necessary task.

Methods

The ART Compass (ARTC) digital staff management platform was used to assess the clinical decision-making of ART laboratory staff. The survey modules presented standardized instructions to technologists and measured inter- and intra-technologist variability for subjective “clinical decision-making” type questions. Internal and external comparisons were achieved by providing technologists two answers: (1) a comparison to their own lab director and (2) to the most popular response collectively provided by all lab director level accounts. The platform is hosted on HIPAA compliant Amazon web servers, accessible via web browser and mobile applications for iOS (Apple) and Android mobile devices.

Results

Here, we investigated the performance of a digital staff management platform for single embryologist IVF practices and for three IVF lab networks (sites A, B, C) from 2020 to 2022. Embryology dish preparation survey results show variance among respondents in the following: PPE use, media volume, timing of oil overlay, and timing of moving prepared dishes to incubators. Surveying the perceived Gardner score and terms in use for early blastocysts reveals a lack of standardization of terminology and fair to poor agreement. We observed moderate inter-technologist agreement for ICM and TE grade (0.47 and 0.52, respectively). Lastly, the clinical decision of choice to freeze or discard an embryo revealed that agreement to freeze was highest for the top-quality embryos, and that some embryos can be highly contested, evenly split between choice to freeze or discard.

Conclusions

We conclude that a digital platform is a novel and effective tool to automate, routinely monitor, and assure quality for staff-related parameters in ART and IVF laboratories. Use of a digital platform can increase regulatory compliance and provide actionable insight for quality assurance in both single embryologist practices and for large networks. Furthermore, clinical decision-making can be augmented with artificial intelligence integration.

Supplementary Information

The online version contains supplementary material available at 10.1007/s10815-023-02713-2.

Keywords: Clinical decision-making, Competency, Quality assurance, Standardization, Training, Assessment, Proficiency, Embryo quality, Blastocyst development, Embryo viability, Embryology, Andrology, LQMS, Laboratory quality management systems

Introduction

The IVF laboratory director must identify and monitor key performance indicators (KPIs) and assess competency for IVF laboratory success. Staff management (encompassing education, training or new staff onboarding, proficiency, ongoing competency, continuing education, real time in-cycle KPIs) is a pillar of IVF laboratory quality management systems because of its enormous impact on clinical outcomes. Key performance indicators are routinely used to continuously monitor and assess culture conditions; however, they are a proxy indicator for human subjective clinical decision-making, which is entirely more difficult to quantify. It is often difficult to identify the source of problems when a program does not produce satisfactory pregnancy outcomes. To that end, all accredited laboratories must document continuous monitoring of quality control and assurance parameters, quality improvement assessments, and competency assessments for staff [1]. Rigorously standardized, written quality control protocols are essential for meaningful staff competency assessment comparisons. This presents a challenge for both single embryologist practices and for large networks with multiple geographic locations. Accurate and successful IVF laboratory tests and procedures depend on rigorous adherence to protocols, competency to perform a wide range of procedures, and on the critical clinical decision-making of technologists [2].

CLIA regulations require six different ways to assess competency for all personnel performing laboratory testing. Evaluating and documenting competency of personnel is required at least semiannually during the first year the individual tests patient specimens and at least annually thereafter. Competency assessment must be performed for each test or procedure that technologists are approved by the laboratory director to perform. Competency assessments are part of a laboratory’s quality management system and should be periodically reviewed and used for continuous improvement [3].

However, ART labs in general face particular challenges with ongoing competency assessment; ART procedures have infinite, lab-specific variations, many varieties of grading systems, regional terminology, rapidly shifting technology, and single embryologist practices or require large data collection across multi-lab networks with hundreds of technologists [4]. In the USA, the current subjective, clinical decision-making proficiency tests are limited, perhaps to just one cleavage stage embryo and one blastocyst image every 6 months, and they cannot be customized to a lab’s own grading system, clinical question(s) of interest, or procedures for that particular lab.

Quality control of andrology laboratory methods that measure inter-technician or intra-laboratory variability over large geographic areas tends to use one of two approaches to reduce bias and standardize testing, data collection, and evaluation [5, 6]. Either a central laboratory distributes QC samples to the study sites for competency assessments over time or large groups convene at one central location for several days. However, both of those methods have potential for bias. Laboratories that use distributed materials, such as standard slides or videos, must replace them frequently or risk the technicians becoming familiar with the images. It is difficult to blind QC samples or to analyze them at the regular laboratory pace and intensity during QC testing when using a central location over a limited number of days.

Campbell et al. (2022) noted that improvement in IVF lab KPIS and competencies often requires interdisciplinary teaming, and that an understanding of team performance is achieved by recording and analyzing data. When using data to track employee performance, the following best practices were recommended [7]: set clear, measurable, and attainable goals that benefit the organization, capture data on a regular basis, engage employees in their own improvement and development, provide regular feedback, and support by checking in often. Choucair et al. (2021) further proposed a robust road map to modernize knowledge transfer in the modern IVF lab [8], which included multiple digital methods; online self-assessment programs; digital technology integration through blogs, podcasts, and influential videos; and an online platform education management platform to report training logbooks, including a “knowledge assessment passport” among other action items.

One of the most important processes in ART is the preparation of dishes for embryo culture because it has an enormous impact on embryo development. Care and precision must be taken for optimal dish preparation, adhering rigorously to the written protocol of the lab. However, our understanding of the impact of droplet size, evaporation, osmolality, and sterility on embryo development has evolved over time [9]. Drift in technique can occur, particularly as workloads shift and new embryologist training subtly varies with new understanding. Quality assurance monitoring of dish preparation can facilitate troubleshooting when and if results drop below a threshold value.

Consistency and accuracy in embryo staging and grading (quality assessment) are another important, yet challenging factor in the IVF laboratory, which relies on the technologist’s judgment [10]. Efforts have been undertaken to standardize embryo staging (terminology) as well as quality grading with published scales and systems [3, 11]. However, paramount to these systems is consistency from technologists in their assessments. Historically, many lines of evidence point to high variability and inconsistency in embryo staging and grading, and artificial intelligence is poised to improve this [12].

Lastly, embryo quality assessment and grading are tied to clinical decision-making. Transferring or freezing top-quality fresh or frozen-warmed embryos combined with selection criteria has been associated with higher implantation rates and better clinical outcomes [13]. However, even low-grade “c-quality” embryos can produce healthy pregnancies [14]. The number of frozen blastocysts a patient has predicts their success of eventually having an IVF pregnancy. IVF labs are challenged by standardization of clinical decision-making, particularly with regard to outlying, low-quality embryos [15]. What is the poorest quality embryo we should be freezing [16]?

Here, we discuss the development and performance of a digital platform to automate and standardize assessments for subjective clinical decision-making, assess competency, calculate interlaboratory and inter-technician variation, track technologists over time, and view comparative data at the technician and clinic level. Mobile application technology allows for rapid data collection and real-time reporting, real-time analysis, management and distribution of multimedia files, the ability to utilize hardware add-ons or proprietary device hardware features (gyroscope, microphone, etc.), and to collect biometric (fingerprint) data to enhance security. The ability to implement compliance standards ensures that tester data can be reported electronically to a central agency without compromising privacy standards or sacrificing efficiency.

The performance studies presented here include the following: embryo culture dish preparation, blastocyst (early and late) grading and terminology, and clinical decision-making (freeze or discard blastocysts), without and with AI assistance.

Most importantly, the survey platform was designed to allow the ART/IVF laboratory director to gain insight into the clinical decision-making of the most senior staff and compare that to junior staff members to inform the key performance indicators (KPIs) that are routinely used to continuously monitor and assess culture conditions.

Materials and methods

Development of a mobile application

Xcode (Apple Inc., Cupertino, CA, USA) was used for the development environment for iOS apps and is distributed free of charge as proprietary software. Java and Kotlin were used for the development of the Android mobile application, and a HIPAA-compliant laravel/PHP framework was used to develop the Internet-based application. HIPAA-compliant Amazon web servers are used to host the data. Open-source SDKs were used and installed locally for development purposes by cloning the GitHub repositories when necessary. The MySQL database, REST API, and administrator panel were developed with the HIPAA compliant laravel framework. Passport authentication, biometric (fingerprint or face) authentication, and encryption protect user data.

Competency assessments

Over 80 different surveys were developed for andrology and embryology competency assessment. Assessment materials were served digitally, and the technologists chose to use either a web browser or mobile device, allowing for the images to be analyzed at the same pace and intensity of the IVF laboratory workflow. Testing materials, images, and videos can be continuously refreshed to reduce the possibility of recognition bias. The modules provided standardized instructions and were used to measure inter and intra-technologist variability between embryologists. Subjective clinical decision-making surveys included here are as follows: embryo culture dish preparation, early blastocyst terminology and expansion score, blastocyst ICM quality, trophectoderm quality, and fate decision, i.e., to freeze or discard.

Scoring for subjective competency assessments

The scoring rubrics for subjective competency assessments are primed by five senior-level lab directors from within the ART Compass scientific advisory board, before being broadly deployed (see “Acknowledgements” section). In general, when a “director” level account gives the first answer, it is given 5 points (result 1). After that, when a second director level account provides an answer, if both directors gave a common answer, then it counts as 5, but if a different answer is provided, then the second director’s answer counts as 3 points (result 2). And it continues on like this, as more directors answer. If the directors continue to give the same answer, then the majority answer counts as 5 (result 1) and the second majority answer counts as 3 (result 2). When a technologist gives their answer, if it matches result 1, they get 5 points, if it matches result 2, they get 3 points. If it does not match either the first or second most popular answer, then it counts as 0. The final score is the (total points achieved divided by the total points available) multiplied by 100.

Embryo culture dish preparation

The embryo culture preparation methods survey contained 10 questions relating to use of personal protective equipment, oil overlay, drop size, embryos per drop, plasticware, drop pattern, and culture media type, with variable answer choices. Results were grouped by account level, lab director, or technologist. The survey was completed by 32 IVF lab directors and embryologists (worldwide) from 2021 to 2022.

Early blastocyst terminology

The early blastocyst staging survey contained 17 images of early blastocysts and was completed by 55 embryologists (worldwide) from 2020 to 2021. Grading options (cavitating morula/Gardner 1, early blastocyst/Gardner 1, cleaving morula/Gardners 1 or 2, early blastocyst/Gardner 2, poor blastocyst, or other) were given as choices to determine the terms in active use, and agreement on perceived Gardner grade, for embryos accumulating fluid in a cavity. The Fleiss kappa coefficient (k) measures inter-technologist agreement among embryologists when assigning blastocyst grade. Kappa value interpretation is as follows: < 0.20: poor; 0.21–0.40: fair; 0.41–0.60: moderate; 0.61–0.80: good; and 0.81–1.00: very good.

Quality scoring: blastocyst ICM and TE

One-hundred expanded (stage 4) blastocyst images in three planes were included in each survey for quality scoring of blastocyst cell layers and proctored to sites Fertility Associates New Zealand (site A), Sunfert, Malaysia (site B).

Survey (Fig. 3A) was for quality scoring of the inner cell mass (ICM) grade, and survey (Fig. 3B) was for quality scoring of the trophectoderm (TE) grades. The same images were used for both surveys. Blastocysts were presented with proportionally mixed grades (ranging from grade A to X). A total of 42 embryologists participated, including 23 from site(s) A and 19 from site B between April and July 2020. Embryologists were advised to complete the tests individually in one sitting. Percentage agreement is defined as the number of agreement scores/total scores. The Fleiss kappa coefficient (k) measures inter-technologist agreement among embryologists when assigning blastocyst grade. Kappa value interpretation is as follows < 0.20: poor; 0.21–0.40: fair; 0.41–0.60: moderate; 0.61–0.80: good; and 0.81–1.00: very good.

Fig. 3.

Fig. 3

Example images with top and poor percentage agreement for ICM grade. A Complete agreement for ICM grade X. B Complete agreement for ICM grade A. C Polarized agreement for ICM; grade A, 4.8%; grade B, 28.6%; grade C, 47.6%; and grade X, 19.0%. D Polarized agreement for ICM; grade B, 21.4%; grade C, 31.0%; and grade X, 47.6%. E Polarized agreement for ICM; grade A, 9.5%; grade B, 40.5%; grade C, 45.2%; and grade X, 4.8%

Time-lapse images in three planes were provided, and technologists (n = 42) were asked to independently evaluate the ICM and TE grade for 100 blastocysts using our modified Gardner criteria (A, B, C, and X). Agreement for blastocyst (1) grade and (2) useability (inferred by grade) was assessed.

Clinical decision-making: freeze or discard?

Two-hundred forty blastocyst images with three focal planes were displayed in 12 surveys each containing 20 images to three IVF lab networks: Fertility Associates New Zealand site A), Sunfert, Malaysia (site B), and CARE UK (site C) for an unlimited time frame. Site C was further analyzed by individual laboratories.

Embryologists were asked whether they would choose to cryopreserve or dispose of (discard) the blastocyst pictured. Raters were classified as technicians (n = 196) or lab directors (n = 21), before measuring interobserver agreement. Agreement between all members of an individual lab (directors and technicians) was also measured. Interobserver agreement was assessed using the Fleiss kappa coefficient (k) in R with the irrCAC package [17]. Kappa value interpretation is as follows: < 0.20: poor; 0.21–0.40: fair; 0.41–0.60: moderate; 0.61–0.80: good; and 0.81–1.00: very good.

The data was then divided into categories of embryo grades with (1) obvious top-quality embryos, (2) good quality embryos, (3) obvious poor-quality embryos, and (4) borderline quality embryos. Agreement among technicians, directors, and between labs was then assessed using the Gwet’s agreement coefficient (AC1) in R using the irrCAC package [17]. Coefficient interpretation is as previously stated: < 0.20: poor; 0.21–0.40: fair; 0.41–0.60: moderate; 0.61–0.80: good; and 0.81–1.00: very good.

Use of EMA AiScore to augment clinical decision-making

The EMA AI algorithm[18] was used to calculate the embryo ranking (1–10, 1 = poor and 10 = good) and correlated with the embryologist interpretations. Five senior embryologists were asked to calibrate their highest and lowest quality scores with the EMA score and then re-evaluate the highly contested embryo images for decision to freeze or discard. Lastly, when the EMA score varies significantly from the embryologist scores, they were asked to re-evaluate the images.

Results

A summary of number of surveys proctored versus number of respondents is provided in Supplementary Table 1.

Embryo culture dish preparation

We observed variance in the following: plating media and oil (Fig. 1A), PPE use in preparing culture dishes (Fig. 1B), and the timing of movement of dishes to incubators. We also noted a trend toward and culture of embryos in 10 uL per embryo or less and the movement toward one-step and continuous culturing (and away from sequential media) in our survey respondents (Table 1).

Fig. 1.

Fig. 1

High variance in embryo culture dish survey responses. A Variance in plating embryo culture media and oil. B Variance in personal protective equipment donned for culture dish preparation. C Variance in timing of moving a freshly plated culture dish to an incubator

Table 1.

Embryo culture dish preparation survey. Trends and observations in dish culture preparation

Question Answer choices Director level Tech level Combined
Do you wear PPE when you make dishes? Yes — gloves and face mask 4 10 14
Yes — gloves only 3 3 6
No 3 10 13
Do you move the embryo culture dish(es) to the incubator as soon as they are covered with oil? No, I plate dishes then cover them with oil, then plate some more, and eventually move them all into incubators 5 14 19
Yes, each dish is moved into an incubator ASAP 5 9 14
When making culturing dishes, do you… Add media drops first and then cover with oil 8 15 23
Add a bit of media first, then cover with oil, and then either replace media or add to media once oil is present 1 3 4
Add oil first and then underlay media 1 5 6
If you add media drops before covering them with oil, what is the maximum number of dishes prepared with media in a row prior to laying down oil? One. I prepare one dish at a time 2 6 8
Two. I lay media drops down for two dishes and then cover both with oil 2 8 10
Three to four. I lay media drops down for three to four dishes first and then cover those three to four dishes with oil 5 5 10
Five to six
Seven or greater 1 1
This question does not apply to me 1 3 4
What is the maximum number of embryos per drop? 1 3 3
2/3 1 3 4
4/5 7 10 17
6/7 1 3 4
Did not answer 1 4 5
What size (approximately) are your typical culturing media drops? Less than 20 μL 3 3
20–40 μL 5 11 16
40–60 μL 4 7 11
Greater than 60 μL 1 2 3
Media amount per embryo (based on answers to two previous questions) Approximately 10 μL 4 7 11
Less than 10 μL 4 8 12
Greater than 10 μL 1 4 5
What type(s) of embryo media system is used from fertilization to day 5/6/7 blastocyst development? Sequential media: “cleavage” media for days 1–3 and then “blast” media for days 3–7 1 3 4
Sequential media: fertilization media and then one step 1 4 5
Sequential media: fertilization media, cleavage media, and blastocyst media 0 1 1
Single-step media refreshed at day 3 3 2 5
Single-step media (i.e., monoculture; same media is used for all 5 or 6 days) 5 14 19
INVOcell or similar offered? 1 1 2
EmbryoScope or similar? 2 4 6

Early blastocyst terminology

Out of 935 possible answers, the term “cavitating morula” was selected 105 times, and the nonexistent grade “cleaving morula” was selected 59 times. The kappa value to examine the consistency of Gardner 1 or Gardner 2 grades was fair to poor for all images. Figure 2 shows representative images of early blastocysts used in the survey.

Fig. 2.

Fig. 2

All early blastocyst images had fair to poor embryologist agreement. The images presented here are representative of early blastocysts (Gardner 1 (i and ii) — Gardner 2 (iii and iv)) that garnered fair to poor agreement. None of the images presented in the survey garnered the highest (good) agreement

Quality scoring: blastocyst ICM and TE

The 100 blastocyst cases assessed were of mixed quality, representing typical agreement values between embryologists for assigning ICM and TE grade given a range of blastocyst qualities. The mean percent agreement between technologists when grading was 61.5% and 66.7% for ICM and TE, respectively (Table 2). Therefore, higher percentage agreement was observed between technologists for TE compared to ICM. Tables 3 and 4 shows the Fleiss kappa coefficient for ICM and TE grade by clinic (sites A and B). Overall, there was moderate agreement for ICM and TE grade for both sites (and combined). Figure 3 shows representative images. ICM grades X (Fig. 3A, no, or degenerate ICM) and A (Fig. 3B, good) had the highest agreement, while ICM grades B and C had the lowest agreement (Fig. 3, fair). These results show that embryologists inconsistently classified ICM grade when it was of moderate to poor quality (Fig. 3CE, grades b or c) but consistently classified ICM grade when there was no apparent ICM (grade X). Figure 4 shows representative images for TE quality grading. For TE grade, embryologists consistently identified a top quality TE (Fig. 4A, grade a). TE grades of moderate to poor quality displayed low embryologist agreement, similar to the ICM grading (Fig. 4BD, fair to poor).

Table 2.

ICM and TE agreement. The average percent agreement between technologists when grading was 61.5% and 66.7% for ICM and TE respectively

Average (%) Min (%) Max (%)
Sites A and B
ICM 61.5 53.1 67.0
TE 66.7 53.6 72.9
Site A
ICM 61.5 55.2 66.2
TE 68.6 61.8 72.9
Site B
ICM 61.6 53.1 67.0
TE 64.4 53.6 71.7

Table 3.

Fleiss kappa coefficient for ICM and TE grade. The Fleiss kappa coefficient for ICM and TE grade by clinic

Category All sites Site A Site B
ICM Category TE Category ICM Category TE Category ICM Category TE Category
A 0.53 Moderate 0.62 Good 0.53 Moderate 0.68 Good 0.57 Moderate 0.57 Moderate
B 0.37 Fair 0.48 Moderate 0.36 Fair 0.55 Moderate 0.39 Fair 0.41 Moderate
C 0.39 Fair 0.46 Moderate 0.40 Fair 0.52 Moderate 0.39 Fair 0.40 Fair
X 0.69 Good 0.57 Moderate 0.70 Good 0.57 Moderate 0.67 Good 0.56 Moderate
Overall 0.47 Moderate 0.52 Moderate 0.47 Moderate 0.58 Moderate 0.48 Moderate 0.47 Moderate

Table 4.

Fleiss kappa coefficient for blastocyst usability. The Fleiss kappa coefficient for blastocyst usability, as inferred by ICM and TE grade. There was good agreement for useability based on ICM grade and moderate agreement for useability based on TE grade

Category All sites Site A Site B
ICM Category TE Category ICM Category TE Category ICM Category TE Category
Usability 0.69 Good 0.57 Moderate 0.70 Good 0.57 Moderate 0.67 Good 0.56 Moderate

Fig. 4.

Fig. 4

Example images with top and poor percentage agreement for TE grade. A Complete agreement for TE grade A. B Polarized agreement for TE; grade B, 38.1%; grade C, 54.8%; and grade X, 7.1%. C Polarized agreement for TE; grade A, 2.4%; grade B, 28.6%; grade C, 61.9%; and grade X, 7.1%. D Polarized agreement for TE; grade A, 33.3%; grade B, 59.5%; and grade C, 7.1%

Clinical decision-making: freeze or discard

The Fleiss kappa coefficients for agreement on whether to discard or freeze embryos on the last day of culture are listed in Tables 5 and 6. Agreement between technicians and agreement between lab directors were moderate for technicians and good for directors, as determined by our kappa value interpretation, with overlapping 95% confidence intervals. Overall agreement with technicians and directors together was classified as moderate.

Table 5.

Fleiss kappa coefficient for technicians and directors at site C. The Fleiss kappa coefficients for agreement on whether to discard or freeze embryos by individual laboratories at site C, composed of director and technician ratings (A). For the most part, agreement within individual labs was high; however, our data shows that agreement between embryologists varies from lab to lab (B)

Freeze or discard 95% confidence interval Category
Technicians 0.58 0.53–0.63 Moderate
Directors 0.60 0.54–0.63 Good
Overall 0.59 0.55–0.63 Moderate

Table 6.

Fleiss kappa coefficients by individual laboratory for site C

Lab
(no. of embryologists)
Freeze or discard 95% confidence interval Category
A (n = 3) 0.89 0.83–0.94 Very good
B (n = 9) 0.70 0.64–0.75 Good
C (n = 6) 0.63 0.57–0.69 Good
D (n = 5) 0.72 0.66–0.79 Good
E (n = 5) 0.76 0.69–0.82 Good
F (n = 5) 0.64 0.57–0.70 Good
G (n = 9) 0.71 0.65–0.77 Good
H (n = 6) 0.71 0.62–0.79 Good
I (n = 8) 0.71 0.66–0.77 Good
J (n = 4) 0.54 0.45–0.64 Moderate
K (n = 7) 0.78 0.71–0.85 Good

At site C (large network of 11 IVF laboratories with standardized protocols), we identified significant differences in blastocyst classification between individual embryologists (Table 6). The embryologists unanimously agreed on the fate of the embryos in only 38.8% of cases. Of these, 16.7% was “agreement to cryopreserve,” and 22.1% was “agreement to discard.” This means that embryologists disagreed in over half of cases with some choosing to cryopreserve and others choosing to dispose of the pictured blastocyst. Representative blastocyst images with 50% agreement are shown in Fig. 5.

Fig. 5.

Fig. 5

Highly contested blastocysts. The images presented here are representative of blastocysts that garnered a complete split decision (to cryopreserve or discard) with only a 50% overall agreement for usability scoring

Table 7 presents the Fleiss kappa coefficients for agreement on whether to discard or freeze embryos by individual laboratories at site C, composed of director and technician ratings. For the most part, agreement within individual labs was high; however, our data shows that agreement between embryologists varies from lab to lab.

Table 7.

Gwet’s agreement coefficients (AC1) for technicians and directors by embryo category. When the embryo images were divided into categories reflecting embryo quality (obvious top quality, obvious bottom quality, borderline, etc.), agreement on whether to freeze or discard an embryo was very strong among the high-quality embryos and decreased as quality became lower

Embryo category Technicians Directors
Gwet’s agreement coefficient 95% confidence interval Category Gwet’s agreement coefficient 95% confidence interval Category
1 0.998 0.995–1 Very good 1 N/A Very good
2 0.993 0.986–1 Very good 0.994 0.982–1 Very good
3 0.801 0.722–0.88 Very good 0.883 0.826–0.939 Very good
4 0.344 0.277–0.411 Fair 0.401 0.328–0.475 Fair

When the embryo images were divided into categories reflecting embryo quality (obvious top quality, obvious bottom quality, borderline, etc.), agreement on whether to freeze or discard an embryo was very strong among the high-quality embryos and decreased as quality became lower or more questionable (Table 7).

Use of EMA AiScore to augment clinical decision-making

The subset of embryos identified as having a the most significant difference in utilization classification between individual embryologists (16.7% was “agreement to cryopreserve” and 22.1% “agreement to discard”) were given an EMA score (all earning a score in the mid-range of 4, 5, and 6 (Table 8). Five senior embryologists re-evaluated the images and changed their decision from “discard” to “freeze” with the EMA AI assistant. EMA identified two extra blastocysts to freeze out of every data set of 20 images, increasing the number of potential number of real “decisions to freeze” by 10%.

Table 8.

EMA AIscore correlates with embryologist clinical decision-making. The EMA artificial intelligence score and a representative set of 20 calibrator images, plus four highly contested embryo images that had a significant difference in utilization classification between individual embryologists. Embryo numbers 5 and 13 in the selected data showed the strongest disagreement between the EMA score and the average rater’s scores. Each data set of 20 images generated two strong disagreements between EMA and embryologists, and those images were re-evaluated by senior embryologists. In each instance (denoted by *), five senior embryologists changed their score from discard to freeze, when their decision was augmented by the EMA score

50% agreement 6, 4, 6 Change to freeze
50% agreement 5, 5, 5 Change to freeze
50% agreement 4, 6, 4 Change to freeze
50% agreement 5, 7, 6 Change to freeze
Image 1 8, 8, 5 Good
Image 2 9, 9, 9 Good
Image 3 > 1, > 1, > 1 Poor
Image 4 6, 9, 7 Test
Image 5* 6, 9, 7 Test
Image 6 6, 4, 6 Test
Image 7 8, 7, 8 Good
Image 8 1, 3, 1 Poor
Image 9 6, 6, 7 Test
Image 10 > 1, 2, 4 Poor
Image 11 5, 4, 4 Test
Image 12 2, 5, 5 Test
Image 13* 5, 8, 9 Test
Image 14 9, 8, 9 Good
Image 15 > 1, > 1, > 1 Poor
Image 16 2, 2, 2 Test
Image 17 4, 2, > 1 Poor
Image 18 4, 2, 1 Test
Image 19 9, 8, 9 Good
Image 20 8, 7, 8 Good

Discussion

Here, we sought to provide a flexible, efficient, digital, and automated staff management solution that is customizable and accessible on any device from “the cloud,” as one way to achieve the best practices noted by Campbell and Choucair et al.

ART Compass (ARTC) digital staff management platform was developed and validated in a pilot study. Feedback was gathered from users to ensure that the platform could fulfill the requirements noted above, and the learnings and recommendations from the pilot study were used to improve the ARTC platform (Supplementary Table 2.) ARTC is a full staff management platform that includes various features to manage personnel files, continuing education, annual procedure evaluations, real-time “in-cycle” KPIs, and subjective and objective competency assessments. The scope of the current work is limited to the subjective competency assessments module.

ARTC competency surveys are user generated, unlimited, and completely customizable (from the images, to the buttons, to the survey directions). ARTC allows the lab director to notify employees in advance that the assessment should be completed by a specific deadline. The assessments can be done while the employee is performing tasks using routine sample images or video. The results of the survey are instantly shared with the employee upon completion of the test (to avoid bias) or after each question is answered (as a training tool).

Our mobile application technology was designed to allow standardized specimens (images or videos) to be served to each technologist at each study site simultaneously, allowing even very small IVF clinics to compare an individual technician’s scores to the mean of all technicians answering on ARTC as well as to technicians in a central laboratory. Test pictures, videos, and written test questions can be refreshed from a large database of multimedia files to reduce bias in a laboratory’s testing procedures.

The pre-survey video can deliver learning content, instructions, or demonstrations. ARTC documents competency assessments, records, time and date stamps results, and is confidential. These records become part of the laboratory’s quality documents and can be periodically reviewed and used for continuous improvement and quality assurance. The records of employees who have separated from the practice are retained in the platform.

Some federated IVF lab networks enforce the same protocols network wide, whereas some allow each lab that is added to the network to retain its local protocols. As IVF lab coalescence into federated networks is becoming more common in the industry [19], the question of measuring complex competencies, like embryo culture dish-making practices, across a network becomes more pertinent. We know that small droplet size, evaporation, high-temperature conditions, and unsterile conditions should be avoided during embryo culture dish preparation. The data we gathered here show that interesting and actionable results can be obtained about an intricate procedure like dish preparation.

Uptake of standard terminology adoption for early blastocyst grading is incomplete and varies widely by geographic region. Additionally, the very low agreement on perceived Gardner scale grades 1–2 lends further evidence to the trend to combine these two grades into a single grade — “early blastocyst.” Similar issues exist with manual assessment of other embryo developmental stages. Consistency in manual annotation, specifically with training datasets, becomes more important as the field of artificial intelligence and predictive modeling advances.

Previously, fair agreement between embryologists has been reported for ICM (kappa 0.349) and TE grade (kappa 0.397) [20]. Fair agreement has also been reported for usability for a cohort of blastocysts that exhibited borderline morphology (kappa 0.301), but the overall agreement for usability will be dependent on the proportion of high-quality blastocysts within the cohort, with very high agreement expected for exclusively top-quality blastocysts (kappa 1.0) [16].

Overall, we observed moderate inter-technologist agreement for ICM and TE grade (0.47 and 0.52, respectively), and as expected, the level of agreement was dependent on blastocyst quality. In general, an absent/degenerate ICM (grade X) and conversely an extensive TE network (grade a) received the highest inter-technologist agreement, where this is clinically useful for determining usability based on the ICM and ranking based on the TE. While overall moderate inter-technologist agreement was found, the data reiterates that blastocyst grading is subjective, with grade discrepancies between embryologists frequently apparent. The figures provide some examples of “polarizing” morphological features; similar cases are likely to arise frequently during real-world grading in the absence of more objective measures of grade or usability, such as those provided by an artificial intelligence assistant.

This variability in grading carries over into the clinical decision-making embryologists face, such as whether to discard or freeze embryos on the last day of a patient’s cycle. Our findings show that generally, there is moderate and good agreement between technicians (0.58) and between lab directors (0.6) on whether to freeze discard. We also found variations in the agreement within individual labs. For the most part, embryologists are in good agreement within their individual labs, but some labs are in less agreement than others. This could be due to variations in training quality or frequency and volume of cycles they are exposed to. Baxter Bendus (2006) demonstrated that clinics running less than 500 cycles had larger interobserver variation than those that ran more than 500 cycles a year [21].

We also assessed agreement between raters when the data was broken down into categories of embryo grade, with group 1 being obvious top quality, 2 as good quality, 3 as obvious poor quality, and 4 as borderline quality. Interestingly, agreement decreased as embryo quality decreased. Although slight, there was a decrease in agreement between top quality and good quality embryos. When it came to the decision to freeze or discard on the last day of culture, the agreement between obvious poor-quality embryos was not as strong as agreement on top quality or grade 2 embryos. This is interesting, as it may highlight another aspect of subjectivity in embryo grading. Simply assigning a grade to an embryo is one subjective matter, but whether to utilize that embryo is another subjective matter that may be more easily influenced by an embryologist’s personal preferences or situations of low embryo production or low pregnancy retention in patients. Our borderline quality embryos had the least agreement, as was expected since the quality of these was highly subjective.

This study highlights the subjectivity in blastocyst classification within the field of IVF. The decision about which embryos to select for utilization has a direct impact on clinical pregnancy rates; it is therefore vital that subjectivity is minimized as much as possible through regular quality assurance and the use of objective methods for embryo classification, such as artificial intelligence [22]. The EMA scores for the subset of highly contested embryos all fell in the mid-range. Senior embryologists re-evaluated the images and unanimously changed their decision from “discard” to “freeze” with input from the EMA AI assistant. EMA identified two extra blastocysts to freeze out of every data set of 20 images, increasing the potential number of frozen blastocysts by 10%. We know the number of frozen blastocysts is a predictor of IVF cycle success [23, 24].

We expect the future analysis of clinic outcome data and whether the subjectivity exists across all embryo grades to support the use of digital staff management technology, such as ART compass, to improve standardization.

Limitations

This performance study used diverse surveys over a wide geographic area, with diverse embryologist generations, training levels, and types of IVF practices. However, there are certain limitations to our findings. ARTC surveys collect self-reported data. Only blastocyst stage embryos were assessed; therefore, additional competency surveys will be assessed for performance in the future. In the current work, survey takers score below a certain percentage; they are automatically assigned to the “Re-Take” category in ARTC. We are currently validating the performance of automated remediation in the format of the following: pre quiz, remediation, and post quiz, with a large IVF lab network. In the current system, subjective competency assessments are treated as if there is no single “correct” answer. In some cases, there may be a single correct interpretation for images; therefore, we are also currently exploring some options.

Lastly, the true relevance of the grading and selection inconsistencies is unknown. The work to compare compiled survey data to ongoing KPI analysis and correlation with clinical significance is ongoing. However, embryo quality and morphokinetics have long been used as proxy indicators of embryo potential. This is a simplistic view that the frequent use of PGT has modified; while the general correlation holds, we now know that even excellent morphology embryos can be aneuploid and have low clinical pregnancy potential.

In conclusion, a challenge for any new technology is to prove that it meets the accuracy standards of tested, tried and true methods. The ease of use by embryologists worldwide, including at three large IVF networks, indicates that a digital staff management platform, facilitated by mobile smartphone applications, provides an automated, user-friendly, and accessible platform for quality assurance. Effective and efficient assessment of competency and KPIs is an ongoing challenge for laboratories; the ART Compass digital staff management platform is a novel and effective tool to monitor QA parameters in the IVF laboratory, for single embryologist practices and federated IVF lab networks.

Supplementary information

ESM 1. (13.5KB, docx)

(DOCX 13 kb)

Acknowledgements

We would like to acknowledge the entire scientific advisory board of ART Compass, Kimball O. Pomeroy, Anthony Anderson, Michael Baker, Eva Schenckman, Said Daneshmand, Ashley Geka, Dean Morbeck, and Senior Scientific Advisor, and Charles Bormann. We could never express enough gratitude to our software development partners, Arvaan Technolabs, especially Pooja Patel, and the entire team of developers. Countless others have freely given their time, energy, feedback, and ideas. Thousands of embryologists have taken our surveys. There is no way to thank them appropriately.

Declarations

Conflict of interest

CLC is the founder of ART Compass, a fertility guidance technology and a big data and artificial intelligence software platform for IVF lab management. CB is a shareholder of AI-related patents. DG and DS are co-founders and shareholders of AIVF. EH, SS, CT, LBW, AC, and DM declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Olofsson JI, Banker MR, Sjoblom LP. Quality management systems for your in vitro fertilization clinic's laboratory: why bother? J Hum Reprod Sci. 2013;6(1):3–8. doi: 10.4103/0974-1208.112368. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Matson PL. Internal quality control and external quality assurance in the IVF laboratory. Hum Reprod. 1998;13(Suppl 4):156–165. doi: 10.1093/humrep/13.suppl_4.156. [DOI] [PubMed] [Google Scholar]
  • 3.Alpha Scientists in Reproductive, M. and E.S.I.G.o Embryology, the Istanbul consensus workshop on embryo assessment: proceedings of an expert meeting. Hum Reprod. 2011;26(6):1270–1283. doi: 10.1093/humrep/der037. [DOI] [PubMed] [Google Scholar]
  • 4.Niederberger C, et al. Forty years of IVF. Fertil Steril. 2018;110(2):185–324 e5. doi: 10.1016/j.fertnstert.2018.06.005. [DOI] [PubMed] [Google Scholar]
  • 5.Rothmann SA, Reese AA. Semen analysis: the test techs love to hate. MLO Med Lab Obs. 2007;39(4):18–20. [PubMed] [Google Scholar]
  • 6.Pacey AA. Is quality assurance in semen analysis still really necessary? A view from the andrology laboratory. Hum Reprod. 2006;21(5):1105–1109. doi: 10.1093/humrep/dei460. [DOI] [PubMed] [Google Scholar]
  • 7.Campbell A, et al. The in vitro fertilization laboratory: teamwork and teaming. Fertil Steril. 2022;117(1):27–32. doi: 10.1016/j.fertnstert.2021.09.031. [DOI] [PubMed] [Google Scholar]
  • 8.Choucair F, Younis N, Hourani A. The value of the modern embryologist to a successful IVF system: revisiting an age-old question. Middle East Fertility Society Journal. 2021;26(1):15. doi: 10.1186/s43043-021-00061-8. [DOI] [Google Scholar]
  • 9.Swain JE, et al. Microdrop preparation factors influence culture-media osmolality, which can impair mouse embryo preimplantation development. Reprod Biomed Online. 2012;24(2):142–147. doi: 10.1016/j.rbmo.2011.10.008. [DOI] [PubMed] [Google Scholar]
  • 10.Coticchio G, et al. Fertility technologies and how to optimize laboratory performance to support the shortening of time to birth of a healthy singleton: a Delphi consensus. J Assist Reprod Genet. 2021;38(5):1021–1043. doi: 10.1007/s10815-021-02077-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Racowsky C, et al. Standardization of grading embryo morphology. Fertil Steril. 2010;94(3):1152–3. doi: 10.1016/j.fertnstert.2010.05.042. [DOI] [PubMed] [Google Scholar]
  • 12.Bormann CL, et al. Consistency and objectivity of automated embryo assessments using deep neural networks. Fertil Steril. 2020;113(4):781–787 e1. doi: 10.1016/j.fertnstert.2019.12.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Heitmann RJ, et al. The simplified SART embryo scoring system is highly correlated to implantation and live birth in single blastocyst transfers. J Assist Reprod Genet. 2013;30(4):563–567. doi: 10.1007/s10815-013-9932-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Kemper JM, et al. Should we look for a low-grade threshold for blastocyst transfer? A scoping review. Reprod Biomed Online. 2021;42(4):709–716. doi: 10.1016/j.rbmo.2021.01.019. [DOI] [PubMed] [Google Scholar]
  • 15.Burns T, et al. Do patient factors influence embryologists' decisions to freeze borderline blastocysts? J Assist Reprod Genet. 2020;37(8):1975–1997. doi: 10.1007/s10815-020-01843-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Hammond ER, et al. Should we freeze it? Agreement on fate of borderline blastocysts is poor and does not improve with a modified blastocyst grading system. Hum Reprod. 2020;35(5):1045–1053. doi: 10.1093/humrep/deaa060. [DOI] [PubMed] [Google Scholar]
  • 17.Gwet KL. irrCAC: computing chance-corrected agreement coefficients (CAC) 2019. [Google Scholar]
  • 18.Lorena Bori PD, et al. Could the EMA artificial neural network grade blastcosyst as an embryologist? Fertility and Sterility; 2021. [Google Scholar]
  • 19.Patrizio P, et al. The changing world of IVF: the pros and cons of new business models offering assisted reproductive technologies. J Assist Reprod Genet. 2022;39(2):305–313. doi: 10.1007/s10815-022-02399-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Storr A, et al. Inter-observer and intra-observer agreement between embryologists during selection of a single day 5 embryo for transfer: a multicenter study. Hum Reprod. 2017;32(2):307–314. doi: 10.1093/humrep/dew330. [DOI] [PubMed] [Google Scholar]
  • 21.Baxter Bendus AE, et al. Interobserver and intraobserver variation in day 3 embryo grading. Fertil Steril. 2006;86(6):1608–1615. doi: 10.1016/j.fertnstert.2006.05.037. [DOI] [PubMed] [Google Scholar]
  • 22.Bori L, et al. The higher the score, the better the clinical outcome: retrospective evaluation of automatic embryo grading as a support tool for embryo selection in IVF laboratories. Hum Reprod. 2022;37(6):1148–1160. doi: 10.1093/humrep/deac066. [DOI] [PubMed] [Google Scholar]
  • 23.Zhu HB, et al. Culturing surplus poor-quality embryos to blastocyst stage have positive predictive value of clinical pregnancy rate. Iran J Reprod Med. 2014;12(9):609–616. [PMC free article] [PubMed] [Google Scholar]
  • 24.Song J, et al. Predictive value of the number of frozen blastocysts in live birth rates of the transferred fresh embryos. J Ovarian Res. 2021;14(1):83. doi: 10.1186/s13048-021-00838-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ESM 1. (13.5KB, docx)

(DOCX 13 kb)


Articles from Journal of Assisted Reproduction and Genetics are provided here courtesy of Springer Science+Business Media, LLC

RESOURCES