Skip to main content
. 2022 Sep 2;4:958284. doi: 10.3389/fdgth.2022.958284

Table 2.

Overview of gaps in best practices for model maintenance.

Domain Gaps/Needs
Maintenance policies
 How should model ownership impact local control over maintenance?
  • • 

    Policies establishing updating expectations of proprietary models

  • • 

    Clarity and fairness of local updating opportunities of proprietary models

  • • 

    Prototypes for establishing collaborative updating of multi-system owned models

 How do we ensure comparable performance across demographic groups is sustained during the maintenance phase?
  • • 

    Guidance on whether and when changes in model fairness warrant pausing AI-enabled tools

  • • 

    Methods for addressing performance fairness drift when model performance deteriorates differentially across subpopulations

 How do we communicate model changes to end users and promote acceptance?
  • • 

    Design of effective communication strategies for warning end users of model performance drift and informing users when updated models are implemented

  • • 

    Guidance on aligning messaging with end-user AI literacy

Performance monitoring
 At what level should model performance be monitored and maintained?
  • • 

    Guidance on aligning monitoring and maintenance with use case needs

  • • 

    Recommendations for handling monitoring in smaller health systems, including determining minimum sample size and methods for collaborative monitoring

  • • 

    Policies supporting collaborative model maintenance in low data resource settings

  • • 

    Guidance on managing interim periods of local performance drift between releases of proprietary models that cannot be locally updated

 What aspects of performance should be monitored?
  • • 

    Generalization recommendations on frequency and sample sizes for measuring performance across a variety of metrics

  • • 

    Customizable and expandable tools to monitor a matrix of metrics

  • • 

    Guidelines for aligning metrics of interest with use case needs

 How do we define meaningful changes in performance?
  • • 

    Framework for selecting drift detection methods

  • • 

    Guidance on establishing clinically acceptable ranges of performance and defining clinically relevant decision boundaries

  • • 

    Methods for tailoring drift detection algorithms to detect a clinically important change

 Are there other aspects of AI models that we should monitor, in addition to performance?
  • • 

    Approaches to systematically surveil external features that may impact model inputs and for monitoring input data distributions

  • • 

    Guidance on when to update in response to changes in model inputs if performance remains stable

  • • 

    Systems for disseminating information on changes anticipated to affect common AI models

Model updating
 What updating approaches should be considered?
  • • 

    Approaches to optimizing update method selection based on performance characteristics most relevant to use case needs

  • • 

    Expanded suite of testing procedures options for more updating methods and increased computational efficiency

  • • 

    Guidance on defining acceptable performance and methods to determine which updating methods, if any, restore acceptable performance

 Should clinically meaningful or statistically significant changes in performance guide updating practice?
  • • 

    Guidance on whether to update models when statistically significant improvement is possible but updating would not provide a clinically meaningful improvement

  • • 

    Methods for comparing updating options that incorporate tests for both statistical and clinical significance

  • • 

    Recommendations for decision-making in cases where available updating methods do not restore performance to acceptable levels

 How do we handle biased outcome feedback after model implementation?
  • • 

    Recommendations for assessing feedback from effective AI-enabled interventions

  • • 

    Methods for model development, validation, and updating that are robust to confounding by intervention