Skip to main content
NIST Author Manuscripts logoLink to NIST Author Manuscripts
. Author manuscript; available in PMC: 2023 Jul 1.
Published in final edited form as: J Manuf Sci Eng. 2022 Jul;144(7):10.1115/1.4053155. doi: 10.1115/1.4053155

Procedural Guide for System-Level Impact Evaluation of Industrial Artificial Intelligence-Driven Technologies: Application to Risk-Based Investment Analysis for Condition Monitoring Systems in Manufacturing

Michael Sharp 1,1, Mehdi Dadfarnia 2, Timothy Sprock 3, Douglas Thomas 4
PMCID: PMC9791980  NIHMSID: NIHMS1854873  PMID: 36578324

Abstract

Industrial artificial intelligence (IAI) and other analysis tools with obfuscated internal processes are growing in capability and ubiquity within industrial settings. Decision-makers share their concern regarding the objective evaluation of such tools and their impacts at the system level, facility level, and beyond. One application where this style of tool is making a significant impact is in Condition Monitoring Systems (CMSs). This paper addresses the need to evaluate CMSs, a collection of software and devices that alert users to changing conditions within assets or systems of a facility. The presented evaluation procedure uses CMSs as a case study for a broader philosophy evaluating the impacts of IAI tools. CMSs can provide value to a system by forewarning faults, defects, or other unwanted events. However, evaluating CMS value through scenarios that did not occur is rarely easy or intuitive. Further complicating this evaluation are the ongoing investment costs and risks posed by the CMS from imperfect monitoring. To overcome this, an industrial facility needs to regularly and objectively review CMS impacts to justify investments and maintain competitive advantage. This paper’s procedure assesses the suitability of a CMS for a system in terms of risk and investment analysis. This risk-based approach uses the changes in the likelihood of good and bad events to quantify CMS value without making any one-time point-wise estimates. Fictional case studies presented in this paper illustrate the procedure and demonstrate its usefulness and validity.

Keywords: manufacturing simulation, evaluating algorithms, cost-benefit analysis, industrial AI tools, condition monitoring systems, inspection and quality control, modeling and simulation, monitoring and diagnostics

1. Introduction

As industrial artificial intelligence (IAI) and other generalized algorithms expand the possibilities of analytic and decision-making tools in the physical industry, so too does the awareness of the need for critical evaluation of these tools across multiple levels of impact on assets, systems, facilities, and beyond. The philosophy of evaluating these tools should never be entirely agnostic for the domain and specific application; however, some aspects can hold in all evaluations:

  • Evaluations should take place at different levels of influence or impact.

  • Evaluations should also be interpretable in terms relatable to end-users, decision-makers, and other stakeholders that may not have intimate knowledge of the tool’s architecture or development.

  • All evaluation processes should be iterative, qualified, and current—it is equally important to know where evaluations are and are not valid as much as the output of the evaluation itself.

Condition monitoring for reliability is one area where sophisticated IAI software and devices are actively expanding [1,2]. These autonomous or semi-autonomous tools use sophisticated devices to measure the state of the asset and extract patterns of good versus bad behavior. These tools often obfuscated inner logic is typically built with some combination of physics, historical observations, human understanding, advanced analytics, or Artificial Intelligence [3]. Recently, Machine Learning and IAI have grown in appeal due to the industrial internet of things (IIOT) providing new opportunities for data generation, collection, and curation [4].

A condition monitoring system (CMS) is a collection of software and devices that monitor the state of one or more assets and respond or alert to detrimental changes. These systems fall under the umbrella of asset performance measurement and management systems, which aid asset operators in achieving sustainable asset productivity with a maximized return on investment [5]. A CMS assesses the operations of a process or equipment to aid high-level planning and rapid response to threats to the asset. This planning and threat response can prevent failures and identify problems to help facilities avoid unexpected downtime and costly repairs.

Adoption of any IAI tool needs to be driven by a system-level or above demand. The higher tier the source of the demand comes from, the more substantial potential justification for that tool and the broader the impact range evaluation must be. Thus, the evaluation goes beyond asking the simple question of what a tool must do via some performance standards and instead frames it to answer more impactful questions about the desired system-level (or above) outcomes. Part of the reason such higher-level evaluations are necessary for tools (such as IAI) with obfuscated processes is that many base-level performance metrics can be misleading and provide an incomplete picture of the true value or merit of the tool.

In this light, CMSs could be evaluated based on their economic justification and profitability to a facility [610]. The range of mechanisms to realize an economic benefit from applying a CMS, or any IAI-driven technology, will strongly depend on the specific application area and the domain. For example, the implementation process for a CMS discussed in Ref. [6] focuses on ensuring managers that cost savings exceed the cost of condition monitoring. The study in Ref. [7] makes similar implementation considerations, with a list of use cases that have seen economic benefits from using a CMS. An analysis of the life-cycle costs of wind power systems across different maintenance strategies finds that CMS benefits maintenance planning and can cover costs related to implementing the CMS itself [8]. The research in Refs. [8,9] considers return on investments from health monitoring applications for light-emitting diode (LED) lighting systems and multifunctional displays used in aircraft, respectively. The existing literature either focuses on profitability for only specific applications or case studies, or does not consider and trace the risk-mitigating impact of the CMS to its economic benefits [11].

In terms of economic benefits, exclusively estimating the direct value an IAI tool such as a CMS provides is not always enough to justify the investment. After estimating the value it provides, investigating additional investment analysis or returns metrics can help decide if the tool is correct to use. These calculations should look at instant and long-term costs and benefits. This calculation can be done on various time scales or cost units to justify real, usable terms. Additionally, calculating investment analysis metrics for multiple alternatives allows for intuitive comparisons that are vital for decision-makers.

The remainder of this paper provides a recommendation guide for quantifying the investment returns of a CMS in terms of risk reduction and dollars saved or lost. We present a process to determine how long a CMS must successfully operate to make back its initial investments. We offer guidance on finding information about critical equipment and incorporating this into your estimations. Finally, we explore the mathematical foundations of risk and investment analysis metrics and suggest simulation tools of various levels of fidelity to evaluate the CMS. This paper aims to help practitioners decide where, how, and if investing in CMS is right for their facility within a framework conducive to evaluating any IAI tool at the system impact level and above.

2. Investment Analysis for Condition Monitoring Systems From a Risk Evaluation Perspective

The value a CMS provides to a system or asset relates directly to the total reduction of risk caused by that CMS. They are inherently preventative systems and their value must be assessed in the context of this function. This means that any evaluation must fundamentally assess the likelihood or frequency of undesirable scenarios both with and without the CMS.

This section elaborates on the topics of asset risk and CMS as a risk mitigation tool. Here, we discuss the importance of identifying and quantifying the risks of the system. These are crucial for evaluating the risk-mitigating effects a CMS has on the system. Because the value of the CMS comes from this risk mitigation, it must be accurately quantified as part of an investment returns analysis of applying the CMS onto the asset.

Figure 1 summarizes the high-level steps for evaluating the potential benefits of using a CMS.

Fig. 1.

Fig. 1

High-level process to evaluate viability of CMS for a particular system

2.1. Identifying Asset Risks and Condition Monitoring Systems Suitability.

No asset, system, or enterprise is without risk of failure [12,13], and identifying that risk is critical for developing mitigating strategies. When evaluating where, how, and if to apply a CMS, you must understand the risk scenarios the CMS will help mitigate. It is critical to identify undesirable scenarios, failure modes, and hazards that contribute to an asset’s risk. This not only determines if the CMS is capable of mitigating the identified risk scenarios but also serves as the foundation for understanding the baseline level of risk of the asset. Both of these are needed for determining the value of applying the CMS.

Performing a risk analysis involves defining each undesirable risk scenario you want to consider. According to Kaplan et al. [14], a risk analysis must answer three fundamental questions:

  1. What can happen (i.e., what can go wrong?)

  2. How likely is it that that will happen?

  3. If it does happen, what are the consequences?

From this, we can develop the common expression defining risk as to the probability of undesirable scenario times the impact of that scenario:

Risk=(Probability or Frequency)*(Consequences) (1)

There are many different methods and processes of identifying and documenting hazards and sources of risk. They include hazard and operability analysis, failure modes and effects analysis, hierarchical trees, checklists, hazard index methods, and the more quantitative fault and tree analyses [1517]. Although each provides different information, most relay information relevant to both scenario costs and likelihoods (or frequencies), the two fundamental variables that define risk. Different facilities or industries may have access to different sources of hazard or risk information.

Although a CMS can prevent financial losses when used correctly, there are two major considerations before investing in a CMS. First, does the process or system in question have the risk of undesirable events that could be lowered by monitoring? Second, is the CMS being reviewed able to lower those risks to a point that justifies its installation and investment? This is similar to asking, can monitoring help, and if so, how much?

Not every asset benefits from monitoring. Some assets do not lend well to being monitored due to inaccessibility or lack of observable degradation symptoms. Some processes or equipment faults do not strongly affect operations, or can safely run to failure in an economical way. Yet other assets may be so robust in construction compared to their use that chances of failure are practically zero. Only assets that are able to be monitored and have the potential to negatively impact operations are candidates for the economical investment of a CMS.

Equally important is evaluating or estimating the efficacy of the CMS on your specific system, particularly if that CMS is built upon “black-box” style data-driven machine learning (ML) technologies. Unwary decision-makers may use these or other “one-size-fits-all” solutions, assuming these tools will positively impact their enterprise, but lack the expertise internally to properly evaluate them. In many cases, insufficient analysis comes from a lack of specifically tailored procedures or a misunderstanding of how to test the tool correctly. Such hurdles make it difficult for managers to quantify and predict their effects on company assets prior to investing [18].

It is critical to understand the real impacts of a CMS on your system when considering investing. If the benefits, costs, and risks of a CMS are not well understood, there is a strong potential for purchasing unnecessary or under justified solutions, misapplying solutions, or misinterpreting results from solutions. To avoid these negative scenarios, we suggest philosophies and practices to refocus testing and evaluation based on impacts to the system, which are both practically more important and easier to understand by those not immersed in the fields of analytics.

2.2. Measuring and Quantifying the Effects of a Condition Monitoring Systems on Risks.

CMSs can be based on a variety of architectures and provide any number of metrics or indications about the asset they are monitoring. Fault diagnosis from a CMS typically provides a user with varying levels of information to detect, locate, and identify a fault [19]. Ultimately, they must provide a user with the knowledge that there is or is not a threat to the system, prompting the user to take either preventive or corrective control actions [20]. The ability to enact these control actions should reduce the overall risk of the system. The value of the CMS comes from the difference between the system’s risks with and without the actions taken on the basis of CMS indicators [21].

Every system or asset has many associated risks that need common terms to be easily evaluated together. Adding together the risk for each scenario can give you a clear picture of the total risk associated with that asset. However, a simple addition of the risks may not make logical sense if the consequences are all in different terms or effects. For example, some consequences result in reduced worker safety, production of unusable products, unforeseen downtime, and more. What is needed is a common “cost” unit attached to the severity of consequences in every risk or undesirable scenario. Risks, which are the product of a scenario’s likelihood and consequences, take on the common “cost” unit so they can be compared across different scenarios objectively.

In industrial facilities, risks can be quantified in monetary units [22,23], potential money lost and potential money gained. Fundamentally any industrial facility may be thought of as a business and business risk is qualified in terms of expected and lowered profits. These losses have explicit monetary values that are involved in every scenario. For example, unforeseen downtime has monetary costs from both additional labor and repair, as well as lowering potential gains by lowering expected output.

A failure mode and effects analysis (FMEA) is a common document used to identify and quantify risks that a CMS can impact [24]. In this inductive, bottom-up approach, the process reviews a system and its components to identify potential hazards and failure modes (that can become undesirable scenarios). Then, each failure mode is reviewed to document the effects (costs, lost hours, regulatory non-compliance, etc.) they have on the system.

There are variants of the FMEA’s hazard identification process that offer some combination of quantitative [25] and qualitative [26] measures of risk. Some also provide a risk or criticality matrix to visualize the severity and criticality of an asset’s individual or aggregate failure modes. In addition, information about asset risk can identify needed risk control strategies, such as the inclusion of more CMS capabilities, safeguarding actions taken based on CMS indicators, or the frequency of on-machine measurements of an operation [27].

Another potential source of risk-related information can be derived from a Reliability-centered maintenance (RCM) program. RCM provides a framework to help decide the configuration and application of appropriate tasks for risk control strategies. Resonating with the work we present here, RCM even calls for selecting the most cost-effective task when there are feasible alternatives. In fact, an RCM process often makes use of failure modes, effects, and criticality analysis (FMECA), a variant of FMEA that emphasizes the criticality of failures via a combination of their severity and probability of occurrence [28]. However, RCM does not emphasize or consider the justification for the costs of maintenance strategies that require investment in software and instrumentation and has no generalized analytical approaches to make such justifications [29]. Therefore, although potentially useful in the evaluation of a CMS, RCM does not fully cover the economic concerns and systems-level impact of investing in such a tool. Standards exist detailing the framework for these analyses and the type of information provided [25,26,3033].

In some cases, explicit values of risk or related numbers are simply unavailable. This leads to some documents or personnel providing less definitive values such as expressions of “high,” “low,” or “moderate”. Where more qualitative values are reported for risk without explicit number values, an “intuitive range” or distribution of values can be assigned to allow risk calculations. The accuracy of these inferred values need not be perfect so long as they generally represent the system and are consistent across the analysis.

Although the implied goal that inspires the use of a CMS is to mitigate risk, even the CMS itself has inherent associated risks that should be considered in judging the value it may present to an asset. Like the inherent asset risks, quantifying the risks of a CMS relies on finding the undesirable outcomes that the CMS might produce. The most intuitive undesirable CMS scenarios come from false alarms, missed alarms, and incorrect alarms. You must also consider risks to the CMS itself in terms of needing maintenance or as a subsystem capable of failure (a sensor may malfunction, a behavior model may drift, etc.). By understanding the risks of a CMS and the real-world impact or costs associated with them, we can obtain a quantified value of the risk for using that CMS.

In an ideal situation, the inherent risks of a CMS will be out-weighed by reduced risks in assets monitored by the CMS, resulting in a net positive for the system being monitored. Only by understanding all the effects that a CMS has on and brings to a system in terms of risk can you begin to quantify the value of that CMS on the system.

2.3. Calculating Condition Monitoring Systems Value and Investment Returns.

Judgments for using a CMS must be made by comparing the level of risk with and without using the CMS. By finding the change in risk provided by a CMS, we can establish the ongoing value it provides to a facility. From the view that lowering a negative is a net positive, lowering risk adds value to a system over time. This value is the difference between projected losses with and without a CMS.

Without a CMS does not mean without any maintenance strategy, but instead to whatever is currently operating on your system. This could in fact even be with another, currently operating CMS. The key is to be able to estimate, or understand rates of correct, unnecessary, unwanted, and missed alerts or maintenance actions in comparison to those that would occur with the CMS under evaluation in place. More detail into this is presented later.

The fundamental principle of this work in regard to the value of a CMS is two-fold. First, the value added to a facility by implementing a CMS can be quantified as the change in risk provided by that asset or system. Second, the value of a CMS is inseparably linked to the system, asset, or facility where it will be used.

Understanding the expected risks to profit from a facility with and without a CMS can provide an objective function of value provided by the CMS over time. This, along with the up-front and ongoing costs of deploying and maintaining the CMS, shows the expected investment returns in a way that is both easily understood by investors and can be used to objectively compare different CMSs. Comparing candidate CMS tools is useful, as some tools may be more suited to some systems or assets than others: one-size-fits-all solutions are rare. Overall, this instills confidence in decisions and makes strong justifications for any actions taken in regard to CMS implementations. The procedures that follow in the next section detail this process. The subsequent section will show examples of the procedures in practice.

3. Procedural Guide

3.1. Outline of Procedures for Conducting an Investment Analysis of a Condition Monitoring Systems.

The basic outline we propose for determining the returns of a CMS has five steps. These steps could be used for any tool or system designed to lower some risk in a facility but are tailored to the example of a CMS in this section.

  1. Determine the baseline risk to the system without the CMS.

  2. List the costs of installing and operating the CMS.

  3. Assess the risks of operating the CMS.

  4. Estimate the value of the CMS applied to the target system.

  5. Conduct a risk-based investment analysis with metrics used by businesses to evaluate investments, such as net present value (NPV) and internal rate of return (IRR) [34].

The end goal of these steps is to have an accept/reject decision regarding investment in a CMS. These steps can also compare the investment returns of a CMS against alternative CMS options and accordingly proceed with the investment decision.

The level of detail that goes into each of these steps should be directly proportional to the asset importance and potential gain that the CMS may provide. For a full in-depth analysis of the CMS, these steps may be applied iteratively, adding more detail to each pass-through. The outlined steps in this paper do not go into the deepest levels of detail that may be required for very expensive investments or critically important assets, but the intuitive extension of the steps can apply to produce an analysis at any level of fidelity.

Many of the steps and values will likely require estimations or approximations of some values. This is largely acceptable and will not invalidate the results of the analysis. In many cases, it is impossible to perfectly know or represent some of the values required for the risk calculations. So long as best efforts are made to include reasonable values, good insights and information can still be obtained. We strongly recommend in cases of uncertain values to perform the analysis with “best” and “worst” case reasonable values to make bounding values on an investment analysis.

An important implicit pre-step in conducting an investment analysis of a CMS is to verify that the CMS has the capabilities to monitor the undesirable events identified in Step 1. If it cannot, then it has no value to the target system and does not require further analysis. If it can only monitor for some of the undesirable scenarios, you can still complete the analysis. The ultimate measure of value and return of the CMS comes from lowering the total risk of the asset it monitors. This does not imply avoiding all risk scenarios, though that may be an additional requirement.

3.1.1. Step 1: Determine a Baseline or Current Level of Risk Associated With the Target Asset (Without the Condition Monitoring Systems).

Establishing a baseline risk requires knowledge about the undesirable scenarios that may occur in a system or asset. It is vital to estimate the occurrence rate and severity of these scenarios. Hazard identification sources, such as the FMEA discussed in the previous section, hold information about these scenarios. Additional sources of information come from worker experience, maintenance logs, and original equipment manufacturer (OEM) manuals and specification sheets. These sources build a maintenance and failure taxonomy of the asset and its components that provide symptoms and frequencies of the undesirable scenarios for the asset [35].

After establishing the list of undesirable scenarios that can befall an asset, remove from the list any scenarios that the CMS is incapable of impacting. This involves looking at the symptoms that precede each scenario to determine if the CMS can detect them with enough time to avoid the undesirable outcome. The goal here is to exclude any scenarios from the subsequent analysis that cannot be detected with enough warning for actions to be taken. Any scenario that the CMS cannot affect will have no implications for the value and investment analysis of that CMS.

This is a good time to verify that all imperative detection scenarios are detectable by the CMS with an acceptable lead time. Imperative detection scenarios are those that you deem necessary for the CMS to be able to detect. Often, these are high-risk scenarios that cannot be allowed to go unalerted or go without mitigating action. If any imperative detection scenarios are not sufficiently managed by the CMS, this may be justification for not using the CMS or investing in additional mitigation methods.

Completing the list of undesirable events does not need to be exhaustive, but it does need to be comprehensive. The list should focus on high-risk scenarios that are relevant to and detectable by the CMS. So long as the list includes all high-risk and imperative detection scenarios, minor scenarios or those that may not be directly affected by the CMS can be omitted. Additional scenarios which may not be affected by the CMS may need to be included if you are comparing multiple CMS options. When comparing the value of multiple CMSs, it is important to use the same list of scenarios across each analysis to ensure equivalent comparative values.

With the list populated, next, the risk associated with each undesirable scenario is calculated by multiplying the probability of occurrence with the monetary cost associated with the severity of that scenario’s consequences.

An informal survey across modern research will show that the most common failure metrics relate to mean time between occurrences or mean time to failure instead of the probability of occurrence—the use of these metrics provides a relatively simple measure to compare and prioritize failures [36]. Many do not explicitly invoke a descriptive distribution of occurrence curve, but use a single descriptive value and largely ignore any uncertainty associated with that value during the operations and scheduling. Although not perfectly accurate, these singular “expected values” can be converted to an expected number per unit time (i.e., frequency of occurrence) that is an acceptable substitute for probability in our risk model. An example is a line breakage that has a mean time between failure (MTBF) of four months. This translates to 0.25 shutdowns per month or three shutdowns per year (1 break/4 months MTBF=0.25 failures per month). So long as you calculate all the frequencies of undesirable scenarios with the same base units (e.g., per year), the relative risk calculations provide the desired information. For simplicity in demonstrating this procedure in the next section, we will refer to all examples in terms of frequency of occurrence.

The authors acknowledge the relationship between frequency and probability, as well as how concepts such as combining uncertainty [37] and fuzzy measures [38] can be applied to risk. Due to space limitations, they will not be covered here, but those interested can find many resources on these concepts [39].

To estimate the monetary cost of a scenario, all actions taken, products lost, resources expended, and lost productivity must be taken into account. The full range of consequences and their equivalent monetary cost must be totaled. Labor hours cost money, and lost productivity is money not gained, etc. The more detailed and comprehensive the monetary cost of a scenario, the more accurate the calculations of risk for that scenario and investment analysis for the CMS.

Finding the monetary contribution of an undesirable scenario resulting from less-direct costs such as loss of moral or public opinion may be difficult and is beyond the scope of this paper. The best estimate of these costs may be used as a surrogate if desired.

The final stage of Step 1 is to estimate the total expected future risks and losses from each undesirable scenario without the CMS. This will serve as a baseline value for calculating the relative change for all future investment analysis calculations. This may be done in the form of a simulation, numeric estimation from future plans, or a simple assumption that the facility will behave as it did in the past. The important thing is to have an expression per unit time of each scenario’s frequency or probability as well as its cost that can be projected into the future.

3.1.2. Step 2: List the Costs of Installing and Operating the Condition Monitoring Systems.

This step needs little explanation. Direct costs to consider are installation, physical equipment, operating costs, maintenance costs, etc. There may also be indirect opportunity costs from resource tie-ups, or other less obvious costs that may require more detailed analysis. For many analyses, those indirect costs are difficult to assess and have little impact on the final analysis.

You may separate the costs into three categories: one-time costs, ongoing static costs, and potential or variable costs over time. One-time costs cover things such as software, installation labor, and equipment. Ongoing static costs are those associated directly with operations, such as operator labor, any service fees, and so on. The third category in this step is for less obvious operational costs that may occur infrequently over time, such as personnel training, sensor replacements. This third category is mostly needed only for highly detailed investment analyses.

The underlying efforts that go into the deployment and operation of a CMS may also play a role in this calculation. Especially if considering developing in-house solutions, the expertise required to develop, maintain, and operate the CMS may be significant. For example, CMSs utilizing physics-based, analytical models tend to require more effort and expertise than deploying CMSs based on black box, machine learning methods [40,41].

3.1.3. Step 3: Assess the Risks of Operating the Condition Monitoring Systems.

This stage estimates the frequency, consequences, and ultimate cost risks of scenarios with the CMS in place. The foundation list of undesirable scenarios created in Step 1 will be built upon, adding the possible effects the CMS can have on each of those scenarios. Specifically, this stage estimates the cost and the probability or frequency that the CMS will:

  1. correctly identify the undesirable scenario;

  2. provide no alert for that scenario, leaving the problem unnoticed by the CMS;

  3. provide the wrong alert for that scenario, misidentifying it with another problem; and

  4. provide an alert when none of the listed scenarios are happening.

The paper refers to these four CMS outcomes for each undesirable scenario as correct, missed, incorrect or misidentified, and false alerts, respectively. For every undesirable event identified in an asset (in Step 1), a CMS will have different costs and frequencies associated with each of its four possible outcomes.

No CMS is perfect, and not all alerts provided by a CMS are useful or accurate. There is an inescapable trade-off between rates of missed alerts and false or incorrect alerts. The more sensitive the model, the quicker it will be to alert of a problem, but minor perturbations could trigger false or incorrect alerts. Conversely, if the model waits to alert until it is more confident in the problem, there could be costly delays or misses in alerting. A balanced CMS will have some acceptable amount of both.

A starting point to assess the risks associated with operating the CMS is with its developers. You should ask the developers for expected alert frequencies. These alert rates should be specified for, at the minimum, each of the four categories of alerts that correspond to scenarios listed in Step 1. Ideally, they would provide real-world numbers from applying the CMS on assets similar to yours. If not, the developer should be able to provide some expected values for generic cases. Even if these rates are educated guesses, they help assess the risk of a CMS when applied to the asset. There is no way to know these values for certain prior to deployment of the CMS, but the accuracy of these values directly affects the accuracy of the investment analysis.

If a developer cannot provide estimations for the expected rates of these alerts, additional efforts must be made. Simulations, expert judgment, historic data, or generic data from reliability databases are popular ways to get these rate estimates [42]. For quick calculations or in the absence of better information, performing the analysis on “best” and “worst” case values can provide a range of acceptable estimates to move forward with.

Connecting the CMS to a simulation of the target system or asset is a common way to determine the alert rates as the CMS naturally produces the various alerts during testing. Although simulation testing is a powerful tool, remember that the found alert rates will still only be estimates of the true values on the system. Although there are many simulation methods and types, each type has its strengths and weaknesses, and simulation testing is limited by the range and fidelity of the simulation itself. The factors for selecting or creating a proper simulation to test a CMS will be the subjects of future works and are largely left out of this paper.

Simulations generally process some combination of replaying historic data and digitally recreating assets and environments. Replaying known data collected from the system is an easy method of simulation for testing CMSs, but often these data are not available or all the important scenarios are not represented in sufficient quantities. Digital system recreations are another way to test a CMS and can be tailored to include any and all target scenarios, but the intricacies of development can make this option expensive or difficult if the simulator is not previously available.

If these values are not forthcoming, you should revert to guesses or Fermi estimates. Guessing is the easiest, but least accurate, method to continue the investment analysis. Perform the calculation with three sets of guesses for the different alert rates: perfect or very good values, typical values, and worst-case values. This can be used to give a range of values that the CMS alerts rates are likely to fall into. Even when no exact values are known, it is useful to get an idea of the maximum alert rates where the CMS still provides useful insight. Table 1 shows examples of typical ranges for the rates of the four alert types. Note that these values will be very different depending on your system requirements.

Table 1.

Example of typical CMS alert rates

Good rates Typical rates Bad rates
Correct alert 99.99% 93% 60%
Missed alert   0.0001%   5% 10%
Incorrect alert   0.0004%   1% 10%
False alert   0.0005%   1% 20%

It is essential to assess the costs that correspond to each alert outcome. These costs come from the actions and repercussions taken in response to a CMS alert, relative to the actual status of the asset’s undesirable scenario. These values may also be estimated through historical data, expert judgments, and Fermi estimates. Both negative and positive alert outcomes will have costs, and it is important to keep all values in terms of money lost or spent. Costs from negative outcomes such as lost revenue from downtime, unnecessary maintenance costs, extra labor costs, etc., are all translatable as a monetary impact to the CMS. Positive outcomes such as correctly alerting a scenario also have costs: maintenance, labor, downtime, etc. Ideally, this is a lower cost than from incorrect, missed, or false alerts. Measuring everything in terms of monetary costs helps avoid the confusion sometimes faced when measuring losses and gains. For example, having a smaller downtime for maintenance should not be represented as a gain in the production time as this is harder to directly translate into monetary terms.

Furthermore, these four alert types may not be enough to fully represent the consequences faced by every system. For example, consider missed alerts to be any alert that does not occur with enough lead time for an operator to react. The amount of lead time (if any) that the CMS does provide toward avoiding an unwanted scenario may impact the consequences or outcome of that scenario. In these cases, it may be possible that the consequences vary as the CMS faces a delay in detection. It may not be as simple as “avoid the scenario” or “all bad consequences occur”. With a 5-min warning, parts of the unwanted scenario could be avoided. But with a 10-min warning, all bad effects could be avoided. In practice, it may be better to distinguish levels of promptness, along with corresponding consequences, for delayed CMS alerts.

The four types of alerts can be split into as many additional categories as needed. Each additional subcategory will need a corresponding alert rate and consequence. The additional subcategory rates and costs can add accuracy to the analysis, but also add complexity and may introduce additional sources of error. Furthermore, enumerating all possible alert types may not be feasible in more complex assets. Adding subcategories of alert types should be done carefully. Primarily focus on alerts with outcomes that have strong differences on the resolution of the scenario affecting the asset, with a lower emphasis on less critical scenarios.

In some cases, rather than report alerts on undesirable scenarios faced by the asset, the CMS will report a continuous signifier of asset health, remaining useful life, or similar. These signifiers also have a probability of being correct, too high, or too low—this would be equivalent to the CMS’s incorrect or misidentified alert rate. The degree by which a CMS’s signifier is correct or erroneous can determine the subsequent costs. Leveled categories can capture the degree to which a signifier is erroneous or not. Examples of these leveled categories are “sufficiently close,” “minor too high,” “major too high,” “minor too low,” and “major too low”. Generally, there are different costs and consequences for being too high versus too low for such metrics.

3.1.4. Step 4: Estimate the Value of the Condition Monitoring Systems Applied to the Target System.

This stage evaluates the application of the CMS, or candidate CMS, to the system or asset. The information about both the asset and the CMS found in the previous steps is used to infer the effects of their interaction. Specifically, it uses the frequencies and costs of the CMS’s alert outcomes from Step 3 and compares them to the baseline outcomes of Step 1. The result of this stage should provide an estimate of the total risk over time for an asset when using a CMS.

Possible evaluation methods for this stage include direct rough calculations, generic simulators, specific simulators tailored to the target system, or any combination of the three. Each of these has benefits and disadvantages in terms of cost, effort, and accuracy. The applicability and suitability of each evaluation method may also depend on the CMS and the acquired information from Steps 1 and 3, but that determination is outside the scope of this paper. Iterating through, or using a combination of methods can increase confidence in the final result, and may help provide quick reasonable answers while more in-depth analyses are performed.

3.1.4.1. Rough equations for direct approximation.

Direct rough estimations for the value of a CMS needs the expected cost of each alert type identified in Step 3 when applied to each unwanted scenario. For the four common listed alert types, this would mean calculating the following for each undesirable scenario listed in Step 1:

  • The expected costs over time associated with a correct alert to the scenario.

  • The expected costs over time for missing an alert about the scenario.

  • The expected costs over time for a misidentified alerting about a different scenario.

  • Lastly, the expected costs over time for a false alert when there are no undesirable scenarios.

Simplified expressions of risk are sufficient for calculating risk over time. The explicit probabilistic equations to calculate each of those can be quite complex and heavily dependent on the interactions of the target system or asset. However, since many of the numbers used in this analysis are going to be rough approximations, some general simplifications overlooking the more complex interactions can still give a good approximation of the value the CMS would provide.

Continuing with the four common alert types, some simplified equations for a single scenario could be the following:

  • Cost of Correct Alerts Over Time = Correct Alert Rate × Scenario Occurrence Rate × Cost With Alert

  • Cost of Missed Alerts Over Time = Missed Alert Rate × Scenario Occurrence Rate × Cost Without Alert

  • Cost of Incorrect Alerts Over Time = Wrong Alert Rate × Scenario Occurrence Rate × Cost of Wrong Alert

  • Cost of False Alerts Over Time = False Alert Rate × Cost of False Alert

The sum of these four values will give the estimated total risk of that scenario over time. Totaling the monetary risks over time across all undesirable scenarios shows the total risk the asset faces with the CMS deployed. In the next step, this risk over time will be compared to the one found in Step 1 to determine the value the CMS provides to the asset.

3.1.4.2. Using simulators.

Custom simulators can better estimate and provide likely outcomes for using the CMS on the asset. However, they can be time-intensive to create and may still have assumptions and inaccuracies that cause errors in the final reported asset risk. Although simulators are powerful tools and provide greater insight through the flexibility of testing, they are only recommended for conducting an investment analysis of a CMS for high-value systems or assets.

Simulators come in a variety of forms and fidelities. Typical types of simulators for this testing include those that replay historic data, those built on statistics and probabilistic interactions, and those built as digital representations of the target environment. As briefly described previously, there are pros and cons to the various types of simulators, and selecting the one best for your analysis must face the trade-offs between versatility, accuracy, and investment of development.

Probabilistic simulators are a good option that do not require large investments of time or data. These may use numbers such as the rate of failure and probability (or rates) of the various alert categories as inputs to model loose representations of your asset. Using these rates or probabilities and Monte Carlo-style forecasting, you could create a distribution of interactions between the CMS and the target asset.

The least labor-intensive simulators are those that replay historical data. Because they use real data, they also give results tailored to your asset. However, as mentioned before, these will be limited to the confirmed scenarios captured within that data. It is uncommon that historical data would exist for all of the Step 1 scenarios because facilities will want to operate in a way to minimize these scenarios. Even so, if the risk of some scenarios can be evaluated this way, other simulations or rough calculations for the other scenarios can be combined to find the total risk of using the CMS.

Other more complicated simulations may act as a digital replica of the specific asset or system with explicit, detailed physical interactions and even integrated historical data. They can even control for the specific undesirable scenarios in the simulation, providing detailed information about edge case scenarios faced by a system or asset. These simulators are more intensive to create and will normally only be used for assets deemed by the enterprise to have high value or importance.

Regardless of the style of the simulator, it must report the frequency and costs of the CMS’s monitoring alert outcomes, given the undesirable scenarios identified in Step 1. The next step of evaluating the investment value of a CMS captures these monetary risks.

3.1.5. Step 5: Conduct a Risk-Based Investment Analysis

3.1.5.1. Net present value.

Net present value is the present value of all cash inflows less the present value of all cash outflows. Present values are calculated by dividing future cash flows by an interest rate or discount rate to account for the time value of money and inflation. Discount rates are often referred to as hurdle rate, interest rate, cutoff rate, benchmark, or the cost of capital. The discount rate is the minimum rate of return that one might need to engage in a particular investment (e.g., 7% annual return, 9% annual return, or higher/lower).

Net present value is calculated by taking each monetary cost and benefit associated with an investment and adjusting it, using the discount rate, to a common time period, which we will call time zero. The inflows are summed together, and the outflows (costs) are subtracted resulting in the net present value [34]

NVP=t=oT(ItCt)(1+r)t (2)

where It is the total cash inflow in time period t; Ct is the total cost in time period t; r is the discount rate; and t is the time period, which is typically measured in years.

Expected values can be entered for cash inflows and outflows to account for various risks such as downtime. You should limit the analysis to a selected study period (e.g., 10 years or 20 years). Then, calculate the net present value manually or by using a software tool such as the National Institute of Standards and Technology (NIST) Smart Investment Tool [43]. High values are considered better, and a positive value would be considered economical. Because these estimations are all relative to the baseline of the target asset, this procedure objectively compares different CMSs. So long as the procedure estimates CMS outcomes on selected scenarios, their effects on ongoing cost risk provide an objective method for comparison.

The total net present value for a CMS can also be analyzed based on its contributing factors. This is especially useful in occasions where regulations require monitoring, or there is some safety or mission-critical scenario that must be avoided. In instances where you have imperative detection scenarios, the total effect of the CMS on those specific scenarios can be viewed. Based on risk acceptability criteria (such as Refs. [23,44]), there may be undesirable scenarios that pose unacceptable levels of risk even after the application of the CMS. No matter the total net present value of the CMS, the presence of too many high-risk scenarios may justify a search for other risk mitigation options.

Lastly, it is recommended to document the evaluations in this process as well as any resulting risk control strategy decisions based on the evaluations. They can be used as references for future system risk studies and condition monitoring decisions. Many hazard identification frameworks, discussed in Sec. 2, may provide a formalization for such documentation [31].

3.1.5.2. Internal rate of return.

The final step in this procedure is to calculate the internal rate of return. The internal rate of return is the discount rate where the net present value equals zero

NVP=0=t=oT(ItCt)(1+IRR)t (3)

In practical terms, this is an annual rate of return. Given the nature of the calculation, it is typically estimated using a software tool such as the NIST Smart Investment Tool [43] or by trial and error. Higher internal rates of return are better, and when it exceeds the decision-makers discount rate, then the investment would be considered economical.

3.2. Summary.

The process described in this section is at its base an estimation technique. When dealing with preventions and “what if” scenarios, it is impossible to say with complete certainty the impact a mitigation action or tool had on a system. To reflect this and better assess the range of possible investment outcomes, the authors recommend altering input values to the analysis to find “best” or “worst” case estimates. This investment analysis procedure is presented as a general guide (Fig. 2), and we expect and recommend adapting it to better reflect concerns of the asset or facility where necessary.

Fig. 2.

Fig. 2

Summary of procedures to calculate a risk-based net present value and internal rate of return

4. Example Case Study

In this section, two separate example case studies of calculating a risk-based investment return for a CMS are presented. The first is a simple single asset, and the second emulates a factory facility using a simulation environment. These examples represent fictitious facilities and assets. They demonstrate the principles of conducting an investment analysis of a CMS’s mitigating effects on an asset’s risks, with variations of techniques for estimating the risk from a CMS’s alert outcomes.

4.1. Example 1: Paper Mill Cutting Blade.

This example focuses on a simple straight-edged cutting blade used in the final stages of a paper mill. Operating at a cutting speed, or frequency of cut, of 10 Hz, the blade processes approximately 150,000 cuts per day. Each cut results in one ream of paper ready for packing. The line is shut down once per day on average when the blade becomes misaligned or dull. This facility currently uses no warning system or CMS and can only identify a bad blade later in the line at the quality control checkpoint. A misaligned blade results in an average of 500 unacceptable cuts made that are now scrapped. A dull blade, which accounts for 1 in 100 shutdowns, is usually easier to identify early and results in only 300 unacceptable cuts. Preliminary “speed of detection” testing shows that with a CMS, the waste from dulling can be reduced to 100 reams per event and 150 scrapped reams for a misalignment. Each instance of dulling requires a $100 sharpening cost, and each instance of misalignment requires a $75 alignment cost. We will consider how valuable a CMS would be to this asset. Note that for simplicity, in this example, we are comparing the baseline scenario to a scenario with CMS. The net present value and internal rate of return, however, could be used to compare additional scenarios. For instance, a third scenario could include a different CMS, or alternatively, it could include more frequent fixed interval blade inspections.

4.1.1. Step 1: Determine a Baseline or Current Level of Risk Associated With the Target Asset (Without the Condition Monitoring Systems).

Only two primary modes of failure are identified for this asset, dulling and misalignment. From the description, misalignment accounts for 99% of the shutdowns meaning that on average the frequency of misalignment is 0.99 per day. If the production cost is $1.50, then each instance of misalignment accumulates to $204k of losses per year, including the alignment cost. This means the cost risk per day is $742.50 losses for misalignments plus $74.25 in realignment costs. A similar calculation can be made for dull blades, resulting in a total cost risk of $1375 per year. Table 2 shows an example spreadsheet to calculate this value. Each column is labeled at the top as being column A through Column G. Below these labels are equations showing how the values are calculated.

Table 2.

Cutting blade baseline cost risk calculation

Column A B C D E F G
Calculation n/a = Reams lost × A n/a = B × C n/a = A × E = (D + F) × 50 × 5
Failure modes Failure rate (frequency per day) Expected daily lost reams Production cost per product Total Sharpening/realignment per occurrence Expected daily cost Annual costs and losses (operating 5 days/week 50 weeks/year)

Dulling 0.01  3 $ 1.50  $ 4.50 $ 100.00   $ 1.00  $ 1,375.00
Misalignment 0.99 495 $ 1.50 $ 742.50   $ 75.00 $ 74.25 $ 204,187.50
  Total $ 205,562.50

4.1.2. Step 2: List the Costs of Installing and Operating the Condition Monitoring Systems.

For simplicity, only the most basic considerations will be made for the costs of the CMS. For this example, the one-time costs of the CMS are software purchase and installation, sensing equipment, and personnel training. There is an ongoing daily cost of $160 that reflects the operational resources, such as man-hours and power consumption. The plant operates 5 days per week for 50 weeks per year, resulting in an operating cost of $30,000 for the CMS. Table 3 lists the numbers for this example.

Table 3.

Cost of CMS for paper mill cutting blade

One-time costs
Software purchase $250,000
Installation $60,000
Equipment $10,000
Personnel training $2500
Total $322,500
Ongoing costs
Operations costs per year $30,000
Total $30,000
Potential/variable costs
N/A $ –
Total $

4.1.3. Step 3: Assess the Risks of Operating the Condition Monitoring Systems.

The proposed CMS uses 2 dedicated sensors to monitor the blade. One alignment laser and one pressure sensor monitor the force profile of the cut to monitor for dulling. The laser is highly sensitive to misalignments but is subject to false alarms if too much dust accumulates. The pressure sensor is not as sensitive and uses AI techniques to learn acceptable cutting profiles, but can mistake changes in paper quality as changes in blade sharpness.

This CMS is tailored to paper mill cutting blades and has been installed in other facilities. The false and missed alert rates for the different failure modes can be inferred from their operations. These numbers are listed in Table 4. Generally, you can assume that the “correct” alert rate is 1 minus the sum of the others. Remember, even if these numbers are gross estimations, they can still be used to find reasonable expectations and maximum acceptable limits for each.

Table 4.

Estimated CMS alert rates for paper mill cutting blade

Failure modes False alert rate Missed alert rate Wrong alert rate Correct alert rate
Dulling 0.05 0.01 0 0.94
Misalignment 0.01 0.001 0 0.989

4.1.4. Step 4: Estimate the Value of the Condition Monitoring Systems Applied to the Target System.

This step multiplies the various alert rates with their associated costs. In this example, each type of alert will result in some number of “lost” production due to shut down and maintenance. These examples are oversimplified to illustrate the idea. In a real analysis, you can be as detailed as possible in the assessment of consequences and cost for each fault mode or undesirable scenario.

Note that you only need to account for costs that would change if the CMS alerts. Some repair costs, such as replacement part cost, do not matter for the analysis because they are the same with and without the CMS. A breakdown of the estimated cost risks for using the CMS is shown in Table 5.

Table 5.

Estimated alert cost risks of CMS for paper mill cutting blade

Column A B C D E F G H
Calculation = Failure rate × Alarm ratea n/a = A × B n/a = C × D n/a = A × F = (E + G) × 50 × 5

Failure modes Occurrence rate (frequency per day) Lost reams Expected daily lost reams Production cost per product Total Sharpening, realignment, or assessment per occurrence Expected daily cost Annual costs and losses (operating 5 days/week 50 weeks/year)
Dulling Correct alert 0.0094 100  0.94 $ 1.50 $ 1.41 $ 100.00   $ 0.94   $ 587.50
Missed alert 0.0001 300  0.03 $ 1.50 $ 0.05 $ 100.00   $ 0.01     $ 13.75
False alert 0.0005  0  0.00 $ 1.50 $ –   $ 85.00   $ 0.04     $ 10.63
Misalignment Correct alert 0.9791 150 146.87 $ 1.50 $ 220.30   $ 75.00 $ 73.43 $ 73,433.25
Missed alert 0.0010 500  0.50 $ 1.50 $ 0.74   $ 75.00   $ 0.07   $ 204.19
False alert 0.0005  0  0.00 $ 1.50 $ –   $ 85.00   $ 0.04     $ 10.52
 Total $ 74,259.83
a

The false alert rate is a separate calculation.

4.1.5. Step 5: Conduct a Risk-Based Investment Analysis.

Bringing together all the costs found in the previous steps, we can calculate the estimated total daily return on cost risk from the CMS. This calculation was made using the NIST Smart Investment Tool. The firm in this example uses a 5% discount rate and a study period of 20 years. Using the equations previously discussed, the net present value can be calculated. Table 6 shows the values for this example. The total net present value for investing in a CMS is $939,955, as shown at the bottom of Table 6. The positive value indicates that this investment is economical. The internal rate of return for this investment is 31%, which significantly exceeds the 5% discount rate, indicating that it is an economical investment.

Table 6.

Net present value and internal rate of return for CMS investment

Year Cash inflow Cash outflow Annual cash flow Net present value Cumulative net present value
0      $0 −$322,500 −$322,500 −$322,500 −$322,500
1 $131,303   −$30,000   $101,303   $96,479 −$226,021
2 $131,303   −$30,000   $101,303   $91,885 −$134,137
3 $131,303   −$30,000   $101,303   $87,509   −$46,628
4 $131,303   −$30,000   $101,303   $83,342  $36,714
5 $131,303   −$30,000   $101,303   $79,373   $116,088
6 $131,303   −$30,000   $101,303   $75,594   $191,681
7 $131,303   −$30,000   $101,303   $71,994   $263,675
8 $131,303   −$30,000   $101,303   $68,566   $332,241
9 $131,303   −$30,000   $101,303   $65,301   $397,541
10 $131,303   −$30,000   $101,303   $62,191   $459,732
11 $131,303   −$30,000   $101,303   $59,230   $518,962
12 $131,303   −$30,000   $101,303   $56,409   $575,371
13 $131,303   −$30,000   $101,303   $53,723   $629,094
14 $131,303   −$30,000   $101,303   $51,165   $680,259
15 $131,303   −$30,000   $101,303   $48,728   $728,987
16 $131,303   −$30,000   $101,303   $46,408   $775,395
17 $131,303   −$30,000   $101,303   $44,198   $819,593
18 $131,303   −$30,000   $101,303   $42,093   $861,686
19 $131,303   −$30,000   $101,303   $40,089   $901,775
20 $131,303   −$30,000   $101,303   $38,180   $939,955
  Total $939,955
    IRR   31%

4.2. Example 2: Discrete, Multistage Laser-Engraving Operations.

This case study focuses on a set of discrete laser-engraving operations that produce a single type of machined part from a workpiece. The production line involves machines that perform four distinct laser-engraving operations. Figure 3 below specifies the sequences of these operations. The line produces 2700 machined parts at a rate of one part per minute, before undergoing a full system check for repairs. This 2700-minute production runs 26 times a year. The target CMS uses only currently available sensing apparatus that evaluates part quality at the end of the production line. The CMS does not provide fault identification, instead gives fault localization within the system. A process failure mode and effect analysis (PFMEA) supplies undesirable scenarios that degrade part quality. A simulation environment of the system demonstrates the steps to conduct an investment analysis of a CMS’s mitigating effects on an asset.

Fig. 3.

Fig. 3

An example of four machines performing laser-engraving operations to produce a part

4.2.1. Step 1: Determine a Baseline or Current Level of Risk Associated With the Target Asset (Without the Condition Monitoring Systems).

Although the CMS in this case study does not distinguish failure modes, it is still important to develop the cost risk scenarios for the list of failure modes (or undesirable scenarios) that will be affected by the CMS. By using the list to create a weighted total of the cost risk, or as in this case study through simulating the various scenarios, you can develop a reasonable baseline for an investment analysis.

The PFMEA lists failure modes for the laser-engraving operation. These are all the process failure modes that cause damage to the machine part. It also lists the potential causes and effects of each failure mode as well as corresponding current risk mitigation controls. Table 7 shows three failure modes, or undesirable scenarios: engravings that are too deep, engravings that are too shallow, and engravings that deform the part.

Table 7.

PFMEA excerpt for laser-engraving operation

Process function/requirement Potential failure mode Failure mode ID # Potential effect(s) of failure Severity Potential cause(s) of failure Occurrence Current process controls (prevention and detection)
Laser engrave characters onto parts;
engravings should be precise
Engraving too shallow 1 Operator: Inconvenience to finish (VI)
Part: Low effort rework quality (VI)
Production Plant: More time to finish engravement (VI)
VI Machine cycle time too short A Operator work instruction;
manufacturing process visual audit
Engraving too deep 2 Operator: Rework out-of-station (IV)
Part: Low rework quality(VI)
Production Plant: Out-process repair increases (IV)
IV Machine cycle time too long B Operator work instruction;
manufacturing process visual audit
Engraving deformed the part 3 Operator: Remove scraps (VI)
Part: Unusable (II) Production Plant: 100% scrap (II)
II Laser power settings set too high D Operator work instruction;
manufacturing process visual audit

The PFMEA also lists the severity and occurrence of each failure mode. The severity of the failure modes for laser engraving indicates the range of damage that the failure may do to the machine part or production line. The corresponding occurrence indicates the range of likelihood that the machine performing laser engraving damages the part or line in that failure mode. As such, the severity and occurrence help derive the cost and likelihood for each failure mode, respectively.

In this example, the failure modes have classification levels for severity and occurrence rather than quantified values for costs and frequencies, similar to the styles found in Refs. [26,31]. Typically, there are three to ten levels. Based on these classification levels, Table 7 shows engravings that deform the part have high severity and low occurrence, opposite to the low severity and high occurrence of shallow engravings.

The levels can be translated into relative quantitative values, but will need to be grounded in some timeframe or monetary cost value. The basis for these levels should be included with the PFMEA. If not, expert opinion will be needed. For example, if the lowest failure occurrence level happens with a frequency of 1 per 100 operating hours, then the next occurrence level up might be assumed to happen two times per 100 operating hours.

The occurrence level that comes with each failure mode may be quantified as a frequency or likelihood. Specifying a likelihood from the level may require deriving additional information about that failure mode. Details for deriving these go beyond the scope of this case study, but if such are available as is the case with the example, they can be used to better evaluate risk through time.

In this example, the simulator quantifies a measure of occurrence for each failure mode experienced by a machine in the production line. This measure embeds failure rates of each of the failure modes and the machine’s hazard distribution. As shown in Fig. 4 for one production run, these distributions were derived from information in the PFMEA and OEM guidelines. Note that, due to the distribution shapes, the likelihood of occurrence of each failure mode associated with a machine increases with time. This follows from machines becoming more error-prone with time, usage, and lack of maintenance.

Fig. 4.

Fig. 4

Failure distributions for laser-engraving machines and their failure modes

The severity level that comes with each failure mode may be estimated as a cost. Specifying a cost from the level may require input from an analyst. When the severity levels do not specify a cost, they must be translated into monetary costs as described in earlier sections.

The risk of each failure mode is the product of the estimated cost and its likelihood of occurrence. Totaling all the risks from each failure mode for one component of the asset, that is one laser-engraving machine in this example, results in the risk of that component and totaling the risks of each component in an asset in the cost risk of the entire asset.

Table 8 shows the severity cost of each failure mode for the machine that performs laser engraving. Figure 5 shows the cost risk of one of these failure modes, the total cost risk incurred by a laser-engraving machine, and the cost risk of the entire production line asset, all throughout a single production run. These plots are from the simulator, which provides dynamically changing risk levels over time.

Table 8.

Severity costs from failure modes on one machine

Severity costs
Failure Mode 1   $65
Failure Mode 2   $95
Failure Mode 3 $150
Fig. 5.

Fig. 5

Cost risk of entire production line (aggregated over all machines), cost risk of one machine (aggregated over all its failure modes), and cost risk of a single failure mode

4.2.2. Step 2: List the Costs of Installing and Operating the Condition Monitoring Systems.

The CMS being evaluated is a machine learning-based solution that evaluates the probability that each machine produces defective parts based on real-time, system information. The inputs to the CMS come from each produced part: the part quality and the sequence of operating machines that produced it. The low requirements for setup, additional sensing equipment, and personnel training make it an appealing option.

For simplicity, this example includes only basic CMS costs. Like the previous case study, the CMS has two categories for costs: one-time costs and ongoing costs. The total one-time cost for this CMS of $4,500,000 includes software purchase, installation, sensing equipment, and personnel training. Ongoing costs of $20,000 per hour are the operational cost of the CMS. Table 9 lists the costs associated with this CMS solution.

Table 9.

Costs of CMS for laser-engraving operations example

One-time costs
Software purchase $500,000
Installation $2,000,000
Equipment $1,000,000
Personnel training $1,000,000
Total $4,500,000
Ongoing costs
Operations costs per hour $20,000
Total $20,000

4.2.3. Step 3: Assess the Risks of Operating the Condition Monitoring Systems.

The CMS solution in this example uses machine sequence data and installed end-of-line quality sensor instrumentation. It does not diagnose particular failure modes, but instead provides a likelihood that a machine is producing poor quality parts and needs servicing. As such, there are no opportunities for incorrect alerts, only false and missed alert rates. Even without identifying particular reasons (failure modes), detecting and isolating faults can still help reduce asset risk.

The proposed CMS solution has been installed on other multistage manufacturing lines, and its false and missed alert rates can be inferred and estimated from their operations. Table 10 shows the alert rates based on past performance for the CMS’s ability to detect and isolate machine failure.

Table 10.

Estimated CMS alert rates for other multistage manufacturing lines

Failure mode False alert rate Missed alert rate Wrong alert rate Correct alert rate
Machine failure detection and isolation 0.08 0.15 0 0.85

These values will serve as “sanity checks” when the CMS is tested with the simulator in the next step. Because the simulator can be directly connected to the CMS, the inherent alert rates of the CMS should exhibit themselves. If there is a large difference in the observed testing rates and those from previous applications in Table 10, there may be cause for deeper investigations.

4.2.4. Step 4: Simulate the Value of the Condition Monitoring Systems Applied to the Target System.

The available information and the system simulator provide a convenient way to directly test the CMS tools on our system. The simulator dynamically schedules the production path of each part and simulates a product quality based on the health of the machines in that path. Additionally, each machine will experience degradation through operations based on information about their failure mode occurrences. The simulation runtime reflects the considered production run scenario and repair cycle described above: 2700 machined parts at a rate of one part per minute.

During the simulation, the CMS attempts to isolate the machines that degrade part quality. These machines are likely experiencing failure modes in their ability to perform their respective operations. For this example, the production line makes use of 10 machines to perform the laser-engraving operations in Fig. 3. The operations use four of these machines to produce each part.

The CMS predicts the health status of all machines after each part is produced. This also corresponds to a single step of the simulation.

The CMS uses a running window log of parts produced by each machine and checks the ratio of the quality of parts operated on by a particular machine to the quality of parts that were not. The CMS uses this information to determine an estimate of the average induced damage at that machine. At each step of the simulation, a threshold discriminates between machines that may be degrading part quality and machines in good health. The simulator then evaluates these machine health predictions to produce correct, missing, or false alerts.

Results of each part production yield CMS predictions and alerts from evaluating those predictions. CMS alerts induce different inspection and maintenance costs—the cost of a false alert will be different from that of a missed or correct alert. Each alert has a cost risk for each failure (or lack thereof) that triggers the alert. The correct alert costs also reflect the instantaneous repairs to machines that degrade part quality, upgrading those machines back to a healthy status. Table 11 shows the costs for false, missed, and correct alerts.

Table 11.

Failure occurrences, alert rates, and alert costs from simulating 1000 CMS results for 2700 min of laser-engraving operations

Failure mode Average occurrence of failed machines per simulation False alert rate False alert cost Missed alert rate Missed alert cost Correct alert rate Correct alert cost
Machine failure detection and isolation 7.5 0.06 $30 0.10 $189.90 0.90 $43.80

Summing the real-time cost risks of these alerts over all the machines results in a simulated risk for the asset with the application of the CMS. The total cost risk depends on the rate at which these alerts occur. Table 11 shows the average false, missed, and correct alert rates through the duration of a simulated production, taken over 1000 simulations. Comparing them to the historic alert rates from Step 3 gives confidence that both are similar, and that the simulated production scenario is not an edge case. The CMS does yield a higher sensitivity to this production scenario as compared to historical data from CMS applications to other multistage manufacturing lines (see Table 10).

Lastly, Fig. 6 compares the dynamic estimated costs of the asset baseline risk as well as the asset’s post-CMS application risk, both taken as an average of 1000 simulations, for a single production run of 2700 min. The wide standard deviation reflects all the possible combinations of machine failures captured by 1000 simulations. Note that cost risks from CMS application to the asset are much lower than the asset baseline risk throughout the simulations, and their respective standard deviation ranges diverge after 1913 min. This is due to the real-time machine repairs as a result of the CMS’s correct alerts.

Fig. 6.

Fig. 6

Production line risk, with and without a condition monitoring system

4.2.5. Step 5: Conduct a Risk-Based Investment Analysis.

Finally, we can calculate the estimated investment returns for the CMS. First, we consider the benefits that come from operating a CMS for a production run of 2700 min. This is directly calculated as the asset’s cost risk with the CMS (Step 4) minus the asset baseline risk (Step 1). Then, we also subtract the operational costs associated with the CMS (Step 2). Figure 7 shows the cost benefits that the CMS brings to the asset in a single production run.

Fig. 7.

Fig. 7

Return of investment of condition monitoring system through 2700 min of production

This figure shows that, on average, the CMS brings positive net cost benefits to the manufacturing line once the asset produces 2553 parts (or 2553 min in the simulation). The average estimated cost benefits during the entire production run accumulate to $62,049. Due to the large variance, the lower bound of cost benefits in a production run accumulates to −$706,100 and the upper bound accumulates to $830,200. Given that this operation runs 26 times a year, the average estimated cash flow is $1,613,274.

We can now conduct an investment analysis on the CMS solution. The firm in this example uses a 7% discount rate and uses a study period of 5 years. The study period reflects the duration that the firm uses the CMS before investing in an upgrade. We calculate the net present value using the equations previously discussed in Step 5 of Sec. 3. Table 12 shows the values for this example. The total net present value for investing in a CMS is $2,114,742, as shown at the bottom of the table. The positive value indicates that this investment is economical. The internal rate of return for this investment is 23.24%, which exceeds the 7% discount rate, indicating that it is a beneficial investment.

Table 12.

Net present value and internal rate of return for CMS investment

Year Annual cash flow Net present value Cumulative net present value
0 −$4,500,000  −$4,500,000 −$4,500,000
1   $1,613,274    $1,507,733 −$2,992,267
2   $1,613,274    $1,409,096 −$1,583,171
3   $1,613,274    $1,316,912   −$266,259
4   $1,613,274    $1,230,759  $964,500
5   $1,613,274    $1,150,242   $2,114,742
   Total    $2,114,742
     IRR    23.24%

5. Conclusions and Future Work

This paper goes through steps that quantify the value of a CMS’s mitigating performance on an asset’s risks. This requires estimating the risk of a CMS’s monitoring outcomes, taking into account the cost of CMS operations and assessing the baseline risk of the asset. The resulting investment returns analysis captures the objective risks and costs a CMS solution brings to its asset. The intention is to help decision-makers use a CMS solution that shows a fair, intuitive quantified value to their asset. These assessments should not only be performed at the initial deployment of a CMS. This assessment can help determine if the solution that was previously proven suitable is still performing at acceptable levels today.

The investment analysis metrics can also incite confidence in decision-makers that their selected CMS can improve system operations. This capability is useful in a marketplace full of IAI and machine learning-based solutions for monitoring systems, where there is a combination of high interest in the potential in these tools, an abundance of different tool vendors that promise solutions, and a lack of tool evaluators or evaluation methods. Quantifying and predicting the value of data-driven or machine learning-based CMS solutions to company assets can improve confidence in their purchase and their monitoring results.

Future work will extend the use of risk assessment for CMS evaluation criteria. This will include capabilities to compare CMS solutions with regard to risk acceptability of undesirable scenarios, that is, by taking into account the degree of acceptability of each of the asset’s failure modes, and then observing and comparing mitigating effects on high-risk failures from CMS monitoring outcomes. In parallel, future work will address information gathering and simulation building, especially when developing asset baseline risk with an analyst’s judgment (in Step 1). In addition, the next work will review different possible CMS evaluation procedures (in Step 4), extending on approximation calculations and simulators shown in this paper’s examples. Different procedures depend on the amount of information available about the asset and its baseline risk, and best practices can help identify the appropriate evaluation procedures.

Footnotes

This material is declared a work of the U.S. Government and is not subject to copyright protection in the United States. Approved for public release; distribution is unlimited.

Disclaimer

The use of any products described in this paper does not imply recommendation or endorsement by the National Institute of Standards & Technology, nor does it imply that products are necessarily the best available for the purpose.

Conflict of Interest

There are no conflicts of interest.

Contributor Information

Michael Sharp, Systems Integration Division, Engineering Laboratory, National Institute of Standards and Technology, Gaithersburg, MD 20899.

Mehdi Dadfarnia, Systems Integration Division, Engineering Laboratory, National Institute of Standards and Technology, Gaithersburg, MD 20899.

Timothy Sprock, Systems Integration Division, Engineering Laboratory, National Institute of Standards and Technology, Gaithersburg, MD 20899; Applied Research Laboratory for Intelligence and Security, University of Maryland, College Park, MD 20742.

Douglas Thomas, Applied Economics Office, Engineering Laboratory, National Institute of Standards and Technology, Gaithersburg, MD 20899.

Data Availability Statement

The data sets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.

References

  • [1].Rehorn AG, Jiang J, and Orban PE, 2005, “State-of-the-Art Methods and Results in Tool Condition Monitoring: A Review,” Int. J. Adv. Manuf. Technol, 26(7–8), pp. 693–710. [Google Scholar]
  • [2].Roth JT, Djurdjanovic D, Yang X, Mears L, and Kurfess T, 2010, “Quality and Inspection of Machining Operations: Tool Condition Monitoring,” ASME J. Manuf. Sci. Eng, 132(4), p. 041015. [Google Scholar]
  • [3].Wu D, Jennings C, Terpenny J, Gao RX, and Kumara S, 2017, “A Comparative Study on Machine Learning Algorithms for Smart Manufacturing: Tool Wear Prediction Using Random Forests,” ASME J. Manuf. Sci. Eng, 139(7), p. 071018. [Google Scholar]
  • [4].Jeschke S, Brecher C, Meisen T, Özdemir D, and Eschert T, 2017, “Industrial Internet of Things and Cyber Manufacturing Systems,” Industrial Internet of Things, Springer, Cham, pp. 3–19. [Google Scholar]
  • [5].Parida A, 2016, “Asset Performance Measurement and Management: Bridging the Gap Between Failure and Success,” Measurement, 9000(14000), p. 55. [Google Scholar]
  • [6].Shrieve P, 1993, “Implementing a Cost-Effective Machinery Condition Monitoring Program,” Profitable Condition Monitoring, Springer, Dordrecht, pp. 3–10. [Google Scholar]
  • [7].Nicholls C, 1989, “Cost-Effective Condition Monitoring,” COMADEM 89 International, Springer, Boston, MA, pp. 335–347. [Google Scholar]
  • [8].Nilsson J, and Bertling L, 2007, “Maintenance Management of Wind Power Systems Using Condition Monitoring Systems—Life Cycle Cost Analysis for Two Case Studies,” IEEE Trans. Energy Convers, 22(1), pp. 223–229. [Google Scholar]
  • [9].Chang M-H, Sandborn P, Pecht M, Yung WK, and Wang W, 2015, “A Return on Investment Analysis of Applying Health Monitoring to LED Lighting Systems,” Microelectron. Reliab, 55(3–4), pp. 527–537. [Google Scholar]
  • [10].Feldman K, and Sandborn P, 2008, “Analyzing the Return on Investment Associated With Prognostics and Health Management of Electronic Products,” International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, Vol. 43277, pp. 1401–1409. [Google Scholar]
  • [11].Zio E, 2016, “Some Challenges and Opportunities in Reliability Engineering,” IEEE Trans. Reliab, 65(4), pp. 1769–1782. [Google Scholar]
  • [12].Swanson L, 2001, “Linking Maintenance Strategies to Performance,” Int. J. Prod. Econ, 70(3), pp. 237–244. [Google Scholar]
  • [13].Valdez-Flores C, and Feldman RM, 1989, “A Survey of Preventive Maintenance Models for Stochastically Deteriorating Single-Unit Systems,” Nav. Res. Logist, 36(4), pp. 419–446. [Google Scholar]
  • [14].Kaplan S, and John Garrick B, 1981, “On the Quantitative Definition of Risk,” Risk Anal, 1(1), pp. 11–27. [DOI] [PubMed] [Google Scholar]
  • [15].Zio E, 2007, An Introduction to the Basics of Reliability and Risk Analysis, Vol. 13, World Scientific, Singapore. [Google Scholar]
  • [16].Dunjö J, Fthenakis V, Vílchez JA, and Arnaldos J, 2010, “Hazard and Operability (HAZOP) Analysis. A Literature Review,” J. Hazard. Mater, 173(1–3), pp. 19–32. [DOI] [PubMed] [Google Scholar]
  • [17].Paté-Cornell ME, 1984, “Fault Trees vs. Event Trees in Reliability Analysis,” Risk Anal, 4(3), pp. 177–186. [Google Scholar]
  • [18].Lee J, Davari H, Singh J, and Pandhare V, 2018, “Industrial Artificial Intelligence for Industry 4.0-Based Manufacturing Systems,” Manufacturing letters, 18, pp. 20–23. [Google Scholar]
  • [19].Gao Z, Cecati C, and Ding SX, 2015, “A Survey of Fault Diagnosis and Fault-Tolerant Techniques—Part I: Fault Diagnosis With Model-Based and Signal-Based Approaches,” IEEE Trans. Ind. Electron, 62(6), pp. 3757–3767. [Google Scholar]
  • [20].Lee J, 1995, “Machine Performance Monitoring and Proactive Maintenance in Computer-Integrated Manufacturing: Review and Perspective,” Int. J. Comput. Integr. Manuf, 8(5), pp. 370–380. [Google Scholar]
  • [21].Paté-Cornell ME, 1996, “Uncertainties in Risk Analysis: Six Levels of Treatment,” Reliab. Eng. Syst. Saf, 54(2–3), pp. 95–111. [Google Scholar]
  • [22].Herrmann JW, 2015, Engineering Decision Making and Risk Management, John Wiley & Sons, Hoboken, NJ. [Google Scholar]
  • [23].Modarres M, 2006, Risk Analysis in Engineering: Techniques, Tools, and Trends, CRC Press, Boca Raton, FL. [Google Scholar]
  • [24].Spreafico C, Russo D, and Rizzi C, 2017, “A State-of€-the-Art Review of FMEA/FMECA Including Patents,” Comput. Sci. Rev, 25, pp. 19–28. [Google Scholar]
  • [25].U.S. Department of Defense, 1980, Military Standard: Procedures for Performing a Failure Mode, Effects, and Criticality Analysis, MIL-STD-1629A.
  • [26].U.S. Department of Defense, 2012, Military Standard: Department of Defense Standard Practice: System Safety, MIL-STD-882E.
  • [27].Horst J, Hedberg T, and Feeney AB, 2019, On-Machine Measurement Use Cases and Information for Machining Operations, US Department of Commerce, National Institute of Standards and Technology, Gaithersberg, MD. [Google Scholar]
  • [28].Barajas LG, and Srinivasa N, 2008, “Real-Time Diagnostics, Prognostics and Health Management for Large-Scale Manufacturing Maintenance Systems,” International Manufacturing Science and Engineering Conference, Evanston, IL, Oct. 7–10, Vol. 48524, pp. 85–94. [Google Scholar]
  • [29].Compare M, Baraldi P, and Zio E, 2020, “Challenges to IoT-Enabled Predictive Maintenance for Industry 4.0,” IEEE Internet Things J, 7(5), pp. 4585–4597. [Google Scholar]
  • [30].Society of Automotive Engineers. Recommended Failure Modes and Effects Analysis (FMEA) Practices for Non-Automobile Application. SAE ARP 5580: 2001. (R2020). [Google Scholar]
  • [31].Society of Automotive Engineers. Potential Failure Mode and Effects Analysis in Design (Design FMEA), Potential Failure Mode and Effects Analysis in Manufacturing and Assembly Processes (Process FMEA). SAE J1739: 2009. [Google Scholar]
  • [32].U.S. Department of Defense, 1998. Military Handbook: Electronic Reliability Design Handbook, MIL-HDBK-338B. [Google Scholar]
  • [33].JA1011, A.E S, 1999, Evaluation Criteria for Reliability-Centered Maintenance (RCM) Processes, Society for Automotive Engineers. [Google Scholar]
  • [34].Thomas D, 2017, “Investment Analysis Methods: A Practitioner’s Guide to Understanding the Basic Principles for Investment Decisions in Manufacturing,” NIST Advanced Manufacturing Series 200-5. [Google Scholar]
  • [35].Cooke R, and Bedford T, 2002, “Reliability Databases in Perspective,” IEEE Trans. Reliab, 51(3), pp. 294–310. [Google Scholar]
  • [36].Hillman C, 2013, “No MBTF? Do You Know MBTF?,” DfR Solutions Resources Paper. [Google Scholar]
  • [37].Dempster AP, 1968, “A Generalization of Bayesian Inference,” J. R. Stat. Soc. Ser. B Methodol, 30(2), pp. 205–232. [Google Scholar]
  • [38].Zadeh LA, 1965, “Fuzzy Sets,” Inf. Control, 8(3), pp. 338–353. [Google Scholar]
  • [39].Aven T, Baraldi P, Flage R, and Zio E, 2013, Uncertainty in Risk Assessment: The Representation and Treatment of Uncertainties by Probabilistic and Non-Probabilistic Methods, John Wiley & Sons, Hoboken, NJ. [Google Scholar]
  • [40].Javed K, Gouriveau R, and Zerhouni N, 2017, “State of the Art and Taxonomy of Prognostics Approaches, Trends of Prognostics Applications and Open Issues Towards Maturity at Different Technology Readiness Levels,” Mech. Syst. Signal Process, 94, pp. 214–236. [Google Scholar]
  • [41].Zhou DP, Hu Q, and Tomlin CJ, 2017, “Quantitative Comparison of Data-Driven and Physics-Based Models for Commercial Building HVAC Systems,” Proceedings of American Control Conference (ACC), IEEE, pp. 2900–2906. [Google Scholar]
  • [42].Bedford T, and Cooke R, 2001, Probabilistic Risk Analysis: Foundations and Methods, Cambridge University Press, Cambridge, pp. 70–73. [Google Scholar]
  • [43].National Institute of Standards and Technology, 2020, Smart Investment Tool. Version 1.0, https://www.nist.gov/services-resources/software/smart-investment-tool
  • [44].Reid SG, 2000, “Acceptable Risk Criteria,” Prog. Struct. Eng. Mater, 2(2), pp. 254–262. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data sets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.

RESOURCES