The real challenge of real-world AI

Jun 24

What happens when an AI system makes a mistake? How much trust can we place in machine learning algorithms? These questions are crucial for decision-makers who are looking to integrate AI into their products and services.

The real challenge in AI and ML isn't just getting algorithms to work—it's ensuring they work correctly in all situations, especially those where mistakes can have serious consequences. Imagine an AI-powered medical diagnostic tool that boasts 99% accuracy. Sounds impressive, right? But what if that 1% error rate means misdiagnosing a life-threatening condition? Or consider a self-driving car system: even a tiny error rate could lead to accidents, potentially causing harm or loss of life.

In this article, we’ll explore why the hard part about machine learning and AI isn't achieving high accuracy. Instead, the challenge lies in understanding the critical scenarios where the AI can't afford to be wrong and ensuring it performs reliably in those situations. Here’s what we’ll cover:

Why high accuracy rates can be misleading and the importance of context.
How different applications require different types of accuracy.
Strategies for building robust AI systems that can handle unexpected scenarios.
The importance of human judgment in critical decision-making processes.

Through real-world examples and practical advice, we aim to highlight the importance of robustness, safety, and human oversight in AI systems.

The Illusion of High Accuracy

Why is high accuracy not enough? High accuracy rates can be misleading. It's easy to be impressed by an AI system that claims 99% accuracy, but this number doesn't tell the whole story. In critical applications, even a small error rate can have catastrophic consequences.

Consider a recent example involving Air Canada's customer support chatbot. The AI system was designed to handle customer queries efficiently, but it mistakenly promised a non-existent discount to customers. This error led to significant customer dissatisfaction and legal action against the airline. While the chatbot might have successfully managed 99% of interactions, the 1% error had serious repercussions, highlighting the risk of relying solely on accuracy metrics without considering the potential impact of errors.

Another example is an AI system used in manufacturing quality control. The system might have a 99% accuracy rate in detecting product defects, but missing just 1% of defective products can lead to significant quality issues. A real-world instance of this occurred with Tesla's Autopilot feature. Tesla's AI-based system, designed to control steering, acceleration, and braking, has been involved in several crashes. One notable incident involved a Tesla Model Y crashing into a police vehicle while on Autopilot in Michigan. This and other similar crashes have led to increased scrutiny and investigations into the safety of Tesla’s Autopilot systems.

The key point here is that context matters more than raw accuracy. In situations where errors can cause significant harm, it's not enough to achieve high accuracy. We need to ensure that the system performs correctly in all critical scenarios, regardless of how well it performs on average.

This requires a deeper analysis beyond just accuracy metrics. Decision-makers need to look at the specific scenarios where AI errors can occur and assess the potential impact. Only then can they develop strategies to mitigate risks and ensure the safety and reliability of AI systems.

Different Types of Errors in Various Applications

Why do different applications prioritize different types of accuracy? Different applications of AI and machine learning prioritize different types of accuracy. Understanding these distinctions is crucial for ensuring the effectiveness and safety of AI systems.

In some applications, recall—the ability to identify all relevant instances—is absolutely critical. We have, for instance, worked on a screening system for detecting relevant patents during a company's Freedom to Operate (FTO) process. In this context, recall is far more important than precision. Missing a relevant patent can lead to expensive legal battles and potential infringement cases. The cost of false negatives (missed patents) is significantly higher than false positives (irrelevant patents flagged). Therefore, the system must prioritize capturing every possible relevant patent, even if it means including some irrelevant ones left for human inspection.

Conversely, in other applications, precision—the ability to accurately identify relevant instances without including irrelevant ones—is more important. Consider a diagnostic system that determines whether a patient needs invasive surgery. In this case, a high precision rate is essential to avoid unnecessary surgeries, which can pose serious health risks and significant costs. Here, the system must ensure that only those patients who genuinely need surgery are identified, even if it means some cases are not flagged immediately and require further non-invasive tests.

In yet another category of applications, prediction bounds and intervals are most important. For example, in financial forecasting or demand planning, understanding the range within which predictions fall is crucial. Companies need to plan for the best and worst-case scenarios to manage risk effectively. In such cases, knowing the prediction intervals allows for appropriate contingency planning and risk management, ensuring that organizations can respond proactively to potential variations in outcomes.

Each type of error—whether it be false positives, false negatives, or uncertain predictions—carries different implications depending on the application. Decision-makers need to understand these nuances and tailor their AI systems to minimize the most critical errors for their specific use case. By doing so, they can enhance the reliability and effectiveness of their AI applications while mitigating the risks associated with errors.

Ensuring Robustness in AI Systems

How can we build AI systems that withstand unexpected scenarios? Ensuring robustness in AI systems is crucial for their reliable performance, especially in critical applications. Robustness means that the system can handle unexpected scenarios and errors without failing.

Redundancy is a key strategy for building robust AI systems. By incorporating multiple algorithms or systems that perform the same task, a company can ensure that if one system fails, another can take over. For example, in an e-commerce recommendation system, having multiple recommendation engines can ensure that if one model underperforms or goes down, others can still provide relevant product suggestions to users, maintaining a seamless shopping experience.

Another essential aspect of robustness is implementing fail-safes. Fail-safes are mechanisms that kick in when the AI system encounters a situation it can't handle. For instance, customer support chatbots often face queries they cannot answer. A robust chatbot system can have a fail-safe that seamlessly transfers the conversation to a human agent when it detects it cannot adequately respond to a customer's query. This ensures that customers receive the help they need without frustration.

Rigorous testing and validation are also critical for robustness. AI systems must be tested in diverse and challenging environments to ensure they can handle real-world scenarios. Consider fraud detection systems in financial services. These systems need to be tested with a wide variety of fraudulent and legitimate transaction patterns to ensure they can accurately detect fraud without flagging too many false positives, which could inconvenience customers. Continuous testing and validation help in fine-tuning the system and maintaining its accuracy and reliability over time.

Ensuring that the system can gracefully handle errors and degrade performance in a controlled manner rather than failing completely. For instance, in a recommendation system, if one algorithm fails, the system should still provide reasonable recommendations using a backup algorithm, rather than showing no recommendations at all.

Testing AI systems on a wide range of data and scenarios to ensure they can handle various inputs and situations. This includes edge cases and rare events that the system might not encounter frequently but could have significant impacts if not handled properly. For example, in an AI system for natural language processing, testing with diverse dialects, slang, and unusual sentence structures ensures the system's robustness.

Continuous monitoring and logging of AI system performance can help quickly identify and address issues. By keeping track of system behavior and performance metrics, organizations can detect anomalies early and take corrective actions before they escalate into significant problems. For instance, in financial trading systems, real-time monitoring can help detect unusual trading patterns that might indicate a malfunction.

Engaging in adversarial testing, where the system is subjected to intentional attacks or stress conditions to evaluate its robustness. This approach helps identify vulnerabilities that might not be apparent during normal testing. For example, in cybersecurity, red teaming involves simulating cyberattacks to test the system's defenses and identify potential weaknesses.

Incorporating these strategies—redundancy, fail-safes, rigorous testing, adaptive learning, error handling, diverse data testing, monitoring, and adversarial testing—helps in building robust AI systems that can withstand unexpected situations and continue to perform reliably. This approach ensures user trust and satisfaction, ultimately leading to the successful integration of AI into critical applications.

The Role of Human Oversight in AI

Why is human oversight indispensable in AI systems? Human oversight is a critical component in the deployment and operation of AI systems. While AI can handle vast amounts of data and perform complex tasks, human judgment is essential to ensure these systems operate correctly and ethically.

Complementing your software with human judgment is crucial for tasks that require nuanced understanding and ethical considerations. For instance, content moderation on social media platforms involves identifying harmful or inappropriate content. While AI can flag potentially problematic posts, human moderators are needed to review these flags and make final decisions, considering context and cultural sensitivities that the algorithm might miss.

In critical decision-making scenarios, human intervention becomes even more important. Consider the example we mentioned above of patent screening in a company's Freedom to Operate (FTO) process. AI systems can help identify relevant patents to reduce human workload, but human experts are essential to thoroughly evaluate the findings. They can assess the context and potential legal implications that AI might overlook. This human oversight ensures that no critical patents are missed, thereby avoiding costly legal issues and ensuring the company's freedom to operate.

Continuous monitoring and feedback are essential to maintain the effectiveness of AI systems over time. Personalized learning platforms, which tailor educational content to individual students, require ongoing human oversight to ensure the content remains appropriate and effective. Educators and administrators can provide feedback to the AI system, helping to adjust algorithms and content to better meet students' needs and address any issues that arise.

By integrating human oversight, we can enhance the reliability, ethical standards, and overall performance of AI systems. This combination of AI efficiency and human judgment creates a balanced approach, ensuring that AI systems can handle complex tasks while being guided by human insight and ethical considerations.

Wrapping Up

Ensuring that AI systems operate safely, reliably, and ethically is a multifaceted challenge. While achieving high accuracy is important, it is equally crucial to understand and mitigate the potential risks associated with AI errors. By focusing on robustness, incorporating human oversight, and understanding the types of errors that matter most in different applications, decision-makers can develop AI systems that not only perform well but also safeguard against critical failures.

Make sure to take concrete steps to ensure the robustness and reliability of your AI systems.

Build systems with multiple layers of protection to handle unexpected scenarios.
Ensure that human experts are involved in critical decision-making processes to catch and correct potential AI errors.
Regularly test your AI systems in diverse and challenging environments to identify and mitigate potential failures.
Focus on minimizing errors that matter most for your specific applications, whether it's improving recall, precision, or prediction intervals.

Taking these steps, you can develop AI solutions that are not only high-performing but also reliable and safe in critical situations. Reach out to discuss how we can help you avoid costly mistakes, protect your stakeholders, and build trust in your AI initiatives.

Book consultancy

Rickard Sjögren