Generative AI: No longer a black box
Part 1.
The promise of Artificial Intelligence (AI) is rapidly becoming reality as its true potential begins to be tapped. The scope of what intelligent applications can do suddenly seems limitless. Re-fueled by the recent surge in Generative AI technology, continued advances promise to produce more capable intelligent systems that perceive, learn, and act autonomously on their own. But we are still in our infancy in our ability to explain how AI makes their decisions. There is a long way to go before we can fully understand and trust AI systems. In this two-part article, we will explore how to achieve a higher level of trust for the next generation AI system.
The Need for More Insights
Generative AI, such as Large Language Models (LLMs), is a very complex machinery. For example, LLMs are able to generate a variety of creative results beyond their initial training dataset. LLMs have learned how to respond to our prompts with natural language, and developers are building applications that leverage LLMs’ knowledge-base and reasoning ability.
We also know that LLMs can produce incorrect answers. The media and tech industry have called this phenomenon, “hallucinations”, “lies”, or “confabulations”, and we are only beginning to understand and characterize this behavior. Fundamentally, we need more capable means to understand their rationale, characterize their capabilities (both strengths and weaknesses), and convey an expectation on how they should behave.
Monitoring AI
In order to triage an AI model, you need to have a good understanding of the system you are trying to debug. You must recognize, isolate, and identify the cause of the bug. Only then can you determine a fix for the bug, apply the modifications, and then test it. Today’s approaches are centered on AI monitoring. This approach tracks the external outputs of the AI system to determine if something goes wrong. However, it does not tell us what went wrong. Example monitoring tools are logs, dashboards, and alerts. They are focused on aggregate metrics related to accuracy, predictions, features, or raw inputs, and to identify the health of the AI model with regards to those metrics.
Recently, newer approaches have been developed to offer observability. This is where AI behavior is correlated to elements of the model or training data, to help interpret which features of a model contributed to the degradation or error. Interpretability helps fix model issues, with a deeper dive on what about the model that went wrong.
If LLMs are not behaving correctly, you need to diagnose and correct the problem. Using a doctor-patient analogy, if a patient is ill, you need a doctor to diagnose the problem. However, current approaches in monitoring and observing AI are akin to a doctor using a stethoscope from afar. Remote observations will only convey speculative diagnosis. In short, our ML scientists lack approaches or tools to triage Generative AI models, whereas our doctors have access to a host of tools (blood tests, ultrasounds, CT, MRI, etc.) to get better confidence on any diagnosis.
AI Introspection and Remediation
What is needed is the ability to directly find the bugs and also fix them without breaking something else in the Generative AI model. Beyond monitoring and observing an AI model, we need tools that can enable developers to introspect and remediate any problems with the model.
We learned early on that you can not just treat the AI model as a black box. An ML scientist needs excellent tools that can query and visualize the internal states of a model. Much like a software developer uses breakpoints, watchpoints, and variable inspection, ML scientists need tools to access tensors and control flow for a model. They need easy means to pause execution, step in, and insert any necessary modifications.
Latent AI’s flagship product, the Latent AI Efficient Inference Platform (LEIP) is now armed with tools for deeper introspection and remediation of a model. Our specialty is models at the edge, and they represent one of the harder challenges, because embedded systems have less accessibility than those in the cloud. Such a capability enables us to easily augment any model tensor and operator to support custom models. It enables fast triage to increase model accuracy (e.g., for model quantization and pruning). It helps in developing models that are robust, and it can help diagnose the models when they are not performing or “ill”. AI models, even the most complex Generative AI models, are no longer a black box, and that means a whole new world of understanding what makes them tick.