In the allegory "The Cave", Plato describes a group of people who have lived chained to the wall of a cave all their lives, facing a blank wall. The people watch shadows projected on the wall from objects passing in front of a fire behind them and give names to these shadows. The shadows are the prisoners' reality, but are not accurate representations of the real world. The shadows represent the fragment of reality that we can normally perceive through our senses, while the objects under the sun represent the true forms of objects that we can only perceive through reason.”
(Source: Wikipedia)
We want to build AI that can understand and interact with the real world. But we’ve been training AI on “shadows” of reality. Text is one type of shadow. Pictures are another. Neither text nor a picture can fully describe a real object, like a penguin. In AI speak, each type of “shadow” is a mode. Multimodal AI seeks to learn a fuller representation of an object by understanding how it looks across modes. Multimodal AI learns about a penguin not just from the 7-letter word penguin, but also from pictures of penguins, their movement patterns – their waddle-like walk, torpedo-like swim stroke, and from how they smell. LLMs are today’s most popular example of modern AI. They are powerful, but only see the shadow-world of text. LLMs are still imprisoned in Plato’s cave.
Multimodal AI is staging a prison break.
Today’s multimodal AI focuses on the most common modes of perception – text, image and sound. Multimodal AI can caption a picture. But for commercial applications, we must weave in additional domain-specific modes beyond text and image.
Consider advanced manufacturing, a domain where our company – Arena – does a lot of work. We are building AI specialists to help design, test and optimize complex hardware – like leading edge chips. Chips aren’t like penguins. To learn how a new chip behaves, we need to learn from the modes that matter most – thermal profiles, power consumption curves, engineers’ notes scribbled into log files, oscilloscope readings (affectionately called eye diagrams), high-pitched sounds outside the human auditory spectrum and even video feeds of what’s being rendered on the screen while the chip is at work.
Domain-specific multimodal AI is opening up entirely new avenues of application.
Have you ever met a fantastic car mechanic? First, you describe the problem to her. Then, she turns on your car, hears a funny sound, notices a strange wobble, and immediately says “I know what’s wrong – it’s probably the carburetor”. That’s multimodal learning. She couldn’t have diagnosed the problem by just listening to the sound or seeing the wobble. Looking at those modes together was the key.
Source: Bing Image Creator
Domain-specific multimodal AI will turn mediocre mechanics and engineers into savants. Anyone who has to debug or repair complex hardware will have an AI expert in their pocket. These AI experts will be networked together, sharing learnings with each other in real-time. The integrated AI mind will evolve quickly, graduating beyond debug and repair to help us with design.
Beyond manufacturing, I’m confident that multimodal AI will rapidly become the norm in almost every industry, dramatically accelerating our pace of innovation as a civilization.
I wanted to tell you that you are so damn good at this-making complex technical concepts clearly understandable for the lay person and entertaining, even. Wow. I’m blown away. I loved the female car mechanic too. The images you created to illustrate are superior to any publication I’ve ever seen. Can’t for your next post! Great work!