Transformers, Cortical Columns & AI’s x86 Moment
The Transformer puts the T in GPT and is the beating heart of generative AI, but I don’t think the full impact of the transformer has been fully felt or understood yet. This deep neural network architecture, first introduced in 2017 by Ashish Vaswani and his team at Google in this paper – “Attention is all you need” – has a whopping 104,270 citations as of January 2024.
But we ain’t seen nothing yet.
There are three big reasons that the transformer is revolutionizing AI:
1. Transformers are our first general “microprocessor” for AI.
They are the first AI architecture that can be scaled up or applied to a broader range of problems without requiring major changes to the underlying architecture. In 1978, the x86 architecture let us build microprocessors for PCs that could support a general set of applications – from word processing to video games. Until now, we’ve needed different AI architectures for different problems. It’s hard to imagine the software revolution that has defined the last 5 decades happening without the x86. Imagine if the only app your PC could run was Word. You’d need a different PC, with a different chip for Excel, another one for video games, and another to browse the web. Until the transformer came along, that’s what things were like for AI.
Transformers are AI’s “x86” moment. To go truly mainstream, the x86 needed DOS, Windows and suites of applications on top. But we have the microprocessor now.
2. Transformers work in a way that’s akin to our own brains.
Human brains evolved by adding new parts on top of old ones. The neocortex – our center of intelligence – makes up 70% of our brain by volume. Interestingly, the neocortex grew rapidly (in evolutionary time), not by evolving new, different architectures for vision, language, memory and logical reasoning. Instead, it grew by just making more copies of the same basic circuit. In Jeff Hawkins’ eye-opening book, A Thousand Brains, he calls this basic circuit a “cortical column”. The human brain has about 150,000 cortical columns. The neocortex is general – the same sub-circuits learn to see, hear, talk, and move.
Transformers are the first AI architecture that functions like the neocortex – transformers are both general and replicable, and they scale by copying and pasting a simple root component called the attention head and the layer. Because of this similarity, transformers can tackle a variety of intelligence tasks that goes way beyond language. We might know them for LLMs, but at Arena, we use transformers for what we call LXMs - large transformers, pre-trained to learn the non-verbal language of complex systems. We believe that the future won’t look like teams of data scientists building custom ML models from scratch, but rather big foundation models built by companies like OpenAI and Arena, fine-tuned by data scientists for specific use cases.
3. Transformers can understand things in context.
The transformer architecture introduces a mathematical innovation called “attention” to learn context, letting it learn what words in a book mean. For example, transformers can understand that a “Captain” in “Star Trek” likely means “Captain Kirk” or “Captain Picard”. The attention mechanism lets transformers process arbitrary-length sequences of data. So for context, it could take in the last paragraph or all the books ever written by Shakespeare. To get a sense of scale, when an LLM powered by a transformer reads a book, it reads by looking at every word on every page all at the same time, rather than reading slowly from cover to cover.
With transformers, for the first time ever, AI has a flexible microprocessor that can bring intelligence to many different problems, with brain-like circuitry that can scale up by copying itself and can understand nuanced context. The transformer isn’t the final step on the path to AGI, but it’s the first true foundation that enables it.
Eventually, the transformer will be replaced, but its successor will share the characteristics of general purpose intelligence and replicability, but with even more data and energy efficiency.
Thanks to Arena Machine Learning Scientists Chris Bryant and Etan Green for enriching my understanding of transformers and reading drafts of this post.