Data doesn't fuel AI. Entropy does.

Jan 01, 2024

Before finding my way to startups, I was a quantum physicist. One of the most fundamental concepts we learn in physics is entropy. Computer scientists often call entropy by another name – information. Importantly, information is not the same thing as data.

Here’s a simple primer:

Information = the degree of “surprise” when you learn something from data. For example, the sentence “the sun will rise tomorrow” has less information than the sentence “the world will end tomorrow”, despite having essentially the same number of characters. Physicists arrived at the concept of entropy in the 1800s by thinking about the number of “states” a group of particles could take. Computer scientists arrived at it by trying to compress data for communication, but they landed in the same place. There are only so many ways you can fundamentally arrange things in any system. That number of ways is the entropy. It’s also the amount of information contained in the system.

In the AI world, I’m seeing a failure mode where people think that all data is useful information. But in reality, data is just a first order approximation for information. Here’s an example:

Let’s say Galileo dropped 1 coin off the leaning tower of Pisa. In that first drop, he recorded a little data (the coin hit the ground after X seconds). The wind was strong, and from the west. This is high entropy data. He drops 3 more coins and writes down that they also hit the ground after X seconds. This adds more entropy, but less than the first observation. Then he drops a heavier object and scribbles down another high entropy entry into his notebook. Then he jots down data for a lighter object. Also a high entropy recording. He drops many more objects. They all hit the ground after the same number of seconds. Each additional entry is lower entropy, and quickly observing more drops tells him nothing interesting. The entropy of an additional drop is near zero. Galileo has squeezed all the “entropic” juice from this experiment. He’s learned that gravitational acceleration is constant and independent of an object’s mass. More experimental data likely won’t change his conclusions.

Source: Bing Image Creator

The same rule applies in AI. The more entropy in the data, the more valuable it is.

Instead, we chase mountains of data – presuming the bigger the mountain, the more valuable the information, and ultimately sit on a pile of noise as a result.

At Arena, we believe that the value of a dataset and its price should be based on its entropy. Imagine a giant spreadsheet of data that’s for sale. Each cell should be priced differently based on its entropy. More entropy = higher price. Less entropy = lower price.

There are lots of zombie datasets out there, being shoved mindlessly into large models with misguided dreams that it will enhance performance. We can do better. We can actively acquire high entropy data from the world. We look at how much more data we’ve collected than we had before, but that isn’t the relevant metric. The relevant metric for building the AI of the future is how much information in the real world have we captured digitally? Spoiler alert… it’s really, really, really low. We need to rethink our sensors to go and get it. Entropic pricing will help by incentivizing data players to focus on capturing entropically rich AI fuel vs. just more data.

Is there anyone working on “entropic pricing” for datasets or building “max entropy sensors”? I’d love to learn more. Or, if you’re a company looking to assess the entropic value of your data, we’d be happy to help.

Pratap’s Substack

Discussion about this post