
"Machines buying data to make better decisions"
Once a concept straight from cypherpunk fiction, the idea of machines autonomously buying datasets for development and decision-making purposes is no longer esoteric, especially for those following recent developments in agentic AI. At this point, it seems like an inevitable trend with agentic automation.
Few people realize this, but this is a trend that predates AI's ChatGPT moment by a few years. It's a little-talked-about innovation in crypto that has been in production now for over 5 years. DeFi protocols, which are machines we call “smart contracts”, already tap into a data market to purchase datasets algorithmically. Lending protocols, much like "automated banks", run on machines that programmatically purchase market data to price collateral and assess risk. The venues operating in that data market have been referred to as “oracles”, with projects like Chainlink at the forefront.
Oracles started the "machines purchasing data" paradigm that will likely be a major theme in AI in the near future. When we started Portex, we made a bet that this concept was only going to expand. Our initial hypothesis was that crypto would serve as the launchpad, since stablecoins already give autonomous agents frictionless payment rails. It would start in crypto as it provides an ideal financial rail for agents, especially with stablecoins.
In mid-2024, we built an oracle that was 93% more efficient than Chainlink and 40% more efficient than the state-of-the-art at the time. We made a bet that oracles were the way data was going to become more easily tradable.
As it turned out, we were wrong and had to course correct. But this quest for PMF revealed some interesting dynamics, perhaps unique to crypto market structure, that we wanted to share.
Oracle Economics: Attractive at the Surface
As former data providers to nearly all of the major oracle systems, we had some familiarity with the economics of oracle providers. As we were designing our infrastructure, we sourced considerable onchain intel on oracles, including the margins or "take rate" that Chainlink node operators receive. After accounting for onchain transaction fees (gas), data acquisition, and cloud expenditures, we found that many node operators maintained margins of ~37% in 2023.
If "your margin is my opportunity", we thought our approach was sound: introduce a more efficient, cheaper service that eliminates intermediation and offer a marketplace for a variety of more interesting data feeds. If Chainlink attained over $200M in data sales since 2021 with mostly crypto pricing data, the size of the market appeared favorable.
This was also happening at a time when Polymarket was making waves and showing how oracle-driven apps could expand beyond pricing data. Onchain applications were growing in size and adoption, so we assumed the addressable market for oracles services would follow suit.
Reality Check: structural moats that are hard to beat
While it appeared clear to us that the oracle space was ripe for disruption, a few realities began to set in.
First, we realized that Chainlink's tight grasp on the market outweighed any incremental gains in efficiency. The Chainlink (“Link Marine”) effect is real and its memetic value hard to quantify. Switching costs are high for a piece of infrastructure that is essential to making applications run smoothly.
Second, the need for data that goes beyond prices simply did not materialize in oracle-powered systems. Under a very unfavorable regulatory regime, Polymarket-style protocols that needed more diverse datasets were not being launched. What's worse: crypto projects were being extremely conservative with their treasuries. The result? A market limited to lending protocols focused on pricing data.
We still believe many (perhaps most?) of the problems in crypto can be solved with data that goes beyond pricing, even though from an oracle standpoint it will likely continue to be the overwhelming majority of that market.
We had very promising results when trying to solve the problem of scams with our Reputation Oracle, proving that there is a path for a world where crypto phishing is no longer prevalent. We collaborated with protocols fighting Sybils (bots) and also saw promising potential with bot identification. The TAM, however, still appeared empirically small when talking to countless protocols that could purchase this data, and the race-to-the-bottom dynamics with price feeds prevailed.
The outward shift in the demand curve
From the beginning, our initial focus on protocols as data buyers was predicated on the fact that, if we're being honest, it's still hard to onboard to crypto. The idea of AI companies using stablecoins to purchase data was still far-fetched, with high opportunity costs associated with onboarding for mid-sized organizations. Focusing on protocols made sense at that time.
But that all changed earlier this year, when Stripe acquired Bridge, a stablecoin infrastructure company. Their offering was vastly expanded, allowing traditional payment mechanisms (ACH/credit cards) to settle into stablecoins. All of a sudden, a path to leverage our infrastructure to serve data buyers outside of crypto opened up.
Unlike our initial target market, AI companies have an outsized demand for data. Multimillion dollar data licensing deals between top AI labs and data vendors are prima facie evidence for this. At the same time, open platforms like Hugging Face are showing the emergence of open-source AI with a growing ecosystem of open datasets, model hosting, inference, research, and more.
Dataset-related activity on Hugging Face has accelerated since ChatGPT's launch in November 2022, and the release of open-weight models like Meta's Llama 3. Over 460,000 datasets have been uploaded to Hugging Face by 100,000 unique users. These datasets have been downloaded more than 60 million times.
The emerging open data economy on Hugging Face is nothing but impressive. We see it as a critical piece of infrastructure for open source AI and we have many more insights and data to share soon on this.
While we're champions of Hugging Face (and operate there), we see a need for a financial incentive layer to surface truly novel datasets and compensate researchers for the time, effort, and resources they commit to their datasets. This is exactly where we think crypto rails, and especially stablecoins, have the most potential to organize and formalize the data economy.
A new horizon
Over the past few months, we have had countless conversations with AI builders across industries to better understand their data needs. We tailored our infrastructure accordingly, maximizing crypto's utility while leveraging partners like Stripe so no compromises on UX and compliance were skipped. A new platform that truly frees up the power of novel data as the ultimate differentiator for AI outcomes.
After working with AI researchers, data scientists and startups, we have finally wrapped up our pilot program in July. We're excited to share our platform with the world and realize our vision of enabling direct price discovery for buyers and sellers of data. This is especially exciting for community-owned datasets that gives users not only true upside with the growth of AI, but also a say on its direction and alignment.
The world is waking up to the dynamic that data is the moat, synthetic or otherwise, and it's evidenced by recent events like Meta's investment in ScaleAI. Crypto provides literally the only feasible architecture to make this a reality, but as it turned out, those that need it the most are on the outside.