Matadisco: Can We Bootstrap Public Data Discovery with ATProto?

Let's talk about a system for universal data discovery, that doesn't depend on any single institution staying funded or online.

Finding open data is hard

We've never had more open data, yet finding the right data remains incredibly hard. Petabytes of satellite imagery, climate models, public health statistics, public domain art, and genomic sequences sit in public repositories. But researchers waste hours navigating fragmented portals or duplicating archives.

As Jed Sundwall argues, data has value only when it's discoverable, accessible, and ready to use. The scientific community codified this in 2016 with the FAIR Principles (Findable, Accessible, Interoperable, Reusable), and major funders now require it. But we’re still stumbling on the most basic requirement: findability.

IPFS solves half the problem, shown by geospatial data projects like Gainforest, EASIER, and ORCESTRA. They use content addressing to make data reliable: datasets can't be silently modified, references don't break, and integrity can always be cryptographically verified. Many projects also use IPFS’s peer to peer networking to make data transfer efficient.

But discovery remains the missing piece. How do you find data across a sea of different archives and mirrors? And how do we ensure that discovery can’t be unilaterally controlled or censored? We need a way to make any open dataset — whether hosted on institutional servers, IPFS, or cloud storage — discoverable in a way that no single entity controls.

Motivated by this challenge and Tom Nicholas’s 2025 essay “Science needs a social network”, we created Matadisco.

In this post, we'll introduce the why and how of Matadisco. For more technical details and a tutorial, check out the website, Github repo, or Volker Mische’s post.

Introducing Matadisco

Matadisco uses the AT Protocol to create an open, decentralized layer for data discovery. It has three key parts: the records (lightweight pointers to metadata), the network (ATProto for decentralized distribution), and the portals (custom filters built by various communities for discovery).

Part 1: Records

Here’s what a Matadisco record contains:

  • a required link to the metadata
  • a data preview (optional)
  • a timestamp

This schema is published as an ATProto Lexicon, ATProto's format for defining structured record types that any application can recognize and use. Draws from lessons learned in other federated systems, the record is deliberately minimal. By requiring only the metadata URI, this can work with any metadata standard: STAC for satellite imagery, DataCite for research datasets, IIIF for cultural heritage, RSS for publications, and more.

Part 2: Publishing to ATProto

Once records are created, they get published to a Bluesky PDS (Personal Data Server), which broadcasts them into the ATProto network. From there, independent relay nodes aggregate and redistribute the records across the ecosystem.

We chose ATProto for both technical and strategic reasons. Technically, it's designed for exactly this: letting anyone publish records and build on shared data. Every record is cryptographically verified.

Beyond the technical design, the ATProto ecosystem is growing rapidly, with 42 million users and an estimated 400 applications. Bluesky, the original creators of ATProto, have also taken steps towards open governance at IETF. Independent infrastructure is expanding too: multiple relays now operate beyond Bluesky's main service (including Blacksky's atproto.africa and relay.fire.hose.cam). Public institutions could help lock this network open by running their own relays, an accessible option with very modest costs.

Part 3: Portals for Discovery

Portals aggregate records from across the ecosystem into specialized discovery interfaces.

Building a portal is straightforward: subscribe to an ATProto relay, filter for relevant records, and display them. Our prototype demonstrates this in about 100 lines of code, using Copernicus Sentinel-2 satellite imagery as an example. It consists of two parts:

  • Matadisco-publisher listens for updates to the Sentinel-2 dataset via Element 84's Earth Search STAC catalogue, creates new records, and publishes them into the ATProto network
  • Matadisco-viewer subscribes to a relay, filters for new records, and displays them (in this case, with an optional preview).

Because records flow through an open network, institutions manage their catalogs independently while participating in shared discovery. Once researchers find relevant data, they access it through their existing tools — Python, R, GIS software, or whatever the workflow requires.

Open Questions & What’s Next

Matadisco is experimental. There’s still a lot to figure out!

How should the schema evolve across different domains? We're currently testing with additional data sources: German geodata catalogues, GLAM collections using IIIF, and public broadcasting archives. The lexicon schema will evolve as we learn from different domains.

Is this actually usable across different domains? The minimal schema should work with any metadata standard, but "should work" and "works well" are different things. We're collaborating with the Exhibit.so team who contribute to the IIIF 3D technical specifications group on adapting this for cultural heritage collections. We would welcome similar conversations in other fields.

How do we connect datasets across mirrors and archives? The current design makes datasets discoverable, but how do we help users find the same dataset whether it's hosted at the original source, a university archive, or a community mirror? This is where content addressing—like IPFS CIDs—could help: the same dataset would have the same identifier regardless of where it's stored, making it trivial to discover all available copies.

How do we make this durable? Making this work long-term will require institutional participation: research organizations, libraries, archives, and data repositories publishing records and potentially operating relays. We also need to figure out governance for how schemas evolve. 

We'd especially love to hear from people working in open data, public archiving, metadata standards, or scientific infrastructure:

  • Try it: Check out the live demo at matadisco.org
  • Code & discuss: Explore the schema and open issues at github.com/ipfs-fdn/matadisco
  • Join the conversation: (add link), especially if you're considering publishing metadata or building a portal
  • Meet in person: Volker Mische will present at FOSSGIS Göttingen in March, and Robin Berjon will give a lightning talk at ATmosphere Conf. Come say hello.