Composable Data Platform: A New Way to Access Data on Stellar

Author

Molly Karcher

Publishing date

Developer Tool

Data

This article is the first in an expansive series on the Composable Data Platform, the next generation data-access platform on Stellar. This article will give a high-level overview of the architecture, as well as its goals and capabilities. Follow-ups will go into depth on the different components of the architecture, deep-dive into meaningful use-cases, and explore how you can use this to supercharge your own applications.


The State of the World

Today, SDF owns and operates two products that mediate data from the Stellar blockchain: Horizon, an API that interacts with the network, and Hubble, a full-history analytics dataset. At present, SDF hosts public versions of these products to encourage small-scale development and experimentation on the network. This model has led to challenges in the ecosystem:

  • Limited Flexibility - the monolithic nature of the current products forces an all-or-nothing approach. Providers must either adopt SDF’s chosen data model or develop their own independent integration from scratch. Building an application from scratch is time-consuming and requires SME-level domain expertise.
  • Logical Centralization - the current products are difficult to run, making SDF-hosted products a community fallback. This has encouraged many to become dependent on a single infrastructure organization.

In order to support and foster a healthy, decentralized network, it is important to us that this data storage and access is distributed across as many ecosystem participants as possible.


Following both the launch of Soroban and the change in SDF Horizon’s retention window, we are seeing a sharp rise in interest in providing network data as a service from analytics providers, infrastructure providers, indexers, and the like. Unfortunately, given the current state of the world, it’s harder than it should be for these kinds of providers to quickly on-ramp when trying to serve Stellar data. Given the monolithic nature of Horizon and Hubble, they tend to drive an all-or-nothing approach; you’re either nudged into adopting SDF’s chosen data model, or you roll your own integration entirely independent of these tools.

Composable Data Platform

The Composable Data Platform (hereafter referred to as CDP) is a collection of open-source tools and libraries that work together to streamline data access for the Stellar ecosystem. The intent is to allow each ecosystem participant to plug-and-play as needed and customize their solution based on their individual application needs.

The key components that make up CDP include:

  • Galexie: Extracts ledger data (transaction metadata, or “TxMeta”) from the blockchain using a Stellar core node and writes it to a pipe, queue, or long-term data storage solution.
  • Data Object Storage: Serves as a long-term storage mechanism for raw, immutable TxMeta, stored compressed in Stellar’s XDR format. Conceptually, this is a data lake.
  • Ledger Backend: Ingests the raw data from a configurable TxMeta source. Captive Core or a Data Lake can serve as a TxMeta source.
  • Processors: Transform the raw data into a meaningful, human-readable format and enables end-users to customize their data processing to their specific application needs.
  • Loaders: Define domain-specific schemas and load the data somewhere to be consumed by an application’s end-users.

At first glance, this might appear quite simple, even obvious. Indeed, this is in essence what Hubble and Horizon both do today. The two share code for some of these components (as they both use the ingest SDK), but for the most part they represent parallel, divergent codebases that often tightly couple most of this functionality together. This makes it extremely difficult (if not impossible) for you to customize deployments of them to suit your own data needs.

CDP reimagines that monolithic architecture by clearly defining and then externally exposing each of those inner components. Components are represented as standalone interfaces that can be configured and operated independently, while seamlessly interlocking with every other component to provide a customized data layer for your application.


What can you do with this?

This unlocks endless possibilities! Importantly, it gives you, the developer, the power to completely customize your data consumption and access patterns, depending on what kind of application you’re building. For example:

  • Choose your XDR (TxMeta) source based on your own application’s unique liveness, consistency, and availability requirements
  • Abstract away your XDR data source, allowing you to swap it in or out via configuration with zero changes to your application code
  • Remove the need for your live application backend to run Captive Core within the application itself, drastically reducing its resource requirements
  • Customize which XDR processors you use (or create your own!), allowing you to ingest and store only the data that matters to you, which safeguards against excessive long-term storage costs
  • Implement your own schema (and Data Loader), allowing you to choose the optimal database for your application, instead of being pigeonholed into PostgreSQL (Horizon) or BigQuery (Hubble)
  • Contribute back useful, fun, or unique Ledger Backends, TxMeta storage options or Processors to the open source ecosystem

Consider the vastly different backend architectural choices that could be made across these examples:

  • Wallet backend that only cares about data pertaining to their customer accounts
  • Asset issuer that only cares about data pertaining to their assets
  • Contract developer that only cares about debugging data related to their contracts
  • Centralized exchange that only cares about transactions in and out of their omnibus accounts
  • Trading bot that only cares about the latest offer and trade data

Ultimately, only you know what data your application needs, so only you can decide the optimal schema (and data store) to hold it. If you need help thinking through options or figuring out how CDP fits in, reach out to us on Discord! We have two channels: #hubble for analytics questions, and #horizon for operational, real-time questions. We’re available to help brainstorm, and your feedback will help us decide what meaningful extensions we add next to CDP.

Integrate Today!

The first piece of the puzzle, Galexie, is out and available for public use; check it out on github or docker hub! It currently supports a single object storage option (GCS), and we’re eager to hear feedback on what storage mediums may be most valuable to you in the future.

We are leveraging the performance gains and simplicity of CDP by refactoring portions of our existing products, Hubble and Horizon. SDF’s Hubble now uses a Galexie-exported GCS data lake as its backend - see stellar-etl for details. Horizon support for re-ingesting from a Galexie-exported data lake is available in v2.32.0. We’ll have posts in this series which go in-depth into how we refactored our services, as well as what you can expect in terms of cost and wall-clock time if you opt to utilize these new components.

To start building your own application independent of Horizon or Hubble, take a look at our ingest SDK. This encapsulates the Ledger Backend interface of CDP, and this can be used to build your own ingestion pipeline configured to consume from a Galexie-exported data lake.

What's next?

Stay tuned for next week’s post on Galexie, where we’ll be doing a deep-dive on installation, development, and usage. This is the first major component that makes up the CDP, enabling developers to efficiently export and memoize Stellar network data for processing.

We’re actively working on developing a library to house our processors (or transforms), which will help to transform the raw XDR format into data models that you’re more familiar with if you’re used to utilizing Horizon or Hubble for data access.

This all may sound a little overwhelming and abstract, but we’ll be coming out with extensive example implementations to demystify the new platform. We’ll also be coming out with more content in this series, where we’ll highlight existing key use cases, and illustrate how you could utilize the full power of CDP in your own application.

In the meantime, join us in our Developer Discord to chat through any questions, concerns, or feature requests as we work to modernize data access on Stellar!