Sitemap

Beyond the Data Abyss

How We Learned to Fall in Love with our Streaming Data Again.

12 min readJul 17, 2025

--

Press enter or click to view image in full size
A colorful rendering of what Space could look like abound with reds and yellows, blues and blacks. Not the classic void of space but something more happy and inviting.
Imaging life at the brink of creation. Your data is in a similar primordial state until you gain control and learn to harness it. Image by Author via MidJourney.

In part one of this series “Escaping the Void of the Data Abyss”, we put our heads together and learned about the countless horrors that will sneak up on you in the night (or over the years) with respect to streaming data at massive scale — through the lens of what didn’t work at Nike. We then learned about the strategy taken at Nike to right the sins of the past, and how Buf came to our rescue and saved the day. Lastly, I left us with a cliff hanger at the end of the first entry saying I’d talk about some of the problems (and solutions) that came about as we started to scale up our protobuf-first strategy. Alas, here we are. Ready. Set. Let’s Go.

Aligning on a Protobuf-First Data Strategy

This can be a hard sell — at first. Most people when given the choice between a) object-flexibility, or mutability and b) strict type-safety, object-level governance, or message-level immutability, will tend to do what is easiest which falls in the mutability camp. Mutability ultimately equals extra work since you’ll be paying the piper on ingest as things change, which in many cases means “after emission into a Kafka-like service”.

This is a problem since streaming JSON data is still mutable and on its own is missing an understanding of schema governance. This just means it has a comedic way of surfacing corrupt records at the worst times. This can be annoying for web services, but it is crippling for event streaming since there is no way to rewind and replay corrupt events — simply put it will lead to data loss and is an outage!

Most people when given the choice between a) object-flexibility, or mutability and b) strict type-safety, object-level governance, or message-level immutability, will tend to do what is easiest which falls in the mutability camp

So to recap. JSON is great for HTTP services. JSON is bad for reliable streaming services. Protobuf makes JSON services more manageable as well (through type-saftey and object-level invariants) — so a win win in both respects and an easier sell is simply to use protobuf as a backbone for traditional web services and as a go-to for event streaming. For us at Nike, we needed it for both.

Sidebar: The Streaming Data Problem

The problem I mentioned earlier is fairly ubiquitous across most enterprises — in a nutshell, non-binary serializable structured data + streaming causing problems. The trouble is that problems sneak up on us over time with slowly changing data (row or columnar) and this can eventually corrupt historically captured data — to the point of no longer being able to read records.

There is likely a graph somewhere that shows the length of time a data product has been in existence (vs) the number of changes to the underlying data structure(s) as a function of increasingly invalid records within said dataset across the lifetime of that data product. If not, then imagine a chart showing something getting slowly worse as a function of time.

Not breaking backwards compatibility is one reason why protovalidate is so important (see part 1 for more details) and one of the easier ways of selling the protobuf-first strategy. It’s simply *impossible to force through a breaking change.

Back to aligning on the strategy.

Moving Past the Hard Sell

Going back to the hard sell (which was “hey, we’re going to go all in on protobuf”), and making it less of a hard sell takes some careful planning. Remember people are most often afraid of what they don’t know, and unless people really trust you, they’ll need to see “proof” (which we had through the clickstream project), and they’ll also be on the lookout for an “easy-button” unless they are convinced they can continue to deliver while undergoing a complete re-architecture.

One of the myths we found floating around the rumor-mill was that “Protobuf is too ridged to work with due to it’s immutability”. Luckily, we could easily squash that myth since while protobuf is immutable, it is not immutable in the way we think of sealed or final objects (in software), in fact, protobuf definitions can and will change over time. The big difference is that the specification accounts for how changes can be made while ensuring backwards compatibility with prior versions of “said message self”.

Knowing that we could achieve backwards-compatibility and gain compile-time guarantees, along with the time-saving benefits of code-generation proved to be the lightbulb moment most engineers were waiting for. Most of them had been dealing with broken data promises and their associated outages (botched deployments, 400s, 500s, broken data pipelines) for years and were happy to swim to shore. An easy sell for engineers is always “you’ll do less work”.

Regarding message and service immutability
A protobuf definition is immutable “at the point in time” that it is compiled. It is easier to think about “this point in time” in terms a version. We simply say “for a given version of a protobuf message, it is immutable” — but how do we account for a given version?

Lessons Learned: Versioning Strategy for our Protobuf

Meanwhile, back in Nike-land. We ran into the problem of versioning our protobuf early-on in the project. What’s the best pattern? especially when we were compiling many “artifacts” for various platforms and languages? We asked ourselves:

  • Do we cut a new release on every change?
  • What if changes occur in places we don’t care about?
  • How do we version and release specific sub-sets of messages?
  • What about releases? Do we wait to release a new “complete” snapshot using git-tags following our standard release versioning?
  • Are there a better way to share common messages so we don’t fall into old habits (like denormalizing our messages vs normalizing via composition)?

As you can tell there was a lot on our minds.

Initially, we followed the data-domain pattern (a play on the domain-driven modeling approach) within a large monorepo. Since we were using a monorepo, we could create shared resources (using local references), and then our domain specific resources (our protobuf messages and connect services) could still benefit from sandboxing (within a given domain) — while still sharing the same parent directories. In essence, we could take the one ring to rule them all approach and hope for the best.

proto/
common/
product/
item.proto
sku.proto
membership/
user.proto
...
order/
order.proto
return.proto
iso/
country.proto
language.proto
...
domain_a/
domain_b/
...
domain_n/

The Problem: Monorepos tend to start off with the best of intentions, but depending on the size and scale of the enterprise, they can become unruly. When we crossed the threshold of around 1000 protobuf types — types being message, enum, rpc’s—we realized we were headed for a deadend.

Additionally, the problem of the monorepo isn’t just tech related, how to scale the number of collaborators (around 20–30 people) who all have their own needs and deadlines is a chore in and of itself. We needed a way to share versioned artifacts without requiring everything to live under the same roof.

We had to find an easier way. We needed decentralized modules. Really, we simply needed a solution that understood the human-scalability concerns.

The Solution: The Buf Schema Registry (BSR).

BSR is a single pane of glass for all things within the protobuf ecosystem. It proved to be a wonderful tool for coordination and collaboration as well. This means that engineers across our API services, data-domains, analytic domains could come together in a central place — while still operating with autonomy.

The Swiss-Army Knife that is the Buf Schema Registry

To fix the issues of scaling out our protobuf-first strategy, across both humans and also across organizational domains, we (Nike) ended up solidified a longer-term partnership with Buf.

It was a no brainer to move forward because of the Buf Schema Registry (BSR), but around the same time that we were going through procurement, there was a new product offering just on the horizon — Bufstream (we’ll get to that at the end of this post) and that was of great interest to our plan for protobuf-based domination.

BSR provided a way for us to create versioned remote modules, rather than continuing to maintain a centralized monorepo. Adoption of BSR helped solve the problem of too many cooks in the mono-kitchen almost overnight. Now teams could operate within organizations independently (similar to how you would utilize organizations within Github), and we could all still share common modules through a “common” organization within specific collections (repos) — For example, if you think about common IETF standards like ISO codes, these can be shared as a collection of enums vs “strings” which means no more typos.

All of the tricky dependency management is handled by Buf automatically through the Buf Cli with assistance from the buf.yaml (updates your buf.lock). When it comes time to create “versioned” resources, this is done through “labels” (in a similar way to using git tags for releases) — bound to a specific git commit (at least how we’d recommend doing it). See Figure 1–1 which showcases versioning in action.

Press enter or click to view image in full size
Shows a User Interface (UI) showcasing the documentation for Buf’s protovalidate. The tab on the web page is open to the “docs” tab, and the README file of the project is rendered beautifully.
Figure 1–1: A view of the “docs” tab within the public Buf Schema Registry. Explore Here.

With the ability to version our protobuf modules in a standardized way including the beautiful addition of remote dependency tracking — all in a decentralized way—we were finally cooking with fire.

But we still needed to figure out the best pattern for releasing our artifacts. Luckily, this was already a feature baked into the Buf Schema Registry.

Light Bulb Moment: Server-Side Artifact Generation

One of the more amazing features of the BSR is the ability to lazily generate artifacts. In Figure 1–2, you’ll see the SDK tab highlighted. The view provides you (the engineer) with simple directions to fetch the specific “versioned” resources you need.

Press enter or click to view image in full size
Figure 1–2: Utilizing Lazy SDK Generation hints via the SDK Tab.

This feature really cuts down on the level of effort for working with globally distrubuted teams and we could now do more work asynchronously. Teams no longer needed to reach out in slack to understand which version was safe to deploy for a given artifact. Gone were the useless meetings to coordinate releases across our API services. We now had enterprise-wide invariants — we could finally trust-first, ask questions later!

Lesson Learned: Git Actions Rule

Utilizing protobuf and building grpc services (via connect) was a huge win. The icing on the cake was the power of BSR’s bot-user’s and Buf’s official github action (buf-action).

What are Bot Users?
If you are unfamiliar with bot-users, they are also referred to as “headless” users, or service principals. In general, this is a user that will never need to “see” the UI, and they are typically used for CI/CD or for other automations.

We realized we could create a new bot user for each onboarded organization within our enterprise BSR. We could then utilized github actions (shown in Figure 1–3) to enforce “stage>pr>push” patterns in a unified way.

Our custom github workflow included a step would fetch the bot-user token and afterwards all commits to BSR would be signed by the bot-user of a given organization. While it can be nice to “test” things out locally and push experiments (as a human), in the long run we turned off “write” access for non-owner’s within the BSR to prevent garbage from littering the environment. This strategy may not be the correct way of working for your company, but for us the additional safe-guard meant we could enforce our ways of working and sleep well at night.

Press enter or click to view image in full size
Figure 1–3. Buf’s official Github Action.

Now that we had a strategy for how we’d roll out protobuf across the company, we could start to build additional layers onto this solid foundation. This was a good time for us to step back and consider why things were going well.

Project Retrospective

We started small. Proved that our ideas would work and scale. Often it feels like you are moving at snail-speed getting everything required into place, but this is more of a time dilation. The momentum of a new project brings with it so many novel ideas but we had to remember to stick to milestones and finish what we started. Then we could add additional layers and complexity with the trust provided by our apriori actions and successes.

The other thing that we did was pave a golden path for all teams that would be following in our footsteps. This meant we had already paid down the system-wide complexity, figured out ways of working that evolved through real trial and error, but we did so with mission-critical focus and concentration.

Semantic Streaming with Bufstream

In part one of this post (and highlighted here in Figure 1–4), I talked about our ingestion architecture and how we built our gateway ingestion services to receive clickstream events, validate this event streams, and either a) emit error messages back to the client, or b) write to Kafka to be consumed within Databricks.

Press enter or click to view image in full size
Shows a data ingestion architectural diagram with client SDKs on the left. Event data is emitted via a Gateway API Service and each record is written to Kafka for consumption downstream via Databricks.
Figure 1–4. Our Original Ingestion Architecture

This architecture was our starting point. Given the goal of having a protobuf-first data strategy, we realized fairly quickly that we were going to run into additional points of contention even with our generic “protobuf-aware” pyspark applications. The overhead alone of running full-time streaming applications, or coordinating scheduled jobs using structured streaming and trigger(availableNow=True) meant that we’d need to bring best in class automation to the table (literally).

Luckily for us, we had the Bufstream ace-up-our-sleeve. While we are currently working to scale out this new architecture, the early results are looking really promising. For us, reducing the overhead required for ingesting from N Kafka topics and writing all records to N lakehouse tables means we’ll reduce our ingestion complexity by N. This is huge, even if it doesn’t seem that way.

Press enter or click to view image in full size
Figure 1–4. Future Facing Ingestion Architecture

This is all made possible given Bufstream can do zero-copy writes into the Iceberg Lakehouse format. Given the acquisition of Tabular by Databricks, Apache Iceberg was now a first-class citizen within Unity Catalog for our Databricks workspace(s), which means we wouldn’t need to do anything complex other than write our analytical data to a Bufstream topic. Now that Buf provides support for Unity Catalog managed Iceberg tables, this is a simple win — take the win! The hard work is now done behind the scenes for us, all within our own VPC.

I’ll be writing more in depth about Bufstream, so look out for future posts.

The Future is Bright (Nike x Buf)

Now that Nike has partnered with Buf, and after the process of integrating BSR, nailing down the golden paths for managing our versioned protobuf, figuring out the right set of github actions to simplify deployment of our connect rpc services, and beginning the journey towards simplifying our Kafka footprint using Bufstream — I can say the future is looking extremely bright.

The one last thing for this post is a quote from one of my close friends at Nike — he said “before using Buf’s tooling we (Nike) were looking at months to do basic changes to our existing APIs, and now that we’ve nailed down the right patterns, we are looking at DAYS to do the same work.”

before using Buf’s tooling we (Nike) were looking at months to do basic changes to our existing APIs, and now that we’ve nailed down the right patterns, we are looking at DAYS to do the same work.

Now, when asked if you prefer a) object-flexibility, or mutability and b) strict type-safety, object-level governance, or message-level immutability — the answer is clear. Choose protobuf and the message-level invariants that you will come to grow and trust.

If you missed Part 1 of this series. Check it out below.

* impossible — you can force a breaking change by turning off protovalidate, luckily with BSR you can enforce breaking-change detection at the registry across the entire enterprise.

--

--

Scott Haines
Scott Haines

Written by Scott Haines

Developer Relations Engineer @ Buf. Ex-Distinguished Software Engineer @ Nike. I write about all things data, my views are my own.

Responses (3)