Sitemap

Escaping the Void of the Data Abyss

Leaning on the Guiding Lights of Structured Data and Protocol-Level Invariants to Avoid the Evil Clutches of Bad Data and Rewrite the Sins of the Past.

17 min readJun 30, 2025

--

Press enter or click to view image in full size
A monster in the clouds appears to be chasing a classic car against a background of storm clouds in red and gray
Does your Data Scare You? Image By Author via Midjourney.

If you’re anything like me you don’t frighten easily. I love horror movies. I truly appreciate the clever use of foreshadowing, especially when we all know what is clearly coming next. Being a member of this silent audience — sitting there on the edge of our seats—provides us with the excitement of premonition (from the safety of our seats) while the movie takes us right around the corner. What comes next is literally why we’re here!

Does Your Data Scare You?

Jump scares and coming face to face with monsters isn’t what we crave though when it comes to our data. It just happens to be one of the realities of working at large enterprises. Comically bad data can make you feel just as helpless as the damsel in distress, and unfortunately leaves you paralyzed simply being dragged along for the ride just like the audience in the movie theater — even though we already understand what is probably right around the corner and in this case it is broken data promises and sadness.

So why does our data scare us?

It doesn’t have to. In fact, just like the unsung hero in most horror movies, it is up to us to flip the script and change the outcome of the story. What comes next is in your hands (dear audience) but as a lifeline, I’ll tell you a recent story from Nike in hopes that you too can escape the event horizon of the data abyss!

Steering around Data Swamps and Preventing the Impending Datapocalypse

I joined Nike in 2022 and my mission was fairly “simple” — I was being hired to change the data culture and find creative ways to fix the monstrous problems we had with “all things enterprise data and analytics”. Simple. Right? Nope.

I had unwittingly just stumbled into a horror movie set in a scary world of hopes and *poorly executed data pipeline dreams — luckily this wasn’t my first time escaping the abyss. Just like Van Helsing, I too had tools at my disposal and prior success to help me on my way. What I needed first was to assemble a crew, and materialize a mission that could set off a series of events to rid the world of the data swamp monsters lurking in the data deep—once and for all.

Oh. But where to get started?

Focusing on Mission Criticality and Solving the Hard Problems

It’s easy to do work in the shadows. Alone. It is much harder to get work accomplished with the blinding light of Sauron’s eye watching your every move — all the while building trust (at a new company) and assembling a core team as an IC (distinguished software engineer).

When you’re working to disrupt the status quo (highest probability of why we were here), there are people who want you to succeed — and then there is everyone else. Luckily for me, I had the support of my boss (head of ED&AI) on my side and found a crew of folk who wanted to work hard and solve complex problems (across multiple organizations in a matrix company… scary!).

In order for us to solve our data problems (which I’ll go into next), we were going to need to roll back the clocks to a time before the worst of the bad decisions were made, and then rewrite the playbook—and as we also found out later; the entire analytics and data serving stack.

We were going to once again need to lean heavily on my best friend Protobuf, but this time we were going to pick up some new advanced tools along the way (foreshadowing!).

If you’ve ever heard me talk before, you’ve probably heard me talk about my love of all things protobuf. If not, check out the video from RedisConf 2020.

So about those data problems!

The Hard Problem: Nike’s Clickstream

After a few months at Nike something became apparent (seemingly bubbling to the surface on its own). There was something wrong with the clickstream. This wasn’t like a simple “oops”, this was so much more than “ooo-oo-oooops”. This was sinister and insidious like a good haunted house story.

Something was systemically wrong and oh, this is it (I thought to myself), this is why I’m here. I found my mission. I figured out where I can lean in and create meaningful change. But what was so wrong (?) you might be asking — cause you don’t know the story like I do.

Press enter or click to view image in full size
Creepy abandoned house in a barren field with an oak tree and face hiding in the background. There is some ominous dread here. Scary
Not the Data Lakehouse you were Expecting!. Image by Author via Midjourney.

Oftentimes what starts out with the best of intentions — essentially all of our clickstream events, our hand-tuned SDKs, and even our end-to-end data delivery strategy — was beginning to crack, crumble, and fall apart. There were even early warning signs of complete system-wide collapse (things had grown too big for the architecture) — much like the abandoned “maybe haunted house” in the picture above—things were not looking too good. We had to make a call. Do we Fix It (and can we)? Or do we abandon ship (and essentially burn things to the ground). This is a difficult question that deserves some truly objective reasoning.

Retrofit or Rebuild?

There are pros and cons when using JSON for data intensive applications — mostly cons though when it comes to reliable streaming of mission critical data. In this case, the mutability of JSON data was causing system-wide issues on the clickstream. So the question was still there though — do we retrofit or rebuild? And can we fix the sins of the past?

If you some time on your hands, you could dig further into a longer post on analytical stream processing. Goes into the why behind “why not JSON”.

Initially, the idea was that we could just retrofit the system since after all it was built with the best of intentions. So the team and I dug in, and we uncovered some more ugly truths.

The Sins of the Past

Back in the day (sometime around 2014-ish) a consulting company (one of the big ones with gigantic market cap) sold Nike on the idea of using “TSV” (yep, tab-separated values) files to encapsulate individual events for the clickstream. This solution was created to solve the problem of long lead times for changes on the clickstream (events specifically). The main problem identified by “said consulting company” was the delta between modeling an event, and then instrumenting said event on our desktop and mobile sites (and applications) with proper testing — it was simply taking “longer than expected”.

So people nodded along, and “said consulting company” was paid to implement the solution, train as many teams as possible across Nike to use their complex solution, and then they left their solution in “our very capable hands”.

The solution was that any product manager (PM) could edit a (tsv) file, and then “it” would be parsed and compiled into JSON-SCHEMA. This was the solution (plus a read-only UI) so that teams across Nike could then utilize the event schemas (after implementing their own interpretation of what each event “required”) for any given event. Sounds okay, right?

In theory, this wasn’t a bad idea. In practice, there were problems that arose quickly — like for example when “a typo” had the amazing effect of removing a previously “generated” value from json-schema’s enum. To paint a clearer picture, this little typo had the cascading power to break all downstream data consumers leading to broken ML models, dashboards and reports. In short, the butterfly effect in the data world. Cause there was no testing for breaking changes (more foreshadowing!) and zero object-level governance or semantic validation at play.

When faced with problems, what is better than adding additional complexity?

Due to symptoms of the sunk cost fallacy an even wilder problem was created on top of a bad solution. Rather than continuing to utilize unified events, communicate between teams, and test changes per release using our shiny new (fragile) system, and ultimately “share” reusable events across all experiences (nike.com, nike app ios/android, and others), the product managers decided that they could simply copy (or fork) events so they could “own” their own and not bother with a unified strategy. So where we once had 1 event, we now had between 9 and 36 versions of each event (yes the math doesn’t make sense, but remember the horror theme?).

Love it or Hate It. Object-Level Governance is Incredibly Important

Given the strain on the system, the years of events being forked (vs reused or composed), and the complexity in painting a unified analytical story across the various Nike experience event streams — we were quickly coming to a difficult realization. As you probably guessed, the retrofit was off the table.

After all, about 8 years had already gone by at this point in time and the cracks in the system were essentially the entire system. So we decided that the best path forward was to do a rebuild. But we were going to do it right this time. Even if it would kill us. Mission Accepted.

The fact that we had loose definitions of events — given the type-free nature of TSV based definitions — we effectively had zero object-level governance and could kiss semantic validation goodbye. This meant that there was no system in place to “modify” an event in a standard way. If you think about classic relational databases (OLTP) each table has a schema. If you want to modify the table’s schema, then the protocol-level invariants (promises) enforce “how” a change can occur and “how>what>when” the change will affect said given table.

When it comes to streaming systems, like the classic event stream (clickstream), you really want to abide by similar rules for schema evolution (as with classic relational databases) even though it may feel like you are moving a little slow (that will change). This includes rules about type-saftey, field-level position (or columnar position), and how nested-objects will be encoded — as well as how backwards and fowards compatibility will be achieved and even enforced.

In order to provide event-level invariants, we’d need to lean on a type-safe message format that supports backwards and forwards compatibility — which again means leaning on protocol buffers.

Fixing Nike’s Clickstream and Behavioral Analytics

We had a mission. We had purpose. What we didn’t have yet was a clear plan of attack— we just knew we needed to fix the sins of the past.

Establishing a Plan of Attack

We had a kick off brainstorming meeting between the small core team of engineers — there were 3 of us, let’s call the other two Doug and Christian (to keep them anonymous), representing Data Ingestion & Platform, Apps & Experiences, as well as Data, Analytics, and AI. The goal of the session was to come up with a set of standards, system expectations, as well as ways of working that we could stake our reputation on. Given we’d be doing a lot of work asynchronously and on a limited timeline (3 months to prove things would work), ways of working were not optional — they were essential.

The end result of our time, coffee, and some beers was the following:

  1. Standardize on Protobuf. Not simply for API services, but truly end to end across analytics and ML as well. End-to-End Protobuf was not a nice to have, it was the way. We’d need a way to ensure our protobuf was written in a standard way across the company, and need to write some test harnesses to ensure backwards compatibility (see zero breaking changes below). JSON was the reason the initial system fell apart — given its mutability and lax standards across the company we wouldn’t fall prey to “loosely” structured data.
  2. Lean on Code Generation & Semantic Validation. Ridding ourselves of the long lead time for change in the system would mean we’d need the ability to cross-compile our event definitions so we’d have no excuse not to “reuse” and “compose” shared events in a much more governed way. In addition, we’d need a way to compile our definition of “correct” for each event to reduce the time taken from event ingestion to insight — we’d need some kind of compiled validation logic. This would flip the script on the prior system where new events would take months to finally release across all experiences.
  3. gRPC for our SDKs. Taking things one step further, we’d compile down our SDKs utilizing gRPC to prevent the issues we’d encountered with traditional REST for our clickstream. This way, we’d write our interfaces (using IDL for gRPC) and cross-compile for javascript, typescript, swift, kotlin, and go. Reducing the time required to implement new events.
  4. Zero Breaking Changes. We’d guarantee that all events would always be backwards compatible within a major version (1.0.0 vs 2.0.0). We would only make a breaking change when it mattered. This was in direct response to the lack of governance in the prior system. Trust is built when things just work release after release.
  5. Automate Event Ingestion to Databricks. We were using databricks for our Lakehouse environment. One of the issues of the prior system was a large pain point when it came to scaling out the JSON-based event streams — due to zero semantic validation and no governance the streams would be corrupt more often than not. We’d fix that issue by automating the ingestion of our event streams, leaning on apriori semantic validation, to ensure that only trust-worth data would be appended to our tables and made available via Unity Catalog.

So in short, we’d agreed to 1) standardize on protobuf, 2) implement code generation and semantic validation, 3) utilize gRPC for our SDKs to simplify the exchange of events, 4) provide compile-time guarantees and a mission of zero breaking changes, and lastly, 5) we’d automate the last mile ingestion of event data by leaning on protocol level invariants, and high-trust built on semantic event validation at the ingestion edge. Now the hard work would begin.

From Mission Briefing to Production

During the early research phase we stumbled upon a company called Buf — by accident really. We’d been digging into some newer changes enabling native protobuf support for Apache Spark and found a file called buf.yaml while digging into the depths of spark-connect (spark’s gRPC client).

Sometimes things in life are simply serendipitous and we couldn’t have hoped for a better find this early on in the project.

Simplified Standardization with Protobuf

The buf.yaml turned out to be a specification that worked along with the Buf CLI — that checked off two of the known unknowns from our list. First, how we’d simplify breaking change detection — which we learned buf breaking could be used for just that (this was clutch), and secondly, how we’d provide capabilities to ensure our protobuf messages were written in a standard way — this ended up being the secret behind buf lint.

version: v1
breaking:
use:
- FILE
lint:
use:
- BASIC

So we’d accidentally discovered amazing tooling that we could use to drastically simplify how we’d manage our protobuf definitions over time. We’d also discovered a way to provide protobuf message-level semantic validation at the Nike edge using protovalidate.

Semantic Validation with Protovalidate

What is protovalidate you ask? It provides a critical missing component to the protobuf specification — runtime field level semantic validation. If you’ve used protobuf before, you are probably familiar with end-of-wire exceptions. These occur while deserializing a binary payload back into a concrete message and are more often than not caused by changing field types in a non-backwards compatible way. I bring this up since an end-of-wire exception points to exceptions due to unknown changes in the “shape” of the protobuf message — but with perfect type-saftey the message can still be invalid due to missing “fields”.

If you want to understand Protobuf Best Practices. Read the protobuf do’s/dont’s. For more on protovalidate, read on.

If you think about API-level contracts, some fields are marked as “required” and other’s are marked as “optional”. This provides the end-user with an understanding of what fields they can rely on, vs which one’s may exist or not. Consistent values within the “required” fields (as well as the optional when they show up) is critical not just for APIs, but for event streams.

With protovalidate, we simply annotate our message (shown below), and we can compile-down the validation rules to then use for runtime semantic validation checks.

message Order {
// Each Order takes place at a point in Time
google.protobuf.Timestamp order_created = 1 [(buf.validate.field).cel = {
id: "not_from_the_future",
message: "we are not ready to offer scheduled orders. Maybe in the future",
// Ensure that the server's local time (utc) is used as a gating mechanism for sane timestamps
expression: "this <= now"
}];
// An Order can be purchased at a CoffeeCo Location, otherwise where is the coffee going to be made
// It is true that the Store could be online, but that makes this reference more complicated than necessary
coffeeco.v1.Store purchased_at = 2 [(buf.validate.field).required = true];
// A Customer can Order from our Coffee Location
coffeeco.v1.Customer customer = 3 [(buf.validate.field).required = true];
// Each Order may have one or more items. We cannot have an Order without something to Purchase
repeated coffeeco.v1.Product items = 4 [(buf.validate.field).required = true];
// Each Order has a monetary value
coffeeco.v1.Total total = 5 [(buf.validate.field).required = true];
}

The Order above can now be validated at runtime.

if err := s.Validator.Validate(req.Msg); err != nil {
log.Println("validation failed:", err)
response = err.Error()
} else {
// do something with the valid protobuf object
}

Finally, here is a longer example that showcases the full validation flow (at the go service level).

func (s *CoffeeserviceServer) CoffeeOrder(ctx context.Context,
req *connect.Request[coffeeservicev1.CoffeeOrderRequest]) (*connect.Response[coffeeservicev1.CoffeeOrderResponse], error) {
log.Println("Request Headers: ", req.Header())
var order = req.Msg.Order

response := ""

if err := s.Validator.Validate(req.Msg); err != nil {
log.Println("validation failed:", err)
response = err.Error()
} else {
data, err := proto.Marshal(order)
if err != nil {
log.Println(err)
}
err = s.Kafka.Produce(&kafka.Message{
TopicPartition: kafka.TopicPartition{
Topic: &s.TopicName,
Partition: kafka.PartitionAny,
},
Key: []byte(order.Customer.Name),
Value: data,
}, nil)
if err != nil {
log.Println("Failed to Send Order : stream/coffeeco.v1.orders")
response = fmt.Sprintf("We Failed to Send your Order, %s\n", order.Customer.Name)
} else {
log.Println("Order Published Successfully : stream/coffeeco.v1.orders")
response = fmt.Sprintf("Thanks for the Order, %s\n", order.Customer.Name)
}
}
log.Println("New Order: ", order)
res := connect.NewResponse(&coffeeservicev1.CoffeeOrderResponse{
Response: response,
})
res.Header().Set("CoffeeService-Version", "v1")
return res, nil
}

The service above provides edge-level validation ensuring that any downstream consumer of “the data” can trust that the data being processed has been edge validated. While this may not seem like a “big deal”, in practice most data that ends up in the hands of data engineers is required to go through brutle “cleansing” steps in order to provide “non-corrupt” data for further downstream processing. Why reprocess all of your data when you can instead just rely on it being “semantically valid”?.

Around the same time we’d hit a wall with the time-to-first-byte cost of running gRPC across our desktop experiences (due to true page-loads on nike.com vs single-page style application architecture), and while all signs were pointing in the right direction elsewhere, this was crippling. We’d hit a potential pitfall that could break the project. Or so we thought!

Connect is the Missing Link for Enterprise gRPC

During our experiments with protovalidate (originally protoc-gen-validate), we’d discovered the Connect protocol — another hidden treasure in the Buf treasure chest. When we thought things were going to simply fall apart, and when back to square one was not on the menu, we found a life line. Connect.

From the Docs: Connect is a family of libraries for building browser and gRPC-compatible APIs. If you’re tired of hand-written boilerplate and turned off by massive frameworks, Connect is for you

Connect provides gRPC networking using the native networking clients for javascript (ecmascript) and typescript (as well as swift, kotlin, and go). This meant that all of our SDKs could lean on gRPC but with faster load times and more native integrations. This meant that we could spin up a connect session on the desktop (or mobile web) without sacrificing time to first byte, and our analytics events could be emitted using our new “unified” analytics stack.

Connect was the bridge and glue to “connect” the final set of dots, and we were well on our way to production, there was just one last thing we needed to get done. We needed to make good on our promise to automate the data ingestion pipelines.

Automating Data Ingestion

At this point in our journey, things we’re smooth sailing and the crew was having a blast. We knew we were onto something great here, and had found an incredible company to partner with — Buf. There was one last thing missing from our initial brainstorming session, we’d need to prove we could automate the ingestion of our clickstream events into Databricks without a full-time dedicated team sitting around. We’d need to make sure the new clickstream could scale out and provide value in a way that most data engineers are familiar with — at the table level.

Data Ingestion Architecture

The ingestion architecture that we’d come up with wasn’t really novel. It was a take on some earlier work from my days at Twilio, the big change was in the available tooling and in some of the changes made to Delta Lake.

I’d been riding the Delta Lake waves since around 2018, and it provided a lot of the same protocol-level invariants that we got from protobuf, just as a columnar-oriented format vs a row-based format. Given protobuf was natively supported in Apache Spark, and given that Delta Lake provided sophisticated streaming capabilities, we could simply leverage the best of both worlds to provide end-to-end streaming.

Press enter or click to view image in full size
The Nike Analytics Ingestion Network. Utilizing Buf, BSR, and Connect to provide end to end consistency.
End-to-End Data Ingestion: Circa 2024. Image by Author.

We were so proud of what we had achieved that a good friend of mine Ashok Singamaneni and I decided to present our work during the Data+AI Summit.

Given the length of this post, I’d suggest watching the following video if you want to dive deeper into the streaming architecture.

The full presentation from Data+AI Summit 2024

Being Brave and Standing up against the Status Quo

It took us three months to get to production. It then took us another year to get everyone on board across the company to ensure the strategy and vision would continue as the number of people on the project grew. The biggest revalation here was that a core group of passionate engineers can accomplish a lot more than people expect, and this doesn’t just happen by burning the midnight oil. In fact, nothing ever happens for free.

The results were astonishing though. In a matter of months, the core team was able to deliver and we scaled through Black Friday and Cyber Week with a measly additional operating cost of around $20/day. This was down around 100x from the prior system and we couldn’t have done any of that without the folks from Buf. They were true silent partners offering advice and support, and helping us along our way. In all reality, if it wasn’t for the tooling provided by Buf (buf generate, buf breaking, buf lint, buf image and the connect protocol), Nike would be still suffering under the weight of prior bad decisions and we’d never be able to escape the clutches of bad data decision making.

Now that we were fully committed to moving forwards, it was time to solidify our partnership with the folks over at Buf and that will take us to part two of this story, where we’ll talk about some of the problems (and solutions) that came about as we started to scale up our protobuf-first strategy.

Continue on to Part 2.

*: Not all data pipelines were terrible. There are always teams that are willing to work against the status quo. There was truthfully a lot of slop though and that happens overtime, and usually comes from a place of neglect and data hoarding.

--

--

Scott Haines
Scott Haines

Written by Scott Haines

Developer Relations Engineer @ Buf. Ex-Distinguished Software Engineer @ Nike. I write about all things data, my views are my own.

Responses (2)