Testing Spark Structured Streaming Applications:

Like a Boss

Scott Haines
2 min readMar 14, 2019

--

photo credit: https://www.flickr.com/photos/cdevers/

At the end of the day your best efforts can come crashing down (pun intended) all due to a minor little nothing bug in your application logic. The problem here is that your little nothing bug can cost your startup, company, ego — whatever drives you — lots and lots of pain, frustration, even big money or potentially everything.

So given that we know we “should” be testing our spark applications, then the next logical thing to discuss is exactly how we go about doing just that.

Welcome to the MemoryStream.

Alright. So examples are fun. But fully functional applications are even better. Luckily, the above code example is part of the unit tests from a workshop I gave back in November 2018. The code can be downloaded on github.

Let’s now walk through the test case above. The pattern for loading a scenario comes from trial by fire while trying to figure out the easiest way to generate repeatable tests based off of real fake data.

My team at Twilio (Voice Insights ) came up with some neat internal generators of fake telephony data — modeled off of real live traffic. Through the generator we can create true to life scenarios instead of just testing randomly generated data or some bogus nonsense! This sometimes means we generate thousands of scenarios that will be loaded and tested. This process is replicated in many spark applications.

Our TestHelper has a lot of nice utilities including the ability to load a scenario (basically read json and parse to our protobuf) which was shown below.

def loadScenario[T<: Message : ClassTag](file: String): Seq[T] = {
val fileString = Source.fromFile(file).mkString
val parsed = mapper.readValue(fileString, classOf[Scenario])
parsed.input.map { data =>
val json = mapper.writeValueAsString(data)
convert[T](json)
}
}

example of scenario loading method above ^^

We then create our MemoryStream and take our scenario iterator and generate a pile of Kafka data in the form of our MockKafkaDataFrame.

case class MockKafkaDataFrame(key: Array[Byte], value: Array[Byte])

This allows us to stream data through our unit tests in the same format that we will be reading from Kafka in the real world. This saves us a ton of time over running locally embedded Kafka.

We then create a streaming query that writes to a MemorySink. This allows us to test an end to end streaming query, without the need to Mock out the source and sink in our structured streaming application. This means you can plug in the tried and true DataSources from Spark and focus instead on ensuring that your application code is running hiccup free.

What the hell. That was easy right. Yep.

--

--

Scott Haines

Distinguished Software Engineer @ Nike. I write about all things data, my views are my own.