Create Clean, Beautiful, Protected Data Resources with Apache Spark and Delta Lake

Water crashing together. 3 states of water, from peaceful to disorderly. Intersecting to create beauty in chaos.
There is beauty in the chaos. Photo Credit: Unspalsh @powwpic

Data comes in all shapes and sizes

It is amusing that when we talk about data the best analogy is typically rooted in water. This makes sense in order to fathom the idea of data — which comes in all shapes and sizes— people tend to embrace the abstract. One could say a single data record could be likened to a droplet of water, many droplets of the same kind (record type) cluster and coalesce into isolated data pools (tables/directories), and it is these pools which reside in the larger construct which is the body of the data lake at large.

A Data Lake is an ecosystem…


Learning to break mental blocks and unstuck yourself

What inpires you? Photo via Unsplash

In Pursuit of Writing ✍️

I’ve been writing off and on now for the better half of a decade. Mostly technical writing here on medium now a days, but over the course of the past few years, I’ve had a few articles published, some even showed up in printed magazines, and I was offered an opportunity in 2019 to take a risk and work on my first book.

In it for the long haul

I’m working on a new book now and it has me writing many nights and every weekend. This will be the longest thing I’ve ever written weighing in at around 400…


Spark SQL Functions to Simplify your Life

Photo Credit: https://unsplash.com/@swimstaralex

Anyone working in the field of analytics and machine learning will eventually need to generate strong composite grouping keys, and idempotent identifiers, for the data they are working with. These cryptographically strong identifiers help to reduce the amount of effort required to do complex bucketing, deduplication, and a slew of other important tasks.

We will look at two ways of generating hashes:

  1. Using Base64 Encoding and String Concatenation
  2. Using Murmur Hashing & Base64 Encoding

Spark SQL Functions

The core spark sql functions library is a prebuilt library with over 300 common SQL functions. However, looking at the functions index and simply listing things…


A Hands-On Introduction: Getting Up and Running.

An Orchestra Playing a song. Violins and people fading into the background
Photo by Larisa Birta on Unsplash

This tutorial is aimed at engineers who want to understand how to get up and running on Spark on Kubernetes. It is my hope that you will be able to use the skills developed across this series in order to become proficient at building and deploying Spark applications using the Kubernetes scheduler.

Given this is the first tutorial in the series, it is also naturally the most simplistic. The idea here is to introduce the concepts and components we will be using across the series, including Kubernetes (K8s), Docker and Spark 3.0.1.

Pre-Requisites

A basic understanding of Apache Spark.

What You Will Learn

This first…


Balancing the choice between disk, ram and the source of truth

Shows a mirror with many different angles and views of the same scene
Many angles provide many views of the same scene. Cache works with partitions similarly. Credit

Apache Spark provides a few very simple mechanisms for caching in-process computations that can help to alleviate cumbersome and inherently complex workloads. This article aims at providing an approachable mental-model to break down and re-think how to frame your Apache Spark computations. Whether your next application is a batch or streaming job the aim here is to provide you with the necessary prerequisite to tackle your next Apache Spark project with gusto!

Spark Application Workloads

Let me first start by stating that a simple workload should fit into memory across a relatively low number of executors — because simple workloads are inherently simple…


How observation, probability and common sense leads to better decision making

Creepy cabin in the woods. Setting an alternative stage for Goldilocks to make a point about decision intelligence.
Was this where Goldilocks found the porridge? Photo Credit: https://unsplash.com/@nate_dumlao

Spoiler: Goldilocks survives. Let us first start by saying things could have turned out much worse. Really worse. Don’t walk into a strangers house and eat their food and nap in their house. This is just a no brainer!

In the real world when we make decisions there is a much more involved and complex inner process at work — regardless of how instantaneous the mind appears to decide how to do X, Y or Z. …


The subtle art of good analytics and explainable insights

An Open Book with a story about tropical pirates, a treasure chest and pirate ships coming off the page.
Storytelling with Data isn’t as whimsical. It does, however, provide the correct mental model. (Source Pixabay)

What is your story?

Across the world, we are producing and consuming data at exponential rates. This is just a fact. With the advent of 5g networks we should be capable of handling up to 10/Gbs up from 300 Mbs from 4g LTE. This means we have the bandwidth to send and consume more event data and metrics from across all kinds of connected devices and embedded systems than ever before in human history. However, you have to ask yourself “what is your data telling you”?

When you consider companies that are data-first/data-driven and succeeding within their respective industries (think Google, Facebook, Microsoft) versus…


How messaging, scalability and failure formed a new kind of architecture.

Source: Pixabay

One subtle and novel paradigms in computer systems architecture is the concept of message passing. This simple, albeit useful construct allowed for magnitude gains in parallel processing by allowing processes (applications/kernel tasks/etc) within a single computer OS to take part in a conversation. The beauty of this was that processes could now participate in conversations in either synchronous or asynchronous fashion and distribute work amongst many applications without the necessary overhead of locking and synchronization.

This novel approach to solving parallel processing tasks within a single system was further expanded to distributed systems processing with the advent of the message…


How communication evolved technology

source: Pixabay

Evolutionary Communication

Communication is the fundamental foundation upon which modern society was built. Without words, pictures or symbols we wouldn’t have been mentally capable of developing the process of being able to formulate complex thoughts or ideas or even to express the most basic of our primal needs. However by some random chance, we took the right evolutionary steps that would lead us to where we have today both as a society but also technologically. If you consider the past twenty or so years alone, it is quite literally astonishing what we’ve been able to accomplish. In essence we are now at…


How to ensure your data remains valid over the years.

Source: Pixabay

Everyone has worked under the wrong assumptions at one point or another in their careers and in no place have I found this more apparent than when it comes to legacy data and a lot of what ends up in most companies’ data lakes.

The concept of the data lake evolved from the more traditional data warehouse, which was originally envisioned as a means to alleviate the issue of data silos and fragmentation within an organization. The data warehouse achieved this by providing a central store where all data can be accessed usually through a traditional SQL interface or other…

Scott Haines

Software Architect at Twilio in California. I work on real-time, ridiculously large data (*Big Data) applications with Scala and Spark. Voice Insights. ☎️

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store