This is the tale of a (not so) young and (not so) bright developer who discovers a whole new fauna of functional life-forms in an attempt of solving a (not so) simple problem.
But before we follow our (not so) young and (not so) bright hero into his adventures, we need to first understand why he decided to set sails toward this strange new world in the first place.
Chapter 1 — Lost in translation
Our hero was a Data Engineer™ (capital D, capital E, with shiny sparks all over), his duty was to massage vast amounts of data making sure it was properly ordered, and to tame big clusters of machines so that they accept to digest all that data.
One day, he joined a project that was not like any previous one. His team was tasked to build yet another one of these big pipelines, but with a devious plot twist : there would be so much different sources, each with so much different kinds of data that using classes would be completely unfeasible.
After much head-scratching and beard-twisting, the team settled down to the following plan : sources would provide a schema of the data they wanted to push into the pipeline, and the team would build a single generic job that would use that schema to validate the incoming data, without resorting to create specific classes for each and every possible data source.
That sounded like a good plan, and our hero and his friends aptly jumped into its realization. But they soon understood it would not be a picnic though.
First there were the schemas themselves. The tribe of the Data Managers™ (capital D, capital M, but nothing sparkling whatsoever) insisted it should be managed in humongous spreadsheets, spanning multiple tabs counting dozens of columns each, mixing intricately functional documentation and technical description of the incoming data.
These dirty spreadsheets needed to be first parsed and translated into a more tractable format that would serve as a reference in every subsequent processes.
Then this sensible schema representation would serve as a blueprint to build a validator for the corresponding data sources, in terms of the
Rules of the jto-validation library.
The validated data would then be transformed into a generic data representation, something very similar to JSON, but with native support for dates and timestamps. This generic data would then be written to disk in Parquet format, or emitted in a Kafka topic as Avro.
Of course, since Parquet and Avro require a schema in order to write data, the original generic schema needed to be translated to these specific formats too.
That makes a lot of translations back and forth, but the team was now able to digest any new source of data, as long as someone provided the big spreadsheet describing that data, using a single piece of code.
Chapter 2 — Wandering into the Dark Schematic Forest
Our story could have ended there with everyone being happy and the project shipped in production, but that wouldn’t have been a very interesting one.
Reviewing what the team had produced, our hero didn’t feel so happy after all. The code looked like a forest of tortured trees, so horrible that showing it here would make this story not suitable for young children.
Each conversion (and there were many of those) was encoded as a fat and ugly
match expression, and there was recursion all over the place. Of course, proper(ty based) testing of such code was also rather difficult — if not impossible — in that context.
But our hero had an little idea of what would help making things better.
A few times earlier, he had heard stories about a land far away, where flourished a plethora of strange species that fed on recursion. When domesticated, even the least scary of these beasts could eat and digest most of the sources of recursion in one codebase.
“If only I could tame one or two of those beasts, thought our hero, I would transform the intricate forest of our code into a nice and tidy à la française garden”.
And so he embarked into a trip that lead him way further he would have imagined…