THIS POST IS THE SECOND IN OUR “BECOMING SUSTAINABLY DATA DRIVEN” SERIES. READ THE OTHER ENTRIES HERE:
- Step One: Acquiring the Data
- Step Two (YOU ARE HERE): Build Data Structures for Efficient and Flexible Consumption
- Step Three: Reporting, Dashboarding and Visualization in a Big Data World
- Step Four: Robust and Agile Advanced Analytics that Make a Difference
“To store or not to store, that is the question…”
Innovation is the driving force in the world of technology. Because of this, few technologies have remained unchanged over time. The relational database, however, is one extremely pervasive exception! An entire Information Management industry has been built around the definition, storage, and consumption of data based on relational databases and tables. While these technologies have improved significantly, they haven’t strayed very far from the basic concept first proposed in 1970.
While this industry has created immense value since 1970, the time has come to innovate and Big Data technologies are central to this creative destruction. Naturally, we can expect some resistance here: the world of data architects, who have been defining data structures, has remained relatively untouched as those who consume and store data have consistently had new tools to learn over the last 47 years.
Big Data technologies have brought us “schema-on-read”, which allows data to be stored in advance of its structure being known: in the Big Data world, one can iteratively define the structure of the data after storing it. This enables a much more agile approach that is one of the most cited reasons for organizations to move into the Big Data world. The idea is that “data architecture is a thing of the past” – the claim is that by using schema-on-read, you get to have your data without incurring the cost of figuring out how to store it! Could it be that we have finally found the elusive free lunch?
Not so fast! In my experience, the value proposition for data architecture is not often articulated by the leading practitioners of this field. This often contributes to the idea that data architecture is just a cost center and just takes too long. New users assume that with schema-on-read you can define your structures when you need to consume them, et voilà, instant results! In practice, this shortcut is not always feasible for two fundamental reasons: performance and cost. The performance of such an approach is often suspect, especially for large volumes. When one has many consumers of some shared data resource, it makes little sense to have every consumer incur the cost to define their own schema at consumption time; doing so can lead to drastic consequences from a data governance perspective as well. If you do this, are you sure you are interpreting the data in the same way if each consumer defines their own schema to consume data? Having said that, does it make sense to invest in fully defining a data model up front, before you know your consumption patterns?
“schema-on-read is not a silver bullet, and data architecture still has an important role to play.”
At Adastra, we recognize that agility is paramount and that working with a sense of urgency is essential for success in today’s business world. While we recognize and support the demise of up-front, big bang data architecture, schema-on-read is not a silver bullet, and data architecture still has an important role to play. Data architecture is about so much more than an up-front investment to define data structures that will be used to store data; it’s about understanding the data and its relationships and how these can be put together to generate insights. Ask yourself the following question: does it make sense to have each consumer spend time understanding the data and its relationships, only to throw away the fruit of all of this labour and have the next consumer perform the same (or very similar) work again?