16 September 2017
In the past, the customary practice when ingesting data was to develop a data model, and then load the data in accordance with this (pre-specified) data model. Times have changed. With the emergence of the “data lake” architecture pattern, we are no longer constrained to the traditional schema-on-write approach. In the past, we would first we define our schema, then write the ingestion code, read your data, until it ends-up in the model you have defined upfront. Big data applications take the opposite tack. We load the data as-is, and figure out later how the data is structured. The latter is referred to a “schema-on-read.”
Both approaches have pros and cons. We can’t deny that a schema-on-read offers a lot of flexibility: you load and store the data, without any constraints. Under the traditional schema-on-write paradigm, a lot of data will be rejected, or must be adjusted before it can be made to “fit” our data model.
We can also not deny that at a certain moment we will have to present our data to our business users. At that moment we will have to check our data on completeness, correctness and accurateness. The modeling has to be done any way.
In her blog on big data modeling (April Reeve – Big Data Modeling July 2013), April Reeve describes that the answer lies in what we mean by data modeling and big data. Her blog concludes that it is necessary in any design that involves the movement of data between systems, whether Big Data or not, to specify the lineage in the flow of data from physical data structure to physical data structure. That includes the mappings and transformation rules necessary from persistent data structure to message to persistent data structure. This level of design requires an understanding of both the physical implementation and the business meaning of the data. We don’t usually call this activity modeling but strictly design.
I would even go one step back. In most large organizations, it’s still data chaos. If we add Big Data, this will make the situation only worse. I am convinced that an EDM will force an organization create a data governance plan, and stand up an organization in support of these goals. Data modeling skills will be necessary to define common business concepts and corporate wide definitions. If you can’t do this, at the very least you will need to define a specific dictionary for each deliverable.
Big Data should contribute to the value of an EDM. Regardless which technology and what the business use case may be, the more encompassing the information provisions from a data lake or Big Data store may be, the more value you will get from an EDM.
I have seen big data projects struggle to align with the basic processes and data definitions of an organization (product, customer). As a result, data quality errors popped up, and a lot of rework was necessary to bring everything in line. This surfaces the classic “pay-me-now, or pay-me-later” dynamic: at some point in time, the data modeling has to be done. A data lake (schema-on-read approach) may allow you to defer it for a while, but you cannot ever skip it.