Data Modeling and Big Data

Ivan Schotsmans

16 September 2017

In the past, the customary practice when ingesting data was to develop a data model, and then load the data in accordance with this (pre-specified) data model. Times have changed. With the emergence of the “data lake” architecture pattern, we are no longer constrained to the traditional schema-on-write approach. In the past, we would first we define our schema, then write the ingestion code, read your data, until it ends-up in the model you have defined upfront. Big data applications take the opposite tack. We load the data as-is, and figure out later how the data is structured. The latter is referred to a “schema-on-read.”

Both approaches have pros and cons. We can’t deny that a schema-on-read offers a lot of flexibility: you load and store the data, without any constraints. Under the traditional schema-on-write paradigm, a lot of data will be rejected, or must be adjusted before it can be made to “fit” our data model.

We can also not deny that at a certain moment we will have to present our data to our business users. At that moment we will have to check our data on completeness, correctness and accurateness. The modeling has to be done any way.

In her blog on big data modeling (April Reeve – Big Data Modeling July 2013), April Reeve describes that the answer lies in what we mean by data modeling and big data. Her blog concludes that it is necessary in any design that involves the movement of data between systems, whether Big Data or not, to specify the lineage in the flow of data from physical data structure to physical data structure. That includes the mappings and transformation rules necessary from persistent data structure to message to persistent data structure. This level of design requires an understanding of both the physical implementation and the business meaning of the data. We don’t usually call this activity modeling but strictly design.

I would even go one step back. In most large organizations, it’s still data chaos. If we add Big Data, this will make the situation only worse. I am convinced that an EDM will force an organization create a data governance plan, and stand up an organization in support of these goals. Data modeling skills will be necessary to define common business concepts and corporate wide definitions. If you can’t do this, at the very least you will need to define a specific dictionary for each deliverable.

Big Data should contribute to the value of an EDM. Regardless which technology and what the business use case may be, the more encompassing the information provisions from a data lake or Big Data store may be, the more value you will get from an EDM.

I have seen big data projects struggle to align with the basic processes and data definitions of an organization (product, customer). As a result, data quality errors popped up, and a lot of rework was necessary to bring everything in line. This surfaces the classic “pay-me-now, or pay-me-later” dynamic: at some point in time, the data modeling has to be done. A data lake (schema-on-read approach) may allow you to defer it for a while, but you cannot ever skip it.

Data Strategy

Ivan Schotsmans

30 May 2017

The objective is to create a centralized, highly standardized, and tightly controlled data governance function in the enterprise. By centralizing control of the data, enterprise wide concerns can be addressed. Resolving differences in definitions and interpretations of data will begin the process of providing management with the information it needs to make informed and effective decisions. Data will be designed to meet corporate needs, rather than specific application needs.

A first step of making data available is to convince leadership that data is an asset and must be handled as a valuable corporate resource. This implies that data must be understood throughout the enterprise, and must be reflected in all policies and procedures. This immediately brings data owners and data stewards in the picture. Their role is important, residing in the business community they are responsible for managing the data on a day-to-day basis. Every knowledge worker is responsible for the accuracy of data, and they should apply the definition and rules defined by the data governance organization.

Data owners and stewards are responsible for data classification. Not every data element has the same use or priorities. Usually data gets classified in corporate or local data, to identify how the data needs to be handled. This does not mean that local data is not important, but usually different data standards apply for the simple reason that its impact on the organization is local, whereas corporate data are shared across the entire enterprise. A corporate data asset’s definition, both semantic and syntactic, must be able to accommodate the widest possible usage while retaining a specific business meaning.

One aspect in the data strategy that is often forgotten is data redundancy. Copies of data will always be required to make data available and accessible. As a general principle, the goal should be to have as few copies of data as possible and, required. Preferably, all copies of data must be direct copies of a single master. The data definition of copies should not be changed. The data governance group needs to help ensure that all data and data editing rules are properly addressed by providing one central point of definition and issue resolution. Why a central point? Awareness of existing data and its definitions will facilitate sharing of data across the enterprise. Legacy systems are typically the biggest risk of uncontrolled data duplication. To mitigate this risk use of legacy systems must be limited to the absolute minimum, wherever possible.

To conclude, a centralized approach does not mean that issue resolution can be done application by application for the simple reason that legacy systems are no longer part of the current application systems but only used for reporting. Another reason is that some applications have only a local function. Nevertheless, data elements considered as corporate should be handled as any other data element. If necessary, new data stores can be created instead of integration with existing ones.

How To Build the Business Case for an Enterprise Data Model

Ivan Schotsmans

30 April 2017

We persuaded the organization to pursue an Enterprise Data Model. Our objective is to align the business requirements with the organization’s business values and vision as a starting point. Let’s begin to make the business case.

The first step in establishing your business case is to find a sponsor who will champion and support your initiative. Someone with a vision of the end state, who is willing to set aside the necessary budget to realize that goal. You’re not served by a nominal sponsor; instead you need someone with authority within the organization who socializes the message that an EDM is the way to go. This is very important, or else your initiative will end as a document EDM. You created the model on paper, but nobody follows your guidelines and rules.

Don’t fall into the trap of making it a purely IT effort. Along the entire implementation process you will impact the day-to-day business. You will never get the prerequisite buy-in when the business itself is not driving the EDM initiative.

Define a clear scope for your EDM. You can’t build a business case as a big bang that you are going to launch. To finish the EDM may take more than 10 years, so make a clear step-by-step plan for the first 3 years, and split these into a yearly plan with clear scope and deliverables. Your activities will impact the day-to-day developments, so it’s important that you can measure the costs, as well as tangible and intangible benefits of the EDM in each impacted project or program. All your efforts should be stored in a framework with expected budget, costs and deliverables. On top of this framework, you need to put a number of KPI’s. Each framework should cover a measurable timeframe linked to a budget. This allows a clear overview of the progress and the cost of your EDM.

Don’t assume that your EDM is merely a modeling effort. You should also cover the processes linked to each aspect of the data model. To achieve efficiency improvements, you need to focus on data quality and standardization. Both have an impact on daily processes. For example: how are you going to avoid new data quality errors from entering the system? There is a high probability you will have to adapt one or more processes to achieve these gains.

It is recommended that you create a separate team to handle the EDM. A team focusing on the implementation of the EDM and following each project and/or program. They provide the reassurance that developments are in line with the objectives of the EDM. You also need an architecture board to approve the data model and process for each project and/or program. This board will decide which data items will be modeled next, approve the process, and oversee the costs involved.

Don’t forget that you are not the first pursuing an EDM. Look for peers, talk to vendors that provide EDM’s, and incorporate their lessons learned in your business case.

And last but not least: communication. From day one you need to communicate your progress, improvements, budget, costs, etc. … The entire organization needs to be aware of this effort.

Usefulness of an Enterprise Data Model

Ivan Schotsmans

5 April 2017

The market is constantly changing and almost every organization struggles to keep up with these changes. New technologies, increased competition, more data, revenue and cost pressures, and, certainly in Europe, regulatory requirements. Organizations must stay on top of all these data to make accurate decisions. This is where data modeling comes into the picture.

Steve Hoberman, Donna Burbank and Chris Bradley describe the importance of a high level data model and how to master the techniques of building one (2009 Technics Publications Data Modeling for the Business). A data model is a visual representation of the people, places, and things of interest to a business. It is used to facilitate communication between business people and technical staff. Designing a data model is not only vital, but also seen as best practice.

Designing a data model fits into the software development cycle where we design information blocks (silos), concentrating on local entities and logical dependencies between these entities (specific business ares). It will function as communication tool between business and technical staff.

An Enterprise Data Model connects the dots between the different business areas and represents the data of an entire organization. “The enterprise data model is the heart and soul of enterprise data architecture” (DAMA International 2010). Once completed it will look like an architect’s construction plan of a house. The construction workers will use this plan to build the house and will have to make adjustments during the construction process. An EDM is never finished, it’s a living document. Businesses evolve and your data model will need constant updates to stay in line with changing business requirements.

The concept of an Enterprise Data Model is more than 30 year old and in this period it created a lot of buzz. There are data modeling specialists in favor, and those against. One reason for the disharmony is the huge impact on a company’s day-to-day business. It’s not like a project plan where you handle some business requirements. In fact you bundle all business requirements as-is and to-be, and build your EDM. It’s a multi year plan and you need to find a balance in your daily business and the implementation of the EDM.

Given that it is so challenging, then why are so many people eager to have an Enterprise Data Model? An EDM is seen as saving costs in the software development cycle, and a tool to decrease the time to market. Therefor it’s very difficult to build a business case that should cover 10 years. Besides to the long term planning, you will have to sell the impact on every project and/or program, small or big.

Let’s start making a business case.