#Data Quality : Noise in data
“The need for high-quality, trustworthy data in our world will never go away.”
While the above said is true, it comes with its own sets of challenges. Firstly, collecting quality data.
With the growth in data, the need arises more than ever before. Even though we have evolved from data silos to pipelines (ELT/TL) to streaming to modern data stack/warehouse, multi-cloud, and data mesh — we are still faced with an age-old problem of trusting data i.e., what data is good for what purpose? What data can be used where? How can we improve it? What data is sensitive? It begs the question of why this remains unanswered after so many decades and with so many disciplines of Data Management such as Data Governance, Data Quality, Data Observability, Data Catalog, Master Data Management, Data Remediation, Data Discovery, Active Metadata Management, Data Privacy, Data Intelligence, Data Lineage, Reference Data Management, and on and on…We talk and do so much that the intent and purpose of why we started is lost i.e. the objective of the trust and improving data. Trust me, I am with you on this. If you don’t believe me, see my hand-drawn picture of me.
For those brave data folks who dared to practice all those disciplines and still hanging with us, this is our effort to clear the noise and point out what is needed.
First, let’s define some factors to identify and show noise in the space of Modern Data Quality.
Scale with an Increase in data (Scale)— With the rise in data, we need a platform that can scale and handle large datasets, diverse types of data, and more importantly an ability to support different types of architectures. When you are dealing with petabytes of data, and harvesting metadata and data, speed and scale are very very important but what is also important is the ability to scale with a no-code to low-code approach.
Each organization is unique (Context) — All organizations are not equal. Even within a vertical, and with similar technologies and architecture and same data, it is still different because the processes and people, and customers are different. Therefore the way we view the data, and measure and understand is completely different. In other words, each organization has something unique in terms of context and culture. If we don’t understand the data from a context, it’s pretty much useless putting any solution. E.g. range of annual income and its treatment are very different in terms of marketing vs. risk analysis vs. underwriting. Same number, same values but different views and interpretations.
Support as organizations evolves (Maturity) — Mergers & Acquisitions are inevitable as an organization grows and with that comes a blend of complexities and a need to handle diverse environment plus the need to support different levels of maturity as organizations grow. Else you are talking every year about a new platform and it’s not gonna work! We see the span at which any startup turns into a unicorn has tremendously decreased and the ability to support different maturity levels is a must.
Relevance to Business Value (Impact) — Sometimes we focus too much on data, that we forget the prime focus on business value. It is not about observing or monitoring or alerting or identifying nulls and blanks or looking at various dimensions. Also, it’s not about using detecting outliers from a data perspective but it’s about how well we relate to the strategic initiatives across different functions. For E.g. a price reduction to drive retention in a tight economy targeted toward specific demographics may show as an outlier but it was meant to be per the business strategy. The last thing anyone wants to do is get spammed — doesn’t matter if it’s email or slack or spending hours on root cause analysis to figure out it’s OK! It has to make business sense and resonate well and work harmoniously with business solving missions and goals.
Lack of Time to value (Time & Cost) — if you are implementing a rich processes-based platform or a heavy layer of governance with people, processes, and technology, forget it! You are wasting time and cost. By the time you finish, your data landscape is most likely changed and so does your regulatory needs and customer expectations. Time to value should be in days, not weeks, not months, definitely not years!
Support users of all types (Stewardship) — In any organization, we have two types of users — business and technical. Both of these users are relevant for improving quality and enabling a trusted framework otherwise we are not talking the same language but building more silos and barriers.
Data Accuracy
is one of the so-called “dimensions” of Data Quality. The goal for these dimensions, and it is a noble one, is so we can measure each of them, and should deficiencies be found then there should be a uniform set of best practices that we can implement. Of course, these best practices will differ from dimension to dimension. But just how feasible is this for Data Accuracy?
We give the definition of Data Accuracy as:
“The degree to which a data value represents what it purports to represent.”
One of the problems of the dimensions of Data Quality is that there are differences of opinion about how it should be defined. No doubt our definition can be improved, but we will take it as a basis for exploring how we should estimate Data Accuracy.
The next thing we must consider is that Data Accuracy is impossible to achieve with 100% accuracy for observations. Two great minds in Quality Control provide a basis for this understanding. Walter A. Shewhart pointed out that all systems of measurement introduce error. W. Edwards Deming, who was taught by Shewhart, went further, and pointed out that there is “no true value of anything.”
So, it seems we can never get 100% Data Accuracy. If this is the case, we will want to know how we can assess Data Accuracy, since we will want to know just how imperfect it is. And this is where we run into another hard truth, which is that was cannot do this just by considering the data alone. This is potentially a difficulty for data professionals. We are used to working with data and like to work with it. But to estimate Data Quality, we will need to step outside the data, figure out a method to independently assess a sample of the population we are interested in for the data values in question and compare that with the data under curation. It just cannot be done from entirely within the data itself. Of course, the way we assess Data Accuracy is likely to vary from data element to data element.
There is one exception to this, which is where data is itself the managed reality. In these cases, the data is not based on observations of something outside itself. Anything dematerialized, like a bank account is included in this category. In the database that actively manages a bank account, then the data is a reality. However, if we look at bank accounts and try to, say, capture their balances, then we are making observations and we are back to normal data. We will explore this distinction more in the future.
Welcome to the Data Quality Series, and in the next set of blogs, we will see how each of these above-mentioned disciplines has its own sets of challenges and how to ways to overcome to get our needed quality assured data.
The goal of this series of blogs are to help data leaders and influencers to not end up as a victim of fads or broken ships.
Stay tuned for the next installments of the data series…