Garbage In, Garbage Out: How Poor Data Quality Clogs Machine Learning Training Pipeline
What is common in the success stories of businesses as diverse as Amazon, Airbnb, and Kakao Bank. The answer is data and a leadership that was relentless in the pursuit of good data quality. In the digital age, good quality data is a key differentiator – an invaluable asset that gives organizations an edge over competitors who have not been as dogged about the same (data quality). As a result, they are burdened with substandard, scattered, duplicate, and inconsistent data, that weighs them down more heavily than iron shackles. In the world businesses are operating in today, the divide is not been the big and the small organizations but between organizations who have invested in improving their data quality and those who have not.
A single rotten apple spoils the barrel!
We have all heard of the story about how a single rotten apple spoils a barrel. It is more or less the same story when it comes to data. Unclean, unrefined, and flawed data does more harm than good. Gartner estimates that poor quality data costs an organization $ 15 million per year. Though survey after survey talks about monetary losses – unclean data or data that has not been refined impacts more than the bottom line – it prevents businesses from deriving actionable insights from their data, leads to poor quality decisions, and drives dissatisfaction among all the people who matter – partners, vendors, and regulatory authorities. We have also heard of several instances where poor data quality has quickly snowballed into a major issue- like a money-laundering scam leading to loss of reputation as well.
Today when a majority of organizations are leveraging the power of AI and machine learning tools and investing millions to stay ahead of the curve, bad data can be a reason for not meeting the ROI. While organizations pour money for AI and ML tools, it is constrained due to bad quality data.
Bad data hurts the American economy
The impact of bad data on the American economy is not trickle-down, rather it is a gigantic leak that is hard to plug. Collectively, the impact of bad decisions made from data that are flawed goes into millions and billions. Ollie East, Director, Advanced Analytics and Data Engineering, of the public accounting and consulting firm, Baker Tilly, says that bad data costs the American businesses about $3 trillion annually, and breeds bad decisions made from having data that is just incorrect, unclean, and ungoverned.
Banks and FIs are no exception to the rule. In fact, because of the privacy and regulatory requirements – they stand to lose more due to bad data. Of the zebibytes of data (including dark data) in existence organization-wide today, organizations are capitalizing only a minuscule percentage. Banks and FIs can ensure that they do not lose business, revenue, and clientele on account of poor data quality. It only takes a bit of effort and strategic planning. Further, the phenomenal success of new-age technologies like AI and machine learning has changed the rules of the game and has enabled banks and FIs to fish value from even the dark data – if only they undertake a planned approach to data standardization, data consistency, and data verification, and ensure that is streamlined for use again and again. Organizations must also account for the new data that enters the workflows and pipelines and ensure that a suitable mechanism is in place to ensure that it is always clean and standardized.
To reiterate, why lose on the competitive advantage? Here's a look at how organizations – banks and FIs- can invoke the power of cleaner, structured, data to make their processes crisper, leaner and undeniably more efficient.
Step 1: Pre-processing data – making data good for downstream processes
Pre-processing of data is the first step in the journey towards cleaner and refined data. Considering that not many organizations today can claim that their data quality meets expectations - the Harvard Business Review states: “only 3% of companies' data meets basic quality standards " - pre-processing of data is critical for the following reasons:
- Identifying of what's wrong with the organization's data. What are the core issues?
- As the data is more likely to be used again and again in workflows, processes and systems enterprise-wide, good quality data with the right encryptions minimizes conflict of interest and other such discrepancies.
- Also, as most organizations are likely to be using some kind of AI and ML for their processes involving this data – it is better to get it in shape to reap the maximum benefits.
Garbage In Garbage Out – The true potential of AI and ML can be leveraged only when data quality is good
Today, data scientists and analysts spend more time pre-processing data for quality (fine-tuning it), than analyzing it for business and strategic insights. This iterative pre-processing of data even though extremely time-consuming is important because if organizations feed "bad or poor-quality unrefined data” into the AI model it will spew (to put it across literally) garbage. Garbage In, results in Garbage Out. To leverage the true potential of AI and ML, it is essential that the quality of data being fed into the machine-learning pipeline downstream is of high quality.
There are of course other substantial benefits as well. One, when the data is cleaned at the point of capture or during entry, banks and FIs have a cleaner database for future use. For example, by preventing the entry of duplicates at the point of capture (either via manual or automated means), organizations are spared from doing menial and repetitive work. It is also relatively easy to build the training model once the data is refined and streamlined. And when banks and FI have a more dependable AI pipeline (thanks to cleaner data) they can gain valuable insights that give them a strategic advantage.
Carrying out data quality checks
For ensuring that their data is up-to-date and foolproof, there are several levels of checks or quality tests including the quick-fact checking of data against a universal known truth – such as the age field - in a dataset age filed cannot have a negative value nor can the name field be null. However, a quick-fact check is a basic check (tests only the data and not the metadata which is the source of extremely valuable information such as the origin of data, creator of data, etc.). Therefore, for a comprehensive test of data quality, holistic or historical analysis of datasets must be carried out where organizations test individual data for authenticity or compare them with historical records for validation.
Manual testing: Herein, staff manually verifies the values for data types, length of characters, formats, etc. The manual verification of the data is not desirable as it is exceedingly time-consuming. It is also highly error-prone. Instead, there are options such as open-source projects and in some cases, coded solutions built in-house, but both are not as popular as automated data quality testing tools.
Automated data quality testing tools: Using advanced algorithms, these tools invariably make it easier for organizations to test data quality in a fraction of the time that manual effort takes (using data matching techniques). However, as reiterated earlier, machines are as good as the training they receive. If unclean, flawed data is poured into the training pipeline, it clogs the machine and prevents it from giving the desired results.
The machines have to be taught like humans to understand and manipulate data so that exceptions can be raised and only clean filtered data remain in the dataset. Organizations can gain intelligence from their data either through rules-based engines or machine learning systems.
1. Rules-based system: Rules-based systems work on a set of strict rules that suggest “if” a certain criterion is met or not met, then what follows. Rules-based data quality testing tools allow organizations to validate datasets against custom-defined data quality requirements. Rule-based systems requiring less effort and is also less risky – false positives are not a concern. It is often asked if rules-based tools and processes are slowly becoming antiquated as banks and FIs deal with an explosion of data. Probably not. They are still a long way from going out of fashion. It still makes sense to use the rules-based approach where the risk of false positives is too high and hence only rules which ensure 100 percent accuracy can be implemented.
2. Machine learning systems: A machine learning system simulates human intelligence. It learns from the data that it is given (training model). Like a child that learns from its parent, it picks the good, the bad, and the ugly. Hence businesses must be extremely careful at the onset itself. They cannot expect optimum results if they are not careful with the quality of the data used for training. When it comes to its learning capacity and potential, however, ML-based systems' capacity is infinite.
Though there are several ways for the machine to learn, supervised learning is the first step. Every time new data gets incorporated in the datasets, the machine learns. The element of continuous learning means that in time it would require minimum human interference – which is good as banks and FIs would like to engage their manpower in far more critical tasks. As machines interpret and categorize data using its historical antecedents, it becomes much smarter and indefinitely more capable than humans.
In the realm of dark data
Every day banks and FIs generate, process, and store humongous amounts of data or information assets. Unfortunately, much of this data (nearly 80%) remains in the dark. Banks and FIs rarely tap into it for business insights and for monetizing the business. However, machine learning systems can help organizations unearth value from dark data with minimum effort. Learning, in this case, begins with making data observations, finding patterns and eventually using it to make good strategic decisions. All based on historical evidence or previous examples. What the system simply does here is alert the supervisor (about the exception) and then process that information and learn – that is what continuous learning does long term.
Data quality is a burning issue for most organizations
"By 2022, 60% of organizations will leverage machine-learning-enabled data quality technology to reduce manual tasks for data quality improvement." Gartner
At Magic FinServ, we believe that high-quality data is what drives top and bottom-line growth. Data quality issues disrupt processes and result in escalated costs as it calls for investment in re-engineering, database processing, and customized data scrubbing. And more for getting data in shape.
Organizations certainly wouldn't want that as they are running short of time already. Knowing that manual testing of data quality is not an option – it is expensive and time-consuming, it is cost-effective and strategically sound to rely on a partner like Magic FinServ with years of expertise.
Ensuring quality data – the Magic FinServ way
Magic FinServ's strategy to ensure high-quality data is centered around its key pillars or capabilities - people, in-depth knowledge of financial services and capital markets, robust partnerships (with the best-in-breed), and a unique knowledge center (in India) for development, implementation, upgrade, testing, and support. Our capabilities go a long way in addressing the key challenges enterprises face today related to their data quality, spiraling data management costs, and cost-effective data governance strategy with a well-defined roadmap for enhancing data quality.
Spinning magic with AI and ML: Magic FinServ has machine learning based tools to optimize operational cost by using Al to automate exception management and decision making. We can deliver a savings of 30% - 70% in most cases. As a leading a digital technology services company for the financial services industry, we bring a rare combination of capital markets domain knowledge and new-age technology skills, enabling leading banks and FinTech's to accelerate their growth.
Cutting costs with cloud management services: We help organizations manage infrastructure costs offering end-to-end services to migrate (to cloud from enterprise), support and optimize your cloud environment.
Calling the experts: We can bring in business analysts, product owners, technology architects, data scientists, and process consultants at a short notice. Their insight in reference data, including asset classes, entities, benchmarks, corporate actions and pricing, brings value to the organization. Our consultants are well-versed in technology. Apart from traditional programming environments like Java and Microsoft stack, they are also well versed in data management technologies and databases like MongoDB, Redis Cache, MySQL, Oracle, Prometheus, Rocks dB, Postgres, and MS SQL Serve.
Partnerships with the best: And last but not least, the strength of our partnerships with the best in the industry gives us an enviable edge. We have not only tied with multiple reference data providers to optimize costs and ensure quality, but have partnership with reputed organizations dealing with complex and intractable environments, multi-domains, covering hundreds of thousands of data sources, to help our clients create a robust data governance strategy and execution plan.
So that is how we contain costs and also ensure that the data quality is top notch. So why suffer losses due to poor data quality.
Connect with us today by writing to us at firstname.lastname@example.org.
Comprehensive Data Extraction, Transformation, and Delivery using AI