Data-Driven Organizations and Data Quality
Countless organizations have started programs to become more data driven which implies that they will try to use data more efficiently and effectively to improve business processes and decision-making processes. This focus on using data has increased their awareness of data and it has made organizations conscious of the importance of high-quality data.
Management of Master Data by IT Specialist
To improve the data quality, organizations have started several initiatives. New functions, such as data owners and data stewards, have been introduced and they have purchased master data management systems (MDM) to help with improving the data quality. As the name suggests, master data management systems are developed to help organizations store, organize, manage, and update their master data. These systems enable organizations to store what the correct values should be.
Management of Master Data by Business User
What IT specialists and software can do is hold the data against the business rules and check whether it complies. Data that does not comply with the business rules is incorrect, but compliant data is not automatically correct. Business users, on the other hand, can look at data and determine it is incorrect. It is their data and represents their world. A project manager knows that the budget value of 180,000 is not correct, the account manager knows what the correct customer address should be, and the inventory manager knows that the category code of a product is not correct.
Cohelion Master Data Management
Business users must be able to insert, update, map, categorize, and correct master data in such a way that it directly impacts on the presented data. This is what the Cohelion Master Data Management product offers. With Cohelion, business users can view and analyze data in reports and dashboards, and when they discover that data is incorrect, they can correct it on the spot and the effect is instantly visible in the dashboard or report without any delay. Cohelion can offer this because their MDM solution is fully integrated with the reporting and analysis modules. This whitepaper describes the importance of master data to an organization, the importance of allowing business users to change the master data, the importance of processing those changes in the reports and analysis without delay and describes in detail how this works in the Cohelion product.
2. Poor Data Quality Data Can Harm Organizations
The Costs of Incorrect Data
The story on master data starts with data quality. Studies have shown how much data managed by organizations is somehow incorrect. One study showed that “On average, U.S. organizations believe 32 percent of their data is inaccurate. This was a 28 percent increase over last year’s figure of 25 percent.” But incorrect data can harm organizations. The same study found that “This high degree of inaccurate information causes 91 percent of respondents to believe revenue is affected by inaccurate data in terms of wasted resources, lost productivity, or wasted marketing and communications spend.” It is not expected that these percentages will be very different for other countries.
Different Forms of Incorrect Data
Data can be incorrect in several ways. Here are some examples:
- Misspelled names: Names of companies, customers, patients, products can all be spelled incorrectly. Especially entering names manually can lead to typos. For example, names for patients can be entered differently within one column of a table, within columns from different
tables of one system and within columns of different systems.
- Incorrect codes: A column may contain different codes to indicates the same value. For example, the codes F, Female, and W are all used to indicate the female gender. Such incorrect codes make analysis of the data complex and is likely to lead to incorrect analysis results. When certain columns contain many different codes, identifying the wrong ones can be difficult.
- Inconsistent codes within an organization: Different applications may use different sets of product category codes to represent the same. For example, Acme Corp. may have been assigned code “12334” in the finance application, while the ERP system may use the code “BE9-773X” to represent them. Such inconsistencies may also apply to customer codes, supplier codes, factory code, or product codes.
- Inconsistent codes across organizations: Different organizations may use different codes to represent the same data. For example, one airline may call the cheaper seats economy class and the expensive seats first class, whereas another airline may call them coach class and royal class respectively. They mean the same thing, but they use different words.
- Different versions of the truth: Different applications within an organization may store different versions of the truth. For example, the available forklifts may register as 5 in the warehouse management system, while the asset management system reports 8 forklifts.
All these and other forms of incorrect data complicates integration of data, analysis of data, and interpretation of data and can lead to misinterpretation of data. Sometimes it even makes certain forms of analysis impossible because integration cannot take place. All this can really hurt an organization.
3. Sources of the Problems
This chapter describes common reasons why some data is incorrect.
Many data entry applications do not check all the entered data properly and completely resulting in incorrect data being stored in the databases. Evidently, it is impossible to check everything, but it is crucial that such applications are updated to include as many data quality checks as possible to stop this constant seeping in of incorrect data.
Applications exist that check most of all incoming data perfectly, but they may still store inconsistent data. For example, they may use different code sets to indicate products, or use completely different unique keys to identify customers. Especially, applications bought from different vendors are notorious for using different codes to identify the same data.
Poorly-Trained Data Entry Personnel
Not everyone who enters data is properly trained. These users have learned how to work with the application and what data to enter in which field. However, the importance of entering the data correctly and the consequences of not doing so, are often not explained. This is directly related to these users. Their managers and departments are not made responsible for the data quality. Whether they type it in correctly or not, does not affect their jobs.
No Data Ownership Assigned
Ownership of the data stored in many systems has not been assigned yet. As a result, no one feels responsible for data quality. No one will deny that data quality is important, but they will not regard it as their own responsibility. Assigning ownership is the first step for an organization to
increase the data quality level.
External Low-Quality Data
Even if an organization controls all its data, data sources coming from external parties can contain incorrect or inconsistent data.
4. Master Data Management 101
As indicated, initiatives such as digital transformation and becoming data driven have increased awareness of the importance of data and the importance of high-quality data. Organizations have introduced new functions, such as data owners and data stewards, and acquired master data
management systems (MDM) to help with improving the data quality. This chapter provides a brief description of MDM and related topics.
Gartner defines master data as follows: “Master data is the consistent and uniform set of identifiers and extended attributes that describes the core entities of the enterprise including customers, prospects, citizens, suppliers, sites, hierarchies and chart of accounts.” Master data is different from transaction data. Master data is quite static while transaction data is more dynamic. Master data is relevant to many business domains, while transaction data is relevant to only a few business domains. Examples of master data are customer data, product data, parts lists, and storage locations, and examples of transaction data are inventory levels, orders, invoices, and sensor measurements.
Master Data Management
Master data must be entered, checked, verified, updated, reported, managed, and so on. Gartner defines this task as follows: “Master data management (MDM) is a technology-enabled discipline in which business and IT work together to ensure the uniformity, accuracy, stewardship,
semantic consistency, and accountability of the enterprise’s official shared master data assets.”
Master Data Management System
As Gartner’s definition indicates, master data management is a technology-enabled discipline. Professional tools are required to support master data management, the so-called master data management systems. These systems allow organizations to store what the correct values should be. For example, if the finance and sales system use different addresses for the same customer, the MDM shows what the right address is; if the stock system uses a non-existing product category code, the MDM notifies this and offers alternatives; and if two systems use different code sets to
identify product categories, the MDM maps the two together.
Master Data Versus Reference Data
There is a subtle difference between master data and reference data. Some see reference data as a form of master data. In a nutshell, master data represents business objects, such as data on customers, products, employees, patients, suppliers, and citizens. Reference data, on the other hand, commonly represents the set of permissible values that can be used. Examples of reference data include the state codes of a country, calendar structure, corporate codes, and product category codes. Reference data tends not to change that often. For some references codes, international standards exist, such as the ISO 3166-1 standard for country codes, and semi-standards, such as the customer codes from Dunn & Bradstreet.
This whitepaper makes no real distinction between master data and reference data, as they both need to be managed in the same way.
5. Master Data and Metadata
Master data is strongly related to metadata. Basically, metadata describes what data means (definitions and descriptions) and what the business rules are that apply to the data. Examples of such business rules are ‘a correct order date must be after the company’s incorporation date’, ‘the birth date of a patient is mandatory’, ‘zip codes must start with four digits and must be followed by a blank and two letters’, and ‘email address values must adhere to the general rules of correct email addresses’.
Some business rules are derived from more general rules and regulations, such as an email address must conform to specific rules originally standardized by the Internet Engineering Task Force in the 1980s and updated by RFC 5322 and 6854. Business rules can consist of permitted sets of values, such as a list with all the correct country codes. This is where the line between metadata and reference data starts to blur.
Metadata can also be an indication of whether some data is personally identifiable, or an indication of the data’s security level. The timestamps of creation and modification of any object, including who created, modified, or deleted it, are considered metadata as well.
6. Why is Master Data Important?
As indicated, incorrect data can harm an organization, but getting all the master data in order and managing and using it properly can be a tour de force that demands a serious investment of time and money. Therefore, this chapter describes some of the key business reasons why investing in MDM is valuable.
Improved Data Integration
For many reports, dashboards, and data science exercises data stored in multiple internal (and sometimes external systems) needs to be extracted and integrated. If these systems contain inconsistent and incorrect data, the integration solution is complex, and the result may not be perfect. Without MDM, the integration effort may lead to a myriad of independent integration solutions, one for each dashboard and report. In other words, besides having to deal with the complexity of integrating the incorrect and inconsistent data, it also leads to reinventing the wheel over and over. This leads to high development and maintenance costs. With MDM, the integration problem can be solved, and all the reports and dashboards and other applications can exploit this single MDM-driven solution.
Create New Business Insights
If data is incorrect and above all inconsistent, certain reports and forms of analysis simply cannot be created, because the required data cannot be integrated. This can severely limit organizations in creating insights into certain business processes and can lead to missed opportunities.
MDM can help to make that data integratable and thus help to create the required business insights.
Improved Analysis Results
The quality of data is directly related to the correctness of reports and dashboards, the lower the data quality, the lower the reporting results. Incorrect reporting results can lead to incorrect management decisions or incorrect management of operational processes. Master data can help to improve the data quality and indirectly improve the quality of management decisions and the correctness of operational processes.
Improved Data Compliance
For compliance reasons, every organization must periodically send reports to various regulators and external organizations. These reports must comply with regulations and standards and must be auditable and governable. It is of the utmost importance that these reports are correct. The data on which they are developed must be correct. MDM helps to produce correct data to develop reports that comply with all the rules.
7. The Cohelion Master Data Management Solution
This chapter uses examples to describe how Cohelion manages master data, how business users can enter and manage master data themselves, and how these changes are processed instantly and do not stop the business users.
7.1 Overall Architecture of Cohelion
The whitepaper entitled ‘Explaining the Cohelion Data Platform’ describes the overall architecture of the Cohelion product. A high-level overview of the platform’s architecture is presented in Figure 1. As can be seen in this figure, master data is a key module of the platform. Note that the master data management module can be used both fully integrated with the rest of the platform and as a stand-alone module that can work together with other ETL or data warehouse tools.
Figure 1 – The high-level architecture of the Cohelion Data Platform.
The referenced whitepaper describes the following features that make the product stand out:
- The Cohelion Data Platform is a master data-driven BI environment. Master data is used to
integrate and improve the quality of data. The multi-domain approach for managing master data
enables storage and management of any type of business object.
- The Cohelion Data Platform is an integrated product that includes data transformations,
metadata, master data, a Web-based user interface for data analysis, and it supports several
optional business solutions such as forecasting and budgeting, contracting, and customer
- Although the business solutions are optional, when used, they are fully integrated with the data
- The product covers the entire path that data travels from source systems to data consumers.
7.2 Defining Domains and Hierarchical Levels
Cohelion Master Data Management can be used by organizations to store and manage different types of master data, such as products, customers, locations, and assets, which are called domains. This makes Cohelion MDM a multi domain solution. Within each domain, data can be organized hierarchically by defining levels. For example, the domain Organization structure can exist off the levels Office, Country, Region, and Legal Entity. Figure 2 shows that the level Offices is part of Organization structure.
Figure 2 – This screenshot shows the Offices elements and its attributes.
7.3 Defining Hierarchies for Entities
Entities can be organized hierarchically. For example, a marketing specialist of a retail company may want to classify customers by location, where the hierarchy of location consists of sector, city, and country. Other users may want to group customers differently. The finance department of the same retail company may prefer to classify customers based on financial aspects, such as total customer sales and customer returns. Or an analyst wants to categorize products by whether they are liquid or fixed, and another analyst by size.
Figure 3 – This screenshot shows the hierarchical structure of the entity Offices.
The horizontal bars to the right of Figure 3 indicate how many elements there are of each entity, so there are clearly more different Office types than Legal entities. Hovering the mouse over a bar displays more detailed information. The numbers in the orange circles indicate how many elements still need to be edited. This is typical for Cohelion. It regards master data as never finished, but as a continues work in progress. Therefore, it should always be easy for users to see what remains to be done.
7.4 Mapping Master Data
Cohelion can import data from source systems and can export master data to target systems. The heart of Cohelion is formed by mappings that indicate how master data relates to data in source or target systems. Figure 4 shows an example of a mapping for Hour types belonging to the human resource (HR) domain. The master data is shown on the left and the data from a source system called Kronos Time and Attendance on the right. In this example, the hour type Planned Absence is mapped to two Hour types in the source system, which are Annual Vacation and Jury Duty, and Unplanned Absence is mapped to Pregnancy and Sick Leave. This means that when data is loaded from the Kronos system, each record
where the label equals Annual Vacation or Jury Duty is mapped to the master data record Planned Absence. The screenshot also shows that this source system does not have data on the hour type 3rd party hours.
Figure 4 – Screenshot showing how master data is mapped to source data.
Evidently, for each source system that contains this type of data, a mapping must be defined. So, master data can be mapped to several source systems. In this example, this master data is also mapped to two other systems called Attendance Enterprise and Bamboo HR. Mapping all entities from each source application to a master list is much more efficient then creating point-to-point mappings. In this example, a point-to-point mapping would require that Kronos is mapped to Bamboo, Attendance Enterprise, and Cohelion with Bamboo to Attendance Enterprise, and Cohelion with Attendance Enterprise to Cohelion. This knowledge about mapping data from different systems to master data enables data to be integrated from several systems and exported to other systems, such as data warehouses or data lakes. Data can also be exported to applications allowing them to read in the correct data, after it has been matched with the master data.
New data can be imported from all kinds of source systems and sources. After the data has been loaded, it must be mapped to the existing master data. For example, in Figure 5, Accounts data must be loaded from an application called Tagetik. After the data is exported to a file, Cohelion loads it in. Assuming that Accounts master data is already available, Cohelion reads the new data and maps it with what is already there. The result is shown in Figure 5. The left-hand column contains all the existing Accounts and the right-hand column the newly loaded data. The account called Revenue Accounts is known in the source system, but not in the master data. Someone must indicate if this is a new account or if this is an alternative name for an already defined account. Assigning a source record to a master data record is merely a matter of dragging and dropping. For example, if the row for this new account is dragged on top of Other income, Cohelion will know that this account belongs to Other income in this particular source system. The result of this exercise is a mapping of the source data to the managed master data. If no master data exists, the source data is loaded as master data. Users can then study the data and turn it into proper master data.
Figure 5 – Screenshot showing a mapping between master data and a source system.
In this example of Hour types and Finance accounts, the list of elements is small. Other domains, such as customers or suppliers, can have thousands and maybe even millions of elements. In this case, it is unpractical to scroll through the entire set to find the data that has yet to be mapped or studied. Therefore, filters can be specified to look at special subgroups. This makes it easier to focus a selection of the data and not all.
Note that mapping inconsistent and incorrect data to the correct data does not change the data in source systems. Cohelion indicates how incorrect codes need to be mapped to the correct data. When data needs to be exported to a target system, the incorrect and inconsistent data is transformed into the correct data.
7.5 Schema Definition or Canonical Data Model
For each level in any domain, a schema can be defined consisting of attributes that form a canonical data model. Figure 6 for example, shows a simple schema that is defined for the Customer level. Each Customer entry can be enriched with these all these attributes.
Figure 6 – Constructing a canonical data model.
Each attribute can be assigned characteristics. For example, in Figure 7, the characteristics Classification type and Display format are specified for the Building attribute. For attributes with specific formats, such as Dutch zip codes, credit card numbers, and e-mail addresses, regular expressions can be defined that Cohelion uses to check whether loaded data conforms to the formatting rules. The advantage of being able to define characteristics for attributes is that each organization can define their own unique set of characteristics. For example, a traditional retail company may need to define different characteristics for the products they sell than an online retail website selling books only. For the former, characteristics such as size, weight, and ingredients may be important, while they are useless for books. For book sellers, characteristics such as author, publisher, and publication date are more relevant.
Figure 7 – Screenshot showing the characteristics of the Building attribute.
7.6 The Golden Record
Commonly, there is no single application that can best supply all attribute values, in practice the best data is scattered over many applications.
Therefore, Cohelion allows a preferred source to be assigned to each attribute. For example, Figure 8 shows which attribute should come from
which source. A specific attribute may come from several source systems, in this case number of stars indicates which system has the highest priority. If the data in that source system is missing, Cohelion takes the data from the source system with the next highest priority. The same figure shows that the Phone number of the Customer Finance department can come from Cohelion, Sales force, SAP, or Tagetik. Same for the repair statistics, they can be supplied by Bamboo or SAP. This means that when Cohelion exports data for a specific entity, a record is constructed with data from all these source systems. Such a record is called the golden record in Cohelion.
Figure 8 – Screenshot showing how golden records for Offices are constructed.
7.7 Instant Impact of Changes on Reports
The screenshots in the previous sections show that Cohelion’s user interface is designed to be used by non-tech savvy users. This section describes how Cohelion enables master data management for and by business users. It shows how changes to the master data made by business users become available almost instantly in the reports. Figure 9 shows the starting situation. This report developed with Cohelion shows monthly Total worked hours for three departments. Suppose the business user notices that something is incorrect, namely that the values for the month of May are too low. The problem is that some Repair shops are incorrectly registered as Manufacturing. As a result, the hours reported in Manufacturing Business Line are too high.
Figure 9 – Screenshot of the monthly worked hours for business line repairs.
In such a situation, the user switches to the master data and re-assigns the correct Business Line value MRO to the incorrectly configured Repair shop; see Figure 10. Next, he refreshes the report to activate the correction; see Figure 11. There is almost no delay between the moment of the master data change and the moment the change takes effect effective. This is different from many systems in which a change cannot be made by the users themselves but where an IT specialist or data steward has to be asked to implement the change. Even in such a case, it can take hours (sometimes even longer) before the change is visible in the report. With Cohelion, master data changes become effective instantly.
Figure 10 – Screenshot of the change to the master data.
Figure 11 – Screenshot of the monthly worked hours for business line repairs after the master data was changed.
7.8 Monitoring Changes to and Securing Master Data
Data entered can be changed, augmented, deleted, and so on. Cohelion keeps track of all such changes. The example in Figure 12 shows that the Phone number of the Master Customer labeled Axe Capital was changed six days ago by a specific user. This logging type data can be important to security experts, data quality experts, and auditors. Not all business users and developers are allowed to view and change all the master data. In
Cohelion, permissions can be assigned to business users and developers indicating what they are allowed to do with the master data. These permissions can be defined in a fine-grained manner on domains, applications, attributes, and hierarchies.
Figure 12 – Screenshot showing that all changes are logged.
As indicated, Cohelion regards master data as a work in progress. New domains, entities, hierarchies, and attributes are added, deleted, or
changed on a regular basis. Therefore, Cohelion can show in several ways what still needs to be done. A special report is supported that shows the overall completeness status of a domain and all its entities; see Figure 13.
Figure 13 – Screenshot showing the status of domains and entities.
7.10 Accessing Data Through an API
All the master data stored in Cohelion MDM can be managed with the tool itself. Additionally, an API (Application Programming Interface) is available with which external applications can update or extract all data and mappings. The Cohelion API supports a pull mechanism that allows external applications to initiate the data exchange. Additionally, applications subscribe to changes in any domain. If a mapping, value, or attribute is changed, a notification is sent to the subscribed application, so that they can take immediate action. These APIs make the curated master data instantaneously available within the entire application landscape.
8. Closing Remarks
More and more organizations start to realize the business value of data. Having the right data at the right time can help to improve the efficiency and effectiveness of business processes and decision-making processes. Unfortunately, not all the data used by organizations is correct. Data can be inconsistent, incorrect, and missing. The quality of data does influence the quality of decision making. Organizations need data governance initiatives to manage their data and to improve the data quality right level. An important instrument for data governance that assists with raising the data quality level is master data management.
Many current MDM systems are developed to allow IT specialists to manage all the master data. But IT specialists do not understand the data on the individual level. They understand the data on the metadata or model level. They don’t know the individual customers, products, or patients. They won’t see that the address of a specific customer is incorrect or that the category of a specific product is incorrect. They understand there is a problem if the data does not meet predefined business rules. Business users do understand the data on the individual level. In most MDM systems, if business users discover that data is incorrect, they cannot correct it. They must inform those responsible for
correcting the data about this. But even is the proper correction is made, it can take hours and in some organizations days before that correction is visible in the user’s reports and dashboards. Because of this delay, business users are almost forced to copy-paste the data from the dashboard to a spreadsheet, correct it there, and then develop the reports based on the correct data in the spreadsheet. This is far
from ideal in all kinds of ways.
About the Author
Rick van der Lans is a highly respected independent analyst, consultant, author, and internationally acclaimed lecturer specializing in data architecture, data warehousing, business intelligence, big data, database technology, and data virtualization. He works for R20/Consultancy (www.r20.nl), which he founded in 1987. In 2018, he was selected the sixth most influential BI analyst worldwide by onalytica.com. He has presented countless seminars, webinars, and keynotes at industry-leading conferences.
Rick helps clients worldwide to design their data warehouse, big data, and business intelligence architectures and solutions and assists them with selecting the right products. He has been influential in introducing the new logical data warehouse architecture worldwide which helps organizations to develop more agile business intelligence systems. He introduced the business intelligence architecture called the Data Delivery Platform in 2009 in several articles, all published at B-eye-Network.com.
He is the author of several books on computing, including his new Data Virtualization: Selected Writings and Data Virtualization for Business Intelligence Systems. Some of these books are available in different languages. Books such as the popular Introduction to SQL are available in English, Dutch, Italian, Chinese and German and are sold worldwide. Over the years, he has authored hundreds of articles and blogs for newspapers and websites and has authored many educational and popular white papers for a long list of vendors. He was the author of the first available book on SQL, entitled Introduction to SQL, which has been translated into several languages with more than 100,000 copies sold. For more information, please visit www.r20.nl, or send an email to firstname.lastname@example.org. You can also get in touch with him via LinkedIn and Twitter (@Rick_vanderlans).
Ambassador of Axians Business Analytics Laren: This consultancy company specializes in business intelligence, data management, big data, data warehousing, data virtualization and analytics. In this part- time role, Rick works closely together with the consultants in many projects. Their joint experiences and insights are shared in seminars, webinars, blogs, and whitepapers.
Cohelion is a one-stop-shop for fast growing organizations that have data integration challenges. With a unique service-oriented approach the company takes the burden of data-integration away from their customers, who in return get a user-friendly data platform with sophisticated data-quality controls and build in business solutions like Forecasting and Budgeting, Contracting and SLA management, and Customer profitability. Check out the case studies and request a demo at www.cohelion.com.