'Garbage in leads to garbage out' is no excuse
Data warehouses are build up from data that is supplied by many datasources: applications or other processes or users that deliver the data. A typical data warehouse consolidates this data via all sorts of transformation steps. These transformation steps however assume that the supplied data is correct, hence the saying “garbage in leads to garbage out”.
For our customers this was just an excuse, and not an acceptable one. Mistakes are unavoidable in real life. They may be an exception, but in a large enterprise all those exceptions add up to a lot. Common types of mistakes are duplication of data, or gaps in data, or incorrectly using a comma instead of a dot as separator, or supplying data in wrong currency, or supplying in pounds instead of tons or Kg.
The Cohelion anomaly detection framework is designed to pro-actively detect these exceptions and prevent them from entering the data warehouse. It does so by continuously running checks on all new data. These checks can be as simple as detecting gaps where there is usually data, or by comparing the newly supplied data to earlier expected forecasts and budgets. But the most sophisticated checks validate against detected trends in historical data. For this we use machine learning algorithms that detect outliers in the datasets.
Any detected irregularity is flagged for review by the user who is responsible for that dataset. By marking suspicious data as valid or not, the user is effectively training the algorithm on what is acceptable or not. With another unique Cohelion feature any invalid data can be corrected within the platform easily.
This decentralized approach to monitoring, confirming and correcting data makes that even large organizations can maintain high levels of data quality in high granularity.
With this new feature we help to improve our customers data quality. Making sure no garbage goes into the platform.