The Big [Data] Idea:
Big Data is best utilized when you have rich analytics to support your decisions, but this is only possible when the data is high quality. However, 67% of the executives in the U.S. are not comfortable using data from analytics because they lack confidence in data quality. This article will take you through a data quality management framework to help you manage data quality and gain confidence in your data analytics.
Why it is Important:
Data that cannot support your decisions is virtually useless. In the era of Big Data, bad data has financial and costs U.S. companies $3.1 trillion yearly. Globally, organizations are losing $9.7 million every year. Some researchers suggest that employees today spend 50% of their time hunting for the right data and 60% cleaning it. If data quality is ensured at the source or in storage, these costs and can be avoided.
What’s ahead:
This article will help you understand data quality and find ways to manage its quality through:
Understanding Quality in Data
High quality data is the key to driving business intelligence. But with the dynamic nature of data, it’s important to consider all its dimensions.
Data quality has six core dimensions:
Discrepancies in any of these dimensions can affect the quality of your data. You must also consider the industry in which the data is being used as the dimensions represent different use cases. Here are a few examples of data quality dimensions as they relate to healthcare and banking settings:
Quality dimension | What it represents | Example |
Timeliness | Real-time update of the data captured | Healthcare: Difference between the time patient vitals were captured and time when it is available to see in your EMR (Electronic Medical Records) system.
Banking: Your customer has withdrawn some cash from the ATM. Will it be reflected immediately in his account? |
Completeness | If your database has complete information | Healthcare: Number of births recorded in your system vs a total number of babies delivered in the hospital.
Banking: A new account has opened. Does your portal provide all the details captured when your customer logs in? |
Uniqueness | If any records are captured more than once | Healthcare: Duplicate entries of your patients in your database.
Banking: Your customer transferred some amount online to another account just once, but does your automated bank feed record it twice? |
Validity | If your data conforms to the syntax | Healthcare: If the level of severity of the hearing loss is selected from the list of allowed values.
Banking: Does your bank statement follow a specific format name-middle name-surname to show the account holder’s identity? |
Consistency | The similarity in different representations of one information | Healthcare: The result of a test of a patient in your internal records and lab records.
Banking: Do your customer details in the loan account match with the saving account details? |
Accuracy | Does it describe the real object correctly | Healthcare: If the date is not entered in the prescribed format MM/DD/YYYY format, but a date is entered first, the patient data will be incorrect.
Banking: If your customer is a working professional, does your customer database contain the occupation details? |
Big Data Quality Management Framework
The speed at which data flows is truly mind-boggling. An effective governance framework is required, especially for banks. The framework can ensure that data quality is maintained across the flow of big data starting from the stage of discovery to defining, designing, deployment, and monitoring. Your framework must contain several key components, including:
Let’s dive into each of these.
Big Data Life Cycle
Big data flows through a series of 5 stages before it reaches the end-user. To avoid compounding data issues, close attention must be paid at each stage. Here’s what you need to know about each stage:
Stage 1: Data generation
Whether we search for something online, click a link, or send a message, every action generates information. When collected, you’ll possess a mix of structured, semi-structured, and unstructured data. At this stage, you might not be able to do much with the unstructured data, but while collecting data in a structured format, you can take the following measures to ensure quality:
Stage 2: Data collection
Your data will likely come from several different sources. Here are a few things you can do to organize your data sources:
Stage 3: Data processing
Processing a large volume is a huge undertaking. At this point, you would be formatting, cleaning, and compressing it.
Stage 4: Data storage
Once your data gets to this stage, it will be published to all the users, so any quality issues you find in the data must be resolved by now. When you are storing data manually, validation checks by data managers are helpful. For automatic data capture, you won’t have to worry about validation.
Stage 5: Data Analytics & Visualization
Once you reach the stage of analytics, you will begin to see the results of your quality management efforts. Minor errors that slipped through the earlier stages are exposed visually. Thus, use checks to ensure that you discover quality issues and flag them to your team so they can make corrections.
Data Validation Framework
Big Data programming languages provide built-in data validation options that can support you in checking data on quality dimensions like accuracy, completeness, timeliness, consistency, and uniqueness of data. Standard validation options are:
Validation | Checks |
Type validation | Must enter age in a whole number format, and data must follow the format MM/DD/YYYY |
Value validation | Specify what is required, allowed, and forbidden in data fields |
Key validation | If your visualizer cannot support a specific feature of data, such as upper-case font, add a check for that |
Range validation | Define a range for fields like age, salary, and duration to ensure that data falls in a valid range |
Consistency check | Ensure that data is logically correct. For example, you cannot have a shipping date before the order date |
Uniqueness check | Run a check on every data point if it must not repeat in another row in your data file |
Code check | Check if the entry is selected from a list such as a list of countries, valid postal codes, etc. |
Lineage validation | Compare test and training databases to ensure all data generated at sources reaches storage as required. |
Data Quality Management Process
Data quality can be managed by following this 5-step process:
1. Identification and measurement of data quality: What is not measured cannot be confirmed as truth. So, your first step should be to define quality on your terms and identify metrics to measure data quality. Some common data quality measures used by US organizations are:
2. Define rules and targets for data performance: Set policies for data quality by assigning rules and targets for each quality dimension. For each activity or process, you might define your focus area in terms of quality check. For instance, during data migration from legacy systems to the cloud, your focus may be on completion. For every process, the focus areas can be different, and for each focus area, you can identify data quality issues. Some examples of these rules include:
Quality dimension | Rule |
Uniqueness | No entity that is recorded must exist more than once in the database and thus, must be checked for duplication. In the case of an existing record, the new information may only be merged with the existing record.
|
Accuracy | Registration status must have a value consistent with regional accreditation board data reference data, and if it is not, the board is to be approached for verification.
|
Completeness | Constraints can be put for fields that are mandatory and optional for data entry before a form is accepted by the system. |
3. Designing of quality improvement processes: Your quality improvement cannot be guaranteed unless you create a culture that respects data quality. Create local structures and define interventions to manage quality issues. Several measures can be taken to improve the quality of your data, as listed below:
Quality improvement measure | What it does |
Profiling | Gather statistics on the data for quality assessment |
Matching | Merge or integrate related data records |
Parsing and standardization | Break data into components and re-combine them in a specific format |
Normalization | Eliminate redundancy and perform feature scaling |
Cleaning | Remove duplicate and incorrect data; Modify values to meet standards and rules |
Metadata management | Centralize management of semantic metadata for organizational consistency |
Data quality dashboard | Show the quality metrics measured to compare actual with target data quality |
4. Implementation of quality improvement methods: Create a proper plan not just for making rules but also for monitoring data quality, identifying opportunities for improvements, and implementing measures for improvements. Share this plan with your whole team to keep them aware of how important data quality is. Any successes must also be communicated to encourage your team to keep working positively.
5. Monitoring of data quality: You can create reporting systems, monitor performance in the field, or run real-time quality checks on data. To make a reporting system, assign the responsibility of checking data on supervisors. You can also create a team of data quality checkers to ensure quality dimensions are well taken care of when managing data. Real-time data quality checks can be run using some tools such as:
Responsibility for Data Quality
From the bottom all the way to the top, many individuals across an enterprise encounter data. Therefore, they all influence data quality. It is essential to educate all knowledge workers on the importance of data quality and encourage them to follow the rules required to maintain it.
Data Quality in the Age of Big Data
Quality data in Big Data streams is essential to achieve positive business outcomes. But the path there isn’t always easy. Understanding the Big Data lifecycle and committing to a data quality management process will create your roadmap there. Additionally, we’ve designed a QE self-assessment test that you can use to understand if your quality engineering practices are aligned to deal with the quality challenges of Big Data. If you are not, our team is here to assist you in your digital transformation.
For more on data quality and quality engineering, machine learning services, check out Apexon’s data services and quality engineering practice or get in touch directly using the form below.