Data Quality Guide: 6 Dimensions, Big Data Lifecycle, Management, Validation Framework

Reading Time: 20 min

Guarav Pande

Sr VP - Enterprise Data Solutions Data Engineering (DE)

Apr 5, 2022 |

Posted in Data Services

Data Quality Guide: 6 Dimensions, Big Data Lifecycle, Management, Validation Framework

The Big [Data] Idea:

Big Data is best utilized when you have rich analytics to support your decisions, but this is only possible when the data is high quality. However, 67% of the executives in the U.S. are not comfortable using data from analytics because they lack confidence in data quality. This article will take you through a data quality management framework to help you manage data quality and gain confidence in your data analytics.

Why it is Important:

Data that cannot support your decisions is virtually useless. In the era of Big Data, bad data has financial and costs U.S. companies $3.1 trillion yearly. Globally, organizations are losing $9.7 million every year. Some researchers suggest that employees today spend 50% of their time hunting for the right data and 60% cleaning it. If data quality is ensured at the source or in storage, these costs and can be avoided.

What’s ahead:

This article will help you understand data quality and find ways to manage its quality through:

6 Dimensions of data quality and how they impact the healthcare and banking industries
Big-Data quality management framework
Data Validation Framework
Quality management process
Responsibility for data quality

Understanding Quality in Data

High quality data is the key to driving business intelligence. But with the dynamic nature of data, it’s important to consider all its dimensions.

Data quality has six core dimensions:

Timeliness
Completeness
Uniqueness
Validity
Consistency
Accuracy

Discrepancies in any of these dimensions can affect the quality of your data. You must also consider the industry in which the data is being used as the dimensions represent different use cases. Here are a few examples of data quality dimensions as they relate to healthcare and banking settings:

Quality dimension	What it represents	Example
Timeliness	Real-time update of the data captured	Healthcare: Difference between the time patient vitals were captured and time when it is available to see in your EMR (Electronic Medical Records) system. Banking: Your customer has withdrawn some cash from the ATM. Will it be reflected immediately in his account?
Completeness	If your database has complete information	Healthcare: Number of births recorded in your system vs a total number of babies delivered in the hospital. Banking: A new account has opened. Does your portal provide all the details captured when your customer logs in?
Uniqueness	If any records are captured more than once	Healthcare: Duplicate entries of your patients in your database. Banking: Your customer transferred some amount online to another account just once, but does your automated bank feed record it twice?
Validity	If your data conforms to the syntax	Healthcare: If the level of severity of the hearing loss is selected from the list of allowed values. Banking: Does your bank statement follow a specific format name-middle name-surname to show the account holder’s identity?
Consistency	The similarity in different representations of one information	Healthcare: The result of a test of a patient in your internal records and lab records. Banking: Do your customer details in the loan account match with the saving account details?
Accuracy	Does it describe the real object correctly	Healthcare: If the date is not entered in the prescribed format MM/DD/YYYY format, but a date is entered first, the patient data will be incorrect. Banking: If your customer is a working professional, does your customer database contain the occupation details?

Big Data Quality Management Framework

The speed at which data flows is truly mind-boggling. An effective governance framework is required, especially for banks. The framework can ensure that data quality is maintained across the flow of big data starting from the stage of discovery to defining, designing, deployment, and monitoring. Your framework must contain several key components, including:

Concept of quality that measures its dimensions (identified above)
Big Data life cycle that guides your quality management processes
Data validation framework that helps you ensure the validity of your data
Quality management process that provides steps to manage quality
Responsibility for data quality lies not just with a data engineer but all the stakeholders

Let’s dive into each of these.

Big Data Life Cycle

Big data flows through a series of 5 stages before it reaches the end-user. To avoid compounding data issues, close attention must be paid at each stage. Here’s what you need to know about each stage:

Stage 1: Data generation

Whether we search for something online, click a link, or send a message, every action generates information. When collected, you’ll possess a mix of structured, semi-structured, and unstructured data. At this stage, you might not be able to do much with the unstructured data, but while collecting data in a structured format, you can take the following measures to ensure quality:

Put validation checks in your fields where data is captured to ensure that it follows the prescribed format
Hold approvals or flag data fields if they are not completed. You can hold people accountable to ensure that fields are completed in a targeted time

Stage 2: Data collection

Your data will likely come from several different sources. Here are a few things you can do to organize your data sources:

Use filters when collecting data instead of capturing everything, so you can only collect what you need
You can use the merging features to combine data after it is received from different sources to ensure that one source of truth is created
If your data-capturing system is integrated with your storage, running a check for an existing entry can help avoid duplicates.

Stage 3: Data processing

Processing a large volume is a huge undertaking. At this point, you would be formatting, cleaning, and compressing it.

While formatting your data, you can check for conformity to syntax and make corrections wherever required
Cleaning your data before it reaches your storage space is good for catching missing information
Merge records whenever there are duplicate data entries

Stage 4: Data storage

Once your data gets to this stage, it will be published to all the users, so any quality issues you find in the data must be resolved by now. When you are storing data manually, validation checks by data managers are helpful. For automatic data capture, you won’t have to worry about validation.

Stage 5: Data Analytics & Visualization

Once you reach the stage of analytics, you will begin to see the results of your quality management efforts. Minor errors that slipped through the earlier stages are exposed visually. Thus, use checks to ensure that you discover quality issues and flag them to your team so they can make corrections.

Data Validation Framework

Big Data programming languages provide built-in data validation options that can support you in checking data on quality dimensions like accuracy, completeness, timeliness, consistency, and uniqueness of data. Standard validation options are:

Validation	Checks
Type validation	Must enter age in a whole number format, and data must follow the format MM/DD/YYYY
Value validation	Specify what is required, allowed, and forbidden in data fields
Key validation	If your visualizer cannot support a specific feature of data, such as upper-case font, add a check for that
Range validation	Define a range for fields like age, salary, and duration to ensure that data falls in a valid range
Consistency check	Ensure that data is logically correct. For example, you cannot have a shipping date before the order date
Uniqueness check	Run a check on every data point if it must not repeat in another row in your data file
Code check	Check if the entry is selected from a list such as a list of countries, valid postal codes, etc.
Lineage validation	Compare test and training databases to ensure all data generated at sources reaches storage as required.

Data Quality Management Process

Data quality can be managed by following this 5-step process:

1. Identification and measurement of data quality: What is not measured cannot be confirmed as truth. So, your first step should be to define quality on your terms and identify metrics to measure data quality. Some common data quality measures used by US organizations are:

data-to-error ratio
empty values count
accurate values to all values ratio
field miss rate
duplication rate

2. Define rules and targets for data performance: Set policies for data quality by assigning rules and targets for each quality dimension. For each activity or process, you might define your focus area in terms of quality check. For instance, during data migration from legacy systems to the cloud, your focus may be on completion. For every process, the focus areas can be different, and for each focus area, you can identify data quality issues. Some examples of these rules include:

Quality dimension	Rule
Uniqueness	No entity that is recorded must exist more than once in the database and thus, must be checked for duplication. In the case of an existing record, the new information may only be merged with the existing record.
Accuracy	Registration status must have a value consistent with regional accreditation board data reference data, and if it is not, the board is to be approached for verification.
Completeness	Constraints can be put for fields that are mandatory and optional for data entry before a form is accepted by the system.

3. Designing of quality improvement processes: Your quality improvement cannot be guaranteed unless you create a culture that respects data quality. Create local structures and define interventions to manage quality issues. Several measures can be taken to improve the quality of your data, as listed below:

Quality improvement measure	What it does
Profiling	Gather statistics on the data for quality assessment
Matching	Merge or integrate related data records
Parsing and standardization	Break data into components and re-combine them in a specific format
Normalization	Eliminate redundancy and perform feature scaling
Cleaning	Remove duplicate and incorrect data; Modify values to meet standards and rules
Metadata management	Centralize management of semantic metadata for organizational consistency
Data quality dashboard	Show the quality metrics measured to compare actual with target data quality

4. Implementation of quality improvement methods: Create a proper plan not just for making rules but also for monitoring data quality, identifying opportunities for improvements, and implementing measures for improvements. Share this plan with your whole team to keep them aware of how important data quality is. Any successes must also be communicated to encourage your team to keep working positively.

5. Monitoring of data quality: You can create reporting systems, monitor performance in the field, or run real-time quality checks on data. To make a reporting system, assign the responsibility of checking data on supervisors. You can also create a team of data quality checkers to ensure quality dimensions are well taken care of when managing data. Real-time data quality checks can be run using some tools such as:

CDI (Customer Data Integration)
Master Data Management (MDM)
PIM (Product Information Management)
DAM (Data Asset Management

Responsibility for Data Quality

From the bottom all the way to the top, many individuals across an enterprise encounter data. Therefore, they all influence data quality. It is essential to educate all knowledge workers on the importance of data quality and encourage them to follow the rules required to maintain it.

Data Quality in the Age of Big Data

Quality data in Big Data streams is essential to achieve positive business outcomes. But the path there isn’t always easy. Understanding the Big Data lifecycle and committing to a data quality management process will create your roadmap there. Additionally, we’ve designed a QE self-assessment test that you can use to understand if your quality engineering practices are aligned to deal with the quality challenges of Big Data. If you are not, our team is here to assist you in your digital transformation.

For more on data quality and quality engineering, machine learning services, check out Apexon’s  data services and quality engineering practice or get in touch directly using the form below.

Stay Updated