Data Analysis · ETL

Data Quality

Data quality is measurement of whether data is meeting the requirements of the business or not.Data quality is very specific to business needs. While managing the data quality certain questions has to be answered as :

  • Is the data accurate:  Data Accuracy means whatever and sometimes in same order the data is being taken out from the source system , it is loaded accurately in target system. Possible issues which make data inaccurate are data duplication, while performing joins if columns are getting mapped to wrong values then data become inaccurate.   E.g Customer source has fields : customerID, customerAccount, customeName, Address and Bank source has details bankId, bankAccount, bankName, Address . If we need output as customerName, customerAccount, customerAddress, bankName – then while mapping we need to take care if this address is from customer source and not from the bank source. Similarly there could be other issues which make data inaccurate.
  • Is the data up to date: Generally in data-warehouses data gets loaded on monthly or weekly basis. But what if daily report has to be generated at some specific time of a day. In that case we need to setup ODS system or any relevant system which has up to date before the report creation time to get the desired data quality. Because customers don’t need obsolete data.
  • Is the data complete: Data completeness is decided based on the business requirements and it can vary from business to business. E.g A business wants to report year of account opening so they just need the year i.e. in YYYY form but for bank it may be complete date needed to put yearly charges i.e. account opening data in DDMMYYYY format.
  • Is data consistent across all data sources: While loading the data we may need to take care of the data lineage. If a particular data is loaded 10 different data marts then it has to be consistent across all 10 systems. E.g Credit card expiration date, it should be consistent in all the data marts which may be used by different departments of bank, if 1 system is showing old expiration date it can lead to monetary loss or legal issues.
  • Is data duplicated: Data should not be unnecessarily duplicated. It may lead to serious resource consumption i.e. space, CPU, memory etc. Also it may lead to other data quality issues like inaccurate data etc.

There could be other data quality measures that is proper keys has to be defined on the data tables, domain integrity , business rules. These are the measures which help us to check if our data is meaningful or not.

 

Note: Featured image is taken from internet.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s