Data Analysis · ETL

Data Quality

Data quality is measurement of whether data is meeting the requirements of the business or not.Data quality is very specific to business needs. While managing the data quality certain questions has to be answered as :

  • Is the data accurate:  Data Accuracy means whatever and sometimes in same order the data is being taken out from the source system , it is loaded accurately in target system. Possible issues which make data inaccurate are data duplication, while performing joins if columns are getting mapped to wrong values then data become inaccurate.   E.g Customer source has fields : customerID, customerAccount, customeName, Address and Bank source has details bankId, bankAccount, bankName, Address . If we need output as customerName, customerAccount, customerAddress, bankName – then while mapping we need to take care if this address is from customer source and not from the bank source. Similarly there could be other issues which make data inaccurate.
  • Is the data up to date: Generally in data-warehouses data gets loaded on monthly or weekly basis. But what if daily report has to be generated at some specific time of a day. In that case we need to setup ODS system or any relevant system which has up to date before the report creation time to get the desired data quality. Because customers don’t need obsolete data.
  • Is the data complete: Data completeness is decided based on the business requirements and it can vary from business to business. E.g A business wants to report year of account opening so they just need the year i.e. in YYYY form but for bank it may be complete date needed to put yearly charges i.e. account opening data in DDMMYYYY format.
  • Is data consistent across all data sources: While loading the data we may need to take care of the data lineage. If a particular data is loaded 10 different data marts then it has to be consistent across all 10 systems. E.g Credit card expiration date, it should be consistent in all the data marts which may be used by different departments of bank, if 1 system is showing old expiration date it can lead to monetary loss or legal issues.
  • Is data duplicated: Data should not be unnecessarily duplicated. It may lead to serious resource consumption i.e. space, CPU, memory etc. Also it may lead to other data quality issues like inaccurate data etc.

There could be other data quality measures that is proper keys has to be defined on the data tables, domain integrity , business rules. These are the measures which help us to check if our data is meaningful or not.

 

Note: Featured image is taken from internet.

Data Analysis · ETL

Data Preparation For Data Analysis

This data set is taken from https://www.data.gov/food/ and is for analysis purpose only.

Note: This example uses Tableau to perform data Preparation and analysis in this example.

Data Preparation: Data Preparation is very first and crucial step in data analysis. In this process we basically collect the needed data on which we want to perform analysis. Do the cleaning/formatting a data so that our analysis software/machine can understand it better and if needed sometimes we may need to consolidate the data from multiple sources as well.

We need to do data preparation is many cases as below:

  • Data is not as per machine standards i.e. it can be made for better human understanding but software such as Tableau may not be able to read data correctly.
  • Multiple sources needed for analysis purpose. Sometimes we need to perform data blending/joining before creating the dashboards

There could be many other business reasons for which we may need to to have data preparation phase in our analysis cycle.

Below is the example of data which is good for human readability but not good for Tableau to read:Unformatted sheet

If we see this data it is very neatly created and easy to understand but machine may find below issues while reading it:

  • First line which is more like information i.e. “Consumer price indexes historical data, 1974 through 2016” , cant be used in analysis and it does not have any meaning for Tableau.
  • There are hidden columns between Annual 1979 and Annual 2016.
  • There are blank lines in between to increase readability.
  • Columns are merged to give categories to data.
  • Tableau can perform summary while doing the analysis so we really don’t need additional summary row as Row 5.

Below is the after modifications picture of the same file. In this file below are the changes:

Little enhanced data.png

  • Categorized the food
  • Better column names
  • Show hidden columns

Blank rows are not removed as Tableau 10 takes care of that automatically with the help of DATA INTERPRETER. Below is the picture on how Tableau will read this data:

Data In tableau

Now you can perform all type of analysis on this data i.e we can pivot the data by making years into columns and other stuff.

Let me know if you need more information on this topic . Your likes/dislikes about the content posted by me so that I can improve it.

 

Data Analysis

Segmentation analysis for a bank

Segmentation analysis is done to check which are the potential customers for particular product. Below storyline(created using tableau) is the example of how we can perform segmentation analysis of customers in a bank say if we want to sell life insurance policy to existing bank customers.

customer_segmentation

In the below pic if we want to sell life insurance in England , we need to understand the customers first:

england

If we see major group of customers are in age 30-35 and mainly 70% are white collar employee having proportionate men and women. All these customers have very low bank balance. So Life insurance product can be sold to them with agenda to secure their future and creating wealth for their retirement.

This is just an example of how segmentation analysis can help in decision making in business. It is used in many industries these days like insurance agencies, credit card companies, banks, luxury items and at many more places.

You could look up my work here : Sample Segmentation Analysis for Bank

Input datasets and idea courtesy : superdatascience.com