ETL

Dimensional Modelling – Fact and dimension tables..

RetailStoreSnowFlake

Fact: Fact is measurement of a particular business process or business itself. Fact tables are used to store the measurements as a track of business process performance.

E.G if we are selling some product at some dollar value then the measurement will be the units sold and amount of dollars earned from that product’s sale.

Each row in fact table is one measurement event. And this data is stored at specific level of detail which is called ‘grain’ such as per above example, it could be one row per product sold. Single fact table should have same grain size to avoid data inconsistency.

Mainly there are 3 type of of  fact tables:

  1.  Additive : Additive facts are the ones which can be added together to generate some insights from the data. E.g To calculate total sales of a region for a year in terms of dollar value can be calculated by summing up the Sales amount for last 1 year records in fact table.
  2. Semi- Additive: Semi additive facts which cant be added in all cases. E.g Account balance , can not be summed across the time dimension. i.e. If in morning account balance is $100 and by evening after doing the transaction it becomes $50 then total balance by end of the day can be calculated by adding $100 and $50.
  3. Non-additive : These type of facts can never be added e.g unit price of a product. We can only perform average and count operations on this type of measures.

While storing a data in fact table, one should consider putting a valid fact information e.g if there is no sales activity for a given product than ,no row should be put in fact table. We should not populate zeroes in the fact table to represent no activity. Because despite not capturing no activity fact tables take up 90% of the space in dimensional models. We should be very judicious in using fact table space.

Grain of fact table can be categorized in 3 parts:

  1. Transaction: This is the most common type of grain fact tables.
  2. Periodic Snapshot
  3. Accumulating Snapshot

Fact tables have 2 or more foreign keys connect to dimension table’s primary keys. E.g in above picture office id is primary key in dealers office dimension whereas it is foreign key in Fact table. Fact tables generally have their own primary keys composed of subset of the foreign keys of dimensions also called composite key. Fact tables have many to many relationships.

Dimension Tables: Dimension table contains the textual information in context to the fact tables. They describe ‘who, what, when ,where, how and why’ associated with the business events. E.g Time , DealersOffice are dimensions in above example. Dimensions have more attributes than fact tables but lesser rows. Each dimension table has single primary key on basis of which it can be joined with fact table.

It serves as source of the report labels and make DW/BI system understandable and useful. Naming of dimensional attribute should resemble the business properties and not always be some code which has to be memorize to understand and decode while reporting.

Sometimes looking at numeric value , it is tough to determine whether it is dimension or fact e.g cost of a product can serve as dimension also which can be fact as well as dimension. Continuously valued numeric are almost always considered as facts where discrete observations are considered as dimensions.

Fact and Dimensional tables in Dimensional modelling 

Above diagram is snowflakes schema some property business. In Dimensional modelling schema are kept as simple as possible but not simpler. So that data can be processed with less joins. Dimensional model should be well planned so that when business wants to analyze the data separately every time , schema need not be changed. So generally data is kept at the lowest level of granule. Atomic data which is not aggregated is the most expressive data.

While creating a report from the dimensional models, dimension attributes supply report filters and labeling , whereas fact tables supply numeric values.

Kimball’s DW/BI architecture:

There are 4 different components to consider the DW/BI architecture:

  1. Operational Source systems i.e. OLTP : These systems generally captures business transactions. E.g Point of Sales transaction systems are OLTP systems.
  2. ETL systems : Extract Load Transform systems. This comes between OLTP and DW/BI presentation area. Detail on ETL can be chekced on article.
  3. Data Presentation Layer
  4. Business Intelligence Appications

 

Source: The Dataware House Toolkit- 3RD Edition by Ralph Kimball and Margy Ross

ETL

ETL

ETL PRocess

ETL – Extract Transform and Load

ETL process is used when business needs to move data from one source system to another source system or target system, one form to another form i.e structure of data, from OLTP to OLAP systems , Enterprise OLAP to subject wise OLAPs etc. It is basically a layer where data is taken from system, transformed as per business logics and loaded in another system.

Extraction: There are multiple disparate source systems in an OLTP systems. Data can be extracted from various sources at once say data from file or DBMS can be extracted together in job,combined and do some processing on that or data could be taken from one source system only. Data Extraction strategies mainly depends upon source systems and how data is stored in source systems.

Transforming: ETL transforms the data from one format to another or we can say make the data in homogeneous format as per target systems. Below can be considered as transformation classifications:
-Data Type conversions
-Joining one or more data sources
-Performing calculations or aggregations on the source data
-Business logics
-Generating surrogate keys for datawarehouse

Below are some examples of scenarios where we can use ETL. These are just to give idea but in real life ETL is used to handle lot more complex requirements and which are tedious to do otherwise:

E.g

  •  If in my source which could be a file or database, I have 10 columns and in my target system I only want 5 columns out of that. How will we load data we can use ETL.
  • If in source there are columns like first name, last name but in my target system which could be used for say reporting purpose I want to use full name , ETL can be used for this purpose.
  • My source has date as DDMMYYY but target wants it in form YYYY-MM-DD, ETL can be used for this

Cleansing the data : We can get corrupt data or duplicate data , ETL can easily handle such scenarios based on business rules provided.

E.g:

  • What should be done when we have duplicate data in source, do we want to discard this data and inform source about this issue or we want to process it by keeping one single record.
  • NULL handling in data.

Loading: ETL is used as bridge between source and target. Data taking from one source system and connecting and loading in another system(even if both are different systems say mainframe and database ) all can be done in one single jobs.

To read good ETL design, Click on Next>>>>

ETL

Interview Questions asked for DWH professional positions

Below are some interview questions which I have faced in course of finding job. There is not any single answer to these questions and can be more elaborated answers but these will give you heads up for what questions to expect in an interview.

Q1: What is the end to end procedure followed in your project from processing ETL source files to reporting along with tools and technologies?

Ans: In my project below was the flow created

Source File–>ETL Jobs–>Stage Table–> ODS tables–>Data warehouse–>Data Cubes(Summarized data)–>Reporting tools

(In your project the process may be somewhat different)

Q2: What is the full form and use of ODS in data-warehouse?

Ans: Full form of ODS is Operational Data Store. Definition of ODS is : An operational data store (or “ODS“) is a database designed to integrate data from multiple sources for additional operations on the data. Unlike a master data store, the data is not passed back to operational systems. It may be passed for further operations and to the data warehouse for reporting.

Q3: How will you print name and salary of the employee from employee table who has maximum salary?

Ans: select name,salary from employees where salary = (select max(salary) from employees)

Q4: Why cant we use group by function in above question?

Ans: If you write query like below

select name,max(Salary) from employees group by salary.

It will give error saying “ORA-00923: not a single-group group function”. This issue comes when we try to select individual and group functions together , unless individual column is included in the group by clause.

Q5: How will you optimize the query in Teradata ?

Ans: Some of the query performing techniques is as below:

  • See the  explain plan and try understanding where the problem is
  • Check if Primary Index is defined properly or not and if yes the data loaded is unique in the PI column
  • If you have partitioned primary index created on the table, try to use it else the performance will be degraded
  • Try avoiding use of functions in Join conditions such as TRIM etc
  • Make sure there is no implicit conversion happening in join columns. Try to keep the data types same
  • Try to avoid using IN column if there are more values to be compared. Try to use Join instead by may be creating static table for matching values.

These are just few tips and you can find more link 

Q6: What are the typical scheduling software used for scheduling ETL jobs on Linux/Unix?

Ans: Autosys, Control-M are the 2 mostly used software for the scheduling need of ETL jobs. Some ETL tools come with their own schedulers but the are not flexible enough like tools specially meant for schedulers.

Q7: Give pros and cons of using Autosys as scheduling agent?

Ans: Pros:

  • You can create jobs using its GUI version or JIL files
  • It has vast variety of commands to handle the job scheduling and execution like you can force start it, you can make it start on arrival of file or at some particular time
  • You can put the jobs on hold without really affecting the later jobs in it

Cons:

  • If you have to create jil files for say hundreds of jobs things may get tedious

Q8: What are different type of dataware house schema?

Ans: Below are the different type of schema in DWH:

  • Star Schema: Star schema resembles star with fact table in the centre and and dimension tables at the star points. This schema is the simplest of all the dataware house schemas. Below is the star schema of Library management system.

StarSchemaLibrary

  • Snow Flake Schema: One dimension is split in multiple dimensions. Snow flake schema is where we normalize the dimension tables of star schema. Below is the image f retail management snow flake schema

RetailStoreSnowFlake

  • Galaxy/Hybrid Schema: This the schema where conformed dimensions are used i.e. single dimension shared by multiple fact tables. Dimensions can be further normalized in these type of schemas.

Q9: What is SCD and describe different type of SCDs?

Ans: SCD means slowly changing dimensions i.e. dimensions whose attributes change slowly over the period of time. Say a customer whose address may change several times over the years.

There are 3 type of SCD implementations in DWH:

  • SCD Type 1 : Overwrite the old value
  • SCD Type 2: Add a new row
  • SCD Type 3 : Add a new column

Q10: How will you implement SCD type 2 in banking system?

Ans: Suppose a customer is changing an address then it will be implemented as below:

Initial record:

Actual Customer Table
id name year address
1 ABC 2016 Boston

Now this customer changes the address, we can capture this change by adding new row and start and end dates

SD2 Customer Table
id name address start_date end_date
1 ABC Boston 01-01-2015 31-12-2016
1 ABC Portland 01-01-2017 01-01-2099

In this implementation start_date should be when first record came and end_date can be updated with date when change in address happens with end date is some future date.

These are few questions I encountered during KP’s interview. Please help me increase in the list by adding other questions in the comments.

Some Information is collected from www.wikipedia.com

ETL

Step By Step approach to solve ETL problem|ETL mini project

This blog will give insight of basic approach to solve any ETL problem:

Question: An insurance company wants to capture details regarding an agent – Agent ID, Agent name, Office address, and office contact. These details are to be captured in SCD type 2 Agent_Dim. Details to be captured once every 7 days. The agent details are obtained in a file as below:

Data File

Describe the loading process and the table structure of Agent_Dim Target table has the following structure – AGT_ID, START_DATE, END_DATE, NAME (VARCHAR), ADDRESS (VARCHAR), CITY (VARCHAR), ZIP (VARCHAR), PHONE, IND. The entire agent ID’s coming from the file needs to be checked against parent AGENT_MASTER. Also update indicator IND to Y if agent belongs to CITY 2
Solution:

When we get a file and need to perform certain transformation rules on that file, first step is to profile the data in file i.e. check the quality of data as per the mapping provided. This can be done either manually using spreadsheets(if file is smaller) or with the help of ETL tool. In this example below are the points to be checked while doing profiling:

  • Check all the source data types are compatible with target data types i.e. in terms of data type compatibility e.g number cant be converted strings etc plus bigger size cant fit in smaller size
  • Check if any field which is not null in target table should not be null in source also or appropriate mapping rule should be given for defaults
  • If some special characters are coming in source data then how it should be handled

Perform any other checks if needed as per requirement.

Steps In ETL job:

  • Unload master agent data in file to perform lookup operation Which will eliminate the need to hit the database again for every record in phase 0 or as first task, also unload yesterday’s load to compare with current load (say filenames are masteragent.lkp and yesterdayagentload.lkp)
  • Read the file, perform basic checks i.e. mobile number is valid 10 digit number or not (along with any other issues found in profiling)
  • Perform the lookup operation for each incoming record and filter the data based on agents which do not have records in master data
  • Perform below cdc operation
    • create hash record of complete record say
      temp.hashrecord = trim(agent_id)+trim(covert(date,DDMMYYYY))+trim(agent_name)+trim(office_addrr)+trim(office_cntct)
      lookup.hashrecord = trim(agent_id)+trim(start_date)+trim(name)+trim(address+city+zip)+trim(phone)
      Compare temp record with all the lookup records, if match found mark ‘No change’ else if match not found check if input records agent id is present in lookup file, if yes mark update else insert
  • Discard no change records. For all other records check city is ‘CITY 2’ by spliiting input records office_addre field if yes create a new output field as IND mar it ‘Y’ if matches else ‘N’ if not
  • For insert only record perform below mappings:
    out.agent_id :: in.agent_id
    out.start_date :: convert(in.date, ‘DDMMYYYY’)
    out.end_date :: You can leave it empty of give some future date say ’01-01-2999′
    out.name :: in.agent_name ( if $~ is correct, else put value after stripping these 2 characters using appropriate functions)
    out.adress :: do substring of office_address using delimiter as ‘,’ and take out first part
    out.city :: do substring of office_address using delimiter as ‘,’ and take out 2nd part
    out.zip :: do substring of office_address using delimiter as ‘,’ and take out 3rd part
    out.phone :: office_cntc
    out.ind: temporary indicatore created
  • For update record, perform upsert for exiting record
    fill end_date as today’s date and create new record as step .

This is just to give brief idea on how to proceed with ETL problems. Kindly give me your suggestions to improve this blog if any .

 

For design consideration, check out previous page … Previous Page

Note: This question is taken from edureka.com

Data Analysis · ETL

Data Quality

Data quality is measurement of whether data is meeting the requirements of the business or not.Data quality is very specific to business needs. While managing the data quality certain questions has to be answered as :

  • Is the data accurate:  Data Accuracy means whatever and sometimes in same order the data is being taken out from the source system , it is loaded accurately in target system. Possible issues which make data inaccurate are data duplication, while performing joins if columns are getting mapped to wrong values then data become inaccurate.   E.g Customer source has fields : customerID, customerAccount, customeName, Address and Bank source has details bankId, bankAccount, bankName, Address . If we need output as customerName, customerAccount, customerAddress, bankName – then while mapping we need to take care if this address is from customer source and not from the bank source. Similarly there could be other issues which make data inaccurate.
  • Is the data up to date: Generally in data-warehouses data gets loaded on monthly or weekly basis. But what if daily report has to be generated at some specific time of a day. In that case we need to setup ODS system or any relevant system which has up to date before the report creation time to get the desired data quality. Because customers don’t need obsolete data.
  • Is the data complete: Data completeness is decided based on the business requirements and it can vary from business to business. E.g A business wants to report year of account opening so they just need the year i.e. in YYYY form but for bank it may be complete date needed to put yearly charges i.e. account opening data in DDMMYYYY format.
  • Is data consistent across all data sources: While loading the data we may need to take care of the data lineage. If a particular data is loaded 10 different data marts then it has to be consistent across all 10 systems. E.g Credit card expiration date, it should be consistent in all the data marts which may be used by different departments of bank, if 1 system is showing old expiration date it can lead to monetary loss or legal issues.
  • Is data duplicated: Data should not be unnecessarily duplicated. It may lead to serious resource consumption i.e. space, CPU, memory etc. Also it may lead to other data quality issues like inaccurate data etc.

There could be other data quality measures that is proper keys has to be defined on the data tables, domain integrity , business rules. These are the measures which help us to check if our data is meaningful or not.

 

Note: Featured image is taken from internet.

Data Analysis · ETL

Data Preparation For Data Analysis

This data set is taken from https://www.data.gov/food/ and is for analysis purpose only.

Note: This example uses Tableau to perform data Preparation and analysis in this example.

Data Preparation: Data Preparation is very first and crucial step in data analysis. In this process we basically collect the needed data on which we want to perform analysis. Do the cleaning/formatting a data so that our analysis software/machine can understand it better and if needed sometimes we may need to consolidate the data from multiple sources as well.

We need to do data preparation is many cases as below:

  • Data is not as per machine standards i.e. it can be made for better human understanding but software such as Tableau may not be able to read data correctly.
  • Multiple sources needed for analysis purpose. Sometimes we need to perform data blending/joining before creating the dashboards

There could be many other business reasons for which we may need to to have data preparation phase in our analysis cycle.

Below is the example of data which is good for human readability but not good for Tableau to read:Unformatted sheet

If we see this data it is very neatly created and easy to understand but machine may find below issues while reading it:

  • First line which is more like information i.e. “Consumer price indexes historical data, 1974 through 2016” , cant be used in analysis and it does not have any meaning for Tableau.
  • There are hidden columns between Annual 1979 and Annual 2016.
  • There are blank lines in between to increase readability.
  • Columns are merged to give categories to data.
  • Tableau can perform summary while doing the analysis so we really don’t need additional summary row as Row 5.

Below is the after modifications picture of the same file. In this file below are the changes:

Little enhanced data.png

  • Categorized the food
  • Better column names
  • Show hidden columns

Blank rows are not removed as Tableau 10 takes care of that automatically with the help of DATA INTERPRETER. Below is the picture on how Tableau will read this data:

Data In tableau

Now you can perform all type of analysis on this data i.e we can pivot the data by making years into columns and other stuff.

Let me know if you need more information on this topic . Your likes/dislikes about the content posted by me so that I can improve it.

 

ETL

Design considerations for ETL process

Below are some issues/challenges we face while designing and implementing ETL projects:

Time taken by batch process to load data: ETL is generally batch process (real time processing can also be done using ETL but mainly it is used for batch ) and window to process entire batch is generally short. There could be various factors affecting load time i.e. volume of source data, complexity of transformations etc.So various techniques are used to handle SOAs i.e data can be processed in parallel mode, filter the data as early as possible, if its better to process data in ETL or database for heavy operations like SORT etc. Various decision decisions has to be taken care to improve performance.

Incremental Loads: Incremental loads are generally time consuming process and we need to be very careful while designing ETL process for incremental loads as data is already existing DWH and new load should not mess up old data. There should be some columns indicators which distinguish data between 2 dates. We need to find ways to perform CDC or calculating delta before doing Incremental loads.

Change Data Capture: Data should be handled in different ways if we want to save history or only load current data. If we want to save history how much space we want to utilize, which methodology we want to use SCD2 or SCD3 etc.

Data Duplication: We may need to analyze the data before loading to check if we have any duplicates in the source system. There is no point loading duplicate data in datawarehouse as it may further corrupt the data while performing joins etc.

Pre and Post Data Validation : It is good practice to create data validations jobs to make sure data is getting loaded correctly and as per requirements. Validation jobs could be performed after every steps or in the last job.

Data Dependencies: Sometimes data should be loaded in particular manner. This can be taken care either while loading the data or scheduling different jobs. E.g if we are getting 2 different files say account information and demographic information of customer and there is dependency that account information can only be loaded after demographic details of customer is loaded.

Recovery methods: While designing an ETL job, we need to think about recovery of the data and job also just in case job fails before completing. As per the requirement it should be decide whether changes needs to be rolled back if there is failure or after if we need to have interval commits to avoid loading same data again. It is good approach to have recovery points in case heavy data loads.

Job Scheduling: Methods should be devised for job scheduling as per requirement. It could be time dependent or file dependent.

Error Handling: Proper process has to be setup for job failures in production. It should be considered that what should be done if job fails in production, how will the system resume without effecting the consumers. These kind of scenarios/decisions should be taken during design phase. Strategy should be in place to bring system back to running. Cleanup activities should be in place for smooth next load start. To systematize the error handling process, proper log and audit tables should be created.

These are just some sample of challenges as per my experience and there could be lot more type of issues need to be taken care of while designing sound ETL process. Let me know if you have any comments/suggestions/complaints regarding this stuff.