ETL

ETL

ETL PRocess

ETL – Extract Transform and Load

ETL process is used when business needs to move data from one source system to another source system or target system, one form to another form i.e structure of data, from OLTP to OLAP systems , Enterprise OLAP to subject wise OLAPs etc. It is basically a layer where data is taken from system, transformed as per business logics and loaded in another system.

Extraction: There are multiple disparate source systems in an OLTP systems. Data can be extracted from various sources at once say data from file or DBMS can be extracted together in job,combined and do some processing on that or data could be taken from one source system only. Data Extraction strategies mainly depends upon source systems and how data is stored in source systems.

Transforming: ETL transforms the data from one format to another or we can say make the data in homogeneous format as per target systems. Below can be considered as transformation classifications:
-Data Type conversions
-Joining one or more data sources
-Performing calculations or aggregations on the source data
-Business logics
-Generating surrogate keys for datawarehouse

Below are some examples of scenarios where we can use ETL. These are just to give idea but in real life ETL is used to handle lot more complex requirements and which are tedious to do otherwise:

E.g

  •  If in my source which could be a file or database, I have 10 columns and in my target system I only want 5 columns out of that. How will we load data we can use ETL.
  • If in source there are columns like first name, last name but in my target system which could be used for say reporting purpose I want to use full name , ETL can be used for this purpose.
  • My source has date as DDMMYYY but target wants it in form YYYY-MM-DD, ETL can be used for this

Cleansing the data : We can get corrupt data or duplicate data , ETL can easily handle such scenarios based on business rules provided.

E.g:

  • What should be done when we have duplicate data in source, do we want to discard this data and inform source about this issue or we want to process it by keeping one single record.
  • NULL handling in data.

Loading: ETL is used as bridge between source and target. Data taking from one source system and connecting and loading in another system(even if both are different systems say mainframe and database ) all can be done in one single jobs.

To read good ETL design, Click on Next>>>>

ETL

Interview Questions asked for DWH professional positions

Below are some interview questions which I have faced in course of finding job. There is not any single answer to these questions and can be more elaborated answers but these will give you heads up for what questions to expect in an interview.

Q1: What is the end to end procedure followed in your project from processing ETL source files to reporting along with tools and technologies?

Ans: In my project below was the flow created

Source File–>ETL Jobs–>Stage Table–> ODS tables–>Data warehouse–>Data Cubes(Summarized data)–>Reporting tools

(In your project the process may be somewhat different)

Q2: What is the full form and use of ODS in data-warehouse?

Ans: Full form of ODS is Operational Data Store. Definition of ODS is : An operational data store (or “ODS“) is a database designed to integrate data from multiple sources for additional operations on the data. Unlike a master data store, the data is not passed back to operational systems. It may be passed for further operations and to the data warehouse for reporting.

Q3: How will you print name and salary of the employee from employee table who has maximum salary?

Ans: select name,salary from employees where salary = (select max(salary) from employees)

Q4: Why cant we use group by function in above question?

Ans: If you write query like below

select name,max(Salary) from employees group by salary.

It will give error saying “ORA-00923: not a single-group group function”. This issue comes when we try to select individual and group functions together , unless individual column is included in the group by clause.

Q5: How will you optimize the query in Teradata ?

Ans: Some of the query performing techniques is as below:

  • See the  explain plan and try understanding where the problem is
  • Check if Primary Index is defined properly or not and if yes the data loaded is unique in the PI column
  • If you have partitioned primary index created on the table, try to use it else the performance will be degraded
  • Try avoiding use of functions in Join conditions such as TRIM etc
  • Make sure there is no implicit conversion happening in join columns. Try to keep the data types same
  • Try to avoid using IN column if there are more values to be compared. Try to use Join instead by may be creating static table for matching values.

These are just few tips and you can find more link 

Q6: What are the typical scheduling software used for scheduling ETL jobs on Linux/Unix?

Ans: Autosys, Control-M are the 2 mostly used software for the scheduling need of ETL jobs. Some ETL tools come with their own schedulers but the are not flexible enough like tools specially meant for schedulers.

Q7: Give pros and cons of using Autosys as scheduling agent?

Ans: Pros:

  • You can create jobs using its GUI version or JIL files
  • It has vast variety of commands to handle the job scheduling and execution like you can force start it, you can make it start on arrival of file or at some particular time
  • You can put the jobs on hold without really affecting the later jobs in it

Cons:

  • If you have to create jil files for say hundreds of jobs things may get tedious

Q8: What are different type of dataware house schema?

Ans: Below are the different type of schema in DWH:

  • Star Schema: Star schema resembles star with fact table in the centre and and dimension tables at the star points. This schema is the simplest of all the dataware house schemas. Below is the star schema of Library management system.

StarSchemaLibrary

  • Snow Flake Schema: One dimension is split in multiple dimensions. Snow flake schema is where we normalize the dimension tables of star schema. Below is the image f retail management snow flake schema

RetailStoreSnowFlake

  • Galaxy/Hybrid Schema: This the schema where conformed dimensions are used i.e. single dimension shared by multiple fact tables. Dimensions can be further normalized in these type of schemas.

Q9: What is SCD and describe different type of SCDs?

Ans: SCD means slowly changing dimensions i.e. dimensions whose attributes change slowly over the period of time. Say a customer whose address may change several times over the years.

There are 3 type of SCD implementations in DWH:

  • SCD Type 1 : Overwrite the old value
  • SCD Type 2: Add a new row
  • SCD Type 3 : Add a new column

Q10: How will you implement SCD type 2 in banking system?

Ans: Suppose a customer is changing an address then it will be implemented as below:

Initial record:

Actual Customer Table
id name year address
1 ABC 2016 Boston

Now this customer changes the address, we can capture this change by adding new row and start and end dates

SD2 Customer Table
id name address start_date end_date
1 ABC Boston 01-01-2015 31-12-2016
1 ABC Portland 01-01-2017 01-01-2099

In this implementation start_date should be when first record came and end_date can be updated with date when change in address happens with end date is some future date.

These are few questions I encountered during KP’s interview. Please help me increase in the list by adding other questions in the comments.

Some Information is collected from www.wikipedia.com

python · Web

Web Scraping | Mini Project | Scrape Financial data and send email to your gmail account

While learning a new language/methodology , it is always a good practice and start creating small projects to understand real life difficulties which you will face on real life projects. I am learning Python recently,and the best way to test my knowledge is to create a projects on it. Being beginner, we generally don’t have giant and ambitious project to work on, so why not to work on small realistic projects just like building sandcastles before pilling up the rock to build a grand castle.

With that in mind , I have come up with below project to get started on.Try to make the project by yourself before reading the code file attached by taking the hints which are jotted below.

Stock Data Scraping :

Goal : As the title mentions, this project is to take out necessary stock information and create a schedule to send an email daily for that information.

Project Specification:

  • Identify website which you want to scrape
  • Identify patterns in URL and inspect objects
  • Take input in file as with every addition of ticker symbol we don’t want to change code
  • Use your gmail account to send email
  • Email body should contain stock symbols and its current price
  • Subject should be : ” Your Daily stock Digest”
  • Mail should go with proper From name and not with email id
  • To should have multiple recipients

Below is the input file:

webscrapinginput

Output should look like:

outputscraping

Yellow highlighted part should be the email id from which this mail is sent. Below is the code for this project

Github Project

To take a heads up for this project go through link : Web Scraping – Stock Data Scraping Using Python 3

Kindly share your views on this article which will encourage me to post more content like this….

 

 

Web

Web Scraping – Stock Data Scraping Using Python 3

Stock Data : The stock market is the best performing asset class by far. To invest in stock market , we don’t need lump sum amount so anyone can buy/sell shares at fraction of cost (as commission). While looking for buying quality stocks, looking at the last stock value is very important along with other parameters.

Web Scraping : It means data extraction from websites. It is generally used to gather a specific data which can be analyzed and used for various purposes. In this article data will be scraped from the financial websites without browsing each web page separately.

If we need to analyze say 10 different shares while monitoring its price daily, one way is daily go to websites and check each ticker symbol’s price which is tedious. Second method is  create an automated routine which gives this information just by running it or schedule it. This way it is possible to create the your own database also say a month’s performance for a particular stock etc.

Below are the steps involved in creating a automated script for web scraping:

  • Inspect the web page manually
  • Find the patterns in the data to be extracted
  • Fetching the web page (basically means downloading the data)
  • Extract the information using that pattern

[Note: This article is only for information purpose ]

Once the page is fetched, the extraction takes place. The content of the page is parsed, searched , reformatted and copied to the desired output source. In this process, information could be just a piece of value from the entire page. In our example, it is just the last stock price value.


import urllib.request
import re

symbollist = ["AAPL","ZEN","DE","MON","FWONA","AUO"]
i=0
while i <len(symbollist):
    #Masked the name of website, instead of abc use the website you want to scrape and adjust url as per the website's url
    url = "https://www.abc.com/stock/quote/" +symbollist[i] +"?q="+symbollist[i]
    regex = '<sup>\$</sup>(.+?)<span> USD</span>'
    pattern = re.compile(regex)
    htmlfile = urllib.request.urlopen(url)
    htmltext = htmlfile.read()
    stockprice = re.findall(pattern,str(htmltext))
    print("Price of ",symbollist[i],"=",stockprice)
    i+=1

 

Output :

Price of AAPL = [‘144.15’]
Price of ZEN = [‘27.86’]
Price of DE = [‘108.20’]
Price of MON = [‘113.95’]
Price of FWONA = [‘32.27’]
Price of AUO = [‘3.90’]

In the next page , we will be having mini project in Web Scraping using python…Next>>>>

Please share your thoughts on the articles. If you like it, please click like, share or comment. Happy reading…

ETL

Step By Step approach to solve ETL problem|ETL mini project

This blog will give insight of basic approach to solve any ETL problem:

Question: An insurance company wants to capture details regarding an agent – Agent ID, Agent name, Office address, and office contact. These details are to be captured in SCD type 2 Agent_Dim. Details to be captured once every 7 days. The agent details are obtained in a file as below:

Data File

Describe the loading process and the table structure of Agent_Dim Target table has the following structure – AGT_ID, START_DATE, END_DATE, NAME (VARCHAR), ADDRESS (VARCHAR), CITY (VARCHAR), ZIP (VARCHAR), PHONE, IND. The entire agent ID’s coming from the file needs to be checked against parent AGENT_MASTER. Also update indicator IND to Y if agent belongs to CITY 2
Solution:

When we get a file and need to perform certain transformation rules on that file, first step is to profile the data in file i.e. check the quality of data as per the mapping provided. This can be done either manually using spreadsheets(if file is smaller) or with the help of ETL tool. In this example below are the points to be checked while doing profiling:

  • Check all the source data types are compatible with target data types i.e. in terms of data type compatibility e.g number cant be converted strings etc plus bigger size cant fit in smaller size
  • Check if any field which is not null in target table should not be null in source also or appropriate mapping rule should be given for defaults
  • If some special characters are coming in source data then how it should be handled

Perform any other checks if needed as per requirement.

Steps In ETL job:

  • Unload master agent data in file to perform lookup operation Which will eliminate the need to hit the database again for every record in phase 0 or as first task, also unload yesterday’s load to compare with current load (say filenames are masteragent.lkp and yesterdayagentload.lkp)
  • Read the file, perform basic checks i.e. mobile number is valid 10 digit number or not (along with any other issues found in profiling)
  • Perform the lookup operation for each incoming record and filter the data based on agents which do not have records in master data
  • Perform below cdc operation
    • create hash record of complete record say
      temp.hashrecord = trim(agent_id)+trim(covert(date,DDMMYYYY))+trim(agent_name)+trim(office_addrr)+trim(office_cntct)
      lookup.hashrecord = trim(agent_id)+trim(start_date)+trim(name)+trim(address+city+zip)+trim(phone)
      Compare temp record with all the lookup records, if match found mark ‘No change’ else if match not found check if input records agent id is present in lookup file, if yes mark update else insert
  • Discard no change records. For all other records check city is ‘CITY 2’ by spliiting input records office_addre field if yes create a new output field as IND mar it ‘Y’ if matches else ‘N’ if not
  • For insert only record perform below mappings:
    out.agent_id :: in.agent_id
    out.start_date :: convert(in.date, ‘DDMMYYYY’)
    out.end_date :: You can leave it empty of give some future date say ’01-01-2999′
    out.name :: in.agent_name ( if $~ is correct, else put value after stripping these 2 characters using appropriate functions)
    out.adress :: do substring of office_address using delimiter as ‘,’ and take out first part
    out.city :: do substring of office_address using delimiter as ‘,’ and take out 2nd part
    out.zip :: do substring of office_address using delimiter as ‘,’ and take out 3rd part
    out.phone :: office_cntc
    out.ind: temporary indicatore created
  • For update record, perform upsert for exiting record
    fill end_date as today’s date and create new record as step .

This is just to give brief idea on how to proceed with ETL problems. Kindly give me your suggestions to improve this blog if any .

 

For design consideration, check out previous page … Previous Page

Note: This question is taken from edureka.com

Data Analysis · ETL

Data Quality

Data quality is measurement of whether data is meeting the requirements of the business or not.Data quality is very specific to business needs. While managing the data quality certain questions has to be answered as :

  • Is the data accurate:  Data Accuracy means whatever and sometimes in same order the data is being taken out from the source system , it is loaded accurately in target system. Possible issues which make data inaccurate are data duplication, while performing joins if columns are getting mapped to wrong values then data become inaccurate.   E.g Customer source has fields : customerID, customerAccount, customeName, Address and Bank source has details bankId, bankAccount, bankName, Address . If we need output as customerName, customerAccount, customerAddress, bankName – then while mapping we need to take care if this address is from customer source and not from the bank source. Similarly there could be other issues which make data inaccurate.
  • Is the data up to date: Generally in data-warehouses data gets loaded on monthly or weekly basis. But what if daily report has to be generated at some specific time of a day. In that case we need to setup ODS system or any relevant system which has up to date before the report creation time to get the desired data quality. Because customers don’t need obsolete data.
  • Is the data complete: Data completeness is decided based on the business requirements and it can vary from business to business. E.g A business wants to report year of account opening so they just need the year i.e. in YYYY form but for bank it may be complete date needed to put yearly charges i.e. account opening data in DDMMYYYY format.
  • Is data consistent across all data sources: While loading the data we may need to take care of the data lineage. If a particular data is loaded 10 different data marts then it has to be consistent across all 10 systems. E.g Credit card expiration date, it should be consistent in all the data marts which may be used by different departments of bank, if 1 system is showing old expiration date it can lead to monetary loss or legal issues.
  • Is data duplicated: Data should not be unnecessarily duplicated. It may lead to serious resource consumption i.e. space, CPU, memory etc. Also it may lead to other data quality issues like inaccurate data etc.

There could be other data quality measures that is proper keys has to be defined on the data tables, domain integrity , business rules. These are the measures which help us to check if our data is meaningful or not.

 

Note: Featured image is taken from internet.

Data Analysis · ETL

Data Preparation For Data Analysis

This data set is taken from https://www.data.gov/food/ and is for analysis purpose only.

Note: This example uses Tableau to perform data Preparation and analysis in this example.

Data Preparation: Data Preparation is very first and crucial step in data analysis. In this process we basically collect the needed data on which we want to perform analysis. Do the cleaning/formatting a data so that our analysis software/machine can understand it better and if needed sometimes we may need to consolidate the data from multiple sources as well.

We need to do data preparation is many cases as below:

  • Data is not as per machine standards i.e. it can be made for better human understanding but software such as Tableau may not be able to read data correctly.
  • Multiple sources needed for analysis purpose. Sometimes we need to perform data blending/joining before creating the dashboards

There could be many other business reasons for which we may need to to have data preparation phase in our analysis cycle.

Below is the example of data which is good for human readability but not good for Tableau to read:Unformatted sheet

If we see this data it is very neatly created and easy to understand but machine may find below issues while reading it:

  • First line which is more like information i.e. “Consumer price indexes historical data, 1974 through 2016” , cant be used in analysis and it does not have any meaning for Tableau.
  • There are hidden columns between Annual 1979 and Annual 2016.
  • There are blank lines in between to increase readability.
  • Columns are merged to give categories to data.
  • Tableau can perform summary while doing the analysis so we really don’t need additional summary row as Row 5.

Below is the after modifications picture of the same file. In this file below are the changes:

Little enhanced data.png

  • Categorized the food
  • Better column names
  • Show hidden columns

Blank rows are not removed as Tableau 10 takes care of that automatically with the help of DATA INTERPRETER. Below is the picture on how Tableau will read this data:

Data In tableau

Now you can perform all type of analysis on this data i.e we can pivot the data by making years into columns and other stuff.

Let me know if you need more information on this topic . Your likes/dislikes about the content posted by me so that I can improve it.