ETL – Extract Transform and Load
ETL process is used when business needs to move data from one source system to another source system or target system, one form to another form i.e structure of data, from OLTP to OLAP systems , Enterprise OLAP to subject wise OLAPs etc. It is basically a layer where data is taken from system, transformed as per business logics and loaded in another system.
Extraction: There are multiple disparate source systems in an OLTP systems. Data can be extracted from various sources at once say data from file or DBMS can be extracted together in job,combined and do some processing on that or data could be taken from one source system only. Data Extraction strategies mainly depends upon source systems and how data is stored in source systems.
Transforming: ETL transforms the data from one format to another or we can say make the data in homogeneous format as per target systems. Below can be considered as transformation classifications:
-Data Type conversions
-Joining one or more data sources
-Performing calculations or aggregations on the source data
-Generating surrogate keys for datawarehouse
Below are some examples of scenarios where we can use ETL. These are just to give idea but in real life ETL is used to handle lot more complex requirements and which are tedious to do otherwise:
- If in my source which could be a file or database, I have 10 columns and in my target system I only want 5 columns out of that. How will we load data we can use ETL.
- If in source there are columns like first name, last name but in my target system which could be used for say reporting purpose I want to use full name , ETL can be used for this purpose.
- My source has date as DDMMYYY but target wants it in form YYYY-MM-DD, ETL can be used for this
Cleansing the data : We can get corrupt data or duplicate data , ETL can easily handle such scenarios based on business rules provided.
- What should be done when we have duplicate data in source, do we want to discard this data and inform source about this issue or we want to process it by keeping one single record.
- NULL handling in data.
Loading: ETL is used as bridge between source and target. Data taking from one source system and connecting and loading in another system(even if both are different systems say mainframe and database ) all can be done in one single jobs.