Below are some issues/challenges we face while designing and implementing ETL projects:
Time taken by batch process to load data: ETL is generally batch process (real time processing can also be done using ETL but mainly it is used for batch ) and window to process entire batch is generally short. There could be various factors affecting load time i.e. volume of source data, complexity of transformations etc.So various techniques are used to handle SOAs i.e data can be processed in parallel mode, filter the data as early as possible, if its better to process data in ETL or database for heavy operations like SORT etc. Various decision decisions has to be taken care to improve performance.
Incremental Loads: Incremental loads are generally time consuming process and we need to be very careful while designing ETL process for incremental loads as data is already existing DWH and new load should not mess up old data. There should be some columns indicators which distinguish data between 2 dates. We need to find ways to perform CDC or calculating delta before doing Incremental loads.
Change Data Capture: Data should be handled in different ways if we want to save history or only load current data. If we want to save history how much space we want to utilize, which methodology we want to use SCD2 or SCD3 etc.
Data Duplication: We may need to analyze the data before loading to check if we have any duplicates in the source system. There is no point loading duplicate data in datawarehouse as it may further corrupt the data while performing joins etc.
Pre and Post Data Validation : It is good practice to create data validations jobs to make sure data is getting loaded correctly and as per requirements. Validation jobs could be performed after every steps or in the last job.
Data Dependencies: Sometimes data should be loaded in particular manner. This can be taken care either while loading the data or scheduling different jobs. E.g if we are getting 2 different files say account information and demographic information of customer and there is dependency that account information can only be loaded after demographic details of customer is loaded.
Recovery methods: While designing an ETL job, we need to think about recovery of the data and job also just in case job fails before completing. As per the requirement it should be decide whether changes needs to be rolled back if there is failure or after if we need to have interval commits to avoid loading same data again. It is good approach to have recovery points in case heavy data loads.
Job Scheduling: Methods should be devised for job scheduling as per requirement. It could be time dependent or file dependent.
Error Handling: Proper process has to be setup for job failures in production. It should be considered that what should be done if job fails in production, how will the system resume without effecting the consumers. These kind of scenarios/decisions should be taken during design phase. Strategy should be in place to bring system back to running. Cleanup activities should be in place for smooth next load start. To systematize the error handling process, proper log and audit tables should be created.
These are just some sample of challenges as per my experience and there could be lot more type of issues need to be taken care of while designing sound ETL process. Let me know if you have any comments/suggestions/complaints regarding this stuff.