What is ETL – Extraction, Transformation and Loading

ETL ProcessData warehousing needs single, consistent and complete data store for operational systems. This task to convert operational data to data warehouse data is called ETL i.e Extraction, Transformation and Loading.

 

You might ask why need we need to transformation? Is operations data not correct? Answer lies in the heterogeneous systems. E.g. If if one system is using 0 for male and 1 for female record on the other hand other system is using M for Male and F for female then what. If one system is using kg as unit of measure while other is using pound. These kind of things need to be taken care of before we query on the data warehouse. 50% data warehouse projects gets scrapped or fail due to low quality of data.

What is ETL – Extraction, Transformation and Loading of OLTP data avoids poor quality of data.

  • Extraction:

    • Extract data from source -> intermediate data storage or data warehouse directly after ETL completes. Intermediate data storage can be flat files, virtual storage access method (VSAM), information management system (IMS) or Indexed Sequential Access Method (IASM).
    • Main goal is to convert data to single format
    • Check for data pattern
    • Reject part of complete data if not met
  • Transformation:
    • Apply series of rules to load data to target
    • Rules can be like
      • Select particular columns, Remove null columns
      • Translate to uniform DW code (e.g. Male,Female to M, F)
      • Derive new values (Revenue = Price * Quantity)
      • Sorting, Aggregating, Joining (e.g First and last name to name)
      • Simple or complex data validations
      • Reject if validation fails
      • Look for slowing

        Data warehouse ETL

        Data warehouse ETL

  • Loading
    • Loads the data into the final target
    • Either overwrite or cumulative on existing data
    • Delete older data

ETL cycle

  1. ETL Cycle initiation
  2. Build reference data
  3. Extract data from sources
  4. Validate the data
  5. Transform (clean, apply business rules, check for data integrity, create aggregates or dis-aggregates)
  6. Stage (load into staging area, if used)
  7. Do audit reports for data (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair)
  8. Load to target tables
  9. Archive if nay such rules are set
  10. Clean up older code as per rules

 

Post by- Shridhar

Leave a Reply

Your email address will not be published. Required fields are marked *