Pentaho: An ETL Tool for Data-Intensive Organizations – My Top 7 Transformation Steps
By Andrew Tumwesigye
Data and Business Intelligence Analyst, Dimension Metrics
10th Mar 2025
Introduction
In 2024, after completing various data analytics and visualization certifications, I noticed that job postings increasingly sought business intelligence analysts proficient in ETL (Extract, Transform, Load) tools. Tools like Pentaho, Apache Kafka, Apache Airflow, Apache Hop, and Talend were frequently mentioned. Having already trained in Power BI, I decided to delve into ETL tools. The challenge was selecting the right one, considering factors such as local installation feasibility, user-friendly graphical interfaces, availability of comprehensive community editions, and accessible training programs.
Establishing Criteria for Selecting an ETL Tool
With these criteria in mind, I explored several ETL tools:
Talend: Discontinued its Community Edition, offering only an enterprise version.
Apache Kafka and Apache Airflow: Faced challenges in local setup due to large installation files and complex configurations.
Apache Hop: While easy to set up, it had a less engaging GUI and lacked comprehensive training resources.
Shifting my approach, I searched for ETL tools with community editions and robust training programs. I discovered the Pentaho Data Integration (PDI) Master Certification program offered by StartTech on Udemy. PDI met all my criteria: straightforward download and setup, an intuitive GUI, a free community edition with extensive functionality, and a structured training program. Although the training was designed for 3 to 4 months, I completed it in 2, thanks to my prior experience with extraction and transformation using Power Query and Power BI.
Exploring PDI's ETL Capabilities
In PDI, the foundation of data processing is the "transformation." Each transformation consists of various steps and functions that facilitate data extraction, transformation, and loading. PDI supports extracting data from multiple formats, including CSV, Excel, XML, JSON, cloud platforms like AWS and Azure, and local databases such as MySQL, PostgreSQL, and MongoDB. Notably, PDI can handle large datasets, a limitation in tools like Power Query and the free version of Power BI.
My Top 7 Transformation Steps in PDI
With numerous transformation options available, users can create intricate data pipelines. Each function allows for data preview to ensure accuracy. Here are my top seven transformation steps:
1. Sort Rows and Sorted Merge:
These functions are essential when dealing with multiple files representing different periods, such as monthly sales data. After extracting and ensuring consistency in metadata and column names, the "Sort Rows" step arranges each table's primary keys in ascending order. Subsequently, the "Sorted Merge" step combines these tables, maintaining the sorted order, which is crucial for orderly data processing.
2. Unique Values:
This function identifies and removes duplicate primary keys that may exist after merging datasets. In large datasets, duplicates are common, and eliminating them ensures data integrity.
3. Replace in String Operations:
Data inconsistencies, such as misplaced characters or typographical errors, are common. The "String Operations" function allows users to specify columns and define rules to correct these inconsistencies, enhancing data quality.
4. Null Handling:
Datasets often contain null values, posing challenges during analysis. The "Null Handling" function enables users to set default values (e.g., "unknown" for strings or 0 for numbers), facilitating seamless data loading into warehouses and ensuring meaningful analysis.
5. Filter Rows:
This function is useful for filtering out invalid data entries. For instance, if an age column contains negative values, the "Filter Rows" step can be configured to exclude such anomalies, ensuring data accuracy.
6. Calculator Step:
The "Calculator" step offers versatile functions, such as converting negative values to positive using the absolute value function. This is particularly useful for correcting data errors identified during the filtering process.
7. Select Values Step:
The "Select Values" step allows users to rename columns, reformat data types, and remove unnecessary columns. It's instrumental in ensuring consistency, especially when aligning data types between different datasets before loading them into the data warehouse.
Conclusion
Upon completing data transformations, the data is loaded into data marts like MySQL or PostgreSQL using appropriate lookup steps. Once various transformations are saved and successfully run, a job is created to combine all transformations (e.g., Customer, Product, Sales). This job can be automated to run at scheduled intervals, ensuring that new incoming data is processed consistently, enabling timely analysis and decision-making.
Disclaimer
The choice of an ETL tool depends on factors like cost, ease of use, and provider support. While other tools may offer similar or superior functionality, my preference for PDI is based solely on my experience and its alignment with my criteria. I am neither an agent of PDI nor compensated to promote the platform. This article highlights my top seven functions; however, PDI offers numerous other transformation tools beyond the scope of this discussion.