Introducing Last Mile ETL: One Tool for Better Data Transformation
Written by Patrik Braborec |
Imagine you connect a database to analytics and the data is not in the shape you would like to have it. For example, coordinates should be separated by latitude and longitude, some values are in a different format or type, and some tables can even have a bad design. Normally, you would need to spend a lot of time context-switching between the ELT/ETL pipeline and analytics, where, in the transformation (T) phase you would need to edit code that transforms data from the application shape to the analytics shape. This experience is sort of normal in our industry but does it have to be? For this reason, we are introducing Last Mile ETL! With it, you can do everything in one tool which significantly improves the ability to iterate on and the speed of development, customization, and security. You may now ask how? Well, enough words, it’s time for an example. Let’s deep dive into it!
Exploratory Analytics With Last Mile ETL
What do I mean by "exploratory analytics"? Let’s say, we have just three tables with data, and we would like to explore its value or learn some facts based on this data. It means that instead of one concrete goal to achieve, we will primarily try to explore some value in this data! With defined exploratory analytics, here are three tables in a database (Airports
, Country list
, and GDP
— Gross Domestic Product):
You can see that the coordinates are in one column called coordinates
instead of latitude
and longitude
, or value
in GDP
the table is text rather than numerics. We can address these issues with the help of Last Mile ETL inside the analytics.
Let’s connect the database to the analytics (if you are not familiar with GoodData, I encourage you to check the documentation). The result is the following:
What you can see in the image above are datasets. We can convert a dataset to a so-called SQL dataset:
The SQL dataset gives us the possibility to write SQL queries that are executed directly in the database. Let’s just check what types of airports we have in the database:
It seems that the Airports
table contains several types of airports such as heliports, or even closed airports. Let’s say that I am interested just in the medium and large airports — it’s not a problem at all. I don’t need to go to the ELT/ETL pipeline, I simply create a new SQL dataset called Airports transformed
directly in the analytics, and I can do that with the following SQL code:
You can see that I can do it with quite a simple SQL query and the result is the following:
Unfortunately, there is one more problem we need to solve. The coordinates are not in the format we need for analytics. We need to break this column into longitude
and latitude
to render a geo chart. Let’s edit the Airports transformed
dataset:
Now, with this little transformation, we can simply render a geo chart to see the all airports in the whole world:
As this was rather simple, let’s explore something more valuable — what is the correlation between GDP and the number of airports in a country?
First, we can just list the values of Airports
and GDP
tables:
The bad news is that the Airports
table does not have the full country names (for example, Albania) but only iso codes of countries (for example, AL). On the other hand, the GDP
table does not have iso codes but only has the full name of the country. Sadly, there is one more issue with the column value
. You can see that the values contain commas and this is not the correct format for numbers. Let’s solve the mentioned problems. We can start with the Country list
table where both the full names and the iso codes are found:
Now, we just need to create a new SQL dataset for the GDP where we join Country list
to add the missing iso code (or as I named it — country code), and remove commas from the column value
and convert it to numeric type:
Perfect, we have the GDP SQL dataset that we need! Also, to count the airports correctly, I will add an identifior to the Airport transformed
SQL dataset:
Both of the SQL datasets now contain country codes, we can simply join them based on these country codes:
The result is as follows. I also removed the Country list
dataset as it does not have any value for us right now:
Let’s move to the Analyze tab to find out what the correlation between the GDP and the number of airports in the country is. First of all, we need to calculate the number of airports in the country (for more information, check how to create metrics in GoodData):
The last step is to calculate the correlation between GDP and the number of airports in the country:
Everything is transformed and computed, the result is as follows:
Conclusion
Together we explored what Last Mile ETL is, and how you can benefit from it. In short, you can do everything within the analytics and, as a result, avoid the need for context-switching. On top of that, you can benefit from the security implications — that being, you do not need to give access to the database in order to make simple changes within the analytics. A further plus point is the ability to iterate, meaning that you can start with something simple, and then move to more complex results. Finally, thanks to GoodData's analytics-as-code approach you can easily version everything in Git and thus apply software engineering best-practices to the analytics.
Why not try our 30-day free trial?
Fully managed, API-first analytics platform. Get instant access — no installation or credit card required.
Get startedWritten by Patrik Braborec |