Late August, I was working in a project along with the developers’ team as they implemented a new tool to provide better control and performance for data extraction, transformation and loading (ETL). My responsibility was to assure that the resulting information was correctly inputed.
Written in python (a kind of well known programming language within the Business Intelligence team), Airflow was developed by the airbnb team and implemented in our very first project designed to migrate the whole financial system to a new level.
As the project took shape, both developers and BI team realized that this should be the main solution for ETL inside 99. Many changes came with this decision, specially for the BI team. May I remind you that this only took a couple of weeks?!
Making it collaborative
Since the beginning, the BI team had a lone developer responsible for creating and maintaining our Data Warehouse. With incredible work, and delivering many different datasets, his work was done almost all by himself with no collaboration of the rest of the team whatsoever.
With Airflow, the main idea was that the whole company could collaborate with the DW development and deploy, not only with specific data for the BI team, but also for CSI (responsible to mitigate risk), Finance and operations (among others).
In just a few weeks (6 and a half to be more precise), over 50 commits were made not only with new improvement codes, but also with extractions built by the BI team. That means that the initial purpose of attending only a single project extended to fulfill a whole company need.
In order for this to work, a new workflow had to be implemented within the BI team, as it was already natural within the developers.
Learning Git, Pull Request, Code Review and Merge were new concepts adopted by the whole team. Not only this made all steps clear, but also gave visibility to what each one is doing, guaranteeing that the ETL contemplates what we need, and also provide better practices along the chain.
This is still a work in progress. As we get more experienced, the whole process will become more natural.
By now, you should be asking yourself: what is BI really doing with the airflow?
Well, basically we are implementing many extractions regarding our Data Warehouse, which gives us the ability to rapidly answer many (many) everyday questions. Also, as we improve our knowledge, we provide feedback to many pull requests made by other teams.
Currently, our main projects running inside airflow consists of: * BI Data Warehouse: this is where we are storing parsed data for fast analysis regarding 99’s operations; * Finance: all transactions with passengers and drivers are stored here, and used for monthly review and tax calculations; * LTV: although inside the finance project, this is a huge branch regarding LifeTime Value calculations of our users; * CSI: fraud (prevention and detection) inside our ecosystem
Many other plans are migrating to this infrastructure.
One might be thinking: “ok, 99 has a lot of data, where do you store all the ETLs?” and that is a great question.
When we started the Airflow implementation, our decision was upon using Postgres 9.5 as it would deliver all we needed for the finance project.
However, once we decided to migrate the BI DW to this structure, our team opted to have a great improvement on performance and allow even more data to be processed. A study was conducted between several platforms and AWS Redshift was the logical choice, since we already run everything within their infrastructure.
How is it going so far?
We have a very tight schedule for this quarter, but we are confident that everything we need will be delivered using this fantastic structure. As we evolve upon this subject I will try to keep this up-to-date, sharing our discoveries, pitfalls and anything that might be of interest.
Interested in learning more about the BI team? Drop us a message by clicking here!