Hey there! My name is Andrei Tserakhau, and I’m an engineer. I previously worked at Yandex, and am now a Tech Lead at DoubleCloud. DoubleCloud is not just another data platform, but an engine that helps you to grow your own data stack on top of our pillars. I write about modern data stacks, how they are built, and how to use them as efficiently as possible.
What is a modern data stack?
What is a data stack? A data stack is a collection of various technologies that allow for raw and sparsed data to be processed before it can be used. A modern data stack (MDS) consists of the specific tools that are used to organize, store, and transform data. These tools allow for the data to be taken from “inedible data” (data that cannot be worked with) to “edible data” (data that can be worked with).
Modern stacks are mostly built with several layers:
1. Ingestion — ELT / ETL pipelines that act as an entrance for your data
2. Storage — OLAP database or DWH / Data Lake. Act as the main value holder for your raw/clean data for later usage.
3. Business Intelligence (BI) — some tools for BI
4. Orchestration — something to orchestrate this madness
For some companies, these layers can be more or less wide, but the common frame is still there.
Traditional ELT/ETL Pipeline Development
Most data pipelines can be broadly divided into two groups:
Let’s focus first on code-based pipelines and their main challenges:
1. Complexity and maintenance — Pipelines need to be implemented, deployed, and then monitored.
2. Scalability — Scaling resources and infrastructure on demand in response to changing data needs is challenging.
3. Long Development Cycles — Writing code can be hard, especially for data-intensive products.
For non-code-based ones (especially SaaS) these issues are less of a problem, but they have their own:
1. Limited visibility and monitoring — This can lead to delays in detecting and addressing data pipeline problems.
2. Hard to reproduce — Reproducting UI-based pipelines can be complicated. Making copies of your pipeline to test some changes can be torture for data engineers.
3. Lack of version control — UI-based pipelines are hard to control in terms of evolution since all changes are user-made instead of automated.
All of this can push us to want to make a change. What if we can unify the simplicity of UI-based pipelines with the visibility and reproducibility of code-based pipelines?
How to improve your pipeline developer experience with Terraform
Recent developments and technologies have made it easier than ever to get started building a modern data stack. However, this increased ease of use has likewise raised the likelihood of a data warehouse changing in unchecked and unintended ways. Terraform by HashiCorp is an open-source project that provides infrastructure-as-code and can implement change management of the modern data stack.
Modern world data-intensive applications are always near actual infrastructure. Terraform can answer the question of how to make your data stack declarative, reproducible, and modular while retaining the simplicity of SaaS tools.
Let’s say we have a typical web application that lives inside Terraform and consists mainly of storages, e.g. Postgres.
And what we need is to add some analytical capabilities here:
1. Offload oltp-db analytical to different storage.
2. Aggregate all data in one place.
3. Join it with data outside our scope.
So let’s say that the target infrastructure should look something like this: