5 Reasons Why to Write Your Semantic Layer in YAML
Written by Tomáš Muchka |
I recently found an interesting article proposing that semantic layers must be written in real programming languages.
The author states: “It’s relatively easy to start building a semantic layer using YAML, so it’s a quick path to get a product to market. However, taking this shortcut shifts the burden to users and creates a poor developer experience. (...) A real semantic layer needs a full-fledged programming language.”
We at GoodData have been interested in analytics as code for quite some time, and we asked ourselves the same questions the Malloy team did. Spoiler alert: We have decided to write our semantic layer in YAML. For the same reason Malloy speaks against it - for the sake of the developer experience.
Here are the top 5 reasons why we think that writing your semantic layer in YAML is a good idea:
1. YAML Is Widely Supported
YAML is widely supported in the dev community. Your IDE can open it without any problem. Even web text editors can work with it without any hassle. Compare this with a custom-developed language - will your IDE support it? And will people understand it out of the box?
All the major editors, such as those listed below, support the YAML language, either inbuilt or via plugins:
- Sublime Text
- VSCode
- Eclipse
- vim
- Textmate
- IntelliJ IDE
- Notepad++
- Brackets
- SlickEdit
People generally try another new language only if they see clear benefits. We have firsthand experience with our own multi-dimensional query language, MAQL. While still seeing MAQL as a need for serious work with analytics, all our user studies point out that people are hesitant to learn it until they see its long-term value. And we are not alone in this thinking. I recently listened to a podcast where the creators of Wasp shared the same experience.
2. YAML Is Easy to Understand
Using YAML doesn't guarantee that people will comprehend its keywords. However, given its widespread usage, they may be at least able to understand its structure and relations. Sure, the YAML structure can be a little verbose at times, but that’s a low price to pay if your colleagues across the company can understand it with minimal effort.
Example of a metric definition in GoodData:
type: metric
id: net_orders
title: Net Orders
description: Total number of orders that have been processed
maql: SELECT {metric/of_orders} WHERE {label/order_status} = "Processed"
format: "#,##0"
And now, let’s compare it with a proprietary language like DAX:
net_orders =
FORMAT(
CALCULATE(
SUM(of_orders),
FilterTable[order_status] = "Processed"
),
"#,##0"
)
Using a full-fledged programming language might be fine for developers or Analytics Engineers, but what about the less technical audience - e.g., data analysts? Could they understand what is going on? And would they be able to propose changes? Or should they be omitted from the code experience altogether?
We at GoodData have tested multiple prototypes of semantic layer definition. These prototypes included Python code and YAML. Guess what? Overall, both Python and YAML were easy for Analytics Engineers to understand. However, they all preferred to use YAML. Why? Because they wanted to collaborate with data analysts and were afraid that a full-fledged programming language would scare them off.
As one of our research participants said: “YAML is simply better for people who are not developers.”
3. YAML Is Declarative
I bet you already heard about declarative and imperative programming. I will not get into a deep explanation here. Many articles on that topic already exist, e.g., Declarative vs imperative on dev.to.
Neither will I conclude which approach is generally better. For the description of analytics, the declarative approach better suits our needs.
Andrii Chumak, our senior principal software engineer expressed this well: “YAML is predictable and ‘safe.’ I immediately know what I'll get when I look at declarative languages. When I look at something with a bunch of function calls and perhaps even ifs, elses, or whiles - it's harder to assert what will be the result until I run it.”
4. YAML Is Flexible
Another aspect where YAML excels is its flexibility. It might be tempting to use a full-fledged programming language to describe data transformation, but is it the best approach to describe data models? And for visualizations or dashboards?
Analytics as code should not be limited to data transformations only. We also see benefits to keeping all the code in a single language as it significantly reduces the context switching, especially with the analytics code, which is highly connected from the semantic layer to the dashboards.
Example of a visualization defined in YAML:
id: fc3dc923-9e18-4fae-ab16-a50bfdb0725d
type: bar_chart
title: Order cost by status
query:
fields:
order_unit_cost: metric/order_unit cost
order_status: label/order_status
metrics:
- field: order_unit_cost
format: "#,##0.00"
view_by:
- order_status
5. YAML Can Be Fully Integrated with IDE
For a long time, YAML has been seen as a sub-optimal configuration format. However, nowadays, IDE functionalities allow us to build a mature DSL on top of it with all the productivity features we are used to from full-fledged programming languages.
This experience consists of real-time validations, including validation of references, code highlighting, code completion, and references. All are nicely integrated straight into the IDE. This is not just in theory. We have developed this kind integration with our GoodData for VS Code extension.
Conclusion
With today's tooling, there is no real reason why YAML should be worse than a proprietary full-fledged programming language. Its biggest strength for usage in analytics is its flexibility and understandability. Of course, there are also downsides, such as being a little more verbose sometimes, but that’s a low price to pay for what you get.
Why not try our 30-day free trial?
Fully managed, API-first analytics platform. Get instant access — no installation or credit card required.
Get startedWritten by Tomáš Muchka |