Go back to Blog's hub Blog   |   tags:  

How to Create Your Next Analytics Project in Code

Written by Andy Chumak  | 

Share
How to Create Your Next Analytics Project in Code

The “as code” approach has proliferated to almost every aspect of the modern tech company. Why should data analytics be any different? A modern analytics project can be no less complex than, let’s say, an Infrastructure as Code setup, and it can also benefit from versioning, automation, and collaborative coding tools.

In this article, you’ll learn how to define an analytics solution as code, how to set up CI/CD pipelines for such a solution, and how to integrate it with your infrastructure.

If you’d rather read about the benefits of analytics as code compared to traditional solutions, here is an article discussing just that!

GoodData’s Take on Analytics as Code

Analytics as code is not new to GoodData. We’ve had our Declarative API for a while now. Its main purpose being to version analytics projects, copy them between different instances of GoodData and allow the manipulation of the metadata in a pipeline.

Python SDK is built on top of the Declarative API and took it to a whole other level. A project can be defined completely programmatically, or loaded from the Declarative API and manipulated with Python scripts.

GoodData for VS Code is our latest addition to the toolset and the topic of this article. Our goal with this tool is to introduce analytics engineers to software development best practices like coding within an IDE, contributing to Github, opening Pull Requests, and utilizing CI/CD pipelines for deployment.

GoodData for VS Code

GoodData for VS Code consists of two complementary tools: a VS Code Extension and a CLI utility. It also defines how you describe analytics objects in code — a language syntax. Let’s go through these items next.

Language Syntax

The GoodData for VS Code language syntax is based on YAML, which we chose for its brevity and simplicity.

Our Python SDK also uses YAML to store declarative definitions in files for versioning. However, these are two different formats. GoodData for VS Code’s format focuses on being human-friendly and brief, while the format of Python SDK strictly follows the REST API schema from our server. These tools have different use cases in mind and, as they evolve further, we will decide if we want to eventually merge them or let them occupy their own niches.

Metric definition in YAML
Metric definition in YAML

Currently, GoodData for VS Code lets you define datasets and metrics. Together, these objects describe the semantic layer; the foundation of an analytics project.

We are planning to support visualizations and dashboards with GoodData for VS Code in the following releases, thus allowing you to define a complete analytics project as code.

VS Code Extension

VS Code has decent support for YAML file editing out of the box, especially if you attach the right JSON schema. However, it would still be lacking the context needed to run semantic validation or suggest the right autocomplete option. It's for that reason we created our extension for VS Code. Here are some features that the GoodData extension packs:

  • Unlike the built-in syntax highlight, our extension also highlights ids and references between objects, making it easier to navigate the document.

  • We’ve put a lot of effort into analytics project validation. You get a standard, to be expected, schema validation and semantic validation for every file. But we went even further and added contextual validation. Your project files are not only cross-validated within the project but also validated against your database to ensure you’re referencing only existing tables and columns. This also opens some possibilities for future integration with other tools in your stack, like dbt or Meltano.

  • Autocomplete does what you expect — it suggests valid options for a given property as you type.

    Metric preview
    Metric preview
  • The preview feature is a huge productivity booster and allows you to preview your datasets and metrics right from VS Code without the need to switch to the browser to check the results.

GoodData’s extension for VS Code is available now on the marketplace. You can also install it right from the extensions tab in your VS Code — just search for “GoodData”.

CLI Utility

GoodData CLI is a command line app that is meant to be used as a companion to the VS Code extension or separately in CI/CD pipelines. It is written in JavaScript, thus requires NodeJS, and can be installed directly from NPM (npm i -g @gooddata/code-cli).

GoodData CLI
GoodData CLI

GoodData CLI provides four commands. Some are more interesting when used in combination with the VS Code extension (init and clone), while others were built with CI/CD pipelines in mind (validate and deploy).

The Workflow

No matter how good your tools are and how efficient you are in creating code, you won't get far without a strong workflow. A workflow to prevent human mistakes, yet be flexible enough to not get in the way when you’re on a role. Let’s see what the setup and CI/CD pipelines could look like for an analytics project.

The Setup

GoodData for VS Code setup
GoodData for VS Code setup

First of all, every analytics engineer needs to have the “manage” permission at the organization level on GoodData Cloud. Ideally, you’ll want to have two organizations: one for development, where all analytics engineers get full access, and another for production, where only CI/CD pipelines can push changes to.

Next, each analytics engineer should ideally have their own sandbox workspace within the development organization. That’s because we need to deploy the changes in order to run previews for datasets and metrics. If several people would share the same dev workspace, there would be a risk of overriding each other’s work and ending up with unreliable previews.

With such a setup, every analytics engineer in your team will be able to work independently in their own sandbox, without any risk of inadvertently affecting production. All changes to the production environment are done through CI/CD pipelines after proper gating: code review and automated tests.

CI/CD Pipelines

If all you need is to propagate the work that analytics engineers are doing to the production server, the CI/CD setup can be extremely simple. Here is an example for GitHub Actions.

First, you’ll need to gate any new code that’s being merged to the main branch — GoodData CLI can validate the project and ensure there are no obvious mistakes. The following pipelines will execute validation on every Pull Request to the main branch. If you also forbid direct pushes to the branch and make the checks mandatory for Pull Requests in your repo settings, you can be sure that no invalid code will ever be merged there.

name: GoodData Analytics Gating

on:
  pull_request:
    branches:
      - 'main'

jobs:
  gate:
    runs-on: ubuntu-latest
    env:
      # Define your token in GitHub secrets
      GOODDATA_API_TOKEN: ${{secrets.GOODDATA_API_TOKEN}}

    steps:
      - name: Checkout code
        uses: actions/checkout@v3
      - name: Set up NodeJS
        uses: actions/setup-node@v3
      - name: Install GoodData CLI
        run: npm i -g @gooddata/code-cli
      - name: Validate agains staging environment
        run: gd validate --profile staging

Next, you’ll want to deploy the new version of analytics after the merge. If your company is embracing Continuous Delivery, this would be your production deployment. If not, you can set it for a staging environment and have other pipelines for production, perhaps triggered manually.

name: GoodData Analytics Deployment

on:
  push:
    branches:
      - 'main'

jobs:
  gate:
    runs-on: ubuntu-latest
    env:
      # Define your token in GitHub secrets
      GOODDATA_API_TOKEN: ${{secrets.GOODDATA_API_TOKEN}}

    steps:
      - name: Checkout code
        uses: actions/checkout@v3
      - name: Set up NodeJS
        uses: actions/setup-node@v3
      - name: Install GoodData CLI
        run: npm i -g @gooddata/code-cli
      - name: Validate agains production environment
        run: gd validate --profile production
      - name: Deploy to production
        run: gd deploy --profile production --no-validate

Note, that in the example above we’ve separated the validation and deployment steps. That’s done purely for our convenience when reading the pipeline results. Technically, every deploy command first runs validation, unless you pass the --no-validate option.

There is a catch to this setup, though. GoodData for VS Code only covers the semantic layer (and soon will cover the analytics layer) of your project. But there is so much more to a typical project: data source definitions, data filters, workspace hierarchies, user management, and permissions, etc. Furthermore, you might want to have several workspaces with different semantic layers in a single organization. How do you orchestrate a complete deployment? Well, that’s where the older brothers of GoodData for VS Code come in: Declarative API and Python SDK. I’ve made a demo project on what a complete setup might look like — with analytics defined through GoodData for VS Code and the rest is done with a Python script. Feel free to fork it on GitHub.

What’s Next?

GoodData for VS Code is currently available as a public beta, and we are committed to developing it further into a stable release. Here are a few topics we are looking into:

  • Adding support for visualization and dashboard definitions in code.
  • Integration with other “as code” tools, both up the data pipeline (e.g. ELT tools like dbt or Meltano) and down the pipeline (like our own React SDK).
  • Test automation for data analytics.

What feature would you like to see implemented next? If you want to be part of the story, reach out to us on our community Slack channel with feedback and suggestions.

Want to try GoodData for VS Code yourself? Here is a good starting point To use it, you’ll need a GoodData account. The best way to obtain it is to register for a free trial.

Written by Andy Chumak  | 

Share
Go back to Blog's hub Blog   |   tags:  

Related content

Read more