Many-to-Many in Logical Data Models
During a typical multidimensional analysis, every dimension attribute participates in a simple one-to-many relationship with every fact. Design of the model then follows a star or a snowflake schema.
In more complex models, multiple stars/snowflakes exist. Users are guided during analysis and only relevant dimension attributes connected via one-to-many relationships are offered. This provides a safe and intuitive environment for the analyst.
The real-world complexity sometimes requires analysis over many-to-many relationship, also known as M:N. The following sections give you information on how to prepare your logical data model for these more complex use cases.
Set up M:N relationships in LDM
Let us demonstrate the many-to-many relation on the following example with projects and consultants:
Project | Revenue | Consultant |
---|---|---|
alpha | 8,000 | Alice |
Joe | ||
beta | 18,000 | Bob |
Cath | ||
Joe | ||
gamma | 33,000 | Bob |
Cath | ||
Ed | ||
Jim | ||
delta | 12,000 | Alice |
Cath | ||
Ed |
The many-to-many relationship between Project and Consultant in this example means that one consultant works on multiple projects and multiple consultants work on one project.
Analytics should be able to provide answers, for example, to the following questions:
What is the revenue per project where Cath works?
What is the total revenue from projects where Alice works?
How many consultants work on each project?
To answer all the questions we need to model data differently in a relational model.
You can create separate entities for Project and Consultant, and capture their relation in a bridge dataset named Timesheet.
You can calculate Revenue in our example from fact Amount of dataset Invoice using the following formula: SELECT SUM(Amount)
To break down the Revenue metric by an attribute from which an oriented path to the fact Amount exists.
You can use the Project attribute in the Project dataset, or the Id attribute in the Invoice dataset.
You cannot use any attributes from the Consultant or Timesheet datasets.
The engine treats the Timesheet dataset as a center of a different star than the star with Amount.
To help the query engine recognize the Timesheet dataset as a bridge dataset of many-to-many relationship and differentiate it from ordinary datasets, drag an arrow from the Timesheet dataset to the Project dataset.
This many-to-many edge in the logical data model is indicated by a two-sided arrow (see the image below).
The filled arrow points in the original direction of one-to-many relationship between Timesheet and Project.
The empty arrow indicates that this relationship can be used to connect a part of the logical model over the many-to-many relationship from the bridge to the rest of the model.
You can now calculate Revenue from Amount on Invoice and filter it or break it down by any attribute in the data model because an oriented path from each dataset to the Invoice dataset exists.
Double Counting with M:N
When you slice the Revenue by both the Project and the Consultant attributes, you will see the effect of double counting:
The Revenue for the Alpha project repeats for each consultant working on this project.
Also, hours that Ed spent on the Gamma project are split into two timesheets. The total revenue for the Gamma project is only 12,000, you see 24,000.
Sum of all rows in the table is higher than the total revenue from all projects.
To avoid this effect of double counting when using attributes over many-to-many relationships, use the technique called allocation factor.
Allocation factor
In the Timesheet bridge dataset, add the Hours spent fact. This will allow you to allocate the calculated revenue to consultants proportionally based on their allocation (i.e., the number of hours spent on a project).
The formula for Attributed Revenue must take the allocation factor into account and weight the revenue of consultants on the projects by their contribution:
SELECT SUM({fact/amount} * {fact/hours_spent} / (SELECT SUM({fact/hours_spent}) BY {label/project}))
This formula:
Multiplies the Amount from the Invoice table by Hours spent from the bridge dataset Invoice.
Divides it by the sum of all Hours spent on a project (the same value is used in all rows belonging to the same project).
Other example of using facts stored in a bridge dataset is currency conversion - you can use many-to-many relation to convert value from a fact table into a selected currency.
More complex models
Models with a single many-to-many edge (double-sided arrow) are suitable for asymmetric use cases - they let you cross the many-to-many bridge dataset only in a single direction.
The following model shows a complex use case where the Consultant dataset also contains information about the Hour rate of consultants.
The Hour rate allows you to compute the costs of a project. You must add an additional may-to-many edge between Timesheet and Consultant to create an oriented path from the Project dataset to the Consultant dataset via Timesheet.
The formula for the Cost is: SELECT SUM({fact/hour_rate} * {fact/hours_spent})
Limitations to M:N
Allowing many-to-many relationships in your logical data model requires careful planning. You want to avoid ambiguity in how the GoodData engine interprets the relationships and approaches calculations.
The following data model schemes show the NOT-recommended modelling structures among datasets.
Alternative paths
The following model is valid, but you need to be aware how the alternative paths are resolved.
The analytical engine selects the shortest path in any computation (A → B
).
You can set one of the paths as many-to-many, since there is only one shortest path.
Alternative paths with many to many
There are alternative paths (C → B
and C → D → A → B
), but the analytical engine will select C → B
because it is the shortest path.
Ambiguous paths
Two or more paths have the same length: A → B → D
or A → C → D
. The analytical engine selects one path randomly.
Negative filters results when used in M:N model
The syntax WHERE Attribute NOT IN (values)
returns records where the result includes at least one different value from the filtered value. You can also use WHERE NOT Attribute IN (values)
for the same results.
Analytical Designer generates NOT ( [attribute] IN ( [element1, element2, ...] ))
for negative filters.
Project | Consultant |
---|---|
alpha | Alice, Joe |
beta | Alice |
gamma | Joe |
delta | NULL |
Queries and results for `View By: Project`
Query | Result |
---|---|
WHERE {label/Consultant} IN ("Alice") | alpha, beta |
WHERE {label/Consultant} = "Alice" | |
WHERE {label/Consultant} NOT IN ("Alice") | alpha, gamma |
WHERE {label/Consultant} <> "Alice" | |
WHERE NOT {label/Consultant} IN ("Alice") | |
WHERE NOT {label/Consultant} = "Alice" | |
WHERE {label/Consultant} NOT IN ("Alice", "Joe") | no result |
WHERE NOT {label/Consultant} IN ("Alice", "Joe") |