Joining Fact Tables
Joining fact tables can be done but there are some inherent risks so you need to be careful when joining fact tables is required. In the following simple scenario, we have a fact of authors to articles and a separate fact of articles to pageviews. Our customer has asked for the ability to 1) find the authors who have provided the most content in a given time period, find the articles which have the greatest number of pageviews for a given time period and 3) find the authors with the highest number of pageviews in a given time period. Our data model can be seen here:
To answer the first two questions is done by using a simple select statement joining a single fact table with the date dimension and the associated other associated dimensions to retrieve the answer.
TOP 10 AUTHORS BY ARTICLE COUNT
TOP 10 ARTICLES BY MONTH
The third requirement will compel us to join two fact tables to get the answer. Remember to join a fact table you need at least one common dimension shared between the facts to accomplish this task. In this scenario, DIM_ARTICLE will be the common dimension which we will use.
TOP 10 AUTHORS FOR 2019
These simple joins work well on small and medium sized fact tables but as the fact tables increase in size your performance may suffer and therefore you may want to be prepared to address these challenges.
There are two other specific issues to be aware of and avoid, Chasm Traps and Fan Traps.
Chasm traps are queries that require measures from more than one fact table. In the data model below we see 2 separate fact tables, one for sales and one for returns. A common request may be to show the net profit (sales – returns) for the 4th quarter of 2018. We may be tempted to join these two dimensions on DIM_DATE and get the results of the sum of sales and the sum of returns.
Let’s walk through the example by running each query separately. First, let’s get the total sales for each store for the 3rd quarter of 2018 using the following query:
RESULTS
Store ID | Store Name | Units Sold | Gross Sales |
2 | NEW YORK | 2109 | 18,532.55 |
7 | PORTLAND | 2094 | 18,205.14 |
6 | PHILADELPHIA | 2058 | 18,117.64 |
4 | WASHINGTON | 2062 | 17,948.92 |
9 | SAN FRANCISCO | 2040 | 17,647.64 |
5 | SEATTLE | 2034 | 17,638.31 |
1 | ONLINE | 2053 | 17,599.21 |
8 | BOSTON | 2010 | 17,286.39 |
3 | LOS ANGELES | 1962 | 17,008.89 |
Next, let’s get the total refunds for each store for the 3rd quarter of 2018 using the following query:
RESULTS
Store ID | Store Name | Units Returned | Gross Returns |
6 | PHILADELPHIA | 68 | 602.66 |
3 | LOS ANGELES | 67 | 580.03 |
7 | PORTLAND | 64 | 569.28 |
2 | NEW YORK | 61 | 526.16 |
4 | WASHINGTON | 62 | 526.05 |
9 | SAN FRANCISCO | 58 | 510.32 |
8 | BOSTON | 52 | 442.92 |
1 | ONLINE | 48 | 414.57 |
5 | SEATTLE | 43 | 382.70 |
Finally, let’s put both of these together into a single query with the following query:
RESULTS
Store ID |
Store Name | Units Sold |
Gross Sales |
Units Returned |
Gross Returns |
7 | PORTLAND | 12140 | 105,680.48 | 8007 | 69,802.46 |
6 | PHILADELPHIA | 11718 | 102,838.25 | 7942 | 69,391.35 |
2 | NEW YORK | 11692 | 101,895.82 | 7732 | 67,121.34 |
1 | ONLINE | 11545 | 99,375.88 | 7628 | 66,735.25 |
5 | SEATTLE | 11441 | 99,165.43 | 7485 | 64,984.80 |
4 | WASHINGTON | 11414 | 99,132.33 | 7607 | 66,445.23 |
9 | SAN FRANCISCO | 11434 | 99,110.30 | 7697 | 66,808.76 |
8 | BOSTON | 11444 | 98,324.45 | 7609 | 66,681.70 |
3 | LOS ANGELES | 11250 | 97,627.69 | 7442 | 64,708.52 |
The results are inflated due to the cross join which will occurs between the fact tables when trying to get a group function result.
To avoid the chasm trap you need to separate the functions into the base SQL statements we originally designed to verify our results, adding zeros for the columns from the other table, and then union the statements together, wrap them in a SQL statement.
RESULTS
Store Name | Units Sold | Gross Sales | Units Returned | Gross Returns | Net Units Sold | Net Sales |
NEW YORK | 2109 | 18,533 | 61 | 526.16 | 2048 | 18,006 |
PORTLAND | 2094 | 18,205 | 64 | 569.28 | 2030 | 17,636 |
PHILADELPHIA | 2058 | 18,118 | 68 | 602.66 | 1990 | 17,515 |
WASHINGTON | 2062 | 17,949 | 62 | 526.05 | 2000 | 17,423 |
SEATTLE | 2034 | 17,638 | 43 | 382.70 | 1991 | 17,256 |
ONLINE | 2053 | 17,599 | 48 | 414.57 | 2005 | 17,185 |
SAN FRANCISCO | 2040 | 17,648 | 58 | 510.32 | 1982 | 17,137 |
BOSTON | 2010 | 17,286 | 52 | 442.92 | 1958 | 16,843 |
LOS ANGELES | 1962 | 17,009 | 67 | 580.03 | 1895 | 16,429 |
Fan traps occur when there is a that require measures from more than one fact table in a master detail relationship. An example of this is an order and order detail relationship.
For this example, our customer has given us a requirement to calculate the average order item. If we try to sum the order value from the fact order table and then sum the quantity of items ordered from the order detail to help us get the average the order value may be inflated due to the cartesian join between the fact tables. Let’s start with a simple select between the customer dimension and the FACT_ORDER table.
RESULTS
Customer ID | Customer Name | Order Value |
[email protected] | Emmy Cuevas | 353.80 |
But when we add in the FACT_ORDER_DTL to the query the result changes in the sum from the FACT_ORDER table.
RESULTS
Customer ID | Customer Name |
Order Value |
Qty | Detail Order Value |
[email protected] | Emmy Cuevas | 381.70 | 38 | 353.80 |
Again, this will be exacerbated by the additional number of line items in the FACT_ORDER_DTL to the cross join between the fact tables.
The way to address this challenge can be done in multiple ways. The first is to ignore the FACT_ORDER (parent) when summing information from the FACT_ORDER_DTL (child). If you rely only on the detail table you can be assured your answer will be correct.
Another way to address this challenge is to adjust the design by removing the FACT_ORDER table and inserting the ORDER_ID as a degenerate dimensional value into the FACT_ORDER_DTL.