Joining Fact Tables

Jim McHugh

Joining fact tables can be done but there are some inherent risks so you need to be careful when joining fact tables is required. In the following simple scenario, we have a fact of authors to articles and a separate fact of articles to pageviews. Our customer has asked for the ability to 1) find the authors who have provided the most content in a given time period, find the articles which have the greatest number of pageviews for a given time period and 3) find the authors with the highest number of pageviews in a given time period. Our data model can be seen here:

To answer the first two questions is done by using a simple select statement joining a single fact table with the date dimension and the associated other associated dimensions to retrieve the answer.

TOP 10 AUTHORS FOR 2019

These simple joins work well on small and medium sized fact tables but as the fact tables increase in size your performance may suffer and therefore you may want to be prepared to address these challenges.

There are two other specific issues to be aware of and avoid, Chasm Traps and Fan Traps.

Chasm traps are queries that require measures from more than one fact table. In the data model below we see 2 separate fact tables, one for sales and one for returns. A common request may be to show the net profit (sales – returns) for the 4th quarter of 2018. We may be tempted to join these two dimensions on DIM_DATE and get the results of the sum of sales and the sum of returns.

Let’s walk through the example by running each query separately. First, let’s get the total sales for each store for the 3^rd quarter of 2018 using the following query:

RESULTS

Store ID	Store Name	Units Sold	Gross Sales
2	NEW YORK	2109	18,532.55
7	PORTLAND	2094	18,205.14
6	PHILADELPHIA	2058	18,117.64
4	WASHINGTON	2062	17,948.92
9	SAN FRANCISCO	2040	17,647.64
5	SEATTLE	2034	17,638.31
1	ONLINE	2053	17,599.21
8	BOSTON	2010	17,286.39
3	LOS ANGELES	1962	17,008.89

Next, let’s get the total refunds for each store for the 3^rd quarter of 2018 using the following query:

RESULTS

Store ID	Store Name	Units Returned	Gross Returns
6	PHILADELPHIA	68	602.66
3	LOS ANGELES	67	580.03
7	PORTLAND	64	569.28
2	NEW YORK	61	526.16
4	WASHINGTON	62	526.05
9	SAN FRANCISCO	58	510.32
8	BOSTON	52	442.92
1	ONLINE	48	414.57
5	SEATTLE	43	382.70

Finally, let’s put both of these together into a single query with the following query:

RESULTS

Store ID	Store Name	Units Sold	Gross Sales	Units Returned	Gross Returns
7	PORTLAND	12140	105,680.48	8007	69,802.46
6	PHILADELPHIA	11718	102,838.25	7942	69,391.35
2	NEW YORK	11692	101,895.82	7732	67,121.34
1	ONLINE	11545	99,375.88	7628	66,735.25
5	SEATTLE	11441	99,165.43	7485	64,984.80
4	WASHINGTON	11414	99,132.33	7607	66,445.23
9	SAN FRANCISCO	11434	99,110.30	7697	66,808.76
8	BOSTON	11444	98,324.45	7609	66,681.70
3	LOS ANGELES	11250	97,627.69	7442	64,708.52

The results are inflated due to the cross join which will occurs between the fact tables when trying to get a group function result.

To avoid the chasm trap you need to separate the functions into the base SQL statements we originally designed to verify our results, adding zeros for the columns from the other table, and then union the statements together, wrap them in a SQL statement.

RESULTS

Store Name	Units Sold	Gross Sales	Units Returned	Gross Returns	Net Units Sold	Net Sales
NEW YORK	2109	18,533	61	526.16	2048	18,006
PORTLAND	2094	18,205	64	569.28	2030	17,636
PHILADELPHIA	2058	18,118	68	602.66	1990	17,515
WASHINGTON	2062	17,949	62	526.05	2000	17,423
SEATTLE	2034	17,638	43	382.70	1991	17,256
ONLINE	2053	17,599	48	414.57	2005	17,185
SAN FRANCISCO	2040	17,648	58	510.32	1982	17,137
BOSTON	2010	17,286	52	442.92	1958	16,843
LOS ANGELES	1962	17,009	67	580.03	1895	16,429

Fan traps occur when there is a that require measures from more than one fact table in a master detail relationship. An example of this is an order and order detail relationship.

For this example, our customer has given us a requirement to calculate the average order item. If we try to sum the order value from the fact order table and then sum the quantity of items ordered from the order detail to help us get the average the order value may be inflated due to the cartesian join between the fact tables. Let’s start with a simple select between the customer dimension and the FACT_ORDER table.

RESULTS

Customer ID	Customer Name	Order Value
[email protected]	Emmy Cuevas	353.80

But when we add in the FACT_ORDER_DTL to the query the result changes in the sum from the FACT_ORDER table.

RESULTS

Customer ID	Customer Name	Order Value	Qty	Detail Order Value
[email protected]	Emmy Cuevas	381.70	38	353.80

Again, this will be exacerbated by the additional number of line items in the FACT_ORDER_DTL to the cross join between the fact tables.

The way to address this challenge can be done in multiple ways. The first is to ignore the FACT_ORDER (parent) when summing information from the FACT_ORDER_DTL (child). If you rely only on the detail table you can be assured your answer will be correct.

Another way to address this challenge is to adjust the design by removing the FACT_ORDER table and inserting the ORDER_ID as a degenerate dimensional value into the FACT_ORDER_DTL.

Press Release

Press Release

Press Release

Press Release

Press Release

Press Release

Press Release

Press Release

Press Release

Press Release

Press Release

Press Release

Media Coverage

Media Coverage

Press Release

Press Release

Press Release

Press Release

Press Release

Press Release

Press Release

Press Release

Press Release

Press Release

Press Release

Press Release

Press Release

Press Release

Press Release

Press Release

Media Coverage

Media Coverage

Press Release

Press Release

Press Release

Press Release

TOP 10 AUTHORS BY ARTICLE COUNT

TOP 10 ARTICLES BY MONTH

TOP 10 AUTHORS FOR 2019

RESULTS

RESULTS

RESULTS

RESULTS

RESULTS

RESULTS

Contact