Collection Joins

_Please see the accompanying demonstration video, _linked here.

Rasgo joins the features in a collection using the dimensions to match them to one another. As the number of data sources and dimensions in a collection increases, this join logic can become complex. Rasgo allows you to quickly understand how the data sources are being joined together. When examining a collection, under the collection name, Rasgo displays the number of data sources and dimensions that this collection is using to join features.

In this case, the collection contains twelve features from five data sources. In addition, five dimensions, account_id, transaction_id, update_id, status_id and customer_id, are being used to join these features together. If we click on the Joins tab, we will see

The first block shows that features from the data source TRANSACTIONS are being joined to features from the TRANSACTIONS STATUS data source on two dimensions, TRANSACTION_ID and STATUS_ID.

The icon in the upper right corner of this block tells us that Rasgo is performing an inner join when performing the join between these two data sources.

The next block shows us that the features from the TRANSACTIONS data source are being joined to those on the ACCOUNTS data source on just one dimension, ACCOUNT_ID, again as an inner join.

The third block show that the features from the ACCOUNTS data source are being joined to those on the CUSTOMERS data source on just one dimension, CUSTOMER_ID, again as an inner join.

The last block shows that features from the TRANSACTIONS data source are being joined to features from the UPDATES data source on two dimensions, TRANSACTION_ID and UPDATE_ID, as an inner join.

Finally, on the right hand side of the screen, there is a graphical representation of this showing two dimensions being used to join TRANSACTIONS to TRANSACTIONS STATUS and to UPDATES, one dimension joining TRANSACTIONS to ACCOUNTS and one dimension joining ACCOUNTS to CUSTOMERS.

If you have generated the data by clicking Access Data in the upper right corner

and Generate Training Data in the corner of the box that pops up

Rasgo will generate insights about the join. Select the Insights tab and examine the Join Stats

Here we see that when TRANSACTIONS and TRANSACTIONS STATUS are joined, about 5,000 rows are not matched from TRANSACTIONS, when that combined set is joined to ACCOUNTS roughly 10,000 rows from ACCOUNTS are lost. Next, CUSTOMERS is joined to this combined set and about 10,000 rows from CUSTOMERS are lost. This leaves roughly 20,000 rows to merge with the UPDATE features. Only about 10,000 rows remain from the combined set in this final data set.

In the end, all of the data from UPDATES remain, but 67% of the 30,000 rows in CUSTOMERS and in ACCOUNTS are lost, 50% of the 20,001 rows in TRANSACTIONS STATUS data was lost, and 60% of the 25,001 rows in the TRANSACTION data was lost.\

Last updated