9.3. Data Blending: Behind the Scenes - Theory
Data blending combines data from different data sources to generate a single, unified data set. In this lesson, we'll focus on the left join method, which retains all the records from the left data source and combines them with matching records from the right data source based on a join key or condition.
Imagine two circles, each representing a data source. The left circle represents the left data source, and the right circle represents the right data source. Data blending using left join works by taking all the records from the left data source and combining them with matching records from the right data source based on the join key or condition.
If there is data in the right data source that doesn't match anything in the left data source, it will be left out of the blended data set.
Consider the following example: you have two tables, one with dates and clicks from an ad platform (left data source) and another with dates and offline sales from a CRM (right data source).
To blend the data, you'll use the "date" field as the join key. The blended data will retain all the records from the left data source (ad platform) and combine them with the matching records from the right data source (CRM). If a date from the left data source has no matching date in the right data source, the sales value will be unknown, but the clicks data will still be retained in the blended data set.
- Data blending with left join retains all records from the left data source and combines them with matching records from the right data source based on a join key or condition.
- If there is data in the right data source that doesn't match anything in the left data source, it will be left out of the blended data set.
- Data blending is similar to performing a lookup for join keys and returning matching values.
Key Concepts about Data Blending
- Join keys and join conditions must match exactly: To successfully blend data, the join keys in both data sources must match exactly. This includes matching formats, capitalization, and the absence of trailing spaces. If the join keys do not match, the blended data will return null values for those records.
- All matching rows from the right data source will be returned for every row in the left data source: Data blending works by looking up the join key in both data sources and returning all matching rows. This can cause duplicate data and incorrect totals if there are multiple matches for a single join key.
Example: Data Blending with Multiple Matches
Consider the example where you have a table with colors and the number of searches for each color (left data source) and another table with translations of the colors into Italian (right data source). In the Italian table, there are two translations for "red" and "yellow" but only one for "blue".
When blending the data using the color as the join key, Data Studio will return both translations for "red" and "yellow", resulting in duplicate rows for these colors. This will cause the total number of searches to be incorrect in the blended data set.
To avoid this issue, you can try the following:
- Eliminate duplicates from the right data source before blending.
- Establish a one-to-one relationship between the tables that you're trying to blend with each other.
When troubleshooting data blending issues, it's helpful to create the tables side by side and analyze them to understand what could be causing duplicate or incorrect totals in the blended data source.
Blending Data with Different Granularity
When blending data from different sources, it's important to ensure that the granularity of the data is consistent. If one data source has a higher level of granularity, such as sales per day, and the other has a lower level of granularity, such as sales per SKU, blending the data can cause problems such as duplicated rows and incorrect totals.
Example: Sales and Clicks Data
Consider a scenario where you have two data sources: one with the number of clicks per day from an ad platform and another with sales data broken down by date and SKU. When blending these data sources, the clicks data will be duplicated for each SKU, causing incorrect totals for clicks.
Fixing Granularity Issues
There are a few ways to fix this problem:
- Aggregate the sales data by day to match the granularity of the clicks data. This would remove the SKU breakdown and make the data sources more compatible for blending.
- Obtain the clicks data with the same level of granularity as the sales data, breaking it down by SKU. This would allow for a more accurate blend of the data.
Mental Checklist for Data Blending
When working with large amounts of data, it's essential to verify that the blended data is accurate. Here's a mental checklist to ensure data is preserved correctly during the blending process:
- Ensure a one-to-one relationship between join keys: Make sure that the join keys are unique in both data sources to avoid duplicate rows.
- Check the total number of rows before and after blending: The total number of rows should remain consistent before and after blending. If the number of rows increases after blending, it's likely that there are duplicates in the data.
- Clean up data before blending: Go through the data sources and eliminate any inconsistencies or duplicates that may cause issues during the blending process.
When working with large datasets, it's helpful to double-check the data and perform any necessary cleanup before blending to ensure accurate results. If you notice discrepancies in the blended data, such as an increase in the number of rows or incorrect totals, it's crucial to investigate the cause and resolve it before proceeding with data analysis and visualization.