17.10. SQL Join in BigQuery
Setting Up the Query
In data modeling in the BigQuery lesson, we covered preparing data and joining tables. Now we will dive deeper into joining two tables: Google Search Console as the left table and Google Analytics data as the right table.
First, we'll create clean versions of both tables using common table expressions (CTE). CTEs, which start with WITH
, help us define subqueries that we can use later in our main query.
WITH GA AS ( SELECT ... ), GoogleSearchConsole AS ( SELECT ... )
Now we'll join these two tables based on their landing pages:
SELECT * FROM GoogleSearchConsole LEFT JOIN GA ON GoogleSearchConsole.landing_page = GA.landing_page;
Before running the join, let's see how many rows we have in our initial dataset - it should be 266.
Troubleshooting the Join
After running the join, we get 488 rows instead of 266. Let's figure out why this is happening. It turns out that there are more search queries for each landing page in our dataset.
To fix this issue, we need to create another intermediary table called GA_agg
to aggregate sessions, goal completions, and goal values by landing page:
WITH GA_agg AS ( SELECT landing_page, SUM(sessions) AS sessions, SUM(goal_completions) AS goal_completions, SUM(goal_value) AS goal_value FROM GA GROUP BY landing_page )
Now let's update our main query to use GA_agg
instead of GA
:
SELECT * FROM GoogleSearchConsole LEFT JOIN GA_agg ON GoogleSearchConsole.landing_page = GA_agg.landing_page;
After running this updated query, we get 266 rows as expected.
Selecting Specific Columns
Our current query selects all columns from both tables, but we only want specific columns from the Google Analytics table. We can update our SELECT
statement to include only the desired columns:
SELECT GoogleSearchConsole.*, GA_agg.sessions, GA_agg.goal_completions, GA_agg.goal_value FROM GoogleSearchConsole LEFT JOIN GA_agg ON GoogleSearchConsole.landing_page = GA_agg.landing_page;
Now our query returns a cleaner table with landing page impressions, clicks, sessions, goal completions, and goal value for each search query - just what we wanted!