17.10. SQL Join in BigQuery

Setting Up the Query

In data modeling in the BigQuery lesson, we covered preparing data and joining tables. Now we will dive deeper into joining two tables: Google Search Console as the left table and Google Analytics data as the right table.

First, we'll create clean versions of both tables using common table expressions (CTE). CTEs, which start with WITH, help us define subqueries that we can use later in our main query.

WITH
  GA AS (
    SELECT ...
  ),
  GoogleSearchConsole AS (
    SELECT ...
  )

Now we'll join these two tables based on their landing pages:

SELECT *
FROM GoogleSearchConsole
LEFT JOIN GA
ON GoogleSearchConsole.landing_page = GA.landing_page;

Before running the join, let's see how many rows we have in our initial dataset - it should be 266.

Troubleshooting the Join

After running the join, we get 488 rows instead of 266. Let's figure out why this is happening. It turns out that there are more search queries for each landing page in our dataset.

To fix this issue, we need to create another intermediary table called GA_agg to aggregate sessions, goal completions, and goal values by landing page:

WITH
  GA_agg AS (
    SELECT
      landing_page,
      SUM(sessions) AS sessions,
      SUM(goal_completions) AS goal_completions,
      SUM(goal_value) AS goal_value
    FROM GA
    GROUP BY landing_page
  )

Now let's update our main query to use GA_agg instead of GA:

SELECT *
FROM GoogleSearchConsole
LEFT JOIN GA_agg
ON GoogleSearchConsole.landing_page = GA_agg.landing_page;

After running this updated query, we get 266 rows as expected.

Selecting Specific Columns

Our current query selects all columns from both tables, but we only want specific columns from the Google Analytics table. We can update our SELECT statement to include only the desired columns:

SELECT
  GoogleSearchConsole.*,
  GA_agg.sessions,
  GA_agg.goal_completions,
  GA_agg.goal_value
FROM GoogleSearchConsole
LEFT JOIN GA_agg
ON GoogleSearchConsole.landing_page = GA_agg.landing_page;

Now our query returns a cleaner table with landing page impressions, clicks, sessions, goal completions, and goal value for each search query - just what we wanted!

Previous
Previous

17.11. Calculating % Share of Total using SQL

Next
Next

17.9. Data Modeling in BigQuery