Have you ever wondered how to connect the dots between your website's search performance and actual revenue? In this guide, I'll walk you through creating a powerful data pipeline that answers exactly that question. We'll focus on the two crucial first steps: bringing your data into BigQuery and transforming it into meaningful insights.

By the end of this tutorial, you'll know how to connect multiple data sources and use SQL to reveal which search queries are driving the most value for your business. Let's dive in!

Why Build a Data Pipeline in BigQuery?

Before we get our hands dirty with code, let's understand what we're trying to achieve. A data pipeline is essentially a series of data processing steps that takes raw data from various sources, cleans it, transforms it, and prepares it for analysis.

For businesses with a web presence, one of the most valuable analyses is understanding which search queries bring not just traffic, but actual revenue. This requires connecting data from:

Google Search Console: Shows which queries drive impressions and clicks
Google Analytics: Shows which landing pages generate sessions, conversions, and revenue

By combining these data sources in BigQuery, we can perform powerful attribution analysis that's simply not possible in the original platforms.

Bringing Data into BigQuery

The Data Sources We'll Use

For this tutorial, we'll work with two data sources:

Google Analytics 4 data exported to a Google Sheet
Google Search Console data already in BigQuery through its data transfer service

Both datasets contain one month (October) of data from the Looker Studio Masterclass website.

*Our GA4 data contains date, session source, landing page, sessions, transactions, and revenue information*

Method 1: Using Google Search Console's Data Transfer Service

Google Search Console offers a native data transfer service that automatically exports your performance data to BigQuery daily. This is a recent addition (within the last couple of years) and is incredibly valuable.

To set this up for your own website:

In Google Search Console, navigate to Settings
Look for "BigQuery Export"
Follow the setup wizard to connect to your BigQuery project

Once configured, you'll have a continuously updating table with all your search performance metrics, including impressions, clicks, and positions for each query and URL.

Method 2: Creating an External Table from Google Sheets

For our GA4 data, we'll create what's called an "external table" in BigQuery. This is different from a standard table because:

The data remains in Google Sheets
The table is just a reference to that sheet
If the sheet updates, the table reflects those changes immediately

Here's how to create an external table:

In your BigQuery project, navigate to your dataset
Click the three dots menu and select "Create table"
For source, select "Drive" and paste your Google Sheet URL
Set "File format" to "Google Sheet"
Name your table (we'll use GA_Organic)
Check "Auto-detect schema" to let BigQuery figure out data types
Under "Advanced options," set "Header rows to skip" to 1
Click "Create table"

*Creating an external table from a Google Sheet in BigQuery*

Pro Tip: External tables are perfect for data that updates regularly in Google Sheets. If you use tools like Supermetrics or Google's native GA4 add-on to refresh your sheet daily, your BigQuery table will always have the latest data.

Testing Your Connections

Let's verify our data is accessible by running simple queries on both tables:

-- Query GA4 data
SELECT * FROM `your-project.your_dataset.GA_Organic`;

-- Query Search Console data
SELECT * FROM `your-project.your_dataset.Google_Search_Console`;

If both queries return data, congratulations! You've successfully brought two data sources into BigQuery. Now let's transform them into something useful.

Processing & Transforming Data

This is where the real magic happens. Raw data is rarely useful on its own - we need to clean, reshape, and combine it to unlock insights.

Step 1: Preparing the Search Console Data

First, let's prepare our Search Console data. We want to focus on queries that actually sent clicks to our site:

SELECT
  date,
  url,
  query,
  SUM(clicks) AS total_clicks
FROM
  `your-project.your_dataset.Google_Search_Console`
WHERE
  query IS NOT NULL  -- Remove anonymized queries
  AND clicks > 0     -- Only include queries that drove clicks
GROUP BY
  date, url, query
ORDER BY
  date, url         -- Keep results organized

This query:

Selects only the data we need
Filters out anonymized queries (where query is NULL)
Excludes queries with zero clicks
Aggregates clicks by date, URL, and query
Organizes results chronologically and by URL

Now we have a clean table showing which queries drove traffic to which URLs on which days.

Step 2: Preparing the GA4 Data

Next, let's prepare our GA4 data. We need to select relevant columns and ensure URL formats will match our Search Console data:

SELECT
  date,
  CONCAT("https://", hostname, landing_page) AS url,  -- Create full URLs
  sessions,
  transactions,
  purchase_revenue
FROM
  `your-project.your_dataset.GA_Organic`
WHERE
  landing_page != '(not set)'  -- Remove rows with invalid landing pages

The CONCAT() function is crucial here - it formats our landing page data to match the URL format in Search Console. Without this, our join would fail due to mismatched keys.

Step 3: Using Common Table Expressions (CTEs) for Clarity

For complex transformations, Common Table Expressions (CTEs) are a game-changer. Think of them as temporary, named result sets that make your code more readable and modular.

Here's how we'll use CTEs to organize our query:

WITH GSC AS (
  -- Our Search Console preparation query goes here
),

GA AS (
  -- Our GA4 preparation query goes here
)

-- Our main query will reference these CTEs
SELECT * FROM GSC;  -- Just a placeholder for now

Pro Tip: CTEs make complex queries much easier to understand and debug. I always use them when joining multiple data sources or when a query has distinct logical steps.

Step 4: Joining the Datasets with LEFT JOIN

Now we'll combine our datasets using a LEFT JOIN. We'll use the Search Console data as our "left" table because:

We want to keep ALL search queries, even if they don't match a landing page in GA
We're trying to attribute GA metrics (sessions, revenue) to those search queries

WITH GSC AS (
  -- Search Console prep query
),

GA AS (
  -- GA4 prep query
)

SELECT
  GSC.date,
  GSC.url,
  GSC.query,
  GSC.total_clicks,
  GA.sessions,
  GA.transactions,
  GA.purchase_revenue
FROM
  GSC
LEFT JOIN
  GA
  ON GSC.date = GA.date
  AND GSC.url = GA.url

When we run this query, we immediately notice a problem. Many queries that we know drove traffic show NULL values for sessions and revenue. What's happening?

Troubleshooting: The URL Mismatch Problem

The most common join issue is mismatched keys. In our case, we thought we fixed our URL formats, but a closer inspection reveals:

GSC URL: https://lookerstudiomasterclass.com/
GA URL after CONCAT: https://lookerstudiomasterclass.com/

They look identical, but there might be trailing slashes, capitalization differences, or other subtle mismatches. We need to ensure our URL formatting is consistent in both CTEs.

The "Fan-Out" Problem: When One Query Isn't Enough

After fixing our URLs, we run into a more subtle issue. When we look at our joined data, we see something like:

Wait - how can three different queries ALL be responsible for the SAME 13 sessions and $14,500 in revenue? This doesn't make sense!

The problem is a data granularity mismatch:

Our GSC data is at the query level (showing which queries drove clicks)
Our GA data is at the landing page level (showing total sessions and revenue for each page)

When we join them, each query gets "credited" with the FULL performance of the landing page. This is called a "fan-out" problem.

Solving the Fan-Out Problem with Window Functions

To solve this, we need to distribute the credit proportionally. If a query drove 2 out of 7 total clicks to a page, it should get 2/7ths of the sessions and revenue.

This is where window functions come in. Window functions perform calculations across a set of table rows related to the current row, without collapsing those rows like a GROUP BY would.

Here's our approach:

Calculate the total clicks for each URL on each day
Calculate each query's share of those clicks
Multiply GA metrics by that share to get attributed values

WITH GSC_step_1 AS (
  -- Our initial GSC query
),

GSC_step_2 AS (
  SELECT
    *,
    SUM(total_clicks) OVER (PARTITION BY date, url) AS total_url_clicks,
    total_clicks / SUM(total_clicks) OVER (PARTITION BY date, url) AS share_of_clicks
  FROM
    GSC_step_1
),

GA AS (
  -- Our GA4 query
)

SELECT
  GSC.date,
  GSC.url,
  GSC.query,
  GSC.total_clicks,
  GSC.share_of_clicks,
  GA.sessions * GSC.share_of_clicks AS attributed_sessions,
  GA.transactions * GSC.share_of_clicks AS attributed_transactions,
  GA.purchase_revenue * GSC.share_of_clicks AS attributed_revenue
FROM
  GSC_step_2 AS GSC
LEFT JOIN
  GA
  ON GSC.date = GA.date
  AND GSC.url = GA.url

Now our attribution looks logical:

Each query gets credit proportional to the clicks it drove. The sum of attributed sessions equals the original 13 sessions, and the revenue distribution makes sense.

Frequently Asked Questions (FAQs)

Can I use BigQuery as a freelancer, or is it only for large companies with in-house teams?

BigQuery is absolutely suitable for freelancers and small agencies! You can help clients set up their BigQuery pipelines while they own the project and pay the (often minimal) costs. I started using BigQuery as a freelancer when a client had too much data for Google Sheets. It's a valuable service to offer clients of any size.

What's the role of tools like Stitch or DBT when I can transfer data directly from Sheets to BigQuery?

Tools like Stitch and Fivetran are valuable when you need to bring data from sources without native BigQuery connectors (like Facebook Ads or Shopify). DBT and similar tools help manage complex transformation workflows with versioning and testing. For simpler projects like ours, BigQuery's native capabilities are often sufficient.

For services like Google Ads that create many tables and views, which should I use?

If Google creates views as part of their data transfer, start with those views as your base. They typically include helpful initial transformations. Examine the underlying query of the view to understand what it's doing, then build your additional transformations on top.

What about BigQuery's "Workflows" feature?

The Workflows feature, powered by Dataform, provides a visual way to design data pipelines. It's extremely promising for managing complex data flows and making your processes more transparent. While I still use scheduled queries for existing projects, I'd strongly consider Workflows for new initiatives.

In this guide, we've covered the crucial first steps of building a data pipeline in BigQuery: bringing in your data and transforming it into actionable insights. We've learned how to:

Create an external table from Google Sheets
Use Google Search Console's native BigQuery export
Clean and prepare our data using SQL
Join datasets with proper key matching
Solve the "fan-out" problem with window functions
Properly attribute website performance to search queries

This foundation allows you to answer the critical business question: "Which search queries are driving the most revenue for my website?"

In the next part of this series, we'll explore how to save our analysis as tables, views, or scheduled queries, and how to visualize the results in Looker Studio. Stay tuned!

Note:

This post is based on a subject covered in the Looker Studio Masterclass Program. To learn more about Looker Studio Masterclass, click here.

How to Build a BigQuery Data Pipeline: Combine Search Console & GA4 Data for Revenue Insights