16.2. Why Do We Need BigQuery?
Volume of Data
Let's begin with a pain point: the volume of data we can handle with our current data stack. A few years ago, I was working with a client who wanted to report and analyze transactional data. They had about 3,000 to 4,000 transactions daily, each with 50 columns of information. Even if we only kept one month of data in a Google Sheet, we were close to hitting the limit (5 million cells at that time). Now the limit is 10 million cells, but it's still not enough for larger datasets.
Moreover, even at 100,000 or 200,000 cells, if you have calculations and functions to perform in the sheet, it becomes very slow.
Speed of ETL
ETL stands for extract, transform, and load. It's the process of extracting data from a tool or source system, transforming it (modeling it or cleaning it), and loading it into another system or providing it to a visualization tool like Looker Studio.
The speed of ETL matters because even if your tool can handle large amounts of data (like Google Sheets), processing that data might take a long time.
Data Modeling Capabilities
The number of functions available in Google Sheets and Looker Studio are limited compared to what you have in SQL and BigQuery. BigQuery offers more flexibility and can handle more use cases for data analysis and data science.
Lack of Data Ownership
When you connect Looker Studio to an external tool or API like Google Analytics or Facebook Ads API, you don't own your data anymore. You need a place where you can store your own dataset securely.
Limited Ownership Over Modeling
Data modeling performed in Looker Studio is exposed to anyone using the report/dashboard. If you want to create an ownership strategy on your data modeling, you need a proper place to perform it.
When working with APIs, there are several factors to consider:
- Volume of data: How many rows are you requesting? How quickly can the tool process your request?
- API speed: Some tools have slow APIs, which can slow down the entire process.
- API limits: Some APIs have limits on how much data you can request per call.
Data Modeling Limitations in Looker Studio
Looker Studio has limitations when it comes to data modeling capabilities compared to BigQuery or other data warehouses. For example, you can't blend more than five data sources together in Looker Studio, and the number of functions available is limited.
Ownership of Modeling in Looker Studio
When you create functions and processes in Looker Studio, anyone with access to the report/dashboard will be able to see how it's done behind the scenes. This makes it difficult to create an ownership strategy over your data modeling and processing.
So, do you feel limited yet?
Lack of Data Ownership
We often spend a significant amount of time and resources collecting and cleaning quality data. This includes setting up cross-domain tracking, processing traffic sources, tagging websites, implementing user ID tracking, and more. However, once we collect this data, it often goes into marketing tools like Facebook or Google Analytics.
These platforms store the data that we collect and dictate how we can access it, with what frequency, and under what limitations. This is an issue because our time and money have gone into data collection but then these platforms impose restrictions on our access to the data.
Data retention policies in some platforms may also limit how long you can keep your collected data. If you want to retain your collected data longer and have full control over its usage, analysis, and access permissions, you need a platform that allows for this flexibility.
The reporting capabilities of each tool are imposed on us based on their built-in features or API limitations. For example, some clients may use an HR system that doesn't allow for easy API access or CSV exports. In extreme cases like this one mentioned earlier where clients must go through laborious processes to get their own data back from the tool they use.
Limited processing capabilities in some tools can be another challenge when working with collected data. In cases where we want more freedom in processing our collected information or if we need to handle personally identifiable information (PII) securely while being legally compliant, it's essential to consider alternative platforms or solutions.
In conclusion, owning your own dataset gives you more control over how it is processed and analyzed without being restricted by the limitations imposed by third-party tools or platforms. BigQuery can be an excellent solution for overcoming these challenges while providing advanced capabilities for handling large volumes of data at scale