bigquery数据类型

I’ve used BigQuery every day with small and big datasets querying tables, views, and materialized views. During this time I’ve learned some things, I would have liked to know since the beginning. The goal of this article is to give you some tips and recommendations for optimizing your costs and performance.

我每天都使用BigQuery处理大小数据集，用于查询表，视图和实例化视图。在这段时间里，我学到了一些东西，从一开始我就想知道。本文的目的是为您提供一些技巧和建议，以优化成本和性能。

基本概念 (Basic concepts)

BigQuery: Serverless, highly scalable, and cost-effective multi-cloud data warehouse designed for business agility [Google Cloud doc].

BigQuery ：专为业务敏捷性而设计的无服务器，高度可扩展且经济高效的多云数据仓库[ Google Cloud doc ]。

bigquery数据类型_将BigQuery与TB数据一起使用后的成本和性能课程_第1张图片

BigQuery logo BigQuery徽标

GCP Project: Google Cloud projects form the basis for creating, enabling, and using all Google Cloud services including managing APIs, enabling billing, adding and removing collaborators, and managing permissions for Google Cloud resources [Google Cloud doc]. Each project could have more than one BigQuery Dataset.

GCP项目 ：Google Cloud项目构成了创建，启用和使用所有Google Cloud服务的基础，包括管理API，启用计费，添加和删除协作者以及管理Google Cloud资源的权限[ Google Cloud文档 ]。每个项目可能有多个BigQuery数据集。

bigquery数据类型_将BigQuery与TB数据一起使用后的成本和性能课程_第2张图片

GCP Project GCP项目

Dataset: It’s similar to the Databases or Schema concepts in other RDBMS. There are located tables, views, and materialized views.

数据集 ：它类似于其他RDBMS中的数据库或架构概念。存在表，视图和实例化视图。

bigquery数据类型_将BigQuery与TB数据一起使用后的成本和性能课程_第3张图片

BigQuery’s Datasets BigQuery的数据集

Slot: It’s a virtual CPU used by BigQuery to execute SQL queries. BigQuery automatically calculates how many slots are required by each query, depending on query size and complexity [Google Cloud doc]. This is the principal indicator if you are in flat-rate pricing because you make a slot commitment (minimum 500 slots).

插槽：这是BigQuery用于执行SQL查询的虚拟CPU。 BigQuery会根据查询的大小和复杂程度自动计算每个查询需要多少个广告位[ Google Cloud doc ]。如果您采用固定费用定价，因为这是您承诺的广告位承诺(最少500个广告位)，则这是主要指标。

Slot/ms: It’s the total amount of slots per millisecond used by a query. That’s the total number of slots consumed by the query over its entire execution time [Taking a practical approach to BigQuery slot usage analysis]. Notice that a query uses a different quantity of slots during all execution, for example, the number of slots when start making a join is higher than making a filter and also is the duration time.

插槽/毫秒：这是查询使用的每毫秒的插槽总数。这是查询在整个执行时间内所消耗的插槽总数[ 采用实用的方法进行BigQuery插槽使用率分析 ]。请注意，查询在所有执行过程中使用不同数量的插槽，例如，开始建立连接时的插槽数量要比创建过滤器的数量高，并且持续时间也要多。

Numer of byte processed: It’s the total bytes read when you run a query. This is the principal cost if you are in on-demand pricing.

已处理字节数 ：运行查询时读取的总字节数。如果您按需定价，这是主要成本。

Getaway of “Select * “ 逃脱“选择*”

Partition: A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data [BigQuery partitioned table].

分区：分区表是一种特殊的表，分为几部分，称为分区，这使管理和查询数据更加容易[ BigQuery分区表 ]。

bigquery数据类型_将BigQuery与TB数据一起使用后的成本和性能课程_第4张图片

A partitioned and clustered table 分区和集群表

Clustering: A clustered table data is automatically organized based on the contents of one or more columns in the table’s schema [BigQuery clustered table].

集群：集群表数据是根据表架构[ BigQuery集群表 ]中一个或多个列的内容自动组织的。

Resuming this two important concepts, I’ve made a simple image to understand visually how a table get organized.

总结这两个重要概念，我制作了一个简单的图像以直观地了解表的组织方式。

bigquery数据类型_将BigQuery与TB数据一起使用后的成本和性能课程_第5张图片

费用建议 (Cost recommendations)

There is not a magic recipe, however, It’s important to understand the BigQuery’s cost structure and then what exactly I’m spending on it.

但是，这并不是一个神奇的秘诀，重要的是要了解BigQuery的成本结构，然后确切地讲我要花多少钱。

Cost Structure

成本结构

Storage [pricing]
储存 [ 定价 ]

Active: A monthly charge for data stored in tables or in partitions that have been modified in the last 90 days. $0.020 per GB
活跃：按月收费，用于存储在过去90天内已修改的表或分区中的数据。 每GB $ 0.020
Long-term: A lower monthly charge for data stored in tables or in partitions that have not been modified in the last 90 days. $0.010 per GB
长期：对于过去90天内未修改的表或分区中存储的数据，每月收取的费用较低。 每GB $ 0.010

‍ Cost TipPartition your tables whenever is possible. Image this scenario you have a 10 Terabytes DataWarehouse and only use your last month’s data. If all your tables are not partitioned this could mean $200 monthly or for a partitioned slightly more than $100 (without counting the compute cost savings!).

成本提示尽可能对表进行分区。在这种情况下，您有一个10 TB的DataWarehouse，并且仅使用上个月的数据。如果您的所有表都没有分区，则意味着每月200美元，或者分区的价格略高于100美元(不计算节省的计算费用！)。

Since the BigQuery partition limit for each table is 4000 an interesting approach could be to set a ‘maximum history availability’ in this case could be 3 years and the older data migrate to a cheaper storage li GCS nearline ($0.004 per GB), so with this, your DataWarehouse will not grow exponentially.

由于每个表的BigQuery分区限制为4000，因此一种有趣的方法是在这种情况下将“最大历史可用性”设置为3年，并且较旧的数据迁移到价格便宜的GCS近线存储(每GB 0.004美元)，因此这样，您的DataWarehouse不会成倍增长。

2. Compute

2.计算

BigQuery has two pricing models and one hybrid model:

BigQuery有两种定价模型和一种混合模型：

on-demand: pay for the number of bytes processed $5 per TB.
按需：支付处理的字节数，每TB 5美元。
flat-rate: you purchase a monthly Slot commitment (minimum commitment is 500 slots, $10 000 monthly), this means you’ll have a maximum compute capacity (500 CPUs) for your GCP Project, the advantage here is you pay a fixed amount doesn’t matter the number of TB processed.
固定费用：您按月购买插槽承诺(最低承诺为500个插槽，每月支付10,000美元)，这意味着您将为GCP项目拥有最大的计算容量(500个CPU)，这是您需要支付固定金额的处理的TB数量无关紧要。
flat-slot: is similar to flat-rate the main difference is that the commitment is as little as 60 seconds. The cost here is $20 per hour.
flat-slot：类似于固定速率，主要区别在于承诺时间仅为60秒。每小时费用为$ 20。

‍ Cost Tip

费用提示

To decide which strategy is the best for your team you need to follow your expenses and slot utilization. Let’s check a real case and identify which pricing model is the best and why.

要确定哪种策略最适合您的团队，您需要了解支出和插槽利用率。让我们检查一个实际案例，并确定哪种定价模型是最佳的以及为什么。

Cloud Monitoring

云监控

Activate this important tool.

激活此重要工具。

bigquery数据类型_将BigQuery与TB数据一起使用后的成本和性能课程_第6张图片

The first time It needs to create the workstation 第一次需要创建工作站

Then select the BigQuery dashboard

然后选择BigQuery仪表板

bigquery数据类型_将BigQuery与TB数据一起使用后的成本和性能课程_第7张图片

Select BigQuery Dashboard 选择BigQuery仪表板

Now Let’s focus on the ‘Slot utilization’ chart. Here we observe a few peaks over 500 that could be caused by intense not optimized queries. This could tell me that an on-demand model or flat-rate could work, to define let’s check bytes processed.

现在，让我们关注“插槽利用率”图表。在这里，我们观察到500多个峰值可能是由强烈的未经优化的查询引起的。这可以告诉我按需模型或固定费率可以工作，以定义让我们检查已处理的字节。

bigquery数据类型_将BigQuery与TB数据一起使用后的成本和性能课程_第8张图片

Second check the bytes processed and the daily and monthly cost. This query results in the last 30 days we processed 115 Terabytes with a total cost of $572 this means that we are far from the $10,000 fix-rate and we keep with the on-demand price model. Another insight is that on July 16 was the highest peak in terabytes processed so only on this day a good idea would have been to use flex-slot.

其次检查已处理的字节以及每日和每月成本。该查询的结果是，在过去30天内，我们处理了115 TB数据， 总成本为572美元，这意味着我们与10,000美元的固定费用相距甚远，并且我们保持按需价格模型。另一个见解是，7月16日是处理的TB的最高峰，因此只有在这一天，才可以使用flex-slot。

绩效建议 (Performance recommendations)

The best partition field (Not is always day!)

最佳分区字段(并非总是一天！)

Each time I need to figure out which is the best partition strategy, I rely on the Google Cloud recommendation. At least each partition should be 1GB. This brings me that the majority of my tables are now partitioned by month.

每当我需要确定哪种是最佳分区策略时，我都会依靠Google Cloud的建议。至少每个分区应为1GB。这给我带来了我的大多数表现在按月分区。

‍ Performance Tip

性能提示

Evaluate to activate the ‘require_partition_filter’ so each time a person needs to query a table It will be required to add a Where clause filtering with the partition field.

评估以激活“ require_partition_filter”，以便每次有人需要查询表时，都需要在分区字段中添加一个Where子句过滤。

ALTER TABLE mydataset.mypartitionedtable
 SET OPTIONS (require_partition_filter=true)

The best clustering strategy

最佳集群策略

Clustering is another great tool for optimizing your cost and performance. I always think clustering like a ‘box inside a box’ so to access an internal box I need to interact with the upper box.

群集是用于优化成本和性能的另一种出色工具。我一直认为群集就像“ 盒子里的盒子 ”一样，因此要访问内部盒子，我需要与上面的盒子进行交互。

‍ Performance Tip

性能提示

To understand this idea let’s compare the following three queries. The last one only retrieves the data within the partition, and giving the two levels of partitioning means that the query wouldn’t need to ‘open’ each box until finding the ‘city box’. All this will be reflected in a short execute time and fewer bytes consumed.

为了理解这个想法，让我们比较以下三个查询。最后一个仅检索分区内的数据，并且提供两个分区级别意味着查询无需先打开每个框，直到找到“城市框”为止。所有这些都将在较短的执行时间和更少的字节消耗中得到反映。

Avoid windows functions

避免Windows功能

Operations that need to see all the data in the resulting table at once have to operate on a single node. Un-partitioned window functions like RANK() OVER() or ROW_NUMBER() OVER() will operate on a single node. [Looker question]. This means that it doesn’t matter that you have the best price model with thousands of slots, still, all the data will go to a single node if you use a window function.

需要立即查看结果表中所有数据的操作必须在单个节点上进行。未分区的窗口函数(例如RANK() OVER()或ROW_NUMBER() OVER()将在单个节点上运行。 [ Looker问题 ]。这意味着，具有数千个插槽的最佳价格模型并不重要，但是，如果您使用窗口功能，所有数据将转到单个节点。

‍ “Resources exceeded during query execution” Tip

“查询执行期间超出了资源”提示

There are a few strategies you could use here. First for example if you need to use OVER()you need to PARTITION the window function by date and build a string as the primary key [Looker question].

您可以在此处使用一些策略。首先，例如，如果需要使用OVER() ，则需要按日期对window函数进行分区，并构建一个字符串作为主键[ Looker问题 ]。

CONCAT(CAST(ROW_NUMBER() OVER(PARTITION BY event_date) AS STRING), ‘|’,(CAST(event_date AS STRING)) as id

If your query contains an ORDER BY clause It may cause the “Resources exceeded” message, for the same reason, all the data is passed to one node. In this situation, avoid ORDER BY if your result is just for creating a new table.

如果您的查询包含ORDER BY子句，则可能由于相同的原因而导致“超出资源”消息，所有数据都传递到一个节点。在这种情况下，如果结果只是用于创建新表，请避免使用ORDER BY 。

Exists some strategies to increase the resources during the execution time. Please refer to the Google Cloud documentation for more details

存在一些在执行期间增加资源的策略。有关更多详细信息，请参阅Google Cloud文档。

结论 (Conclusions)

BigQuery is a fantastic Data Warehouse with a challenging price model. To keep your cost-controlled without losing the performance, I recommend you to keep reading more articles and useful information I’m sure exists many excellent tips out there ‍ ‍.

BigQuery是一个出色的数据仓库，具有挑战性的价格模型。为了在不损失性能的情况下保持成本控制，我建议您继续阅读更多文章和有用的信息，我相信这里确实存在许多出色的技巧‍ ‍。

PS This month Google Cloud introduced BigQuery Omni,

PS本月Google Cloud推出了BigQuery Omni ，

It’s a flexible, multi-cloud analytics solution that lets you cost-effectively access and securely analyze data across Google Cloud, Amazon Web Services (AWS), and Azure.

它是一种灵活的多云分析解决方案，可让您经济高效地访问和安全地跨Google Cloud，Amazon Web Services(AWS)和Azure进行数据分析。

I hope to be writing about my experience in the following weeks! Also, keep an eye on Materialized Views, now is in Beta that is why I didn’t add too much information here, however, I recommend you to give it a try.

我希望在接下来的几周内写下我的经历！另外，请注意Materialized Views (现在处于Beta版)，这就是为什么我在此处未添加太多信息的原因，但是，我建议您尝试一下。

PS 2 if you have any questions, or would like something clarified, ping me on Twitter or LinkedIn I like having a data conversation If you want to know about Apache Arrow and Apache Spark I had an article A gentle introduction to Apache Arrow with Apache Spark and Pandas with some examples.

PS 2如果您有任何疑问或想澄清一些问题，请在Twitter或LinkedIn上ping我，我想进行数据对话如果您想了解Apache Arrow和Apache Spark，我有一篇文章 通过Apache Spark和Pandas轻松介绍Apache Arrow 有一些例子。

有用的链接 (Useful links)

Thanks to all these persons behind each link.

感谢每个链接后面的所有这些人。

https://cloud.google.com/blog/products/data-analytics/monitoring-resource-usage-in-a-cloud-data-warehouse
https://cloud.google.com/blog/products/data-analytics/monitoring-resource-usage-in-a-cloud-data-warehouse
https://cloud.google.com/bigquery/docs/reservations-workload-management#choosing_between_on-demand_and_flat-rate_billing_models
https://cloud.google.com/bigquery/docs/reservations-workload-management#choosing_between_on-demand_and_flat-rate_billing_models
https://github.com/GoogleCloudPlatform/professional-services/tree/master/tools/bq-visualizer
https://github.com/GoogleCloudPlatform/professional-services/tree/master/tools/bq-visualizer
https://cloud.google.com/bigquery/pricing#on_demand_pricing
https://cloud.google.com/bigquery/pricing#on_demand_pricing
https://cloud.google.com/bigquery/docs/clustered-tables
https://cloud.google.com/bigquery/docs/clustered-tables
https://cloud.google.com/bigquery/docs/partitioned-tables
https://cloud.google.com/bigquery/docs/partitioned-tables
https://cloud.google.com/bigquery/pricing#active_storage
https://cloud.google.com/bigquery/pricing#active_storage
https://cloud.google.com/bigquery/pricing#flat_rate_pricing
https://cloud.google.com/bigquery/pricing#flat_rate_pricing
https://cloud.google.com/bigquery/docs/estimate-costs
https://cloud.google.com/bigquery/docs/estimate-costs
https://cloud.google.com/storage/pricing
https://cloud.google.com/storage/pricing
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language
https://medium.com/enkel-digital/googles-big-query-resources-exceeded-during-query-execution-%EF%B8%8F-13c20b6693e2
https://medium.com/enkel-digital/googles-big-query-resources-exceeded-during-query-execution-%EF%B8%8F-13c20b6693e2
https://discourse.looker.com/t/resources-exceeded-during-query-execution-when-building-derived-table-in-bigquery/4414
https://discourse.looker.com/t/resources-exceeded-during-query-execution-when-building-derived-table-in-bigquery/4414
https://cloud.google.com/bigquery/docs/writing-results#large-results
https://cloud.google.com/bigquery/docs/writing-results#large-results

翻译自: https://medium.com/dataseries/costs-and-performance-lessons-after-using-bigquery-with-terabytes-of-data-54a5809ac912