BigQuery Fun Facts!
If you’re new to BigQuery, I suggest starting here and reading the official docs first.
- Export all your audit and billing logs back to BigQuery for analysis. I don’t know how many times this pattern has saved my butt.
- Don’t be lazy with your SQL. Avoid
SELECT *on big tables unless you absolutely have to. BigQuery charges on data scanned. The fewer columns you reference, the cheaper the query.
- BigQuery likes wide, denormalized tables with nested and repeated data. So, denormalise whenever possible (also see here). However, that doesn’t mean BigQuery can’t handle normalized data and joins. It absolutely can. It just performs better on denormalized stuff because BigQuery is essentially an OLAP engine.
LIMITdoes not reduce costs. It's an anti-pattern. You still pay for the table scan. It will return your results quicker though. There is one exception to this - see #7 for more info (clustering).
- Use custom quotas to control costs when using on-demand. Also, use
max-bytes-billedto control individual query costs. Gotta watch those greens!
- Try to avoid ingestion-based partitioning if you can. It can get cumbersome when wrangling the data and working across multiple time zones. Instead, partition by a column instead. It’s more intuitive and easier to maintain. You can currently partition by date/timestamp or integer range.
- Once you’ve partitioned your data, then cluster it for a free turbo boost on your queries and some cost savings. In addition, and as my good friend Felipe Hoffa pointed out on Twitter here, a clustered table will stop counting bytes early if a
LIMITstatement is used for simple statements. This is another good reason to cluster your tables!
- For partitioned tables, enforce users to specify the partitioned column in their
WHEREclause by using setting the
true. Reduces cost and speeds up query time.
- Contrary to popular belief, BigQuery’s storage layer is not GCS. It’s Colossus. GCS and BigQuery both use Colossus under the hood. Oh, and again, you don’t need to know this stuff, but it’s a fun fact: BigQuery uses Bigtable for its streaming engine, and Spanner for its metadata and query result preview.
- Keep an eye on the materialized views. It’s something that the BigQuery community has been waiting a long time for.
- Set TTLs on datasets/tables when you know data won’t be needed after N period. If you want to archive even further, move it out to GCS cold line. It’s much cheaper.
SELECT *to preview data. Use
- BigQuery Machine Learning (BQML) is cool ‘n all, but dang is it expensive! At the time of writing, it’s ~$470(AUD) p/TB when creating/training the model if you’re using the on-demand pricing model (it’s included as part of the flat-rate model). Be warned.
- Don’t use streaming ingestion if you don’t have to. You need to pay for it. Batch loading is free on the other hand.
- Enable the cache (24hrs). Remember that it’s not shared though. It’s per user, per project. Use something with a service/account in the middle if you want to share the cache i.e. a proxy.
- Editing — not querying — a table resets the long-term storage discount counter. You’ll have to wait 90 days again if you do edit it. Bummer.
- Be wary of 3rd party chrome plugins that promise to save you lots of money or improve the performance of your queries. They require elevated permissions on your GCP project(s), which might not fly with your security/privacy teams — especially at an enterprise level. I’m also dubious about their lofty claims.
- Use the Public Issue Tracker to raise feature requests and get your friends to star them. The BigQuery engineers and PMs hang out a lot there. Don’t comment with “+1”. Instead, actually star it to give it a proper vote.
- Don’t use legacy SQL for new queries. No excuses — just don’t! If you’re running legacy SQL from the old days, get a plan together to migrate off it ASAP. It’s no longer maintained and nothing new is backported.
- Put your SQL in source control. Don’t treat it as a 2nd class citizen. Integrate it into your CI/CD pipelines. That should be a no-brainer.
- Use SESSION_USER() as a workaround to BigQuery not having row level permissions yet. Until it’s available you’ll need to do this as recommended by Google.
- Currently, BigQuery supports customer-managed keys (CMEK), not customer supplied keys (CSEK). See here. Don’t confuse ‘client side encryption’ with CSEK. They are not the same thing.
- Avoid using the native scheduled queries feature in BigQuery if you can. I don’t like how they’ve implemented on top of the BigQuery Data Transfer Service (BQ-DTS). It raises too many concerns and questions by security teams — and rightly so. It also confuses people. Finally, they are tied to user accounts and very hard to untangle should the user/employee offboard the company, and not scalable. Instead, use something like Cloud Scheduler + Cloud Build or Apache Airflow.
- Instead of using Dataflow for ETL, look at BigQuery as a perfectly good ETL tool in its own right. It’s also more performant. But, there are trade-offs e.g. it’s easier to test code (Dataflow) than SQL etc.😉
- BI Engine is currently very immature, but keep an eye on it as it grows up. Get ready to see Looker support, APIs etc. as time goes on, I guess.
- Use batch queries when your queries are not time sensitive. They don’t count toward your 100 concurrent query limit.
- Use approximate functions (within 1% of exact number) when you don’t need exact results e.g.
approx_count_distinct. If you're dealing with numbers in the tens of millions upwards, do you really need the answer to be exact? Who really cares if you're stock trading system is out by a few cents. Oh wait..
- BigQuery has two pricing models: on-demand and flat-rate. Learn the difference between the two of them. If your compute/analysis monthly bill is pushing north of $10K USD, then it’s time to look at flat-rate.
- The limits and quotas page are important to stay abreast of. However, some of them are soft limits. If you’re a big enough customer then they can be raised on per-case basis. Talk to your local Google rep.
- Someone that I work with, and who’s a lot smarter than me, wrote a nifty little open source tool for analysing all your BigQuery views. It’s very handy indeed.
bqcommand line tool is incredibly powerful. You can solve a lot of problems with it, quickly and easily. You hook in all your favourite Bash commands/tools using pipe. See here for an example. #shamelessplug
- Google recently open sourced ZetaSQL, which is the SQL parser and analyzer that drives BigQuery (and others e.g. Spanner). ZetaSQL can be found here. This is very useful if you want to build, for example, a BigQuery test harness/framework 😉
- You can use
FOR SYSTEM_TIME ASrestore previously deleted tables/rows. It supports up to 7 days in the past. Awww, snap! See here.
- BigQuery now has scripting, which allows you to do all sorts of funky stuff like send multiple statements to BigQuery in one request, use variables, tap into control flow statements such as
WHILE, and loops. Neat.
- Just a few weeks ago, the BigQuery team announced that on-demand queries can burst through the default of 2000 concurrent slots “when it can”. I’d love to see more details released around how exactly BigQuery dertermines when it can burst. Very cool, nonetheless.
Originally published at https://www.techojournal.com.