Cloud Dataflow

Harshad Patel
4 min readJun 19, 2020

If you’re new to Cloud Dataflow, I suggest starting here and reading the official docs first.

  1. Develop locally usingDirectRunner and not on Google Cloud using the DataflowRunner. The Direct Runner allows you to run your pipeline locally, without the need to pay for worker pools on GCP.
  2. When you want to shake-out a pipeline on a Google Cloud using the DataflowRunner, use a subset of data and just one small instance to begin with. There's no need to spin up massive worker pools. That's just a waste of money silly.
  3. Assess the new-ish Dataflow Streaming Engine and Dataflow Shuffle services to see if reduced costs and performance gains can be made in your pipelines. Check region availability first though as not all are supported.
  4. Dataflow has three SDKS. In order of maturity & feature parity: Java > Python > Go. Personally, I recommend using the Java SDK whenever possible. Java also has strict type safety, so there’s that too y’all. 🤷
  5. Beam SQL looks promising, but don’t use it in production just yet. It’s not ready, and it’s lacking some SQL features. As a side note, Cloud Dataflow SQL (which is in alpha at the time of writing) is based on Beam SQL. And if you want to go even deeper, Beam SQL is based on Apache Calcite. It’s turtles all the way down folks.
  6. This one still catches a lot of people out. Dataflow is available in Sydney. Don’t confuse it with the Regional Endpoint, which is different and not available in Sydney. The Regional Endpoint location is where your pipeline is orchestrated and controlled from, not where the actual worker VMs spin up to process your data. Got it? Great, let's move on.
  7. Keep your security team happy by turning off public IPs if you don’t need them. Simply set the -usePublicIps=false flag/parameter. Easy-peasy-lemon-squeezy.
  8. Assess FlexRS for batch jobs. This feature uses a mix of regular and preemptible VMs, and might work out to be cheaper for you to use. Again, check region availability first though.
  9. If left unspecified, Dataflow will pick a default instance type for your pipeline. For example, if it’s a streaming pipeline, it picks an n1-standard-4 worker type. For most use cases you don't need such big nodes. This will save you quite a bit of coin. Experiment with the instance size during shake-out and testing.
  10. Cap the max number of instances…
Harshad Patel

7x GCP | 2X Oracle Cloud| 1X Azure Certified | Cloud Data Engineer