Data Pipeline Orchestration

Harshad Patel
2 min readJan 20, 2021

Google Cloud Workflows

Orchestrate and automate Google Cloud and HTTP-based API services with server-less workflows.

You can use Workflows to create serverless workflows that link series of serverless tasks together in an order you define. Combine the power of Google Cloud’s APIs, serverless products like Cloud Functions and Cloud Run, and calls to external APIs to create flexible serverless applications. Workflows requires no infrastructure management and scales seamlessly with demand, including scaling down to zero.

The product is in Beta, and Pre-GA products may have limited support, and changes to pre-GA products may not be compatible with other pre-GA versions. For more information, see the launch stage descriptions.

Using Workflows, one can create server-less flows that link a series of server-less tasks together in the order you define in GCP. The flow of steps in the Batch pipeline is orchestrated using Workflows. A Cloud Scheduler triggers the Workflows job which in turn starts the batch pipeline. A workflow job is made up of a series of steps described using a YAML-based syntax.

Orchestration Service Comparison on GCP

Workflows definition for Batch Pipeline

Below is the sample workflow used to trigger the DataFlow job.

Step 1: triggerDataFlowJob:

A DataFlow is triggered using the post method with relevant input parameters.

Step 2: delayToInitiateDF:

When the DataFlow job is triggered in the previous step, it will take some time to generate the DataFlow Job ID. This step adds the delay.

Step 3: getDataFlowStatus:

This step is used to fetch the DataFlow status by calling the GET method. Passing the result to the next step.

Step 4: statusPolling:

This step is used to jump back to the previous step in case the DataFlow job is not completed. Once the job status is DONE, then it will jump to the next step.

Harshad Patel

7x GCP | 2X Oracle Cloud| 1X Azure Certified | Cloud Data Engineer