This post explains step by step guide for preparing the Google Professional Data Engineer Certification.

A Professional Data Engineer enables data-driven decision making by collecting, transforming, and publishing data. A Data Engineer should be able to design, build, operationalize, secure, and monitor data processing systems with a particular emphasis on security and compliance; scalability and efficiency; reliability and fidelity; and flexibility and portability. A Data Engineer should also be able to leverage, deploy, and continuously train pre-existing machine learning models.

GCP Exam Certification Official Guide:

GCP official documentation:

Exam Overview

Google Cloud Workflows

Orchestrate and automate Google Cloud and HTTP-based API services with server-less workflows.

You can use Workflows to create serverless workflows that link series of serverless tasks together in an order you define. Combine the power of Google Cloud’s APIs, serverless products like Cloud Functions and Cloud Run, and calls to external APIs to create flexible serverless applications. Workflows requires no infrastructure management and scales seamlessly with demand, including scaling down to zero.

The product is in Beta, and Pre-GA products may have limited support, and changes to pre-GA products may not be compatible with other pre-GA versions. …

Dataflow Streaming Data Generator

This pipeline takes in a QPS parameter, a path to a schema file, and generates fake JSON messages (sample messages used for load testing and system integration testing) matching the schema to a Pub/Sub topic at the QPS rate.

JSON Data Generator library used by the pipeline allows various faker functions to be used for each schema field. See the docs for more information on the faker functions and schema format.”

Fully managed relational database service for MySQL, PostgreSQL, and SQL Server.



Cloud SQL offers sizes to fit any budget. Pricing varies with settings, including how much storage, memory, and CPU you provision. Cloud SQL offers per-second billing and database instances are easy to stop and start.

Source Google Cloud Document

Now committed use discount is also…

If you’re new to Cloud Dataflow, I suggest starting here and reading the official docs first.

If you’re new to BigQuery, I suggest starting here and reading the official docs first.

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python

Installation or Setup Installing pandas with Anaconda

Installing pandas and the rest of the NumPy and SciPy stack can be a little difficult for inexperienced users.

The simplest way to install not only pandas but Python and the most popular packages that make up the SciPy stack (IPython, NumPy, Matplotlib, …) is with Anaconda, a cross-platform (Linux, Mac…

Cloud Data Fusion is a fully managed, cloud-native data integration service that helps users efficiently build and manage ETL/ELT data pipelines. With a graphical interface and a broad open-source library of preconfigured connectors and transformations, Cloud Data Fusion shifts an organization’s focus away from code and integration to insights and action.
Cloud Data Fusion is based on CDAP is a 100% open-source framework for build data pipelines.


Pricing for the service is broken down into:

How to Create a Private Instance

Before creating…

Harshad Patel

7x GCP | 2X Oracle Cloud| 1X Azure Certified | Cloud Data Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store