This post explains step by step guide for preparing the Google Professional Data Engineer Certification.
A Professional Data Engineer enables data-driven decision making by collecting, transforming, and publishing data. A Data Engineer should be able to design, build, operationalize, secure, and monitor data processing systems with a particular emphasis on security and compliance; scalability and efficiency; reliability and fidelity; and flexibility and portability. A Data Engineer should also be able to leverage, deploy, and continuously train pre-existing machine learning models.
GCP Exam Certification Official Guide: https://cloud.google.com/certification/data-engineer
GCP official documentation: https://cloud.google.com/docs
Orchestrate and automate Google Cloud and HTTP-based API services with server-less workflows.
You can use Workflows to create serverless workflows that link series of serverless tasks together in an order you define. Combine the power of Google Cloud’s APIs, serverless products like Cloud Functions and Cloud Run, and calls to external APIs to create flexible serverless applications. Workflows requires no infrastructure management and scales seamlessly with demand, including scaling down to zero.
This pipeline takes in a QPS parameter, a path to a schema file, and generates fake JSON messages (sample messages used for load testing and system integration testing) matching the schema to a Pub/Sub topic at the QPS rate.
Fully managed relational database service for MySQL, PostgreSQL, and SQL Server.
Cloud SQL offers sizes to fit any budget. Pricing varies with settings, including how much storage, memory, and CPU you provision. Cloud SQL offers per-second billing and database instances are easy to stop and start.
Now committed use discount is also…
DirectRunnerand not on Google Cloud using the
Direct Runnerallows you to run your pipeline locally, without the need to pay for worker pools on GCP.
DataflowRunner, use a subset of data and just one small instance to begin with. There's no need to spin up massive worker pools. That's just a waste of money silly.
If you’re new to BigQuery, I suggest starting here and reading the official docs first.
SELECT *on big tables unless you absolutely have to. BigQuery charges on data scanned. The fewer columns you reference, the cheaper the query.
Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python
Installation or Setup Installing pandas with Anaconda
Installing pandas and the rest of the NumPy and SciPy stack can be a little difficult for inexperienced users.
The simplest way to install not only pandas but Python and the most popular packages that make up the SciPy stack (IPython, NumPy, Matplotlib, …) is with Anaconda, a cross-platform (Linux, Mac…
Cloud Data Fusion is a fully managed, cloud-native data integration service that helps users efficiently build and manage ETL/ELT data pipelines. With a graphical interface and a broad open-source library of preconfigured connectors and transformations, Cloud Data Fusion shifts an organization’s focus away from code and integration to insights and action.
Cloud Data Fusion is based on CDAP is a 100% open-source framework for build data pipelines.
Pricing for the service is broken down into: