Cloud Data Fusion Private Instance Guide

Harshad Patel
3 min readMay 30, 2020

Cloud Data Fusion is a fully managed, cloud-native data integration service that helps users efficiently build and manage ETL/ELT data pipelines. With a graphical interface and a broad open-source library of preconfigured connectors and transformations, Cloud Data Fusion shifts an organization’s focus away from code and integration to insights and action.
Cloud Data Fusion is based on CDAP is a 100% open-source framework for build data pipelines.

Pricing

Pricing for the service is broken down into:

  • Cloud Data Fusion instance hours to operate the data integration interface
  • Cloud Dataproc VMs to execute the transformations prescribed by Cloud Data Fusion

How to Create a Private Instance

Before creating a Data Fusion private instance, we need to create a VPC network and a private sub-network. Private Google Access is required by Cloud Data Fusion to establish a private connection with Dataproc cluster. To do so we need to allocate the IP range, to do so follow the steps mentioned below:

  1. Go to the VPC Network page of your network in which you want to create private Cloud Data Fusion Instance.
  2. Click on the Private Service Connection tab.
  3. If asked, enable Service Networking API.
  4. Allocate an IP range of size /22 by clicking on the Allocate IP Range button.

Command to create an instance

Export the following variable for ease of use. Refer these variable in actual commands:
export PROJECT = <project-id>
export LOCATION = <region> Example: us-east1, asia-east1
export DATA_FUSION_API_NAME = datafusion.googleapis.com

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" https://$DATA_FUSION_API_NAME/v1beta1/projects/$PROJECT/locations/ $LOCATION/instances?instanceId=<INSTANCE_NAME> -X POST -d '{"description": "Private CDF instance created through REST.", "type": "BASIC", "privateInstance": true, "networkConfig": {"network": "VPC_NETWORK", "ipAllocation": "IP_RANGE"}}'

The ipAllocation field value provided to the call is the one allocated in step 4 above.

Once a private instance is created it will be listed in the Data Fusion UI. You can perform any other operations which you perform on public instances from the Data Fusion UI for example you can delete the private instance from the UI.

Peering With Cloud Data Fusion Network

Cloud Data Fusion uses VPC Peering to provide private instances. A VPC Peering requires peering to be set up on both ends (networks) independently. A peering is automatically set up from the Cloud Data Fusion tenant project network to your network. You must set up the peering to Cloud Data Fusion network from your network to be able to connect to the private instance.

Finding Tenant Project Id You can retrieve the tenant project id from the instance details.
It is a part of the service account. For example
Service Account: cloud-datafusion-management-sa@<project-id>-tp.iam.gserviceaccount.com
Tenant Project Id: <project-id>

Creating VPC Peering

Steps to create a VPC Peering with the tenant project are as follows:

  1. Go to your VPC Network
  2. Select VPC Network Peering
  3. Click on Create Connection
  4. Give a name to your peering ex: datafusion-peering
  5. Make sure that Your VPC network lists the network which you selected while creating Cloud Data Fusion instance.
  6. In Peered VPC network select In another project Provide the tenant project ID in the Project ID field
  7. In VPC network name provide <instance-region>-<instanceid>. Please note the network
  8. name in tenant project is of the format <instance-region>-<instanceid> i.e why you are providing the above name.
    Click on Exchange custom routes and select both Import custom routes (so you can access CDF UI) and Export custom routes (so CDF can access on-prem connected to your VPC network)
  9. Click on Create and wait for the operation to complete.

Originally published at https://www.techojournal.com on May 30, 2020.

--

--

Harshad Patel

7x GCP | 2X Oracle Cloud| 1X Azure Certified | Cloud Data Engineer