Google DataFlow utility pipelines( File Conversion and Streaming data generation)

Dataflow Streaming Data Generator

This pipeline takes in a QPS parameter, a path to a schema file, and generates fake JSON messages (sample messages used for load testing and system integration testing) matching the schema to a Pub/Sub topic at the QPS rate.

JSON Data Generator library used by the pipeline allows various faker functions to be used for each schema field. See the docs for more information on the faker functions and schema format.”

Source: google official documentation

Running the Streaming Data Generator template

  1. Go to the DataFlow page in the Cloud Console .
  2. Click Create job from template.

3. Select the Streaming Data Generator template from the Dataflow template drop-down menu. Enter a job name in the Job Name field.

4. Enter your parameter values in the provided parameter fields.

5. Click on RUN.

File Format Conversion

This template creates a batch pipeline that reads files from Google Cloud Storage (GCS), converts them to the desired format and stores them back in a GCS bucket. The supported file transformations are:

  • Csv to Avro
  • Csv to Parquet
  • Avro to Parquet
  • Parquet to Avro

Pipeline Requirements

  • Input files in the GCS bucket are accessible to the Dataflow pipeline.
  • Output GCS bucket exists and is accessible to the Dataflow pipeline.

Running File Format Conversion Pipelines

Follow the steps 1 and 2 from previous section.

3. Select the Streaming Data Generator template from the Convert file formats between Avro, Parquet & csv Template. Enter a job name in the Job Name field.

4. Enter your parameter values in the provided parameter fields.

5. Click on RUN.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store