CI/CD Pipeline with Cloud Build and Composer (with Terraform)

Marcelo Marques
Marcelo Marques
Published in
5 min readApr 25, 2021

--

Hey

Sometimes I use some Google tutorial to do some training. But I like to automate (yes, I know you know!). So, let's talk about CI/CD for data processing in GCP. I'm going to use this tutorial:

In summary:

This tutorial describes how to set up a continuous integration/continuous deployment (CI/CD) pipeline for processing data by implementing CI/CD methods with managed products on Google Cloud. Data scientists and analysts can adapt the methodologies from CI/CD practices to help to ensure high quality, maintainability, and adaptability of the data processes and workflows.

Looks good, huh? We will use 5 things here:

  1. Terraform — The tutorial it's a hands-on, but I will "transpose" to Terraform. 🤓
  2. Cloud Build — Similar to Jenkins, where we will create the pipelines, triggers, …
  3. Cloud Composer — It's a managed Apache Airflow in GCP. We will use to define the steps of the workflow, like start the data processing, test and check results.
  4. Dataflow to run a job in Apache Beam as sample.
  5. There's also Cloud Source Repositories, that is the "GitHub" from Google (but reeeeeeaaaly far away from GitHub).

All the code can be found here: CI/CD Repository

First thing, we need to have a user with "Owner" permission in some folder (I will not create this in root level, there's a way to create in some specific folder. And I know Owner is not the best way to grant permission, but this is for test purposes). You can get the list of folders in GCP with this command:

gcloud resource-manager folders list — organization=<Your Org ID>

Cool! Now update the terraform.tfvars file (I'm using Terraform 0.13.6 version) in bootstrap folder. File is really simple!

From here, please note that this is a PAID test. Some resources will charge you, so remember to delete the project when finish. :)

Run the Terraform steps:

terraform init
terraform plan (Good to review, right?)
terraform apply

You should see something like this:

Plan: 54 to add, 0 to change, 0 to destroy.

The apply process should take 30 minutes. Just go get some coffee.

The output should return this:

Take note of Cloudbuild project and csr_repo.id.

You should be ready to go! If you go to Cloud build, you will see 2 triggers:

Composer is now created also:

Let's test our plan trigger. So, just to understand, everything that you commit that is not "master" branch, will execute the plan trigger. Let's see. First let's clone the CSR repository (go outside of our code that you cloned before):

gcloud source repos clone gcp-cicd — project=<CloudBuild Project ID>

Now change to a different branch (I will use plan) and copy everything inside source-code from our previous repo inside this one (change the command accordling your actual path).

git checkout -b plan
cp -rf ../gcp-cicd-terraform/source-code/* .
git add -A
git commit -m "First Commit"
git push -set-upstream origin plan

If you check your Cloudbuild page, you will see the plan started:

If you open you can see all steps and information:

In AirFlow UI, you can see DAG information:

And DataFlow the Job Graph:

So, what happened? This:

  1. A developer commits code changes to the Cloud Source Repositories.
  2. Code changes trigger a test build in Cloud Build.
  3. Cloud Build builds the self-executing JAR file and deploys it to the test JAR bucket on Cloud Storage.
  4. Cloud Build deploys the test files to the test-file buckets on Cloud Storage.
  5. Cloud Build sets the variable in Cloud Composer to reference the newly deployed JAR file.
  6. Cloud Build tests the data-processing workflow Directed Acyclic Graph (DAG) and deploys it to the Cloud Composer bucket on Cloud Storage.
  7. The workflow DAG file is deployed to Cloud Composer.
  8. Cloud Build triggers the newly deployed data-processing workflow to run.

Cool, our process is now working in plan/test!! Now we can just apply to prod pipeline!

For this article, I will do a manual deployment to production by running the Cloud Build production deployment build. The production deployment build follows these steps:

  1. Copy the WordCount JAR file from the test bucket to the production bucket.
  2. Set the Cloud Composer variables for the production workflow to point to the newly promoted JAR file.
  3. Deploy the production workflow DAG definition on the Cloud Composer environment and running the workflow.

There's some wayt to automate this steps with Cloud Function or even during plan pipeline, but the idea here is just to understand a simple way. So, first thing, we need to get the name JAR filename to update or trigger. Let's use gcloud command:

gcloud composer environments run <COMPOSER_ENV_NAME> \
--location <COMPOSER_REGION> variables -- \
--get dataflow_jar_file_test 2>&1 | grep -i '.jar'

Now that we have this, let's change the Apply trigger this value. Go to Cloudbuild and edit the apply trigger (change the "_DATAFLOW_JAR_FILE_LATEST" to the result before):

Now let's run the trigger (just run):

Let's check:

Now we have the DAG deployed to Composer. You can see if you go to AirFlow UI:

Let's just run the job. In AirFlow UI, just click on "Trigger Dag"

Now you can go to Dataflow and check the job:

And that's it! You have now a CICD pipeline that you can use for data processing, or any other model of process.

To destroy the resources, simple: just go inside the bootstrap folder and run:

terraform destroy

I hope you like this! As always, feel free to reach me, provide feedbacks, anythin!!

Stay safe, folks!

--

--