Unlock the Power of Cloud Run: Data Preprocessing for Batch Jobs

Guides

Unlock the Power of Cloud Run: Data Preprocessing for Batch Jobs

admin

November 12, 2024

Guides

Latest Articles

Simplifying Authentication with Workload Identity Federation in GCP

Introduction: The Journey from Keys to Federation In the early days of…

Mastering VPC Peering in Google Cloud: A Step-by-Step Guide

Introduction As businesses grow, so does the complexity of their cloud infrastructure….

Top GCP trends to watch out for in 2025

Introduction As we edge closer to 2025, the landscape of Google Cloud…

Subscribe Newsletter

Subscribe to our news letter to get the latest on Google Cloud Platform and more!

In today’s fast-paced world of cloud computing, businesses are constantly looking for ways to streamline processes, reduce operational overhead, and scale efficiently. This was the challenge I faced when tasked with automating a job to preprocess data for our recommendation engine. The dataset we were working with was complex, with movie IDs interspersed between multiple customer reviews, making it difficult to structure the data correctly for downstream processing. We needed a service that could handle batch jobs, scale automatically, and integrate seamlessly with other cloud components like Google Cloud Storage and BigQuery.

After considering several options, I decided to implement the job using Spring Batch and run it on Google Cloud Run. Spring Batch was a natural choice due to its ability to manage large-scale, high-volume batch processing efficiently. However, the challenge wasn’t just the batch processing itself — it was how to run this job in a serverless, scalable environment. After evaluating the options, Google Cloud Run emerged as the perfect solution. It provided the scalability, cost-efficiency, and ease of integration with other Google Cloud services, which made it ideal for handling our preprocessing jobs.

In this post, I’ll walk you through why Cloud Run was the perfect choice, the benefits it brought to the table, and how we built an automated CI/CD pipeline to deploy and execute the batch jobs seamlessly on Cloud Run.

Why Choose Google Cloud Run?

When building cloud-native applications, one of the most important decisions is choosing the right platform. In my case, I had to choose between Google Kubernetes Engine (GKE), Cloud Functions, and Cloud Run. While GKE is a powerful tool for managing containerized applications, I didn’t need full-blown Kubernetes for this use case. The task was to run a batch job that preprocesses a dataset, and I wanted to avoid the overhead of managing a Kubernetes cluster. Cloud Functions, on the other hand, didn’t seem to be a fit either because of the complexity of the job and the limitations around execution time.

Cloud Run, however, emerged as the ideal solution. It offers the following key benefits:

Fully Managed: Cloud Run abstracts away the underlying infrastructure, allowing you to focus solely on your code without worrying about server management.
Automatic Scaling: Cloud Run automatically scales your containers up or down depending on demand, ensuring you’re only paying for what you use.
Ease of Integration: Cloud Run integrates seamlessly with other Google Cloud services like Cloud Storage, B igQuery, and Pub/Sub, which was important for the data-heavy tasks we were dealing with.
Support for Containerized Applications: Cloud Run supports Docker containers, which meant we could package our Spring Batch job into a container and run it seamlessly in a serverless environment.

The Preprocessing Step: Why Spring Batch?

When working with datasets for a recommendation engine, it’s important to ensure the data is properly structured before it can be used for training the model. I found an old Netflix dataset on Kaggle, which was a great starting point for building the recommendations engine. The dataset, however, was not in the ideal format for processing. Here’s what the structure looked like:

MOVIE ID
CUSTOMER ID, RATING, DATE
CUSTOMER ID, RATING, DATE
CUSTOMER ID, RATING, DATE
CUSTOMER ID, RATING, DATE
CUSTOMER ID, RATING, DATE
MOVIE ID
CUSTOMER ID, RATING, DATE
CUSTOMER ID, RATING, DATE
MOVIE ID
CUSTOMER ID, RATING, DATE
CUSTOMER ID, RATING, DATE

As you can see, each movie’s reviews were grouped together with the movie ID appearing multiple times throughout the dataset. The challenge was that each customer ID had multiple ratings attached, and since Apache Beam (used in Dataflow) automatically runs in parallel across multiple worker nodes, it was difficult to ensure that the correct movie ID was assigned to each customer’s review.

This led to the need for a preprocessing job that could structure the dataset properly before it was ingested into Dataflowfor further transformations. Since Spring Batch is excellent for managing batch jobs and processing large datasets, it became the obvious choice for the job. Using Spring Batch, we could:

Read the dataset from Google Cloud Storage.
Process the data in manageable chunks.
Ensure that each review was correctly associated with its corresponding movie ID.
Write the preprocessed data back to Google Cloud Storage.

The Deployment Process: CI/CD Pipeline for Cloud Run

Once the decision to use Google Cloud Run for executing our preprocessing job was made, the next step was automating the process of deploying the application, running it as a job, and ensuring seamless updates with minimal manual intervention. This led us to build an automated CI/CD pipeline — a crucial part of ensuring that our data preprocessing job runs efficiently and consistently every time we push changes to the repository.

Let’s walk through the entire deployment pipeline, step by step, highlighting the key components that make it work:

Step 1: Dockerizing the Spring Batch Application

The first task in automating the deployment was containerizing the Spring Batch application. Docker is essential for packaging the application with all its dependencies, making it portable and executable in a serverless environment like Google Cloud Run.

Here’s a breakdown of how the Docker container was set up:

Base Image: We start with the Eclipse Temurin 21 base image, which is a production-ready Java runtime, perfectly suited for running Spring Boot applications. This version of Java is compatible with the Spring Boot version we are using for the batch job.

FROM eclipse-temurin:21
WORKDIR /app
RUN mkdir -p /data/out && chmod -R 777 /data/out
COPY build/libs/recommendations-0.0.1-SNAPSHOT.jar recommendations-preprocessing-job.jar
ENTRYPOINT ["java", "-jar", "recommendations-preprocessing-job.jar"]

WORKDIR: We define /app as the working directory where the application will reside within the container.
Data Directory: The RUN mkdir -p /data/out && chmod -R 777 /data/out command creates the output directory (/data/out) and sets appropriate permissions. This is where the job will store the processed data before it’s uploaded or passed to the next stage of the pipeline.
Copy the JAR: The JAR file (recommendations-0.0.1-SNAPSHOT.jar) that contains the Spring Batch application is copied into the container.
ENTRYPOINT: This command tells Docker to run the application as a Java process using the java -jarcommand to execute the JAR file.

Once the Docker image is ready, we need a process to build, store, and deploy this image.

Step 2: Automating the Build and Deployment with GitHub Actions

Building a fully automated deployment pipeline is a crucial step for ensuring consistent, reliable, and repeatable deployments. For this project, we leveraged GitHub Actions to automate both the build and deployment process for our Cloud Run job. The pipeline ensures that each commit gets validated, tested, and deployed efficiently without manual intervention.

Continuous Integration and Deployment

Our CI/CD workflow is triggered by specific events in the repository, such as pushes to the main, develop, releases/**, and hotfix/** branches, as well as pull requests. We also added the workflow_dispatch trigger for manual deployments when needed

name: CI-CD
on:
  push:
    branches: [ "main", "develop", "releases/**", "hotfix/**" ]
  pull_request:
    types: [ opened, synchronize, reopened ]
  workflow_dispatch:

Workflow Permissions and Authentication

In this CI/CD process, it’s essential to authenticate with Google Cloud to interact with resources such as Cloud Run, Docker, and the Container Registry. We set up the necessary permissions for GitHub to interact with these services. Specifically, we used the google-github-actions/auth action for GCP authentication and configured it to use a service account key with workload identity federation for secure access.

permissions:
  id-token: write
  contents: read
  issues: write
  pull-requests: write

jobs:
  cloudrun:
    name: CI-CD
    runs-on: ubuntu-latest

Steps of the Workflow

1. Setting Up the Environment

The first few steps of the job set up the environment by checking out the code from GitHub, configuring the Java environment, and installing any dependencies using Gradle:

- name: Git Checkout
  uses: actions/checkout@v4

- name: Setup Java
  uses: actions/setup-java@v4
  with:
    distribution: temurin
    java-version: 21
    cache: gradle

2. Build the Application

Next, we build the application using Gradle. This step compiles the code, runs tests, and generates the necessary JAR files for packaging into a Docker container.

- name: Install Dependencies
  run: ./gradlew build --no-daemon

3. Code Quality Checks

We also implemented a code coverage check using Jacoco. This ensures that our tests are adequately covering the code base. If coverage falls below a threshold (80%), the pull request is marked as failed:

- name: Run Coverage
  run: |
    chmod +x gradlew
    ./gradlew jacocoTestReport
  if: github.event_name != 'workflow_dispatch'

- name: Fail PR if overall coverage is less than 80%
  if: ${{ github.event_name == 'pull_requests' && steps.jacoco.outputs.coverage-overall < 80.0 }}
  uses: actions/github-script@v6
  with:
    script: |
      core.setFailed('Overall coverage is less than 80%!')

4. Build and Push Docker Image

After ensuring code quality, the next crucial step is to build the Docker image. Using the provided Dockerfile, we package the application into a container image, which is then pushed to Google Container Registry (GCR) for easy access in Cloud Run:

- name: Docker Auth
  run: |-
    gcloud auth configure-docker ${{ secrets.PROJECT_REGION }}-docker.pkg.dev --quiet

- name: Build Image
  run: |
    docker build . --file Dockerfile --tag ${{ secrets.PROJECT_REGION }}-docker.pkg.dev/${{ secrets.PROJECT_ID }}/${{ secrets.REPOSITORY_NAME }}/${{ secrets.GCP_IMAGE_NAME }}

- name: Push Image
  run: docker push ${{ secrets.PROJECT_REGION }}-docker.pkg.dev/${{ secrets.PROJECT_ID }}/${{ secrets.REPOSITORY_NAME }}/${{ secrets.GCP_IMAGE_NAME }}

5. Deploy to Cloud Run

Once the image is built and pushed, the deployment step begins. This step checks if the Cloud Run job already exists. If it does, the job is updated with the new image. If not, a new Cloud Run job is created:

- name: Deploy Cloud Run Job
  run: |
    JOB_EXISTS=$(gcloud beta run jobs describe ${{ secrets.JOB_NAME }} --project ${{ secrets.PROJECT_ID }} --region ${{ secrets.PROJECT_REGION }} --format="value(name)" || echo "not found")    
    if [ "$JOB_EXISTS" != "not found" ]; then
      echo "Job already exists, updating..."
      gcloud beta run jobs update ${{ secrets.JOB_NAME }} \
      --image asia-south1-docker.pkg.dev/architect-learning-435310/recommendations-engine/recommendations-pre-processor-job \
      --project ${{ secrets.PROJECT_ID }} \
      --region ${{ secrets.PROJECT_REGION }} \
      --set-env-vars "DB_URL=${{ secrets.POSTGRES_DB_URL }},DB_USERNAME=${{ secrets.POSTGRES_DB_USERNAME }},DB_PASSWORD=${{ secrets.POSTGRES_DB_PASSWORD }},OUTPUT_PATH=${{ secrets.OUTPUT_PATH }}" \
      --args="${{ secrets.GCP_STORAGE_BUCKET_INPUT_PATH }}","${{ secrets.GCP_STORAGE_BUCKET_NAME }}"
    else
      echo "Job does not exist, creating a new job..."
      gcloud beta run jobs create ${{ secrets.JOB_NAME }} \
      --image asia-south1-docker.pkg.dev/architect-learning-435310/recommendations-engine/recommendations-pre-processor-job \
      --project ${{ secrets.PROJECT_ID }} \
      --region ${{ secrets.PROJECT_REGION }} \
      --set-env-vars "DB_URL=${{ secrets.POSTGRES_DB_URL }},DB_USERNAME=${{ secrets.POSTGRES_DB_USERNAME }},DB_PASSWORD=${{ secrets.POSTGRES_DB_PASSWORD }},OUTPUT_PATH=${{ secrets.OUTPUT_PATH }}" \
      --args="${{ secrets.GCP_STORAGE_BUCKET_INPUT_PATH }}","${{ secrets.GCP_STORAGE_BUCKET_NAME }}"
    fi

6. Notifications

To ensure transparency and provide real-time feedback, we integrated Slack notifications. These notifications will inform the team whether the deployment was successful or failed:

- name: Slack notification
  uses: 8398a7/action-slack@v3
  with:
    status: custom
    fields: workflow,job,commit,repo,ref,author,took
    custom_payload: |
      {
        attachments: [{
          color: '${{ job.status }}' === 'success' ? 'good' : '${{ job.status }}' === 'failure' ? 'danger' : 'warning',
          text: `Action Name: ${process.env.AS_WORKFLOW}\n Repository Name:${process.env.AS_REPO}@${process.env.AS_REF} by ${process.env.AS_AUTHOR} ${{ job.status }} in ${process.env.AS_TOOK}`,
        }]
      }
  env:
    SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
  if: always()

This CI/CD pipeline automates the process of building, testing, deploying, and running Cloud Run jobs in Google Cloud. By using GitHub Actions, we streamlined the deployment flow, ensuring consistency across different environments. Every commit is thoroughly tested, packaged, and deployed to Cloud Run in a seamless, automated fashion.

Running the Pre-Processor Job with GitHub Actions

Once the environment is set up and authenticated, the next crucial step in the CI/CD pipeline is executing the pre-processor job on Google Cloud Run. This step leverages the flexibility of Cloud Run jobs to handle batch processing efficiently.

Triggering the Cloud Run Job

In our workflow, the pre-processor job is triggered using the gcloud CLI to execute a job that was previously deployed to Cloud Run. The execution is done using the following command:

- name: Run Job
  run: |
    gcloud beta run jobs execute ${{ secrets.JOB_NAME }} --region ${{ secrets.PROJECT_REGION }}

This command tells Google Cloud to execute the specified Cloud Run job in the given region. Let’s break this down:

${{ secrets.JOB_NAME }}: This refers to the name of the Cloud Run job, which is stored securely in GitHub secrets. It ensures that the job being executed is always the correct one, as it’s dynamically fetched.
${{ secrets.PROJECT_REGION }}: This is the region where the job is hosted on Google Cloud Run, allowing you to target the correct Cloud Run environment.

GitHub Repository

For those interested in exploring the code behind this blog post and replicating the setup, I’ve made the entire project available on GitHub. You can find the repository, including the Dockerfile, GitHub Actions workflows, and the Spring Batch code, in the link below:

GitHub Repository: Recommendations Preprocessing Job

Feel free to fork the repository, explore the setup, and adapt it to your own projects. If you have any questions or suggestions, don’t hesitate to open an issue or contribute to the project!

Conclusion

In this blog post, we’ve walked through the end-to-end process of setting up an automated CI/CD pipeline for Google Cloud Run, from building the Docker image to executing batch jobs in a scalable and efficient manner. By leveraging Cloud Run jobs, we were able to create a serverless solution that automatically scales based on demand, reduces operational overhead, and ensures that our data processing tasks are handled seamlessly.

The integration with tools like GitHub Actions for continuous deployment and Slack notifications for workflow updates adds visibility and automation, making the pipeline easier to monitor and manage.

In a world where speed, scalability, and cost-efficiency are paramount, this architecture ensures that the preprocessing job for our recommendation engine runs smoothly and can easily adapt to future demands. With the flexibility and power of Google Cloud services, we’ve built a robust system that helps us focus on innovation, without worrying about infrastructure.

Whether you’re building machine learning pipelines, processing large datasets, or simply looking to automate batch jobs, leveraging Cloud Run with a solid CI/CD setup is a powerful and scalable approach that can help your business stay agile and cost-effective.

Share This Post :

Post Tags :

batch jobs

cloud run

GCP

Building a Dataflow Pipeline for BigQuery Integration says:

November 17, 2024 at 12:28 pm

[…] Unlock the Power of Cloud Run: Data Preprocessing for Batch Jobs […]

Reply