Subscribe to our news letter to get the latest on Google Cloud Platform and more!
Simplifying Authentication with Workload Identity Federation in GCP
Introduction: The Journey from Keys to Federation In the early days of…
Introduction: The Journey from Keys to Federation In the early days of…
Introduction As businesses grow, so does the complexity of their cloud infrastructure….
Introduction As we edge closer to 2025, the landscape of Google Cloud…
Subscribe to our news letter to get the latest on Google Cloud Platform and more!
In today’s fast-paced world of cloud computing, businesses are constantly looking for ways to streamline processes, reduce operational overhead, and scale efficiently. This was the challenge I faced when tasked with automating a job to preprocess data for our recommendation engine. The dataset we were working with was complex, with movie IDs interspersed between multiple customer reviews, making it difficult to structure the data correctly for downstream processing. We needed a service that could handle batch jobs, scale automatically, and integrate seamlessly with other cloud components like Google Cloud Storage and BigQuery.
After considering several options, I decided to implement the job using Spring Batch and run it on Google Cloud Run. Spring Batch was a natural choice due to its ability to manage large-scale, high-volume batch processing efficiently. However, the challenge wasn’t just the batch processing itself — it was how to run this job in a serverless, scalable environment. After evaluating the options, Google Cloud Run emerged as the perfect solution. It provided the scalability, cost-efficiency, and ease of integration with other Google Cloud services, which made it ideal for handling our preprocessing jobs.
In this post, I’ll walk you through why Cloud Run was the perfect choice, the benefits it brought to the table, and how we built an automated CI/CD pipeline to deploy and execute the batch jobs seamlessly on Cloud Run.
When building cloud-native applications, one of the most important decisions is choosing the right platform. In my case, I had to choose between Google Kubernetes Engine (GKE), Cloud Functions, and Cloud Run. While GKE is a powerful tool for managing containerized applications, I didn’t need full-blown Kubernetes for this use case. The task was to run a batch job that preprocesses a dataset, and I wanted to avoid the overhead of managing a Kubernetes cluster. Cloud Functions, on the other hand, didn’t seem to be a fit either because of the complexity of the job and the limitations around execution time.
Cloud Run, however, emerged as the ideal solution. It offers the following key benefits:
When working with datasets for a recommendation engine, it’s important to ensure the data is properly structured before it can be used for training the model. I found an old Netflix dataset on Kaggle, which was a great starting point for building the recommendations engine. The dataset, however, was not in the ideal format for processing. Here’s what the structure looked like:
MOVIE ID
CUSTOMER ID, RATING, DATE
CUSTOMER ID, RATING, DATE
CUSTOMER ID, RATING, DATE
CUSTOMER ID, RATING, DATE
CUSTOMER ID, RATING, DATE
MOVIE ID
CUSTOMER ID, RATING, DATE
CUSTOMER ID, RATING, DATE
MOVIE ID
CUSTOMER ID, RATING, DATE
CUSTOMER ID, RATING, DATE
As you can see, each movie’s reviews were grouped together with the movie ID appearing multiple times throughout the dataset. The challenge was that each customer ID had multiple ratings attached, and since Apache Beam (used in Dataflow) automatically runs in parallel across multiple worker nodes, it was difficult to ensure that the correct movie ID was assigned to each customer’s review.
This led to the need for a preprocessing job that could structure the dataset properly before it was ingested into Dataflowfor further transformations. Since Spring Batch is excellent for managing batch jobs and processing large datasets, it became the obvious choice for the job. Using Spring Batch, we could:
Once the decision to use Google Cloud Run for executing our preprocessing job was made, the next step was automating the process of deploying the application, running it as a job, and ensuring seamless updates with minimal manual intervention. This led us to build an automated CI/CD pipeline — a crucial part of ensuring that our data preprocessing job runs efficiently and consistently every time we push changes to the repository.
Let’s walk through the entire deployment pipeline, step by step, highlighting the key components that make it work:
The first task in automating the deployment was containerizing the Spring Batch application. Docker is essential for packaging the application with all its dependencies, making it portable and executable in a serverless environment like Google Cloud Run.
Here’s a breakdown of how the Docker container was set up:
FROM eclipse-temurin:21
WORKDIR /app
RUN mkdir -p /data/out && chmod -R 777 /data/out
COPY build/libs/recommendations-0.0.1-SNAPSHOT.jar recommendations-preprocessing-job.jar
ENTRYPOINT ["java", "-jar", "recommendations-preprocessing-job.jar"]
/app
as the working directory where the application will reside within the container.RUN mkdir -p /data/out && chmod -R 777 /data/out
command creates the output directory (/data/out
) and sets appropriate permissions. This is where the job will store the processed data before it’s uploaded or passed to the next stage of the pipeline.recommendations-0.0.1-SNAPSHOT.jar
) that contains the Spring Batch application is copied into the container.java -jar
command to execute the JAR file.Once the Docker image is ready, we need a process to build, store, and deploy this image.
Building a fully automated deployment pipeline is a crucial step for ensuring consistent, reliable, and repeatable deployments. For this project, we leveraged GitHub Actions to automate both the build and deployment process for our Cloud Run job. The pipeline ensures that each commit gets validated, tested, and deployed efficiently without manual intervention.
Our CI/CD workflow is triggered by specific events in the repository, such as pushes to the main
, develop
, releases/**
, and hotfix/**
branches, as well as pull requests. We also added the workflow_dispatch
trigger for manual deployments when needed
name: CI-CD
on:
push:
branches: [ "main", "develop", "releases/**", "hotfix/**" ]
pull_request:
types: [ opened, synchronize, reopened ]
workflow_dispatch:
In this CI/CD process, it’s essential to authenticate with Google Cloud to interact with resources such as Cloud Run, Docker, and the Container Registry. We set up the necessary permissions for GitHub to interact with these services. Specifically, we used the google-github-actions/auth
action for GCP authentication and configured it to use a service account key with workload identity federation for secure access.
permissions:
id-token: write
contents: read
issues: write
pull-requests: write
jobs:
cloudrun:
name: CI-CD
runs-on: ubuntu-latest
The first few steps of the job set up the environment by checking out the code from GitHub, configuring the Java environment, and installing any dependencies using Gradle:
- name: Git Checkout
uses: actions/checkout@v4
- name: Setup Java
uses: actions/setup-java@v4
with:
distribution: temurin
java-version: 21
cache: gradle
Next, we build the application using Gradle. This step compiles the code, runs tests, and generates the necessary JAR files for packaging into a Docker container.
- name: Install Dependencies
run: ./gradlew build --no-daemon
We also implemented a code coverage check using Jacoco. This ensures that our tests are adequately covering the code base. If coverage falls below a threshold (80%), the pull request is marked as failed:
- name: Run Coverage
run: |
chmod +x gradlew
./gradlew jacocoTestReport
if: github.event_name != 'workflow_dispatch'
- name: Fail PR if overall coverage is less than 80%
if: ${{ github.event_name == 'pull_requests' && steps.jacoco.outputs.coverage-overall < 80.0 }}
uses: actions/github-script@v6
with:
script: |
core.setFailed('Overall coverage is less than 80%!')
After ensuring code quality, the next crucial step is to build the Docker image. Using the provided Dockerfile, we package the application into a container image, which is then pushed to Google Container Registry (GCR) for easy access in Cloud Run:
- name: Docker Auth
run: |-
gcloud auth configure-docker ${{ secrets.PROJECT_REGION }}-docker.pkg.dev --quiet
- name: Build Image
run: |
docker build . --file Dockerfile --tag ${{ secrets.PROJECT_REGION }}-docker.pkg.dev/${{ secrets.PROJECT_ID }}/${{ secrets.REPOSITORY_NAME }}/${{ secrets.GCP_IMAGE_NAME }}
- name: Push Image
run: docker push ${{ secrets.PROJECT_REGION }}-docker.pkg.dev/${{ secrets.PROJECT_ID }}/${{ secrets.REPOSITORY_NAME }}/${{ secrets.GCP_IMAGE_NAME }}
Once the image is built and pushed, the deployment step begins. This step checks if the Cloud Run job already exists. If it does, the job is updated with the new image. If not, a new Cloud Run job is created:
- name: Deploy Cloud Run Job
run: |
JOB_EXISTS=$(gcloud beta run jobs describe ${{ secrets.JOB_NAME }} --project ${{ secrets.PROJECT_ID }} --region ${{ secrets.PROJECT_REGION }} --format="value(name)" || echo "not found")
if [ "$JOB_EXISTS" != "not found" ]; then
echo "Job already exists, updating..."
gcloud beta run jobs update ${{ secrets.JOB_NAME }} \
--image asia-south1-docker.pkg.dev/architect-learning-435310/recommendations-engine/recommendations-pre-processor-job \
--project ${{ secrets.PROJECT_ID }} \
--region ${{ secrets.PROJECT_REGION }} \
--set-env-vars "DB_URL=${{ secrets.POSTGRES_DB_URL }},DB_USERNAME=${{ secrets.POSTGRES_DB_USERNAME }},DB_PASSWORD=${{ secrets.POSTGRES_DB_PASSWORD }},OUTPUT_PATH=${{ secrets.OUTPUT_PATH }}" \
--args="${{ secrets.GCP_STORAGE_BUCKET_INPUT_PATH }}","${{ secrets.GCP_STORAGE_BUCKET_NAME }}"
else
echo "Job does not exist, creating a new job..."
gcloud beta run jobs create ${{ secrets.JOB_NAME }} \
--image asia-south1-docker.pkg.dev/architect-learning-435310/recommendations-engine/recommendations-pre-processor-job \
--project ${{ secrets.PROJECT_ID }} \
--region ${{ secrets.PROJECT_REGION }} \
--set-env-vars "DB_URL=${{ secrets.POSTGRES_DB_URL }},DB_USERNAME=${{ secrets.POSTGRES_DB_USERNAME }},DB_PASSWORD=${{ secrets.POSTGRES_DB_PASSWORD }},OUTPUT_PATH=${{ secrets.OUTPUT_PATH }}" \
--args="${{ secrets.GCP_STORAGE_BUCKET_INPUT_PATH }}","${{ secrets.GCP_STORAGE_BUCKET_NAME }}"
fi
To ensure transparency and provide real-time feedback, we integrated Slack notifications. These notifications will inform the team whether the deployment was successful or failed:
- name: Slack notification
uses: 8398a7/action-slack@v3
with:
status: custom
fields: workflow,job,commit,repo,ref,author,took
custom_payload: |
{
attachments: [{
color: '${{ job.status }}' === 'success' ? 'good' : '${{ job.status }}' === 'failure' ? 'danger' : 'warning',
text: `Action Name: ${process.env.AS_WORKFLOW}\n Repository Name:${process.env.AS_REPO}@${process.env.AS_REF} by ${process.env.AS_AUTHOR} ${{ job.status }} in ${process.env.AS_TOOK}`,
}]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
if: always()
This CI/CD pipeline automates the process of building, testing, deploying, and running Cloud Run jobs in Google Cloud. By using GitHub Actions, we streamlined the deployment flow, ensuring consistency across different environments. Every commit is thoroughly tested, packaged, and deployed to Cloud Run in a seamless, automated fashion.
Once the environment is set up and authenticated, the next crucial step in the CI/CD pipeline is executing the pre-processor job on Google Cloud Run. This step leverages the flexibility of Cloud Run jobs to handle batch processing efficiently.
In our workflow, the pre-processor job is triggered using the gcloud
CLI to execute a job that was previously deployed to Cloud Run. The execution is done using the following command:
- name: Run Job
run: |
gcloud beta run jobs execute ${{ secrets.JOB_NAME }} --region ${{ secrets.PROJECT_REGION }}
This command tells Google Cloud to execute the specified Cloud Run job in the given region. Let’s break this down:
${{ secrets.JOB_NAME }}
: This refers to the name of the Cloud Run job, which is stored securely in GitHub secrets. It ensures that the job being executed is always the correct one, as it’s dynamically fetched.${{ secrets.PROJECT_REGION }}
: This is the region where the job is hosted on Google Cloud Run, allowing you to target the correct Cloud Run environment.For those interested in exploring the code behind this blog post and replicating the setup, I’ve made the entire project available on GitHub. You can find the repository, including the Dockerfile, GitHub Actions workflows, and the Spring Batch code, in the link below:
GitHub Repository: Recommendations Preprocessing Job
Feel free to fork the repository, explore the setup, and adapt it to your own projects. If you have any questions or suggestions, don’t hesitate to open an issue or contribute to the project!
In this blog post, we’ve walked through the end-to-end process of setting up an automated CI/CD pipeline for Google Cloud Run, from building the Docker image to executing batch jobs in a scalable and efficient manner. By leveraging Cloud Run jobs, we were able to create a serverless solution that automatically scales based on demand, reduces operational overhead, and ensures that our data processing tasks are handled seamlessly.
The integration with tools like GitHub Actions for continuous deployment and Slack notifications for workflow updates adds visibility and automation, making the pipeline easier to monitor and manage.
In a world where speed, scalability, and cost-efficiency are paramount, this architecture ensures that the preprocessing job for our recommendation engine runs smoothly and can easily adapt to future demands. With the flexibility and power of Google Cloud services, we’ve built a robust system that helps us focus on innovation, without worrying about infrastructure.
Whether you’re building machine learning pipelines, processing large datasets, or simply looking to automate batch jobs, leveraging Cloud Run with a solid CI/CD setup is a powerful and scalable approach that can help your business stay agile and cost-effective.
[…] Unlock the Power of Cloud Run: Data Preprocessing for Batch Jobs […]