The go-to resource for upgrading Ruby, Rails, and your dependencies.

Achieving Zero-Downtime Database Migrations on Google Cloud Run


In the mid-nineteenth century, replacing a railroad bridge was a delicate operation. You couldn’t stop the trains for a week; commerce depended on the schedule. Instead, engineers developed techniques to build the new bridge alongside or even around the old one, transitioning the tracks only when the new structure was entirely ready. The trains kept running.

We face a similar challenge when deploying web applications. When migrating a Ruby on Rails application to a serverless container platform like Google Cloud Run, teams often encounter unexpected downtime during routine deployments. In traditional virtual machine environments — what we might call the old bridge — a deployment script might temporarily take the application offline, run bin/rails db:migrate, and then restart the web server. While this causes a brief interruption, the process is linear and predictable.

Cloud Run, though, operates differently. During a deployment, Cloud Run initiates new container instances running the updated code while the existing instances continue to serve active user requests. Traffic is then gradually routed to the new instances. If you configure your container’s entrypoint to execute database migrations before starting the web server, you create a race condition. The database schema changes while the old containers are still processing requests.

When the old code attempts to interact with the new database schema — perhaps inserting a record into a table that has had a column renamed or removed — ActiveRecord throws a StatementInvalid exception. Users experience HTTP 500 errors until the old instances are completely drained and shut down. To achieve true zero-downtime deployments on Cloud Run, we must fundamentally decouple the database migration process from the container startup sequence.

Decoupling Migrations with Cloud Run Jobs

A robust architectural pattern for running migrations on Google Cloud Run is to utilize Cloud Run Jobs. Unlike Cloud Run Services, which listen for HTTP requests and scale based on incoming traffic, Cloud Run Jobs are designed to execute a specific task to completion and then terminate. This makes them the ideal environment for administrative tasks like schema migrations or data backfills.

Instead of bundling the migration command into the CMD or ENTRYPOINT of your Dockerfile, you orchestrate the deployment through your continuous integration pipeline, such as GitHub Actions or Google Cloud Build.

There are a number of different approaches to managing deployments. Generally speaking, the most reliable method divides the deployment into three distinct steps:

  1. Build and Push: The CI pipeline builds the new Docker container image and pushes it to Google Artifact Registry.
  2. Execute the Migration Job: The pipeline updates a dedicated Cloud Run Job with the new container image and executes it. This job runs bin/rails db:migrate.
  3. Deploy the Web Service: Only after the Cloud Run Job completes successfully does the pipeline update the Cloud Run Service to route traffic to the new image.

For example, using the Google Cloud CLI (gcloud), you might execute the migration job like this:

$ gcloud run jobs update rails-migration-job \
  --image my-new-image:v2
$ gcloud run jobs execute rails-migration-job \
  --wait

And then, if that succeeds, you deploy the web service:

$ gcloud run deploy rails-web-service \
  --image my-new-image:v2

By isolating the migration step, we ensure that the web service containers boot quickly and do not attempt to run concurrent migrations. However, orchestrating the infrastructure is only half of the solution.

The Expand and Contract Pattern

Because your Cloud Run deployment process now updates the database before the new code is fully rolled out, the database schema must remain entirely compatible with the currently running version of your application. If a migration drops a column that the old code expects to be present, the old containers will crash.

To handle destructive changes safely, development teams use the “expand and contract” pattern. This strategy breaks a single breaking change into multiple, backward-compatible deployments.

Consider the common scenario of renaming a database column from first_name to given_name. Rather than issuing a single RENAME COLUMN statement, we divide the work into four phases.

Phase 1: Expand

First, we add the new column to the database and update the application code to write to both columns, but continue reading from the old one. We start by generating a migration to add given_name:

class AddGivenNameToUsers < ActiveRecord::Migration[7.1]
  def change
    add_column :users, :given_name, :string
  end
end

In our ActiveRecord model, we might use callbacks to ensure that whenever a record is saved, the value is populated in both fields:

class User < ApplicationRecord
  before_save :sync_names

  private

  def sync_names
    self.given_name = first_name
  end
end

We deploy this code. The database expands to accommodate the new structure, and the old code continues functioning perfectly because first_name remains intact.

Phase 2: Migrate Data

Once the expansion phase is deployed, we must backfill the existing records. We write a data migration — often executed as a separate Rake task via a Cloud Run Job — to copy historical data from first_name to given_name for all existing rows in the database:

namespace :data_migrations do
  task backfill_given_names: :environment do
    User.where(given_name: nil).find_each do |user|
      user.update_column(:given_name, user.first_name)
    end
  end
end

Phase 3: Transition

Next, we update the application code to rely entirely on the new given_name column for both reads and writes. We can also instruct ActiveRecord to explicitly ignore the old column:

class User < ApplicationRecord
  self.ignored_columns += ["first_name"]
end

We deploy this change. At this point, the application is no longer utilizing the old column.

Phase 4: Contract

Finally, we create a migration to drop the first_name column from the database.

class RemoveFirstNameFromUsers < ActiveRecord::Migration[7.1]
  def change
    remove_column :users, :first_name, :string
  end
end

Because the currently running code explicitly ignores this column, running this migration will not cause any errors in the active Cloud Run instances. Once deployed, the database contracts to its final, desired state.

Enforcing Safe Practices

Of course, implementing these patterns requires discipline across the engineering team. A developer can inadvertently write a migration that locks a heavily trafficked table or drops an active column.

To prevent these issues from reaching production, teams should integrate tools like the strong_migrations gem. This library hooks into ActiveRecord and will raise an exception in development and test environments if it detects a potentially dangerous operation. For example, if a developer attempts to add a new column with a non-null default value, strong_migrations will halt the migration and provide instructions on how to accomplish the change safely. It forces developers to acknowledge the risk and guides them toward the safe, multi-step approach required for zero-downtime environments.

Conclusion

Deploying Ruby on Rails to Google Cloud Run offers significant advantages in scalability and operational overhead. It necessitates a shift, though, in how we manage stateful resources. By leveraging Cloud Run Jobs for execution and adhering to the expand and contract pattern for schema design, teams can eliminate deployment-related downtime and provide a seamless experience for their users.

Sponsored by Durable Programming

Need help maintaining or upgrading your Ruby on Rails application? Durable Programming specializes in keeping Rails apps secure, performant, and up-to-date.

Hire Durable Programming