The go-to resource for upgrading Ruby, Rails, and your dependencies.

Setting Up Log Aggregation with Datadog or New Relic After a Heroku Migration


In the late 19th century, the development of municipal water systems meant that urban residents no longer had to dig their own wells. You could pay a fee, connect to the city’s water main, and the municipality would handle the immense operational complexity of sourcing, pressurizing, and routing water to your home.

Platform-as-a-Service (PaaS) providers like Heroku operate on a similar principle, abstracting away significant operational complexity. Heroku’s Logplex, for example, automatically captures standard output (stdout) and standard error (stderr) from every dyno, aggregates them, and routes them to designated add-ons. It’s a feature we often take for granted.

When we migrate off Heroku — whether to reduce costs, increase performance, or gain architectural control with AWS ECS, Kubernetes, or Kamal — this automatic log routing is lost. Our application will, of course, continue to write logs to stdout. Without an ingestion mechanism, though, those logs are ephemeral; if a container restarts, the logs are gone, rendering them useless during a production incident.

This, of course, raises the question of how to replace Heroku Logplex. To do so, we must explicitly engineer a log aggregation pipeline. This involves modifying our Ruby on Rails application to emit structured data, deploying a forwarding agent to capture it, and configuring a log management platform like Datadog or New Relic to index it for search and alerting.

Architectural Trade-offs: Datadog vs. New Relic

Before we get into the mechanics of log routing, though, we need to choose a destination. While many log aggregation and Application Performance Monitoring (APM) providers exist, Datadog and New Relic are two of the most common enterprise choices for Ruby on Rails applications. Strictly speaking, both are comprehensive observability platforms, not just log aggregators. They offer robust integrations for Ruby, but they prioritize different operational paradigms. I tend to prefer Datadog for complex, distributed infrastructure, but depending on your circumstances, one may be more useful than the other.

Datadog, for example, excels at infrastructure correlation. If a container experiences a CPU spike, Datadog allows us to seamlessly pivot from an infrastructure metric directly into the logs generated by that specific container during the exact time window. Its Ruby agent (ddtrace) integrates cleanly with modern Rails applications, providing detailed spans for Active Record queries and HTTP requests. However, Datadg’s pricing model can become aggressively expensive at scale, particularly for log ingestion and retention. Therefore, we must carefully filter our log streams to control costs.

New Relic, on the other hand, has a long history in the Ruby community. Its APM tooling provides deep, granular insights into thread profiling and memory allocation, and for teams focused on application-level performance over complex Kubernetes topologies, it often provides a more focused developer experience. It’s pricing structure has undergone multiple revisions; at the time of this writing, it offers a unified data ingest model that some organizations find more predictable than Datadog’s per-host, per-million-events billing.

Of course, both platforms require deliberate configuration to extract maximum value; neither will automatically parse complex, multi-line Ruby stack traces without explicit instruction from us.

Standardizing Output: The Importance of Structured Logging

The default Ruby on Rails logger, of course, outputs plain text. A single web request generates multiple lines of output detailing the route, parameters, database queries, and rendering time. If we were to inspect our stdout stream, we might see something like this:

Started GET "/users/123" for 127.0.0.1 at 2023-10-27 10:00:00 -0400
Processing by UsersController#show as HTML
  Parameters: {"id"=>"123"}
  User Load (0.5ms)  SELECT "users".* FROM "users" WHERE "users"."id" = $1 LIMIT $2  [["id", 123], ["LIMIT", 1]]
Completed 200 OK in 15ms (Views: 10.2ms | ActiveRecord: 0.5ms)

When these disparate text lines are forwarded to a log aggregator, they are treated as independent events. This makes searching for an error that occurred for a specific user an exercise in fragile text parsing.

To resolve this, our Rails application must emit structured logs, typically in JSON format. Structured logging ensures that every log event is a single, parsable object containing all relevant context. For this, the lograge gem is a standard, robust solution in the Rails ecosystem.

We can configure lograge for structured output by modifying our production environment’s configuration:

# config/environments/production.rb
Rails.application.configure do
  # Enable lograge
  config.lograge.enabled = true

  # Set the log formatter to output JSON
  config.lograge.formatter = Lograge::Formatters::Json.new

  # Add custom data to the log output
  config.lograge.custom_options = lambda do |event|
    {
      time: Time.now.utc.iso8601,
      host: Socket.gethostname,
      user_id: event.payload[:user_id],
      request_id: event.payload[:request_id]
    }
  end
end

With this configuration, that same web request will generate a single line of JSON. I’ve abbreviated the output for brevity:

{"method":"GET","path":"/users/123","format":"html", ...snip... "user_id":123,"request_id":"5f8a9b21"}

You also may notice we are including user_id and request_id in our custom options. By adding the request_id, we establish a correlation identifier. If we inject this same request_id and user_id into our background jobs (for example, with Sidekiq or Resque), we can trace a user’s action from the initial web request through all asynchronous processing within our logging platform.

Infrastructure Implementation: The Forwarder Pattern

One common approach we might consider during a Heroku migration is to send logs directly from the Ruby application to the Datadog or New Relic API. This approach, though, is fragile. If the logging provider’s API experiences latency, the Ruby process will block waiting for a response, leading to cascading failures.

A more robust solution is to treat logs as a continuous, local stream. The application writes to stdout, and the container runtime (like Docker) manages that stream. A separate, dedicated process — a log forwarder — is then responsible for reading that stream and reliably transmitting it to the aggregator.

In an AWS ECS or Kubernetes environment, this is typically implemented with a sidecar container or a DaemonSet. For this role, the Datadog Agent and Fluent Bit are standard choices. Fluent Bit, for example, is a lightweight C program that can tail container logs, buffer them to survive network partitions, and forward them in batches.

A basic configuration might look like this:

# Example fluent-bit.conf snippet for Datadog
[OUTPUT]
    Name        datadog
    Match       *
    Host        http-intake.logs.datadoghq.com
    TLS         on
    API_Key     ${DATADOG_API_KEY}
    compress    gzip

This architecture decouples our application’s performance from the availability of the log ingestion API. It is a read-only process from the application’s perspective; the application simply writes to stdout and remains blissfully unaware of whether the logs are successfully transmitted.

Cost Management and Data Retention Strategies

Log aggregation costs, of course, scale linearly with traffic. If an application serves millions of requests daily, logging every successful 200 OK response will rapidly consume the infrastructure budget. Effective log management, therefore, requires strict filtering at both the application and the forwarder level.

First, we should eliminate noise at the source. Load balancers, for example, frequently ping an application’s health check endpoint (e.g., /up in Rails 7.1+). These requests provide no diagnostic value during an incident and should not be logged in production. We can instruct lograge to ignore them:

# config/environments/production.rb
config.lograge.ignore_actions = ['Rails::HealthController#show']

Second, we can utilize index management within our aggregator. Both Datadog and New Relic allow us to ingest logs but selectively index them. For example, we might configure the platform to index 100% of logs with an error severity, but only sample 5% of info level logs for statistical analysis. The remaining 95% of info logs can then be archived to a service like AWS S3 for compliance, without incurring expensive hot-storage indexing costs.

Conclusion

Replacing Heroku Logplex requires upfront architectural planning. By enforcing structured JSON logging, utilizing robust log forwarders, and implementing aggressive filtering strategies, we can build an observable, highly searchable environment that scales securely and economically. It is, in effect, building our own municipal water service — a significant undertaking, but one that grants us control over a critical piece of our infrastructure.

Sponsored by Durable Programming

Need help maintaining or upgrading your Ruby on Rails application? Durable Programming specializes in keeping Rails apps secure, performant, and up-to-date.

Hire Durable Programming