Under load, the breaker doesn’t guarantee the exact number of failures before switching to the open state. #91

jejefferson · 2025-04-09T01:06:24Z

⚠️ Problem: Circuit Breaker doesn’t guarantee the exact number of failed requests ignoring the config

The Circuit Breaker doesn’t track how many requests are sent. If too many go at once, the service gets overloaded and response time grows a lot. As a result, you only get timeouts — no request finishes successfully.
If you increase the timeout, the queue grows — potentially unlimited, consuming memory.

🧪 Example Scenario

In my case, PostgreSQL responded in about 72 ms without load, but under load, response times increased to over 30–40 seconds due to an exhausted connection pool and huge queue.
We simulate a failing service:

func req() (T, error) {
    time.Sleep(30 * time.Second)
    return nil, errors.New("wow, no guarantee")
}

Running a stress test with k6 at 1000 RPS, the first error will only be reported after 30 seconds —
after 30,000 in-flight requests!

Why? Because the Circuit Breaker transitions state based on results, not calls in flight or slow calls.
Until a response is received, no failure is registered:

func (cb *CircuitBreaker[T]) Execute(req func() (T, error)) (T, error) {
    generation, err := cb.beforeRequest()
   ...
    result, err := req() // Latency determines when failures are counted. Queue grows unlimited
    cb.afterRequest(generation, cb.isSuccessful(err))
    return result, err
}

1. ✅ When CB Works fine

Ramping traffic (slowly) → breaker has time to react
Low latency → fewer in-flight calls
Controlled ramp-up → stable recovery of protected service

2. ❌ When It Breaks (e.g. overloaded protected resource)

Sudden spike → latency increases dramatically. CB itself produce such spikes when changing state from "half open" to "closed" and "open" again.
Delayed error tracking → requests keep piling up, queue grows
The breaker opens too late when the protected resource is overwhelmed with requests, and each of them times out.

💡 Possible Improvements

Track slow requests and trigger state to open
resilience4j uses SLOW_CALL_RATE_THRESHOLD
* Configures a threshold in percentage. The CircuitBreaker considers a call as slow when
* the call duration is greater than slowCallDurationThreshold(Duration). When the
* percentage of slow calls is equal to or greater than the threshold, the CircuitBreaker
* transitions to open and starts short-circuiting calls.
Monitor In-Flight Load
Use a ratio of in-flight / total or just limit in-flight.
Example from Elasticsearch
Predictive Backpressure
Track error thresholds over time.
Proactively reject new requests before overload.
See the comment with the proposal — it’s a tough but necessary way to handle this properly.
Yes, I know throttling and rate-limiting aren’t typically part of a circuit breaker, but we live in the real world where smart
solutions are required.

📝 Documentation highlight needed. This behavior should be clearly documented.

Copy-pasted from a linked resource in repo README

“If the timeout is too long, a thread running a circuit breaker may be blocked for an extended period...
In this time, many other application instances may also attempt to invoke the service...
tying up a significant number of threads before they all fail.”
— MSDN

📝 Artefacts

circuit breaker config for postgres used:

	postgresCB := gobreaker.NewCircuitBreaker(gobreaker.Settings{
		Name:        "PostgresBreaker",
		MaxRequests: 3,
		Interval:    time.Minute,
		Timeout:     30 * time.Second,
		ReadyToTrip: func(counts gobreaker.Counts) bool {
			return counts.ConsecutiveFailures >= 3
		},
	})

k6 stress config used:

import http from 'k6/http';
import { check } from 'k6';

export let options = {
    scenarios: {
        stress_test: {
            executor: 'ramping-arrival-rate',
            startRate: 100, // Start with 100 RPS
            timeUnit: '1s',
            preAllocatedVUs: 500, // Initial VUs allocation
            maxVUs: 50000, // Allow max scaling to find RPS limit
            stages: [
                { duration: '60s', target: 1000 }, // Ramp up to 1000 RPS
                { duration: '30s', target: 1500 }, // Ramp up to 1500 RPS
            ],
        },
    },
};

export default function () {
    // check the endpoint that retrieves 1.2 MB of JSON data from the table.
    let res = http.get('http://127.0.0.1:9999/api/someresponsefrompostgresfulltable');
    check(res, {
        'is status 200': (r) => r.status === 200,
    });
}

The text was updated successfully, but these errors were encountered:

jejefferson · 2025-04-09T14:46:54Z

A proposal for handling peak loads and traffic spikes.

Let's look at above and sum what we have.

The Sony circuit breaker handles state transitions well when requests increase gradually — that’s great!
However, the protected service shows a dramatic increase in latency under sudden load, allowing many requests to pass through before the first error (typically a timeout) is even reported.
The Circuit Breaker itself caused the traffic to take the form of peaks, rising from low to high. This happens because, after a timeout, the protected service can easily pass the "half open" check. So protected service gives load like: П_П_П_П form.

So how can we tune Scenario 2 to behave more like Scenario 1? Let’s figure out what we need to change.

Peak load flattening algorithm.

Always start with 1 request per second.
Gradually increase the RPS over time until the first error is encountered. The rate of increase can depend on the latency of current requests.
Record the value N — the number of requests the service was able to handle before the first error was reported. See t0 marker.
Once the rate limiter is released, allow all requests to proceed without limits — further handling will be controlled by the circuit breaker state (e.g., open).
Repeat from step 2 starting with value N requests stored (learned) before at step 3.

Or even more easy if we don't add a rate limiter to the sony CB:

After transitioning from "half-open" to "closed", continue throttling requests. Optional: provide config timeout how long.
Throttle 99, pass 1. Throttle 98, pass 2, etc. If N is not 0 then start from N requests value saved at step 3 in prev iteration.
When first error occurred example when 50 throttled, save the percentage or count N requests passed.
Next errors will be registered faster in the nearest 40 throttled/ 60 passed. Latency will still small because system is not overwhelmed.
CB will change the status to "open" early. Resources will be saved, and the recovering service will appreciate it.
Go to 1 after timeout.

In both the tradeoffs will be:

Throttle requests or let the queue grow until fully "closed". Optionally, provide a config for in-flight limits and allow queue growth only if it’s not empty.
Should we optimize by learning how many requests passed in the previous iteration, or always start from zero?
Starting from MaxRequests is more stable if the system state has changed — which is common with shared resources.

vladimirshishmintsev-ht mentioned this issue Apr 9, 2025

is circuit breaker threadsafe? #55

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Under load, the breaker doesn’t guarantee the exact number of failures before switching to the open state. #91

Under load, the breaker doesn’t guarantee the exact number of failures before switching to the open state. #91

jejefferson commented Apr 9, 2025 •

edited

Loading

jejefferson commented Apr 9, 2025 •

edited

Loading

Under load, the breaker doesn’t guarantee the exact number of failures before switching to the open state. #91

Under load, the breaker doesn’t guarantee the exact number of failures before switching to the open state. #91

Comments

jejefferson commented Apr 9, 2025 • edited Loading

⚠️ Problem: Circuit Breaker doesn’t guarantee the exact number of failed requests ignoring the config

🧪 Example Scenario

1. ✅ When CB Works fine

2. ❌ When It Breaks (e.g. overloaded protected resource)

💡 Possible Improvements

📝 Documentation highlight needed. This behavior should be clearly documented.

📝 Artefacts

jejefferson commented Apr 9, 2025 • edited Loading

A proposal for handling peak loads and traffic spikes.

Peak load flattening algorithm.

jejefferson commented Apr 9, 2025 •

edited

Loading

jejefferson commented Apr 9, 2025 •

edited

Loading