Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Under load, the breaker doesn’t guarantee the exact number of failures before switching to the open state. #91

Open
jejefferson opened this issue Apr 9, 2025 · 1 comment

Comments

@jejefferson
Copy link

jejefferson commented Apr 9, 2025

⚠️ Problem: Circuit Breaker doesn’t guarantee the exact number of failed requests ignoring the config

The Circuit Breaker doesn’t track how many requests are sent. If too many go at once, the service gets overloaded and response time grows a lot. As a result, you only get timeouts — no request finishes successfully.
If you increase the timeout, the queue grows — potentially unlimited, consuming memory.


🧪 Example Scenario

In my case, PostgreSQL responded in about 72 ms without load, but under load, response times increased to over 30–40 seconds due to an exhausted connection pool and huge queue.
We simulate a failing service:

func req() (T, error) {
    time.Sleep(30 * time.Second)
    return nil, errors.New("wow, no guarantee")
}

Running a stress test with k6 at 1000 RPS, the first error will only be reported after 30 seconds
after 30,000 in-flight requests!

Why? Because the Circuit Breaker transitions state based on results, not calls in flight or slow calls.
Until a response is received, no failure is registered:

func (cb *CircuitBreaker[T]) Execute(req func() (T, error)) (T, error) {
    generation, err := cb.beforeRequest()
   ...
    result, err := req() // Latency determines when failures are counted. Queue grows unlimited
    cb.afterRequest(generation, cb.isSuccessful(err))
    return result, err
}

1. ✅ When CB Works fine

  • Ramping traffic (slowly) → breaker has time to react
  • Low latency → fewer in-flight calls
  • Controlled ramp-up → stable recovery of protected service

2. ❌ When It Breaks (e.g. overloaded protected resource)

  • Sudden spike → latency increases dramatically. CB itself produce such spikes when changing state from "half open" to "closed" and "open" again.
  • Delayed error tracking → requests keep piling up, queue grows
  • The breaker opens too late when the protected resource is overwhelmed with requests, and each of them times out.

Breaker Graph


💡 Possible Improvements

  1. Track slow requests and trigger state to open
    resilience4j uses SLOW_CALL_RATE_THRESHOLD
    * Configures a threshold in percentage. The CircuitBreaker considers a call as slow when
    * the call duration is greater than slowCallDurationThreshold(Duration). When the
    * percentage of slow calls is equal to or greater than the threshold, the CircuitBreaker
    * transitions to open and starts short-circuiting calls.

  2. Monitor In-Flight Load
    Use a ratio of in-flight / total or just limit in-flight.
    Example from Elasticsearch

  3. Predictive Backpressure
    Track error thresholds over time.
    Proactively reject new requests before overload.
    See the comment with the proposal — it’s a tough but necessary way to handle this properly.
    Yes, I know throttling and rate-limiting aren’t typically part of a circuit breaker, but we live in the real world where smart
    solutions are required.


📝 Documentation highlight needed. This behavior should be clearly documented.

Copy-pasted from a linked resource in repo README

“If the timeout is too long, a thread running a circuit breaker may be blocked for an extended period...
In this time, many other application instances may also attempt to invoke the service...
tying up a significant number of threads before they all fail.”

MSDN


📝 Artefacts

circuit breaker config for postgres used:

	postgresCB := gobreaker.NewCircuitBreaker(gobreaker.Settings{
		Name:        "PostgresBreaker",
		MaxRequests: 3,
		Interval:    time.Minute,
		Timeout:     30 * time.Second,
		ReadyToTrip: func(counts gobreaker.Counts) bool {
			return counts.ConsecutiveFailures >= 3
		},
	})

k6 stress config used:

import http from 'k6/http';
import { check } from 'k6';

export let options = {
    scenarios: {
        stress_test: {
            executor: 'ramping-arrival-rate',
            startRate: 100, // Start with 100 RPS
            timeUnit: '1s',
            preAllocatedVUs: 500, // Initial VUs allocation
            maxVUs: 50000, // Allow max scaling to find RPS limit
            stages: [
                { duration: '60s', target: 1000 }, // Ramp up to 1000 RPS
                { duration: '30s', target: 1500 }, // Ramp up to 1500 RPS
            ],
        },
    },
};

export default function () {
    // check the endpoint that retrieves 1.2 MB of JSON data from the table.
    let res = http.get('http://127.0.0.1:9999/api/someresponsefrompostgresfulltable');
    check(res, {
        'is status 200': (r) => r.status === 200,
    });
}
@jejefferson
Copy link
Author

jejefferson commented Apr 9, 2025

A proposal for handling peak loads and traffic spikes.

Let's look at above and sum what we have.

  1. The Sony circuit breaker handles state transitions well when requests increase gradually — that’s great!
  2. However, the protected service shows a dramatic increase in latency under sudden load, allowing many requests to pass through before the first error (typically a timeout) is even reported.
  3. The Circuit Breaker itself caused the traffic to take the form of peaks, rising from low to high. This happens because, after a timeout, the protected service can easily pass the "half open" check. So protected service gives load like: П_П_П_П form.

So how can we tune Scenario 2 to behave more like Scenario 1? Let’s figure out what we need to change.

Image

Peak load flattening algorithm.

  1. Always start with 1 request per second.
  2. Gradually increase the RPS over time until the first error is encountered. The rate of increase can depend on the latency of current requests.
  3. Record the value N — the number of requests the service was able to handle before the first error was reported. See t0 marker.
  4. Once the rate limiter is released, allow all requests to proceed without limits — further handling will be controlled by the circuit breaker state (e.g., open).
  5. Repeat from step 2 starting with value N requests stored (learned) before at step 3.

Or even more easy if we don't add a rate limiter to the sony CB:

  1. After transitioning from "half-open" to "closed", continue throttling requests. Optional: provide config timeout how long.
  2. Throttle 99, pass 1. Throttle 98, pass 2, etc. If N is not 0 then start from N requests value saved at step 3 in prev iteration.
  3. When first error occurred example when 50 throttled, save the percentage or count N requests passed.
  4. Next errors will be registered faster in the nearest 40 throttled/ 60 passed. Latency will still small because system is not overwhelmed.
  5. CB will change the status to "open" early. Resources will be saved, and the recovering service will appreciate it.
  6. Go to 1 after timeout.

In both the tradeoffs will be:

  • Throttle requests or let the queue grow until fully "closed". Optionally, provide a config for in-flight limits and allow queue growth only if it’s not empty.
  • Should we optimize by learning how many requests passed in the previous iteration, or always start from zero?
    Starting from MaxRequests is more stable if the system state has changed — which is common with shared resources.

@jejefferson jejefferson changed the title Race condition: Execute doesn’t strictly guarantee how many failed executions may occur before the breaker transitions to the open/half state. Race condition under the load: Execute doesn’t strictly guarantee how many failed executions may occur before the breaker transitions to the open/half state. Apr 9, 2025
@jejefferson jejefferson changed the title Race condition under the load: Execute doesn’t strictly guarantee how many failed executions may occur before the breaker transitions to the open/half state. Under load, the breaker doesn’t guarantee the exact number of failures before switching to the open state. Apr 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant