aws/sqs: prevent ackLoop starvation #3103

rockwotj · 2025-01-04T05:06:37Z

The ackLoop can be starved, we add a max outstanding limit to ensure that stuff gets acked properly and fix the refresh behavior in general.

Making this a batch input might help too. My theory is that:

Polling returns 10 messages it grabs the lock
Acking also grabs the lock, but is only able to process 1 message at a time

So if there is a backlog and the locking is fair, the poll loop is going to produce a lot more values than the ack loop can and we end up with unbounded memory usage for the inflight messages. So a batch input where 10 messages are acked at a time would probably help with lock fairness. Just having the cond var in there seems to help a ton, but there is now an explicit mechanism to ensure that we don't have unbounded memory usage.

We fix all this by moving the refresh out to another loop and only try to refresh 10 messages at once (which is the API limit for requests at once). The only improvement from here is to make more requests in parallel, but I think with limiting the refreshes we're good from needing to do that.

Just adding the enforcement of this limit seems to alievate some lock contention and now the ackLoop doesn't get stuck and stop acking stuff, the limit here should help too in cases where there is actual backup. The property name was inspired from the gcp pubsub input.

If the advanced delete message prop was ever `false`, the in flight tracker would just accumulate memory until OOM. Much better to just noop the actual delete, because now when there are nacks we will reset the visibility faster.

From testing I am getting false positive log messages and am not sure why the atomic still has a value unless there is some sneaky copy in there somewhere. Just move to checking in the original map when there are failures, which should hopefully be rare anyways.

Duh, we *don't* want to refresh messages if the deadline is *greater* than half the timeout Sigh 🤦

rockwotj · 2025-01-06T16:39:38Z

internal/impl/aws/input_sqs.go

-		if timeoutStr, exists := m.Attributes[sqsiAttributeNameVisibilityTimeout]; exists {
-			// Might as well keep the queue timeout setting refreshed as we
-			// consume new data.
-			if tmpTimeoutSeconds, err := strconv.Atoi(timeoutStr); err == nil {
-				handle.timeoutSeconds = tmpTimeoutSeconds
-			}


FYI this attribute is not mentioned in the docs and doesn't seem to be present in my testing...

Most of this is just cleanup, the only real fix is correcting the timeout to match the queue.

rockwotj · 2025-01-06T18:07:56Z

internal/impl/aws/input_sqs.go

+	if !a.conf.DeleteMessage {
+		return nil
+	}
+	const maxBatchSize = 10


The API limit here on the amazon side is 10, so currently if you MaxNumberOfMessages > 10 then you can't delete or reset visibility for anything...

mihaitodor

Nice job! 🏆 Just left some small nitpicky questions, but feel free to

internal/impl/aws/input_sqs.go

Prevents errors in the batch API, even if it's unlikely. Also fix a log statement.

aws/sqs: improve logging

0ea5b3e

rockwotj requested review from Jeffail and mihaitodor January 4, 2025 05:11

rockwotj added 4 commits January 4, 2025 05:28

aws/sqs: fix memory leak

ec3ce58

If the advanced delete message prop was ever `false`, the in flight tracker would just accumulate memory until OOM. Much better to just noop the actual delete, because now when there are nacks we will reset the visibility faster.

update changelog

e544ecc

update docs

824dea2

rockwotj force-pushed the sqs branch from 66c7376 to 824dea2 Compare January 4, 2025 05:28

rockwotj added 7 commits January 5, 2025 04:30

aws/sqs: make message refresh async

783113f

aws/sqs: limit the number of items refreshed at once

cbed7d0

aws/sqs: pull out refresh loop seperately

2c414da

aws/sqs: invert condition

c66e52d

Duh, we *don't* want to refresh messages if the deadline is *greater* than half the timeout Sigh 🤦

chore: update docs

6780bb4

aws/sqs: downgrade reset failures to info logs

62fdee3

rockwotj requested a review from ooesili January 6, 2025 16:35

update changelog

43eabc4

rockwotj commented Jan 6, 2025

View reviewed changes

aws/sqs: fix and cleanup tests

bc1c3f9

Most of this is just cleanup, the only real fix is correcting the timeout to match the queue.

rockwotj commented Jan 6, 2025

View reviewed changes

mihaitodor approved these changes Jan 7, 2025

View reviewed changes

internal/impl/aws/input_sqs.go Show resolved Hide resolved

internal/impl/aws/input_sqs.go Show resolved Hide resolved

internal/impl/aws/input_sqs.go Show resolved Hide resolved

internal/impl/aws/input_sqs.go Outdated Show resolved Hide resolved

rockwotj added 2 commits January 7, 2025 15:11

aws/sqs: deduplicate IDs

b5773e6

Prevents errors in the batch API, even if it's unlikely. Also fix a log statement.

aws/sqs: handle duplicate recieves of inflight stuff

341cd6f

rockwotj merged commit 6801337 into main Jan 7, 2025
4 checks passed

rockwotj deleted the sqs branch January 7, 2025 15:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws/sqs: prevent ackLoop starvation #3103

aws/sqs: prevent ackLoop starvation #3103

rockwotj commented Jan 4, 2025 •

edited

Loading

rockwotj Jan 6, 2025

rockwotj Jan 6, 2025

mihaitodor left a comment

aws/sqs: prevent ackLoop starvation #3103

aws/sqs: prevent ackLoop starvation #3103

Conversation

rockwotj commented Jan 4, 2025 • edited Loading

rockwotj Jan 6, 2025

Choose a reason for hiding this comment

rockwotj Jan 6, 2025

Choose a reason for hiding this comment

mihaitodor left a comment

Choose a reason for hiding this comment

rockwotj commented Jan 4, 2025 •

edited

Loading