Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Configuring Hangfire to avoid logging errors on transient database timeouts #2493

Open
wlthomson opened this issue Jan 9, 2025 · 4 comments

Comments

@wlthomson
Copy link

I am using Hangfire and Serilog (backed by SQL server storage) for background tasks in my dotnet application, with monitoring setup to notify when any unexpected exceptions occur.

Due to hosting in a cloud environment, transient database errors occur on a non-infrequent basis. This is fine for jobs themselves, due to retry policies being configured to only log if all retries are exhausted. The problem I am experiencing is when a transient database connectivity blip causes Hangfire to fail to access job metadata from our SQL database. Whenever this occurs, an exception is thrown and an error logged (e.g. Execution Worker is in the Failed state now due to an exception, execution will be retried in no more than 00:00:04). In almost all cases, the database recovers on retry.

Is there a way that Hangfire can be configured to only log these errors after a certain number of retries have occurred?

@wlthomson wlthomson changed the title [Question] Configuring Hangfire to avoid error logs on transient database errors [Question] Configuring Hangfire to avoid logging errors on transient database timeouts Jan 9, 2025
@odinserj
Copy link
Member

odinserj commented Jan 9, 2025

Hm, all logging levels, starting from Info already limit the number of logging messages to avoid logging transient errors. Only Debug and Trace levels log everything. So when you see that something in the failed state, it means that several retry attempts already being made.

Any chance you are logging Debug or Trace levels?

@wlthomson
Copy link
Author

wlthomson commented Jan 29, 2025

Thanks for responding so quickly and apologies for the delayed reply.

I've double checked and can confirm that the logs are occurring at the Error level.

As above, the error message states that a retry will be performed in 4s. Ideally, we wouldn't have an error log emitted until a few retries have occurred.

To clarify, the logging mechanism is working as documented when errors occur within our user code (e.g. when a background task fails to read from the DB, no error logs are emitted until the retry count has exceeded). The problem seems to be when Hangfire itself encounters a transient database error when trying to read job metadata from the database (outside of the scope of our user code).

Hopefully that helps... please let me know if there's anything other info I can provide you!

@odinserj
Copy link
Member

odinserj commented Feb 4, 2025

Can you post here some more error messages that occur too often, possibly with the stack traces of the corresponding exceptions? I need to understand where this problem happens to see the possible workarounds.

@wlthomson
Copy link
Author

wlthomson commented Mar 16, 2025

Can you post here some more error messages that occur too often

The frequency rate itself isn't the problem. The issue is that the error is caused by an expected error state (a connectivity blip) which we expect to resolve itself on retry. UnfortunateIy I can't provide any other examples. As of writing, this is the only exception we are getting that seems to be being raised before a retry has been attempted.

possibly with the stack traces of the corresponding exceptions?

Here is a stack trace of the exception which is being surfaced:

Microsoft.Data.SqlClient.SqlException (0x80131904): A connection was successfully established with the server, but then an error occurred during the login process. (provider: TCP Provider, error: 35 - An internal exception was caught)
 ---> System.IO.IOException: Unable to write data to the transport connection: Connection reset by peer.
 ---> System.Net.Sockets.SocketException (104): Connection reset by peer
   at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 count)
   --- End of inner exception stack trace ---
   at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 count)
   at System.IO.Stream.Write(ReadOnlySpan`1 buffer)
   at Microsoft.Data.SqlClient.SNI.SslOverTdsStream.Write(ReadOnlySpan`1 buffer)
   at System.Net.Security.SslStream.WriteSingleChunk[TIOAdapter](ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at System.Net.Security.SslStream.WriteAsyncInternal[TIOAdapter](ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at Microsoft.Data.SqlClient.SNI.SNITCPHandle.Send(SNIPacket packet)
   at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParserStateObject.SNIWritePacket(PacketHandle packet, UInt32& sniError, Boolean canAccumulate, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParserStateObject.WriteSni(Boolean canAccumulate)
   at Microsoft.Data.SqlClient.TdsParserStateObject.WritePacket(Byte flushMode, Boolean canAccumulate)
   at Microsoft.Data.SqlClient.TdsParser.TdsLogin(SqlLogin rec, FeatureExtension requestedFeatures, SessionData recoverySessionData, FederatedAuthenticationFeatureExtensionData fedAuthFeatureExtensionData, SqlConnectionEncryptOption encrypt)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.Login(ServerInfo server, TimeoutTimer timeout, String newPassword, SecureString newSecurePassword, SqlConnectionEncryptOption encrypt)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.LoginNoFailover(ServerInfo serverInfo, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance, SqlConnectionString connectionOptions, SqlCredential credential, TimeoutTimer timeout)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.OpenLoginEnlist(TimeoutTimer timeout, SqlConnectionString connectionOptions, SqlCredential credential, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds..ctor(DbConnectionPoolIdentity identity, SqlConnectionString connectionOptions, SqlCredential credential, Object providerInfo, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance, SqlConnectionString userConnectionOptions, SessionData reconnectSessionData, Boolean applyTransientFaultHandling, String accessToken, DbConnectionPool pool)
   at Microsoft.Data.SqlClient.SqlConnectionFactory.CreateConnection(DbConnectionOptions options, DbConnectionPoolKey poolKey, Object poolGroupProviderInfo, DbConnectionPool pool, DbConnection owningConnection, DbConnectionOptions userOptions)
   at Microsoft.Data.ProviderBase.DbConnectionPool.CreateObject(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
   at Microsoft.Data.ProviderBase.DbConnectionPool.UserCreateRequest(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
   at Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
   at Microsoft.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
   at Microsoft.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at Microsoft.Data.SqlClient.SqlConnection.Open(SqlConnectionOverrides overrides)
   at Hangfire.SqlServer.SqlServerStorage.CreateAndOpenConnection() in C:\projects\hangfire-525\src\Hangfire.SqlServer\SqlServerStorage.cs:line 393
   at Hangfire.SqlServer.SqlServerStorage.UseConnection[TContext,TResult](DbConnection dedicatedConnection, Func`4 func, TContext context) in C:\projects\hangfire-525\src\Hangfire.SqlServer\SqlServerStorage.cs:line 278
   at Hangfire.SqlServer.SqlServerJobQueue.DequeueUsingSlidingInvisibilityTimeout(String[] queues, CancellationToken cancellationToken) in C:\projects\hangfire-525\src\Hangfire.SqlServer\SqlServerJobQueue.cs:line 143
   at Hangfire.Server.Worker.Execute(BackgroundProcessContext context) in C:\projects\hangfire-525\src\Hangfire.Core\Server\Worker.cs:line 104
   at Hangfire.Server.BackgroundProcessDispatcherBuilder.ExecuteProcess(Guid executionId, Object state) in C:\projects\hangfire-525\src\Hangfire.Core\Server\BackgroundProcessDispatcherBuilder.cs:line 82
   at Hangfire.Processing.BackgroundExecution.Run(Action`2 callback, Object state) in C:\projects\hangfire-525\src\Hangfire.Core\Processing\BackgroundExecution.cs:line 118
ClientConnectionId:4a8f097c-f741-4112-9164-84b0795879d7

As indicated by the trace, the exception seems to be surfaced by connection.open (the cause of which is understood to be a transient cloud connectivity issue). The outer exception message is:

Execution Worker is in the Failed state now due to an exception, execution will be retried in no more than 00:00:04.

Edit: apologies for the duplicate posts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants