Necessity of linger on exit for servers that time out #63

shikokuchuo · 2023-03-23T19:54:45Z

shikokuchuo
Mar 23, 2023
Maintainer

As servers have the option to time out or task-out after a set number of tasks, it would be ideal to exit the process immediately thereafter - however, at present, this is only possible after an 'exitlinger' period, which by default is set to 1s. This should be sufficient for sending objects of ~ 1GB in size.

What is currently not possible is for exit to be conditional upon the send being completed.

This is, I believe, due to:

If no linger period is implemented in R, the interpreter thinks execution has ended and reaps all child threads even though the send is in progress asynchronously at the C level.
C functions that are part of the NNG library do not help as sends are recorded as complete once the socket accepts the message for transport. That means that NNG's definition of a send being complete only means the responsibility is transferred to the system sockets. However this does not guarantee that the send actually completes if the process is reaped in the meantime.

It would be great if a solution can be found.

shikokuchuo · 2023-03-24T17:38:03Z

shikokuchuo
Mar 24, 2023
Maintainer Author

This man page for socket close suggests it may not be possible through the existing NNG interface: https://nng.nanomsg.org/man/tip/nng_close.3.html

Closing the socket while data is in transmission will likely lead to loss of that data. There is no automatic linger or flush to ensure that the socket send buffers have completely transmitted. It is recommended to wait a brief period after calling nng_send() or similar functions, before calling this function.

0 replies

wlandau · 2023-05-08T15:13:44Z

wlandau
May 8, 2023

In the case of long-running computation, it seems like this would matter most when sending the result of a completed task back to the client, rather than receiving data for a new task. And in the former case, would it be possible for the server to pause its idle timers etc. before initiating a send? Unless I am missing something, it seems like this would just be a matter of expressing the timer logic differently in R.

0 replies

shikokuchuo · 2023-05-08T15:40:45Z

shikokuchuo
May 8, 2023
Maintainer Author

The issue is we can do what we like prior to the send, or afterwards for that matter. But we just simply do not know when it has finished. As that is an interplay between the C process and the system TCP stack, that R has no access to at present.

0 replies

wlandau · 2023-05-08T19:19:27Z

wlandau
May 8, 2023

That makes sense.

By the way, this discussion made me concerned that a server could exit and lose the data far before the client has a chance to download it. I am happy to see that lightweight tasks seem to be available somewhere well after the server exits. On my company's cluster, I started a dispatcher on one node:

library(mirai)
url <- sprintf("ws://%s:57000", getip::getip())
print(url)
daemons(
  n = 1L,
  url = url,
  dispatcher = TRUE,
  token = FALSE
)
while (!is.matrix(daemons()$daemons)) {
  Sys.sleep(0.1)
}
while (daemons()$daemons[, "online"] < 1L) {
  Sys.sleep(0.1)
}
tasks <- replicate(4, mirai(rnorm(n = 1)))
Sys.sleep(4)
print(as.numeric(lapply(tasks, function(task) task$data)))

During the while() loop with daemons()$daemons[, "online"] , I launched a server on a different node on the local network:

R -e 'mirai::server(url = "ws://x.x.x.x:57000", idletime = 1000, exitlinger = 1000)'

The server visibly came and went, and the client did not make an attempt to collect the data until a couple seconds after that. But yet no result went missing!

print(as.numeric(lapply(tasks, function(task) task$data)))
#> [1]  1.3502759 -0.2049120  0.1465165 -0.5801425

This is really amazing. Where do the results live between the server exit and the moment the client starts to collect them?

0 replies

shikokuchuo · 2023-05-08T19:25:10Z

shikokuchuo
May 8, 2023
Maintainer Author

That makes sense.

By the way, this discussion made me concerned that a server could exit and lose the data far before the client has a chance to download it. I am happy to see that lightweight tasks seem to be available somewhere well after the server exits. On my company's cluster, I started a dispatcher on one node:

Ha yes TCP is surprisingly resilient.

During the while() loop with daemons()$daemons[, "online"] , I launched a server on a different node on the local network:
R -e 'mirai::server(url = "ws://x.x.x.x:57000", idletime = 1000, exitlinger = 1000)'
The server visibly came and went, and the client did not make an attempt to collect the data until a couple seconds after that. But yet no result went missing!

The send is eager so it is done when the server is still alive. <- This though assumes it finishes transmitting before the 'exitlinger' period and the process dies.

This is really amazing. Where do the results live between the server exit and the moment the client starts to collect them?

I believe the data is just buffered at the client (listener) TCP socket, so it can be collected at any time by NNG.

0 replies

wlandau · 2023-05-19T15:17:28Z

wlandau
May 19, 2023

Seems like there would have to be new logic. Just for the sake of thinking out loud:

Server: when beginning a send, increment a statistic like sends.
Server: create a new condition variable to count dispatcher-side receives.
Dispatcher: check for incoming data without actually downloading it, similar to .unresolved() (is this possible?)
Dispatcher: in the event loop, if (2) shows that the data is completely ready for download from listener TCP socket, then trigger a pipe event to increment the server-side receives condition variable.
Server: if the sends statistic and receives CV are equal to each other, then it is safe to exit.

Is this all possible? Am I missing something? I'm not sure if (4) is possible because the dispatcher is non-polling. Without polling, I suppose a callback mechanism would be needed, and from #42 (comment) it sounds like a callback mechanism does not exist at the NNG level.

0 replies

shikokuchuo · 2023-05-19T22:31:30Z

shikokuchuo
May 19, 2023
Maintainer Author

It's just a question of efficiency. You can always do something like send a received ack when dispatcher receives the result from server and have server wait for that. Just sending messages will be more efficient than establishing a new pipe in [4].

However this will mean having a 'receive task' state at server, followed by a 'receive ack' state. Probably robust, but likely 'something they did 30 years ago'...

And I think this will mean doing this for every task, I don't think there's a good way for server to signal 'I want to exit, send an ack next time'.

0 replies

shikokuchuo · 2023-09-18T21:40:21Z

shikokuchuo
Sep 18, 2023
Maintainer Author

Eliminated the 'exitlinger' in ephemeral (dot) daemons through synchronisation techniques in 35e94ec. The message receive completion callback automatically closes the pipe - triggering a condition variable on the daemon to wake and proceed to exit.

Need to consider how best to implement something similar for the other cases.

0 replies

shikokuchuo · 2023-09-19T23:36:32Z

shikokuchuo
Sep 19, 2023
Maintainer Author

With the latest commit 51ec228 (requiring dev nanonext) - the exitlinger is likely to be eliminated after the next round of releases. Whilst this commit doesn't get us there yet, it seems solvable.

Notably this commit does solve another important issue - the 'backlogged workers' problem of always having to re-launch these daemons (i.e. wlandau/crew#79 (comment)). @wlandau this should hopefully also allow the crew scheduling algorithm to be simplified.

3 replies

wlandau Sep 20, 2023

Notably this commit does solve another important issue - the 'backlogged workers' problem of always having to re-launch these daemons

This sounds like an exciting prospect! I will investigate.

wlandau Sep 20, 2023

Worked for me! I was able to really simplify the logic in crew. Also, there was a spot where I needed to call mirai::status() twice in quick succession, and now I only need to call it once (where calls are spaced seconds_interval apart, with default value 0.25 seconds). I think this should be more robust and kinder to the dispatcher. In fact, something about all these changes seems to fix #65 on my end. So all the TLS-related development and debugging is done as far as I am concerned.

shikokuchuo Sep 20, 2023
Maintainer Author

That's fantastic! I'm really glad this had the unintended consequence of fixing #65! I also came to the conclusion these issues were linked when I was doing my own testing on removing 'exitlinger' today.

Incidentally, those changes (on the 'dev' branch of mirai) were ready to merge based on the old version of crew, but now failing manual tests 'tasks_max' and 'transient' after the changes you made here. If you have any ideas, do let me know.

In any case, the results we have achieved today are more important than removing 'exitlinger'. That can wait until I have a more creative solution.

shikokuchuo · 2023-09-21T10:17:25Z

shikokuchuo
Sep 21, 2023
Maintainer Author

This is solved as of 7d6f4c2. In all cases there is a fallback to the global timeout of 5s to ensure exiting processes never hang.

7 replies

wlandau Sep 21, 2023

Ah, so if we are talking about how a daemon waits for a response from the dispatcher, this is different from download time. Sounds like it is okay if the input/output data for a task takes more than 5 seconds to transfer over the websocket connection. Please correct me if I am wrong.

shikokuchuo Sep 21, 2023
Maintainer Author

Not quite. The assumption is that what happens asynchronously at the system level takes less than 5s. This isn't the time of the send exactly, as some of it happens synchronously with our send function call. But whereas previously the default for this value was 1s (as we always wait this amount of time), we have now relaxed this massively to 5s (as this is only a maximum, and actual likely to be much lower). We could even set this higher, as this is just a failsafe - if dispatcher remains available then it will just take however long it takes.

wlandau Sep 21, 2023

Mainly why I am asking is because I am thinking about targets pipelines where some tasks have data that takes more than 5 seconds to upload to a worker or download to the host (maybe even a couple minutes). As long as the dispatcher stays running, will this be possible? If the dispatcher exits, the download/upload no longer matters, and I am okay if the pipeline just drops everything and errors out.

shikokuchuo Sep 21, 2023
Maintainer Author

Forget the above, in ca2dced I've switched it around so that the wait is dependent on the condition variable instead. This is safe as either a response signal or dispatcher exiting (connection drop) will trigger the wake and exit. This is better than having a time-bound failsafe.

wlandau Sep 21, 2023

That's fantastic! Really assures me to know this is no longer time-dependent. And I can't believe how fast crew can churn through transient workers now!

I just went through all my tests again, and everything looks good on both Mac and Linux except wlandau/crew#126. (I have not visited https://github.com/wlandau/crew/blob/main/tests/mirai/test-tls-max_tasks.R in a long time, and I am probably missing something obvious.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Necessity of linger on exit for servers that time out #63

{{title}}

Replies: 10 comments 10 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Necessity of linger on exit for servers that time out #63

shikokuchuo Mar 23, 2023 Maintainer

Replies: 10 comments · 10 replies

shikokuchuo Mar 24, 2023 Maintainer Author

shikokuchuo May 8, 2023 Maintainer Author

shikokuchuo May 8, 2023 Maintainer Author

shikokuchuo May 19, 2023 Maintainer Author

shikokuchuo Sep 18, 2023 Maintainer Author

shikokuchuo Sep 19, 2023 Maintainer Author

shikokuchuo Sep 20, 2023 Maintainer Author

shikokuchuo Sep 21, 2023 Maintainer Author

shikokuchuo Sep 21, 2023 Maintainer Author

shikokuchuo Sep 21, 2023 Maintainer Author

shikokuchuo
Mar 23, 2023
Maintainer

Replies: 10 comments 10 replies

shikokuchuo
Mar 24, 2023
Maintainer Author

shikokuchuo
May 8, 2023
Maintainer Author

shikokuchuo
May 8, 2023
Maintainer Author

shikokuchuo
May 19, 2023
Maintainer Author

shikokuchuo
Sep 18, 2023
Maintainer Author

shikokuchuo
Sep 19, 2023
Maintainer Author

shikokuchuo Sep 20, 2023
Maintainer Author

shikokuchuo
Sep 21, 2023
Maintainer Author

shikokuchuo Sep 21, 2023
Maintainer Author

shikokuchuo Sep 21, 2023
Maintainer Author