Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow execution failure during dispatcher PR #72

Open
burnout87 opened this issue Oct 3, 2023 · 18 comments
Open

Workflow execution failure during dispatcher PR #72

burnout87 opened this issue Oct 3, 2023 · 18 comments

Comments

@burnout87
Copy link
Contributor

During the development of oda-hub/dispatcher-app#585 , the following workflow failed:

https://github.com/oda-hub/dispatcher-app/actions/runs/6149352313/job/16685039350

Please feel free to move the issue somewhere else more suitable if needed

@volodymyrss
Copy link
Member

its a duplicate of #72

@volodymyrss
Copy link
Member

Actually I keep it open to confirm it's a duplicate

@dsavchenko
Copy link
Member

This is not a duplicate, tests for both plugins were failing.
The complication is that it's not always reproducible.
And also it seems that the causes (or at least the manifestations in the logs) were probably different.

@dsavchenko dsavchenko self-assigned this Oct 4, 2023
@burnout87
Copy link
Contributor Author

I think the same error appeared again:

https://github.com/oda-hub/dispatcher-app/actions/runs/6484236816/job/17607602126

@dsavchenko
Copy link
Member

Yes, it's the same error. And I can't reproduce it locally. I see lots of timeouts while polling dispather with oda-api in test_full_stack
The dispatcher is functional, though, and replies correctly, but after the request is timed out.

We discussed a bit that we want to profile and reduce latency if possible. But for the time being, as it only appears in pipeline, I propose #73

@burnout87
Copy link
Contributor Author

I re-run the workflow, and this time it completed

@dsavchenko
Copy link
Member

Still, let's keep this issue open for some time

@burnout87
Copy link
Contributor Author

I think I encountered again the same issue, is it related? A TimeoutError is mentioned.

https://github.com/oda-hub/dispatcher-app/actions/runs/7087471000/job/19287806505?pr=626

@dsavchenko
Copy link
Member

dsavchenko commented Dec 4, 2023

That's a different kind of timeout

TimeoutError: The provided start pattern Serving Flask app could not be matched within the specified time interval of 30 seconds

It's related to the live_nb2service fixture which starts nb2service as a separate process via xprocess lib and waits to 'Serving Flask app' in the stdout of it

Didn't we have changes to the nb2workflow which could e.g. affect the verbosity?

This xprocess sometimes used to cause issues when debugging tests locally: unclearly terminated test can leave the process, leading to the impossibility to start another one because of the port being used. But in CI it always worked well. I will investigate further.

@dsavchenko
Copy link
Member

It was a transient issue. Not sure what was the cause. I wasn't able to reproduce it locally. Then I restarted the CI job and it passed.

@burnout87
Copy link
Contributor Author

ok, it did the same for me

@volodymyrss
Copy link
Member

It would be better to have some better process starting/tracking behavior to avoid this. Though I suspect that this particular issue will not happen in production since if the service is not starting, the pod will be recreated.

@volodymyrss
Copy link
Member

Still, let's keep for tracking open.

@dsavchenko
Copy link
Member

Though I suspect that this particular issue will not happen in production since if the service is not starting, the pod will be recreated.

Exactly, this mechanism is only used in tests

@volodymyrss
Copy link
Member

Though I suspect that this particular issue will not happen in production since if the service is not starting, the pod will be recreated.

Exactly, this mechanism is only used in tests

Well, it might be that the server is not starting for some reason. It is then an issue for nb2workflow itself. Port already used is a common issue indeed, in dispatcher I made some custom xprocess analog which tries to deal with this.
I think we might be able to adapt xprocess to behave better, but let's leave it for now.

@burnout87
Copy link
Contributor Author

@volodymyrss
Copy link
Member

looks like it happened again:

https://github.com/oda-hub/dispatcher-app/actions/runs/7130587275/job/19417305982

production was down for some 15min, is something there calling it?

@burnout87
Copy link
Contributor Author

Now I noticed it, I noticed some crashes elsewhre

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants