-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove the coordinator #3131
Remove the coordinator #3131
Conversation
Remove the policy coordinator and policy leader election mechanisms from fleet-server. Deprecate the coordinator_idx value in fleet-server's json schema and remove coordinator_idx references when processing policies.
f691382
to
28c2822
Compare
FleetPoliciesLeader = ".fleet-policies-leader" | ||
FleetServers = ".fleet-servers" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the functions that create entries in policies-leader as well as the one that writes to .fleet-servers
. Is there any other component that uses the .fleet-servers
entries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should create a follow up issue to check and clean up this index in the next major if not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michel-laterman the Fleet status API rely on the .fleet-servers
entries https://github.com/elastic/kibana/blob/main/x-pack/plugins/fleet/server/routes/setup/handlers.ts#L22 so we should change that in Kibana before going forward with that PR otherwise Fleet UI will be broken and user will not be able to add agents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to change Kibana's behaviour as well, or change fleet-server to register itself somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for this one it's probably okay to change the Kibana behavior, we can rely on retrieving agents with fleet-server installed instead.
This pull request is now in conflicts. Could you fix it @michel-laterman? 🙏
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM.
Is there anything that can go wrong if there are multiple fleet servers, some of them on older version with a coordinator?
If we have the scenario where an older version is running at the same time as a newer version, it will only provide policies where the @cmacknz, would this sequence present an issue to the agent? |
Just to confirm this only duplicates the policy change action and not other action types? If the policy is exactly the same I think this is fine, the agent should realize there are no changes to the running set of components and make not changes. I would recommend testing this quickly to confirm though. There is no actual requirement for independent Fleet Servers to run the same version is there? In practice this is often the case but there is nothing enforcing. I am wondering if we could get into an edge case where the agent is continuously re-processing a policy change going between two Fleet servers with different versions. Is this possible or something that we need to guard against? Does Fleet server itself tolerate the |
OK, I did some testing and removing the coordinator_idx value should not effect the agent. I tested the following sequence
The agent behaved normally throughout. |
This pull request is now in conflicts. Could you fix it @michel-laterman? 🙏
|
We need to hold off on merging this until the following Kibana issues are resolved: |
|
This pull request is now in conflicts. Could you fix it @michel-laterman? 🙏
|
@michel-laterman does this one also closes - elastic/kibana#173538 |
The v8.5 migration generated new output keys for an agent by forcing the policy outputs to be prepared by incrementing the coordinator_idx for the policy. Behaviour was changed instead to detect if a single output has no api_key value (empty string) and subscribe with a revision of 0.
af2552a
to
76a2334
Compare
// use revision_idx=0 if the agent has a single output where no API key is defined | ||
// This will force the policy monitor to emit a new policy to regerate API keys | ||
revID := agent.PolicyRevisionIdx | ||
for _, output := range agent.Outputs { | ||
if output.APIKey == "" { | ||
revID = 0 | ||
break | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The coordinator incrementation in dl/migration.go
has been removed, this should result in the same behaviour and remove the need for elastic/kibana#173538
|
FYI elastic/kibana#173537 was recently completed and merged, so this should be unblocked now. Are there any BWC scenarios we need to consider? |
I can't think of any. |
@pierrehilbert I believe the control-plane and data-plane team aren't up-to-date yet. Perhaps you'd like to change who is supposed to review this PR on behalf of the control-plane. |
|
buildkite run perf-tests |
I've re-validated this change;
At this point I'm certain that the change will function. However I continued the experiment and once more halted the fleet-server and replaced it with one with the coordinator and saw the agent was still healthy. I'm confident with the change and testing done |
@michel-laterman It would be good to test an existing fleet server (with coordinator) + agents, and upgrading fleet server to this version without coordinator. |
@jen-huang, that's what I tested |
🤦 I misread some of the lines. Ty :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀
note this will be deployed to serverless as soon the different quality gates does the migration from that version back to using the coordinator will work in case we need to revert? (I think so but just checking)
I did a brief test where I stopped a fleet-server (with no coordinator) and replaced it with one with a coordinator and the agent was able to become healthy in the UI |
What is the problem this PR solves?
fleet-server coordinator and leader election mechanisms are nops that add unneeded complexity to understanding the codebase.
How does this PR solve the problem?
Remove the policy coordinator and policy leader election mechanisms from fleet-server. Deprecate the coordinator_idx value in fleet-server's json schema and remove coordinator_idx references when processing policies.
Design Checklist
I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration files./changelog/fragments
using the changelog toolRelated Issues