-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to enroll with a specific ID #4290
Conversation
After chatting with @blakerouse over slack we should fail the enrollment of this "new" agent if the policy has temper protection enabled. |
Changes so far look good to me, waiting for updates relevant to @jlind23's comment above to approve |
model/openapi.yml
Outdated
If another agent is enrolled with the same ID the other agent will no longer be able to communicate, | ||
this new agent is considered a replacement of the other agent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty big change from our discussion.
Please also add a sentence saying the (replaced) agent will still be able to send data into ES
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a slight change, because it is not possible to get an API key token again after the initial create. That made me have no choice to change the behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did add more as requested to the description of this field to inform that it will still allow data to flow.
See #4226 (comment) |
@jlind23 @michel-laterman @kaanyalti I have updated this PR based on the discuss I had with @jlind23 about security with this feature. This PR now includes an additional |
Except a small ending of a trace-span the code changes look good to me. I understand the potential pitfalls with this feature and definitely see how the |
@pkoutsovasilis @michalpristas I updated the PR with the request fixes. Thanks for the reviews. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for resolving comment. one question about behavior other than that I'm ok with the change.
i let you decide how you want to address the point i raised in this iteration
return nil, err | ||
} | ||
|
||
if agent.Id != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can it return agent.id == ""
? if not this if statement is not needed.
if so. we dont have this case handled as this should not be the same as empty ID when req.ID
is not used.
we whould probably not continue with empty id, probably we should fail. generating a new one breaks the purpose of providing it via req.id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It absolutely will return an agent.id == ""
. This happens when we check to see if an existing agent already exists with that ID. The ErrNotFound will not be returned from _checkAgent
, it will return nil
error and this will be `agent.id == "".
@cmacknz @pkoutsovasilis Thanks for catching this now for FIPS. I have updated the implementation to use PBKDF2 and SHA512 like the Elastic Agent. Let me know what you think. |
The ID of the agent. | ||
This is the ID that will be used to reference this agent, if no ID is passed one will be generated. | ||
If another agent is enrolled with the same ID the other agent will no longer be able to communicate, | ||
this new agent is considered a replacement of the other agent. The other agent will be able to continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other agent will be able to continue sending data to ES.
How do the ES API keys eventually get revoked? This normally happens through unenrolling an agent via the UI, but with replacement of the agent, you lose that.
Should replacing an agent revoke it's ES API keys to treat it like a force unenroll but triggered through fleet server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I maybe misunderstood what happens the first time, the ES API keys are tied to the agent ID so they are kept alive because they are reused.
So this is documenting that this feature only works as expected if the agent you are replacing stops running. Otherwise you will have two agents ingesting data but one of them is unmanaged and has had its Fleet API key revoked. Correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct. It will be able to keep ingesting data, which we want for a short period, but will not be able to communicate with Fleet. Then its up to the deployer to turn it off when the other one is working.
This works for Kubernetes, as it will allow a new pod to spawn while the previous version is still running. Once the new version is up and running and healthy then the other pod will be stopped.
Str("AgentId", agent.Id). | ||
Str("APIKeyID", agent.AccessAPIKeyID). | ||
Msg("Invalidate old api key with same id") | ||
err = invalidateAPIKey(ctx, zlog, et.bulker, agent.AccessAPIKeyID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does any of the state in the .fleet-agents document need to be reset? For example the new agent is not going to have the same persisted action state or .fleet-action sequence number, and will not yet have a copy of the policy.
If we do any state resets there, it has to happen after the API key confirmed invalidated and not just when it is requested to avoid race conditons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up until this operation completes, the other agent can checkin and mutate .fleet-agents, if it is still actively running while this happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .fleet-action
sequence number I do not reset, as I don't think we want this Elastic Agent to replay all actions of the previous Elastic Agent. If this is used outside of the container use-case, you would not expect it to roll through performing an upgrade action because it replaced a previous Elastic Agent.
I took this into account and didn't reset the sequence number for that exact reason. The policy and revision is not tied to the sequence number in anyway. It is actually stored directly on the .fleet-agents
document, which is reset by this code here - https://github.com/elastic/fleet-server/pull/4290/files#diff-b3d2d3a49ef214fba08e3aa10772162cbb65c378e24ef6c4a0c5df65b051f71aR378
|
||
// confirm that its on the same policy | ||
// it is not supported to have it the same ID enroll into different policies | ||
if agent.PolicyID != policyID { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an unlikely race condition where a user assigns the agent to a new policy as this replacement is happening. The Fleet server state isn't guaranteed to be current and we don't have any kind of lock or transaction in the index to help us avoid this type of problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A user could unenroll the agent as this function is executing to enroll it again with a new API key is another weird situation that could arise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of agentless, neither is possible. Outside of agentless, those are possible. We could keep checking the version of the document at each step (that seems like a lot), but could be a way to ensure that it is not be mutated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should just document the intended use case clearly for this, primarily containers or use inside Agentless where we control the order of operations.
We may end up using this for moving agents between fleet clusters, but this problem is also unlikely there as they are two independent fleet instances in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we really document specific flows in Fleet Server. Those flows I believe are normally document publicly as Elastic Agent documentation. Let me know where you think I should put this information and I can do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation LGTM, seems FIPS compliant AFAIK. You could use the Go 1.24 release candidate to double check this, it has a mode that causes non-FIPS crypto to return errors.
I think there are some interesting but unlikely concurrency problems that could be introduced here around:
- The agent you are replacing being in the process of checking in as you are replacing it. Possibly this gives it a chance to update .fleet-agents information on you. I'm not sure this matters.
- User actions on the UI side of the .fleet-agents index being in conflict with the re-enrollment here. The user unenrolling the agent you are replacing concurrently is probably the most interesting case.
- The new, replaced agent having it's action state reset to zero. Maybe this doesn't matter, but we should test it. Will the agent resetting it's action sequence number to 0 have any unexpected effects?
noting infosec security review discussion here #4226 |
This pull request is now in conflicts. Could you fix it @blakerouse? 🙏
|
Most of my responses to your inline comments are related here, but I will re-iterate.
Luckily the way Fleet Server works it only updates fields on a document and not a whole document. Because of this I am not worried as much of all the fields being reset to the wrong value for check-in. Looking at the code the following fields are updated on check-in: https://github.com/elastic/fleet-server/blob/main/internal/pkg/checkin/bulk.go#L227 I don't see any conflicts there that would cause an issue for the newly replaced Elastic Agent to be able to successfully check-in and update that information. The previous Elastic Agent that just checked-in will not be able from that point forward.
Not an issue in agentless, as it cannot be unenrolled. I do think outside of agentless there is a small window where this could get in a bad state. I think it is a very small window, but I don't know if the effort involved to prevent that window completely is worth it currently.
This is not reseting the action state, see my inline comment. It is just reseting the policy revision to ensure that on its first check-in that it gets it latest policy. I think if we solve this issue here, we don't even need to reset this here - elastic/elastic-agent#6446 |
|
Add ability to enroll and provide the agent id as well as a replace-token to allow an existing agent to be replaced by a new agent that has the same agent id. (cherry picked from commit 265bfbf) # Conflicts: # NOTICE.txt # go.mod
Add ability to enroll and provide the agent id as well as a replace-token to allow an existing agent to be replaced by a new agent that has the same agent id. (cherry picked from commit 265bfbf)
Add ability to enroll and provide the agent id as well as a replace-token to allow an existing agent to be replaced by a new agent that has the same agent id. (cherry picked from commit 265bfbf) Co-authored-by: Blake Rouse <blake.rouse@elastic.co>
* Add ability to enroll with a specific ID (#4290) Add ability to enroll and provide the agent id as well as a replace-token to allow an existing agent to be replaced by a new agent that has the same agent id. (cherry picked from commit 265bfbf) --------- Co-authored-by: Blake Rouse <blake.rouse@elastic.co>
What is the problem this PR solves?
This solves an issue where an Elastic Agent is being replaced with a new Elastic Agent instance for the same host, pod, or workload. This allows the enrolling Elastic Agent to tell the ID that it wants to use, that ID can be currently in-use and this enrollment will take over the record of that Elastic Agent. To take off the existing Elastic Agent both the original and the new enrollment must use the same
replace_token
during the enrollment. This ensures that the original enrollment informs Fleet Server that it can be replaced, and ensures that the replacement has the same token to perform the replacement.How does this PR solve the problem?
It solves the issue by taking a new
id
field in the enroll HTTP request. Thatid
is then used as the Elastic Agent ID and determines if this is a new Elastic Agent or if it should take over an existing Elastic Agent record.How to test this PR locally
At the moment the integration tests are the best way to test this, as the ability to use this field has not been exposed yet on the Elastic Agent.
Design Checklist
Checklist
[ ] I have made corresponding change to the default configuration files(no config changes)./changelog/fragments
using the changelog toolRelated issues