Add ability to enroll with a specific ID #4290

blakerouse · 2025-01-08T01:41:32Z

What is the problem this PR solves?

This solves an issue where an Elastic Agent is being replaced with a new Elastic Agent instance for the same host, pod, or workload. This allows the enrolling Elastic Agent to tell the ID that it wants to use, that ID can be currently in-use and this enrollment will take over the record of that Elastic Agent. To take off the existing Elastic Agent both the original and the new enrollment must use the same replace_token during the enrollment. This ensures that the original enrollment informs Fleet Server that it can be replaced, and ensures that the replacement has the same token to perform the replacement.

How does this PR solve the problem?

It solves the issue by taking a new id field in the enroll HTTP request. That id is then used as the Elastic Agent ID and determines if this is a new Elastic Agent or if it should take over an existing Elastic Agent record.

How to test this PR locally

At the moment the integration tests are the best way to test this, as the ability to use this field has not been exposed yet on the Elastic Agent.

Design Checklist

I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc. (already covered by enroll handle)

Checklist

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
~~[ ] I have made corresponding change to the default configuration files~~ (no config changes)
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool

Related issues

Closes Add ability to provide the agent-id in enroll API #4226

jlind23 · 2025-01-08T14:38:43Z

After chatting with @blakerouse over slack we should fail the enrollment of this "new" agent if the policy has temper protection enabled.

kaanyalti · 2025-01-08T16:18:22Z

Changes so far look good to me, waiting for updates relevant to @jlind23's comment above to approve

michel-laterman · 2025-01-08T16:26:57Z

model/openapi.yml

+            If another agent is enrolled with the same ID the other agent will no longer be able to communicate,
+            this new agent is considered a replacement of the other agent.


This is a pretty big change from our discussion.
Please also add a sentence saying the (replaced) agent will still be able to send data into ES

It is a slight change, because it is not possible to get an API key token again after the initial create. That made me have no choice to change the behavior.

I did add more as requested to the description of this field to inform that it will still allow data to flow.

jlind23 · 2025-01-08T20:56:02Z

Changes so far look good to me, waiting for updates relevant to @jlind23's comment above to approve

See #4226 (comment)

blakerouse · 2025-01-08T22:35:12Z

@jlind23 @michel-laterman @kaanyalti I have updated this PR based on the discuss I had with @jlind23 about security with this feature. This PR now includes an additional replace_token during the enrollment API. I have updated the PR description to describe this as well as the API specification describes it.

internal/pkg/api/handleEnroll.go

pkoutsovasilis · 2025-01-09T04:49:09Z

Except a small ending of a trace-span the code changes look good to me. I understand the potential pitfalls with this feature and definitely see how the replace_token helps in minimising some of them but still this feature to me serves only special-case scenarios and is not streamlined usage 🙂

internal/pkg/api/handleEnroll.go

blakerouse · 2025-01-09T16:15:48Z

@pkoutsovasilis @michalpristas I updated the PR with the request fixes. Thanks for the reviews.

internal/pkg/api/handleEnroll.go

pkoutsovasilis

LGTM

michalpristas

Thanks for resolving comment. one question about behavior other than that I'm ok with the change.
i let you decide how you want to address the point i raised in this iteration

michalpristas · 2025-01-10T08:02:49Z

internal/pkg/api/handleEnroll.go

+			return nil, err
+		}
+
+		if agent.Id != "" {


can it return agent.id == ""? if not this if statement is not needed.
if so. we dont have this case handled as this should not be the same as empty ID when req.ID is not used.
we whould probably not continue with empty id, probably we should fail. generating a new one breaks the purpose of providing it via req.id

It absolutely will return an agent.id == "". This happens when we check to see if an existing agent already exists with that ID. The ErrNotFound will not be returned from _checkAgent, it will return nil error and this will be `agent.id == "".

blakerouse · 2025-01-21T22:11:17Z

@cmacknz @pkoutsovasilis Thanks for catching this now for FIPS. I have updated the implementation to use PBKDF2 and SHA512 like the Elastic Agent. Let me know what you think.

internal/pkg/api/handleEnroll.go

internal/pkg/config/pbkdf2.go

cmacknz · 2025-01-31T21:39:24Z

model/openapi.yml

+            The ID of the agent.
+            This is the ID that will be used to reference this agent, if no ID is passed one will be generated.
+            If another agent is enrolled with the same ID the other agent will no longer be able to communicate,
+            this new agent is considered a replacement of the other agent. The other agent will be able to continue


The other agent will be able to continue sending data to ES.

How do the ES API keys eventually get revoked? This normally happens through unenrolling an agent via the UI, but with replacement of the agent, you lose that.

Should replacing an agent revoke it's ES API keys to treat it like a force unenroll but triggered through fleet server?

I think I maybe misunderstood what happens the first time, the ES API keys are tied to the agent ID so they are kept alive because they are reused.

So this is documenting that this feature only works as expected if the agent you are replacing stops running. Otherwise you will have two agents ingesting data but one of them is unmanaged and has had its Fleet API key revoked. Correct?

That is correct. It will be able to keep ingesting data, which we want for a short period, but will not be able to communicate with Fleet. Then its up to the deployer to turn it off when the other one is working.

This works for Kubernetes, as it will allow a new pod to spawn while the previous version is still running. Once the new version is up and running and healthy then the other pod will be stopped.

cmacknz · 2025-01-31T21:47:52Z

internal/pkg/api/handleEnroll.go

+				Str("AgentId", agent.Id).
+				Str("APIKeyID", agent.AccessAPIKeyID).
+				Msg("Invalidate old api key with same id")
+			err = invalidateAPIKey(ctx, zlog, et.bulker, agent.AccessAPIKeyID)


Does any of the state in the .fleet-agents document need to be reset? For example the new agent is not going to have the same persisted action state or .fleet-action sequence number, and will not yet have a copy of the policy.

If we do any state resets there, it has to happen after the API key confirmed invalidated and not just when it is requested to avoid race conditons.

Up until this operation completes, the other agent can checkin and mutate .fleet-agents, if it is still actively running while this happens.

The .fleet-action sequence number I do not reset, as I don't think we want this Elastic Agent to replay all actions of the previous Elastic Agent. If this is used outside of the container use-case, you would not expect it to roll through performing an upgrade action because it replaced a previous Elastic Agent.

I took this into account and didn't reset the sequence number for that exact reason. The policy and revision is not tied to the sequence number in anyway. It is actually stored directly on the .fleet-agents document, which is reset by this code here - https://github.com/elastic/fleet-server/pull/4290/files#diff-b3d2d3a49ef214fba08e3aa10772162cbb65c378e24ef6c4a0c5df65b051f71aR378

cmacknz · 2025-01-31T21:49:43Z

internal/pkg/api/handleEnroll.go

+
+			// confirm that its on the same policy
+			// it is not supported to have it the same ID enroll into different policies
+			if agent.PolicyID != policyID {


There is an unlikely race condition where a user assigns the agent to a new policy as this replacement is happening. The Fleet server state isn't guaranteed to be current and we don't have any kind of lock or transaction in the index to help us avoid this type of problem.

A user could unenroll the agent as this function is executing to enroll it again with a new API key is another weird situation that could arise.

In the case of agentless, neither is possible. Outside of agentless, those are possible. We could keep checking the version of the document at each step (that seems like a lot), but could be a way to ensure that it is not be mutated.

We should just document the intended use case clearly for this, primarily containers or use inside Agentless where we control the order of operations.

We may end up using this for moving agents between fleet clusters, but this problem is also unlikely there as they are two independent fleet instances in that case.

I don't think we really document specific flows in Fleet Server. Those flows I believe are normally document publicly as Elastic Agent documentation. Let me know where you think I should put this information and I can do that.

cmacknz

The implementation LGTM, seems FIPS compliant AFAIK. You could use the Go 1.24 release candidate to double check this, it has a mode that causes non-FIPS crypto to return errors.

I think there are some interesting but unlikely concurrency problems that could be introduced here around:

The agent you are replacing being in the process of checking in as you are replacing it. Possibly this gives it a chance to update .fleet-agents information on you. I'm not sure this matters.
User actions on the UI side of the .fleet-agents index being in conflict with the re-enrollment here. The user unenrolling the agent you are replacing concurrently is probably the most interesting case.
The new, replaced agent having it's action state reset to zero. Maybe this doesn't matter, but we should test it. Will the agent resetting it's action sequence number to 0 have any unexpected effects?

levinebw · 2025-02-04T20:16:51Z

noting infosec security review discussion here #4226

mergify · 2025-02-05T14:01:20Z

This pull request is now in conflicts. Could you fix it @blakerouse? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b enroll-with-id upstream/enroll-with-id
git merge upstream/main
git push upstream enroll-with-id

blakerouse · 2025-02-05T14:21:39Z

Most of my responses to your inline comments are related here, but I will re-iterate.

The agent you are replacing being in the process of checking in as you are replacing it. Possibly this gives it a chance to
update .fleet-agents information on you. I'm not sure this matters.

Luckily the way Fleet Server works it only updates fields on a document and not a whole document. Because of this I am not worried as much of all the fields being reset to the wrong value for check-in. Looking at the code the following fields are updated on check-in:

https://github.com/elastic/fleet-server/blob/main/internal/pkg/checkin/bulk.go#L227

I don't see any conflicts there that would cause an issue for the newly replaced Elastic Agent to be able to successfully check-in and update that information. The previous Elastic Agent that just checked-in will not be able from that point forward.

User actions on the UI side of the .fleet-agents index being in conflict with the re-enrollment here. The user unenrolling the agent you are replacing concurrently is probably the most interesting case.

Not an issue in agentless, as it cannot be unenrolled. I do think outside of agentless there is a small window where this could get in a bad state. I think it is a very small window, but I don't know if the effort involved to prevent that window completely is worth it currently.

The new, replaced agent having it's action state reset to zero. Maybe this doesn't matter, but we should test it. Will the agent resetting it's action sequence number to 0 have any unexpected effects?

This is not reseting the action state, see my inline comment. It is just reseting the policy revision to ensure that on its first check-in that it gets it latest policy. I think if we solve this issue here, we don't even need to reset this here - elastic/elastic-agent#6446

elastic-sonarqube · 2025-02-05T14:52:18Z

Quality Gate passed

Issues
2 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
79.2% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

Add ability to enroll and provide the agent id as well as a replace-token to allow an existing agent to be replaced by a new agent that has the same agent id. (cherry picked from commit 265bfbf) # Conflicts: # NOTICE.txt # go.mod

Add ability to enroll and provide the agent id as well as a replace-token to allow an existing agent to be replaced by a new agent that has the same agent id. (cherry picked from commit 265bfbf)

Add ability to enroll and provide the agent id as well as a replace-token to allow an existing agent to be replaced by a new agent that has the same agent id. (cherry picked from commit 265bfbf) Co-authored-by: Blake Rouse <blake.rouse@elastic.co>

* Add ability to enroll with a specific ID (#4290) Add ability to enroll and provide the agent id as well as a replace-token to allow an existing agent to be replaced by a new agent that has the same agent id. (cherry picked from commit 265bfbf) --------- Co-authored-by: Blake Rouse <blake.rouse@elastic.co>

blakerouse added 2 commits January 7, 2025 20:34

Add ability to enroll and provide the agent id.

c489556

Fix lint.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

01bc665

blakerouse added Team:Elastic-Agent-Control-Plane backport-8.x labels Jan 8, 2025

blakerouse self-assigned this Jan 8, 2025

blakerouse requested a review from a team as a code owner January 8, 2025 01:41

blakerouse requested review from pkoutsovasilis and kaanyalti January 8, 2025 01:41

Add changelog entry.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

4e383f5

michel-laterman reviewed Jan 8, 2025

View reviewed changes

blakerouse added 4 commits January 8, 2025 17:14

Add replace_token.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

1e0b16a

Add crypto dep.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

7e78f58

More fixes.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

7a8ef63

Fix lint.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

0a59a2b

blakerouse mentioned this pull request Jan 8, 2025

Add ability to enroll with defined ID and replace_token elastic/elastic-agent#6498

Merged

5 tasks

pkoutsovasilis reviewed Jan 9, 2025

View reviewed changes

internal/pkg/api/handleEnroll.go Outdated Show resolved Hide resolved

michalpristas added the enhancement label Jan 9, 2025

michalpristas reviewed Jan 9, 2025

View reviewed changes

internal/pkg/api/handleEnroll.go Outdated Show resolved Hide resolved

internal/pkg/api/handleEnroll.go Outdated Show resolved Hide resolved

internal/pkg/api/handleEnroll.go Show resolved Hide resolved

internal/pkg/api/handleEnroll.go Outdated Show resolved Hide resolved

Updates from code review.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

fec4f23

pkoutsovasilis reviewed Jan 9, 2025

View reviewed changes

internal/pkg/api/handleEnroll.go Outdated Show resolved Hide resolved

Use now variable.

Loading
Loading status checks…

8e43c86

pkoutsovasilis approved these changes Jan 9, 2025

View reviewed changes

michalpristas approved these changes Jan 10, 2025

View reviewed changes

kaanyalti approved these changes Jan 10, 2025

View reviewed changes

pkoutsovasilis reviewed Jan 22, 2025

View reviewed changes

internal/pkg/api/handleEnroll.go Outdated Show resolved Hide resolved

pkoutsovasilis reviewed Jan 22, 2025

View reviewed changes

internal/pkg/config/pbkdf2.go Outdated Show resolved Hide resolved

pkoutsovasilis reviewed Jan 22, 2025

View reviewed changes

internal/pkg/config/pbkdf2.go Outdated Show resolved Hide resolved

Update from code review.

Loading
Loading status checks…

d76931f

pkoutsovasilis approved these changes Jan 22, 2025

View reviewed changes

cmacknz reviewed Jan 31, 2025

View reviewed changes

blakerouse mentioned this pull request Feb 5, 2025

Add ability to provide the agent-id in enroll API #4226

Closed

Merge branch 'main' into enroll-with-id

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

964b578

pierrehilbert added the backport-9.0 label Feb 5, 2025

cmacknz approved these changes Feb 5, 2025

View reviewed changes

blakerouse merged commit 265bfbf into elastic:main Feb 5, 2025
9 checks passed

blakerouse deleted the enroll-with-id branch February 5, 2025 16:09

This was referenced Feb 5, 2025

[8.x](backport #4290) Add ability to enroll with a specific ID #4416

Merged

[9.0](backport #4290) Add ability to enroll with a specific ID #4417

Merged

This was referenced Feb 11, 2025

[8.x](backport #6498) Add ability to enroll with defined ID and replace_token elastic/elastic-agent#6806

Merged

[9.0](backport #6498) Add ability to enroll with defined ID and replace_token elastic/elastic-agent#6807

Merged

cmacknz mentioned this pull request Mar 4, 2025

pbkdf2 settings validation is FIPS compliant #4542

Merged

3 tasks

michel-laterman mentioned this pull request Mar 6, 2025

[8.18](backport #4543) Update to go v1.24.0 #4552

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to enroll with a specific ID #4290

Add ability to enroll with a specific ID #4290

blakerouse commented Jan 8, 2025 •

edited

Loading

jlind23 commented Jan 8, 2025

kaanyalti commented Jan 8, 2025

michel-laterman Jan 8, 2025

blakerouse Jan 8, 2025

blakerouse Jan 8, 2025

jlind23 commented Jan 8, 2025

blakerouse commented Jan 8, 2025

pkoutsovasilis commented Jan 9, 2025

blakerouse commented Jan 9, 2025

pkoutsovasilis left a comment

michalpristas left a comment

michalpristas Jan 10, 2025

blakerouse Jan 10, 2025

blakerouse commented Jan 21, 2025

cmacknz Jan 31, 2025

cmacknz Feb 1, 2025

blakerouse Feb 5, 2025

cmacknz Jan 31, 2025

cmacknz Jan 31, 2025

blakerouse Feb 5, 2025

cmacknz Jan 31, 2025

cmacknz Jan 31, 2025

blakerouse Feb 5, 2025

cmacknz Feb 5, 2025

blakerouse Feb 5, 2025

cmacknz left a comment

levinebw commented Feb 4, 2025 •

edited

Loading

mergify bot commented Feb 5, 2025

blakerouse commented Feb 5, 2025

elastic-sonarqube bot commented Feb 5, 2025

		If another agent is enrolled with the same ID the other agent will no longer be able to communicate,
		this new agent is considered a replacement of the other agent.

Add ability to enroll with a specific ID #4290

Add ability to enroll with a specific ID #4290

Conversation

blakerouse commented Jan 8, 2025 • edited Loading

What is the problem this PR solves?

How does this PR solve the problem?

How to test this PR locally

Design Checklist

Checklist

Related issues

jlind23 commented Jan 8, 2025

kaanyalti commented Jan 8, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlind23 commented Jan 8, 2025

blakerouse commented Jan 8, 2025

pkoutsovasilis commented Jan 9, 2025

blakerouse commented Jan 9, 2025

pkoutsovasilis left a comment

Choose a reason for hiding this comment

michalpristas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blakerouse commented Jan 21, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmacknz left a comment

Choose a reason for hiding this comment

levinebw commented Feb 4, 2025 • edited Loading

mergify bot commented Feb 5, 2025

blakerouse commented Feb 5, 2025

elastic-sonarqube bot commented Feb 5, 2025

Quality Gate passed

blakerouse commented Jan 8, 2025 •

edited

Loading

levinebw commented Feb 4, 2025 •

edited

Loading