Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to provide the agent-id in enroll API #4226

Closed
blakerouse opened this issue Dec 17, 2024 · 12 comments · Fixed by #4290
Closed

Add ability to provide the agent-id in enroll API #4226

blakerouse opened this issue Dec 17, 2024 · 12 comments · Fixed by #4290
Assignees
Labels
Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@blakerouse
Copy link
Contributor

blakerouse commented Dec 17, 2024

Describe the enhancement:

Add the ability to specify the agent-id of the enrolling Elastic Agent.

Describe a specific use case for the enhancement or feature:

On serverless, an Elastic Agent is static but the pod doesn't have any persistent storage so it cannot store the enrollment information between restarts of the Elastic Agent. There have also been other reports of this issue from customers where they do not need persistent storage from the integration and requiring the Elastic Agent to have it just for the enrollment information is not possible.

To provide a stable Elastic Agent in the agents list in Kibana, this would allow an Elastic Agent to enroll with the ID they want to have. This would also replace an existing Elastic Agent if one already has the same ID.

The new enrolled Elastic Agent will replace the previous Elastic Agent prevent it from being able to communicate with the Fleet Server any more.

Describe any security issues:

This does open the possibility that if a bad actor had the enrollment token and the ID of the Elastic Agent it would be able to enroll over top of it and prevent the communication of that current Elastic Agent as the other Elastic Agent would be come the newly communicating Elastic Agent.

To prevent this only an additional replace-token would be added to the enrollment API. This would be any unique value that is stored as a pbkdf2-sha512 hash on the Elastic Agent record. If an Elastic Agent is enrolled without this token then it doesn't allow any other Elastic Agent to enroll with the same ID (trying to enroll with the same ID would error). If an Elastic Agent is enrolled with the replace token and its the first enrollment then it would successfully enroll. On a second enrollment to replace the Elastic Agent the exact same replace token must be provided and if it matches (using pbkdf2-sha512 hash) then it would be considered the replacement of the Elastic Agent and allow the enrollment to complete.

@michel-laterman
Copy link
Contributor

I think we already provide this through the enrollment_id in the API:

enrollment_id:
type: string
description: |
The enrollment ID of the agent.
To replace an agent on enroll fail.
The existing agent with a matching enrollment_id will be deleted if it never checked in. The new agent will be enrolled with the enrollment_id.

It was added with #2655

@blakerouse
Copy link
Contributor Author

@michel-laterman The existing agent with a matching enrollment_id will be deleted if it never checked in. What if it has checked-in?

@kpollich kpollich assigned kpollich and unassigned kpollich Jan 7, 2025
@michel-laterman
Copy link
Contributor

@blakerouse and I had a brief conversation about this.

We've decided to add an ID field to enrolment requests that is distinct from the existing enrollment_id value.
If this field is used, and indicates an existing agent that agent's current policy & existing API keys are used by the "new agent".
If the agent does not exist it's treated as a new enrolment.

This is so that we don't break/get blocked on existing scale tests when delivering this feature; and as a follow up we can see if we can make the scale tests just use the new ID value and deprecate enrollment_id (cc @juliaElastic).

I've also looked a bit more into opamp for how it handles duplicate IDs. In short, this type of workflow (where we may have more than one pod that are "the same agent") isn't supported.
It's pretty clear by the implications of the duplicate websockets connection section.

When sending a message an agent is able to specify their own instance_uid value or request one from the server
The server can also force agents to use a new instance_uid value at any time.

Additionally AgentToServer messages are expected to be sequential (indicated by sequence_num) as a mechanism for detecting missed messages.

Supporting this workflow is something we'll need to handle once we start supporting opamp.

@jlind23
Copy link
Contributor

jlind23 commented Jan 8, 2025

@nimarezainia do you think this is something we could piggy back on in order to migrate Agent from a cluster to another using the same ID in the enroll command?

@blakerouse
Copy link
Contributor Author

@jlind23 After our discussion of the security implications I have added a section to the description about the addition agent-token API option for enrollment. Hopefully this implementation would alleviate those implications.

@elastic/product-security Could you give the security implications a review?

@jlind23
Copy link
Contributor

jlind23 commented Jan 16, 2025

@levinebw @jkakavas could you please take a look at this as per Blake's comment above?

@jlind23
Copy link
Contributor

jlind23 commented Jan 21, 2025

@levinebw @jkakavas Were you able to spend some time on this?

@blakerouse
Copy link
Contributor Author

Updated description to change from bcrypt to pbkdf2-sha512, so it would be FIPS compliant.

@jkakavas
Copy link
Member

jkakavas commented Feb 4, 2025

HI @blakerouse ,

I am trying to wrap my head around this ( generally a flow diagram or some kind of design doc works wonders for a review 🙏 ), can you validate my understanding ?

  • The general use case is that we want to allow an agent to enroll to a Fleet Server as the same agent.
  • This is problematic because we don't have persistent storage on the agent side in all use cases, so we can't ensure that we can keep track of the same enrollment token. Subsequently ( after a restart or other cases ) an agent will have a new enrollment token , but we'd still want to treat it as the old agent.
  • For that the first thought was to use an Agent ID to differentiate between agents. The risk associated with that is that if a malicious actor can guess the Agent ID, they can enroll their agent instance and disable the legitimate agent with that Agent ID.

Now, we propose that we introduce a replace-token that needs to be sent along with the initial enrollment, stored server side and then must be resent every time an agent wants to reuse an Agent ID.

How is the Agent ID calculated or persisted on the pod ? Can we make this impractical to guess instead of introducing a new token to be generated and stored ?
How will the replace-token be stored or calculated client side so that it can be reused by the new agent instance with the same Agent ID if there is no persistent storage ?
If it can be persisted somehow so taht the new instance can be used, why not persist the enrollment token in the first place ?

There is a high chance I am missing something, so happy to get some more info here or sync in a zoom chat about this or read something more detailed to get up to speed.

@blakerouse
Copy link
Contributor Author

HI @blakerouse ,

I am trying to wrap my head around this ( generally a flow diagram or some kind of design doc works wonders for a review 🙏 ), can you validate my understanding ?

  • The general use case is that we want to allow an agent to enroll to a Fleet Server as the same agent.

Correct.

  • This is problematic because we don't have persistent storage on the agent side in all use cases, so we can't ensure that we can keep track of the same enrollment token. Subsequently ( after a restart or other cases ) an agent will have a new enrollment token , but we'd still want to treat it as the old agent.

An enrollment token is like an authorization to enroll, it can be shared by many Elastic Agent's. Currently at enrollment you are assigned a new Agent ID, that is the issue at hand. We need to define that at enrollment time that we want a specific Agent ID, that will overwrite an existing Agent.

It is an issue of persistent storage where we lose the saved credentials to continue using the same Agent ID.

  • For that the first thought was to use an Agent ID to differentiate between agents. The risk associated with that is that if a malicious actor can guess the Agent ID, they can enroll their agent instance and disable the legitimate agent with that Agent ID.

Correct.

Now, we propose that we introduce a replace-token that needs to be sent along with the initial enrollment, stored server side and then must be resent every time an agent wants to reuse an Agent ID.

How is the Agent ID calculated or persisted on the pod ? Can we make this impractical to guess instead of introducing a new token to be generated and stored ? How will the replace-token be stored or calculated client side so that it can be reused by the new agent instance with the same Agent ID if there is no persistent storage ? If it can be persisted somehow so taht the new instance can be used, why not persist the enrollment token in the first place ?

We could make the Agent ID hard to guess, but they are shown in the Fleet UI. If you have access to the Fleet UI then you would easily be able to get this ID. It will generate a unique replace-token and store that in a secret in Kubernetes, Kubernetes will be the persistent storage in that case. This unique replace-token is not visible to the user in the Fleet UI and even if you view the .fleet-agents document it would be a hash of the replace-token so you would not be able to reverse engineer that token. You will need to know that token.

At the moment the Elastic Agent doesn't have a way of saving all of its enrollment information as a Kubernetes secret and we don't want to give Elastic Agent any credentials of communication with the Kubernetes API to store this information as that would be worse for security.

There is a high chance I am missing something, so happy to get some more info here or sync in a zoom chat about this or read something more detailed to get up to speed.

I don't think you are missing anything, hopefully my answers above provide more clarity.

@jkakavas
Copy link
Member

jkakavas commented Feb 5, 2025

@blakerouse Thank you. I had trouble consolidating these two:

On serverless, an Elastic Agent is static but the pod doesn't have any persistent storage so it cannot store the enrollment information between restarts of the Elastic Agent.

and

It will generate a unique replace-token and store that in a secret in Kubernetes, Kubernetes will be the persistent storage in that case.

but I think I get it know. We can persist secrets but we cannot persist all the information we need for an agent to be enrolled ( policies, etc ) so it has to go through enrollment again after restart. Enrollment token is not tied to a specific agent so we can's store that instead.

If you have access to the Fleet UI then you would easily be able to get this ID.

Is someone who legitimately has access to the Fleet UI, someone we need to protect against ? Couldn't they disengage an agent ? Or is it more granular that there are viewers who can see agents and their IDs but have no more control over these agents and these policies ?

In general, I don't have any qualms with the design that includes the replace-token as long as we generate the token on the agent in a secure manner ( i.e. https://pkg.go.dev/crypto/rand ) and we store it securely server-side. PBKDF2 sounds proper as long as we select a high enough number of iterations and a proper hash function ( SHA512 ticks both security and compliance boxes )

@blakerouse
Copy link
Contributor Author

@blakerouse Thank you. I had trouble consolidating these two:

On serverless, an Elastic Agent is static but the pod doesn't have any persistent storage so it cannot store the enrollment information between restarts of the Elastic Agent.

and

It will generate a unique replace-token and store that in a secret in Kubernetes, Kubernetes will be the persistent storage in that case.

but I think I get it know. We can persist secrets but we cannot persist all the information we need for an agent to be enrolled ( policies, etc ) so it has to go through enrollment again after restart. Enrollment token is not tied to a specific agent so we can's store that instead.

Correct.

If you have access to the Fleet UI then you would easily be able to get this ID.

Correct it is in the URL of viewing the Agent and rendered on the page.

Is someone who legitimately has access to the Fleet UI, someone we need to protect against ? Couldn't they disengage an agent ? Or is it more granular that there are viewers who can see agents and their IDs but have no more control over these agents and these policies ?

We do protect against customers performing actions on serverless agents in Kibana, like unenroll.

In general, I don't have any qualms with the design that includes the replace-token as long as we generate the token on the agent in a secure manner ( i.e. https://pkg.go.dev/crypto/rand ) and we store it securely server-side. PBKDF2 sounds proper as long as we select a high enough number of iterations and a proper hash function ( SHA512 ticks both security and compliance boxes )

#4290

The PR uses pbkdf2-sha512 with 210,000 iterations (which is WASP 2023 recommendation) - https://cheatsheetseries.owasp.org/cheatsheets/Password_Storage_Cheat_Sheet.html#pbkdf2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants