Track and report upgrade details #3119

ycombinator · 2023-07-25T00:25:40Z

Describe the enhancement:

Elastic Agent should track details about an ongoing upgrade and report them to Fleet Server.

Describe a specific use case for the enhancement or feature:

To understand where upgrades fail or get stuck without progressing.

What is the definition of done?

Elastic Agent is able to track the various upgrade states (defined below) representing details about an ongoing upgrade and correctly transition between these states.
Elastic Agent is able to report a change in an upgrade state to Fleet Server via a new upgrade_details field in the check-in API.
For backwards compatibility, Elastic Agent continues to acknowledge the UPGRADE action with Fleet Server, via the acknowledgements API, as before.

Details

In the future, Elastic Agents will send details about an ongoing upgrade to Fleet Server. These details will be communicated via a new upgrade_details field in check-in API requests.

The proposed structure of the upgrade_details field is:

{
  "upgrade_details": { // new field; present when upgrade is in progress
    "target_version": "8.12.0", // version being upgraded to; always present
    "action_id": "xxxxxxxx", // ID of the UPGRADE action
    "state": "UPG_*",
    "metadata": {
      "scheduled_at": "2023-08-09T10:11:12Z", // when state == "UPG_SCHEDULED"
      "download_percent": 16.4, // when state == "UPG_DOWNLOADING"
      "failed_state": "UPG_*" // when state == "UPG_FAILED"
      "error_msg": "" // when state == "UPG_FAILED"
    }
  }
}

Where upgrade_details.state is expected to hold one of the following values:

State	Meaning
`UPG_REQUESTED`	Upgrade requested by user
`UPG_SCHEDULED`	Upgrade scheduled for <date/time>
`UPG_DOWNLOADING`	Downloading new Agent artifact version
`UPG_EXTRACTING`	Extracting new Agent artifact version
`UPG_REPLACING`	Replacing old Agent artifact version with new one version
`UPG_RESTARTING`	Starting new Agent version
`UPG_WATCHING`	Monitoring new Agent version
`UPG_ROLLBACK`	Upgrade unsuccessful; rolling back to Agent version
`UPG_FAILED`	Upgrade failed due to error from state

The possible transitions between these states are:

The various points in the upgrade workflow at which these state transitions would occur are:

sequenceDiagram
    actor U as User
    participant UI as Fleet UI
    participant ES
    participant FS as Fleet Server
    participant A as Agent
    participant UW as Upgrade Watcher
    participant UM as Upgrader Marker

    U->>UI: Initiate upgrade
    UI->>ES: Update Agent doc in `.fleet-agents`<br />set `upgrade_started_at`
    UI->>UI: Show Agent status as "updating"
    UI->>ES: Create new doc in `.fleet-actions` for `UPGRADE` action
    A->>FS: Check-in request
    FS->>ES: Read pending actions from .fleet-actions
    FS->>A: Check-in response
    A-->>FS: Check-in request: `upgrade_details.state_id` = `UPG_REQUESTED`
    A->>A: Queue upgrade action
    alt If upgrade is scheduled for future
        A-->>FS: Check-in request: `upgrade_details.state_id` = `UPG_SCHEDULED`
    end
    alt If upgrade start fails
       A->>FS: Ack failed upgrade
       A-->>FS: Check-in request: `upgrade_details.state_id` = `UPG_ERROR` with `upgrade_details.failed_state_id` = `UPG_REQUESTED` or `UPG_SCHEDULED`
       FS->>ES: Update Agent doc in `.fleet-agents`<br />set `upgrade_status` = "failed"
       FS-->>ES: Set `upgrade_details` = value of `upgrade_details` from check-in request
       UI-->>UI: Agent status = "upgrade failed"
       UI->>UI: Agent status remains as "updating" (bug) (fallback)
    else
       opt If previous upgrades found
          A->>FS: Ack previous upgrades
          A->>A: Remove previous upgrades from queue
       end
       A-->>FS: Check-in request: `upgrade_details.state_id` = `UPG_DOWNLOADING`
       A->>A: Download new Agent artifact
       opt If download fails
          A-->>FS: Check-in request: `upgrade_details.state_id` = `UPG_ERROR` with `upgrade_details.failed_state_id` = `UPG_DOWNLOADING`
          FS-->>ES: Set `upgrade_details` = value of `upgrade_details` from check-in request
          UI-->>UI: Agent status = "upgrade failed"
       end
       A-->>FS: Check-in request: `upgrade_details.state_id` = `UPG_EXTRACTING`
       A->>A: Extract new Agent artifact
       opt If extraction fails
          A-->>FS: Check-in request: `upgrade_details.state_id` = `UPG_ERROR` with `upgrade_details.failed_state_id` = `UPG_EXTRACTING`
          FS-->>ES: Set `upgrade_details` = value of `upgrade_details` from check-in request
          UI-->>UI: Agent status = "upgrade failed"
       end
       A-->>FS: Check-in request: `upgrade_details.state_id` = `UPG_REPLACING`
       A->>A: Replace current Agent artifact with new one
       opt If extraction fails
          A-->>FS: Check-in request: `upgrade_details.state_id` = `UPG_ERROR` with `upgrade_details.failed_state_id` = `UPG_REPLACING`
          FS-->>ES: Set `upgrade_details` = value of `upgrade_details` from check-in request
          UI-->>UI: Agent status = "upgrade failed"
       end
       A->>UM: Create
       A->>UW: Start
       A-->>FS: Check-in request: `upgrade_details.state_id` = `UPG_RESTARTING`
       A->>A: Rexec to start new Agent artifact
       opt If restart fails
          A-->>FS: Check-in request: `upgrade_details.state_id` = `UPG_ERROR` with `upgrade_details.failed_state_id` = `UPG_RESTARTING`
          FS-->>ES: Set `upgrade_details` = value of `upgrade_details` from check-in request
          UI-->>UI: Agent status = "upgrade failed"
       end
       A-->>FS: Check-in request: `upgrade_details.state_id` = `UPG_WATCHING`
       UI-->>UI: Show Agent status as "upgrade watching"
       UW->>UW: Watch new Agent
   end
   opt Success
       A->>FS: Ack successful upgrade
       FS->>ES: Write successful ack in `.fleet-actions-results`
       FS->>ES: Update Agent doc in `.fleet-agents`<br />set `upgrade_status` = null<br />`upgraded_at` = <now><br />`upgrade_started_at` = null
       UI->>UI: Show Agent status as "healthy" (fallback)
       UW->>UM: Remove
       A-->>UM: Watch for removal
       A-->>FS: Check-in request: remove `upgrade_details` field
       FS-->>ES: Update Agent doc in `.fleet-agents`<br />remove `upgrade_details` field
       UI-->>UI: Show Agent status as "healthy"
   end
   opt Rollback
       UW->>UM: Write `upgrade_details.state_id` = `UPG_ROLLBACK`
       UW->>A: Start
       A-->>UM: Read `upgrade_details.state_id`
       A-->>FS: Check-in request: `upgrade_details.state_id` = value from UM
       A->>FS: Ack failed upgrade
       FS-->>ES: Set `upgrade_details` = value of `upgrade_details` from check-in request
       FS->>ES: Update Agent doc in `.fleet-agents`<br />set `upgrade_status` = null<br />`upgraded_at = <now> (fallback)
       UI->>UI: Show Agent status as "healthy" (fallback)
       UW->>UM: Remove
       A-->>UM: Watch for removal
       A-->>FS: Check-in request: remove `upgrade_details` field
       FS-->>ES: Update Agent doc in `.fleet-agents`<br />remove `upgrade_details` field
       UI-->>UI: Show Agent status as "healthy"
   end

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-07-25T06:11:52Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

ycombinator · 2023-10-02T19:35:21Z

I'm going to start implementing this enhancement now. I'm breaking it down into the following sub-tasks with the idea of each sub-task resulting in a single PR, to make it easier to review and also track incremental progress.

~~When in Fleet-managed mode, whenever there is a state transition, call check-in API call to Fleet, with upgrade details in new upgrade_details field.~~ See Track and report upgrade details #3119 (comment)
Track current upgrade details within Agent for all upgrade states (except UPG_SCHEDULED and UPG_ROLLBACK; see follow-up items below): Track upgrade details #3527
Include the current upgrade details in the output of elastic-agent status: Include upgrade details in output of elastic-agent status during ongoing upgrade #3615
Include the current upgrade details in the output of elastic-agent diagnostics and diagnostics requests from Fleet: Include upgrade details in diagnostics bundle during ongoing upgrade #3624
Follow up: Track UPG_ROLLBACK state. This is being done as a follow up as it requires data to be exchanged between the Upgrade Watcher process and the "main" Elastic Agent process: [Upgrade Details] Implement tracking of UPG_ROLLBACK state #3694
Follow up: When in Fleet-managed mode, track UPG_SCHEDULED state. This is being done as a follow up as the scheduling happens in the Fleet Dispatcher code, which is a separate component from the Upgrader where most of the rest of the upgrade states are tracked: [Upgrade Details] Implement tracking of UPG_SCHEDULED state #3757
When in Fleet-managed mode, include the current upgrade details in next call to check-in API via upgrade_details field: Send upgrade details to Fleet Server in check-in API requests #3528
Regression testing: ensure that Elastic Agent continues to acknowledge the UPGRADE action with Fleet Server via the acknowledgements API.
Cleanup: generate UPG_* constants for use in Agent code from corresponding Fleet OpenAPI definition. See also Track upgrade details #3527 (comment).

cmacknz · 2023-10-03T14:44:08Z

When in Fleet-managed mode, whenever there is a state transition, call check-in API call to Fleet, with upgrade details in new upgrade_details field.

The impact of the increased check ins will need to go through scale testing, so you'll need to update Horde or perhaps take on some of #2169.

I would suggest you make this possible but don't actually change the check in frequency so we can release this without impacting the system stability, and increase the checkin frequency separately.

## Summary Closes elastic/ingest-dev#1937 This PR implements the UI side of the new Elastic Agent upgrade states. Note: the following changes will be present regardless of whether Elastic Agent has upgrade details: - Wider `Version` column - `Upgrade available` text is now a badge ### Screenshots <img width="1903" alt="Screenshot 2023-10-05 at 14 23 46" src="https://github.com/elastic/kibana/assets/23701614/6d24b9d6-2561-4018-b8b0-9582095804bc"> <img width="382" alt="Screenshot 2023-10-05 at 11 36 08" src="https://github.com/elastic/kibana/assets/23701614/ed550127-4c03-423d-9a7c-dace1211f67d"> <img width="314" alt="Screenshot 2023-10-05 at 17 09 03" src="https://github.com/elastic/kibana/assets/23701614/80d18cf1-31d4-4969-acf3-e6afd39aee92"> <img width="382" alt="Screenshot 2023-10-05 at 11 36 23" src="https://github.com/elastic/kibana/assets/23701614/2a8257d7-6f8d-40be-a629-1634ed15f054"> <img width="382" alt="Screenshot 2023-10-05 at 11 36 30" src="https://github.com/elastic/kibana/assets/23701614/a4fef046-4bcd-4c99-a51e-ce994f9c6565"> <img width="382" alt="Screenshot 2023-10-05 at 11 36 40" src="https://github.com/elastic/kibana/assets/23701614/501a1b6d-d41b-448a-9d37-5987717d81d1"> <img width="382" alt="Screenshot 2023-10-05 at 11 36 47" src="https://github.com/elastic/kibana/assets/23701614/6cdb7f42-2cb3-4861-b6e5-86267602c2fa"> <img width="382" alt="Screenshot 2023-10-05 at 11 36 53" src="https://github.com/elastic/kibana/assets/23701614/9a04421c-4b8a-437d-8bb9-114ece3bea6a"> <img width="382" alt="Screenshot 2023-10-05 at 11 36 58" src="https://github.com/elastic/kibana/assets/23701614/1bb1b362-e30a-4985-877f-9908c778acf0"> <img width="382" alt="Screenshot 2023-10-05 at 11 37 05" src="https://github.com/elastic/kibana/assets/23701614/66d0bf65-67f2-432b-a3a8-7dd5773368a0"> <img width="1196" alt="Screenshot 2023-10-05 at 11 37 35" src="https://github.com/elastic/kibana/assets/23701614/353f5942-e74f-468a-9430-895b7dbe19e5"> ### Steps to reproduce Note: the [Elastic Agent changes](elastic/elastic-agent#3119) are not ready yet, so the proposed testing steps aim to mock the upgrade states by editing the agent document(s) manually. 1. Run Kibana on this branch. 2. Enroll at least one agent (or more to test more quickly). 3. Create a "super duper user" in order to edit the agent document(s) (see below). Use this user to edit your agent document(s) and add upgrade details (see below for examples). ⚠️ These upgrade details won't stay for very long, so make sure to check the UI immediately. 4. In Fleet UI, check that the UI correctly reflects the upgrade state (badge and tooltip). 5. Check that an upgrading agent with no upgrade details correctly gets a tooltip informing that the upgrade details are not available. This can be mocked by editing the agent document again and setting `upgrade_started_at` to some timestamp and `upgraded_at` to `null`. #### How to create a "super duper user" In dev tools, run the following two commands: ``` PUT _security/role/super_duper_user { "cluster" : [ "all" ], "indices" : [ { "names" : [ "*" ], "privileges" : [ "all" ], "field_security" : { "grant" : [ "*" ], "except" : [ ] }, "allow_restricted_indices" : true } ], "applications" : [ ], "run_as" : [ ], "metadata" : { }, "transient_metadata" : { "enabled" : true } } PUT _security/user/strong_user { "roles": [ "super_duper_user", "superuser" ], "full_name": "Super Duper User", "email": "super@elastic.co", "password": "changeme", "metadata": {}, "enabled": true } ``` #### Adding upgrade details to an agent Example commands: ``` POST .fleet-agents/_update/<agent_id> { "doc": { "upgrade_details": { "target_version": "8.11", "action_id": "xxxxxxxx", "state": "UPG_SCHEDULED", "metadata": { "scheduled_at": "2023-10-04T16:34:12Z" // edit this to a better time to check that the number of hours in the tooltip message is correct } } } } ``` ``` POST .fleet-agents/_update/<agent_id> "doc": { "upgrade_details": { "target_version": "8.11", "action_id": "xxxxxxxx", "state": "UPG_DOWNLOADING", "metadata": { "download_percent": 16.4 } } } } ``` ``` POST .fleet-agents/_update/<agent_id> { "doc": { "upgrade_details": { "target_version": "8.11", "action_id": "xxxxxxxx", "state": "UPG_FAILED", "metadata": { "failed_state": "UPG_DOWNLOADING", "error_msg": "Something went BOOM" } } } } ``` ### Checklist - [ ] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] This renders correctly on smaller devices using a responsive layout. (You can test this [in your browser](https://www.browserstack.com/guide/responsive-testing-on-local-server)) - [ ] This was checked for [cross-browser compatibility](https://www.elastic.co/support/matrix#matrix_browsers) --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

blakerouse · 2023-12-04T18:45:29Z

Tracking and reporting upgrade details is completed in 8.12.

amitkanfer · 2023-12-05T07:56:47Z

well done team! Happy to see this getting closed. 🚀

ycombinator added the enhancement New feature or request label Jul 25, 2023

pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Jul 25, 2023

pierrehilbert assigned ycombinator Jul 28, 2023

ycombinator mentioned this issue Sep 29, 2023

Quick multiple upgrades while agent is still in grace period break agent installation #2706

Closed

ycombinator mentioned this issue Oct 3, 2023

Adding FSM for upgrades #3501

Closed

7 tasks

jillguyonnet mentioned this issue Oct 4, 2023

[Fleet] Elastic Agent upgrade states UI elastic/kibana#167539

Merged

5 tasks

This was referenced Oct 5, 2023

Track upgrade details #3527

Merged

Send upgrade details to Fleet Server in check-in API requests #3528

Merged

[Fleet UI] Forbid Agent from being upgraded if upgrade is currently in progress elastic/kibana#168171

Closed

ycombinator mentioned this issue Oct 17, 2023

Include upgrade details in output of elastic-agent status during ongoing upgrade #3615

Merged

7 tasks

ycombinator mentioned this issue Oct 18, 2023

Include upgrade details in diagnostics bundle during ongoing upgrade #3624

Merged

7 tasks

jillguyonnet mentioned this issue Oct 30, 2023

Document agent detailed upgrade states elastic/ingest-docs#596

Merged

ycombinator mentioned this issue Nov 2, 2023

[Upgrade Details] Implement tracking of UPG_ROLLBACK state #3694

Merged

7 tasks

This was referenced Nov 13, 2023

[Upgrade Details] Implement tracking of UPG_SCHEDULED state #3757

Merged

[Upgrade Details] Track failed download attempts #3769

Closed

jillguyonnet mentioned this issue Nov 16, 2023

[Fleet] Update restart agent upgrade action to use upgrade details elastic/kibana#171419

Closed

pierrehilbert added QA:Ready For Testing Code is merged and ready for QA to validate QA:Needs Validation Needs validation by the QA Team labels Nov 20, 2023

blakerouse assigned AndersonQ and unassigned ycombinator Nov 27, 2023

blakerouse closed this as completed Dec 4, 2023

amolnater-qasource removed the QA:Ready For Testing Code is merged and ready for QA to validate label May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track and report upgrade details #3119

Track and report upgrade details #3119

ycombinator commented Jul 25, 2023 •

edited

Loading

elasticmachine commented Jul 25, 2023

ycombinator commented Oct 2, 2023 •

edited by blakerouse

Loading

cmacknz commented Oct 3, 2023

blakerouse commented Dec 4, 2023

amitkanfer commented Dec 5, 2023

Track and report upgrade details #3119

Track and report upgrade details #3119

Comments

ycombinator commented Jul 25, 2023 • edited Loading

elasticmachine commented Jul 25, 2023

ycombinator commented Oct 2, 2023 • edited by blakerouse Loading

cmacknz commented Oct 3, 2023

blakerouse commented Dec 4, 2023

amitkanfer commented Dec 5, 2023

ycombinator commented Jul 25, 2023 •

edited

Loading

ycombinator commented Oct 2, 2023 •

edited by blakerouse

Loading