Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track and report upgrade details #3119

Closed
3 tasks done
ycombinator opened this issue Jul 25, 2023 · 5 comments
Closed
3 tasks done

Track and report upgrade details #3119

ycombinator opened this issue Jul 25, 2023 · 5 comments
Assignees
Labels
enhancement New feature or request QA:Needs Validation Needs validation by the QA Team Team:Elastic-Agent Label for the Agent team

Comments

@ycombinator
Copy link
Contributor

ycombinator commented Jul 25, 2023

Describe the enhancement:

Elastic Agent should track details about an ongoing upgrade and report them to Fleet Server.

Describe a specific use case for the enhancement or feature:

To understand where upgrades fail or get stuck without progressing.

What is the definition of done?

  • Elastic Agent is able to track the various upgrade states (defined below) representing details about an ongoing upgrade and correctly transition between these states.
  • Elastic Agent is able to report a change in an upgrade state to Fleet Server via a new upgrade_details field in the check-in API.
  • For backwards compatibility, Elastic Agent continues to acknowledge the UPGRADE action with Fleet Server, via the acknowledgements API, as before.

Details

In the future, Elastic Agents will send details about an ongoing upgrade to Fleet Server. These details will be communicated via a new upgrade_details field in check-in API requests.

The proposed structure of the upgrade_details field is:

{
  "upgrade_details": { // new field; present when upgrade is in progress
    "target_version": "8.12.0", // version being upgraded to; always present
    "action_id": "xxxxxxxx", // ID of the UPGRADE action
    "state": "UPG_*",
    "metadata": {
      "scheduled_at": "2023-08-09T10:11:12Z", // when state == "UPG_SCHEDULED"
      "download_percent": 16.4, // when state == "UPG_DOWNLOADING"
      "failed_state": "UPG_*" // when state == "UPG_FAILED"
      "error_msg": "" // when state == "UPG_FAILED"
    }
  }
}

Where upgrade_details.state is expected to hold one of the following values:

State Meaning
UPG_REQUESTED Upgrade requested by user
UPG_SCHEDULED Upgrade scheduled for <date/time>
UPG_DOWNLOADING Downloading new Agent artifact version
UPG_EXTRACTING Extracting new Agent artifact version
UPG_REPLACING Replacing old Agent artifact version with new one version
UPG_RESTARTING Starting new Agent version
UPG_WATCHING Monitoring new Agent version
UPG_ROLLBACK Upgrade unsuccessful; rolling back to Agent version
UPG_FAILED Upgrade failed due to error from state

The possible transitions between these states are:

upgrade_states

The various points in the upgrade workflow at which these state transitions would occur are:

sequenceDiagram
    actor U as User
    participant UI as Fleet UI
    participant ES
    participant FS as Fleet Server
    participant A as Agent
    participant UW as Upgrade Watcher
    participant UM as Upgrader Marker

    U->>UI: Initiate upgrade
    UI->>ES: Update Agent doc in `.fleet-agents`<br />set `upgrade_started_at`
    UI->>UI: Show Agent status as "updating"
    UI->>ES: Create new doc in `.fleet-actions` for `UPGRADE` action
    A->>FS: Check-in request
    FS->>ES: Read pending actions from .fleet-actions
    FS->>A: Check-in response
    A-->>FS: Check-in request: `upgrade_details.state_id` = `UPG_REQUESTED`
    A->>A: Queue upgrade action
    alt If upgrade is scheduled for future
        A-->>FS: Check-in request: `upgrade_details.state_id` = `UPG_SCHEDULED`
    end
    alt If upgrade start fails
       A->>FS: Ack failed upgrade
       A-->>FS: Check-in request: `upgrade_details.state_id` = `UPG_ERROR` with `upgrade_details.failed_state_id` = `UPG_REQUESTED` or `UPG_SCHEDULED`
       FS->>ES: Update Agent doc in `.fleet-agents`<br />set `upgrade_status` = "failed"
       FS-->>ES: Set `upgrade_details` = value of `upgrade_details` from check-in request
       UI-->>UI: Agent status = "upgrade failed"
       UI->>UI: Agent status remains as "updating" (bug) (fallback)
    else
       opt If previous upgrades found
          A->>FS: Ack previous upgrades
          A->>A: Remove previous upgrades from queue
       end
       A-->>FS: Check-in request: `upgrade_details.state_id` = `UPG_DOWNLOADING`
       A->>A: Download new Agent artifact
       opt If download fails
          A-->>FS: Check-in request: `upgrade_details.state_id` = `UPG_ERROR` with `upgrade_details.failed_state_id` = `UPG_DOWNLOADING`
          FS-->>ES: Set `upgrade_details` = value of `upgrade_details` from check-in request
          UI-->>UI: Agent status = "upgrade failed"
       end
       A-->>FS: Check-in request: `upgrade_details.state_id` = `UPG_EXTRACTING`
       A->>A: Extract new Agent artifact
       opt If extraction fails
          A-->>FS: Check-in request: `upgrade_details.state_id` = `UPG_ERROR` with `upgrade_details.failed_state_id` = `UPG_EXTRACTING`
          FS-->>ES: Set `upgrade_details` = value of `upgrade_details` from check-in request
          UI-->>UI: Agent status = "upgrade failed"
       end
       A-->>FS: Check-in request: `upgrade_details.state_id` = `UPG_REPLACING`
       A->>A: Replace current Agent artifact with new one
       opt If extraction fails
          A-->>FS: Check-in request: `upgrade_details.state_id` = `UPG_ERROR` with `upgrade_details.failed_state_id` = `UPG_REPLACING`
          FS-->>ES: Set `upgrade_details` = value of `upgrade_details` from check-in request
          UI-->>UI: Agent status = "upgrade failed"
       end
       A->>UM: Create
       A->>UW: Start
       A-->>FS: Check-in request: `upgrade_details.state_id` = `UPG_RESTARTING`
       A->>A: Rexec to start new Agent artifact
       opt If restart fails
          A-->>FS: Check-in request: `upgrade_details.state_id` = `UPG_ERROR` with `upgrade_details.failed_state_id` = `UPG_RESTARTING`
          FS-->>ES: Set `upgrade_details` = value of `upgrade_details` from check-in request
          UI-->>UI: Agent status = "upgrade failed"
       end
       A-->>FS: Check-in request: `upgrade_details.state_id` = `UPG_WATCHING`
       UI-->>UI: Show Agent status as "upgrade watching"
       UW->>UW: Watch new Agent
   end
   opt Success
       A->>FS: Ack successful upgrade
       FS->>ES: Write successful ack in `.fleet-actions-results`
       FS->>ES: Update Agent doc in `.fleet-agents`<br />set `upgrade_status` = null<br />`upgraded_at` = <now><br />`upgrade_started_at` = null
       UI->>UI: Show Agent status as "healthy" (fallback)
       UW->>UM: Remove
       A-->>UM: Watch for removal
       A-->>FS: Check-in request: remove `upgrade_details` field
       FS-->>ES: Update Agent doc in `.fleet-agents`<br />remove `upgrade_details` field
       UI-->>UI: Show Agent status as "healthy"
   end
   opt Rollback
       UW->>UM: Write `upgrade_details.state_id` = `UPG_ROLLBACK`
       UW->>A: Start
       A-->>UM: Read `upgrade_details.state_id`
       A-->>FS: Check-in request: `upgrade_details.state_id` = value from UM
       A->>FS: Ack failed upgrade
       FS-->>ES: Set `upgrade_details` = value of `upgrade_details` from check-in request
       FS->>ES: Update Agent doc in `.fleet-agents`<br />set `upgrade_status` = null<br />`upgraded_at = <now> (fallback)
       UI->>UI: Show Agent status as "healthy" (fallback)
       UW->>UM: Remove
       A-->>UM: Watch for removal
       A-->>FS: Check-in request: remove `upgrade_details` field
       FS-->>ES: Update Agent doc in `.fleet-agents`<br />remove `upgrade_details` field
       UI-->>UI: Show Agent status as "healthy"
   end
Loading
@ycombinator ycombinator added the enhancement New feature or request label Jul 25, 2023
@pierrehilbert pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Jul 25, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@ycombinator
Copy link
Contributor Author

ycombinator commented Oct 2, 2023

I'm going to start implementing this enhancement now. I'm breaking it down into the following sub-tasks with the idea of each sub-task resulting in a single PR, to make it easier to review and also track incremental progress.

@cmacknz
Copy link
Member

cmacknz commented Oct 3, 2023

When in Fleet-managed mode, whenever there is a state transition, call check-in API call to Fleet, with upgrade details in new upgrade_details field.

The impact of the increased check ins will need to go through scale testing, so you'll need to update Horde or perhaps take on some of #2169.

I would suggest you make this possible but don't actually change the check in frequency so we can release this without impacting the system stability, and increase the checkin frequency separately.

kpollich pushed a commit to elastic/kibana that referenced this issue Oct 17, 2023
## Summary

Closes elastic/ingest-dev#1937

This PR implements the UI side of the new Elastic Agent upgrade states.

Note: the following changes will be present regardless of whether
Elastic Agent has upgrade details:
- Wider `Version` column
- `Upgrade available` text is now a badge

### Screenshots

<img width="1903" alt="Screenshot 2023-10-05 at 14 23 46"
src="https://github.com/elastic/kibana/assets/23701614/6d24b9d6-2561-4018-b8b0-9582095804bc">

<img width="382" alt="Screenshot 2023-10-05 at 11 36 08"
src="https://github.com/elastic/kibana/assets/23701614/ed550127-4c03-423d-9a7c-dace1211f67d">

<img width="314" alt="Screenshot 2023-10-05 at 17 09 03"
src="https://github.com/elastic/kibana/assets/23701614/80d18cf1-31d4-4969-acf3-e6afd39aee92">

<img width="382" alt="Screenshot 2023-10-05 at 11 36 23"
src="https://github.com/elastic/kibana/assets/23701614/2a8257d7-6f8d-40be-a629-1634ed15f054">

<img width="382" alt="Screenshot 2023-10-05 at 11 36 30"
src="https://github.com/elastic/kibana/assets/23701614/a4fef046-4bcd-4c99-a51e-ce994f9c6565">

<img width="382" alt="Screenshot 2023-10-05 at 11 36 40"
src="https://github.com/elastic/kibana/assets/23701614/501a1b6d-d41b-448a-9d37-5987717d81d1">

<img width="382" alt="Screenshot 2023-10-05 at 11 36 47"
src="https://github.com/elastic/kibana/assets/23701614/6cdb7f42-2cb3-4861-b6e5-86267602c2fa">

<img width="382" alt="Screenshot 2023-10-05 at 11 36 53"
src="https://github.com/elastic/kibana/assets/23701614/9a04421c-4b8a-437d-8bb9-114ece3bea6a">

<img width="382" alt="Screenshot 2023-10-05 at 11 36 58"
src="https://github.com/elastic/kibana/assets/23701614/1bb1b362-e30a-4985-877f-9908c778acf0">

<img width="382" alt="Screenshot 2023-10-05 at 11 37 05"
src="https://github.com/elastic/kibana/assets/23701614/66d0bf65-67f2-432b-a3a8-7dd5773368a0">

<img width="1196" alt="Screenshot 2023-10-05 at 11 37 35"
src="https://github.com/elastic/kibana/assets/23701614/353f5942-e74f-468a-9430-895b7dbe19e5">



### Steps to reproduce

Note: the [Elastic Agent
changes](elastic/elastic-agent#3119) are not
ready yet, so the proposed testing steps aim to mock the upgrade states
by editing the agent document(s) manually.

1. Run Kibana on this branch.
2. Enroll at least one agent (or more to test more quickly).
3. Create a "super duper user" in order to edit the agent document(s)
(see below). Use this user to edit your agent document(s) and add
upgrade details (see below for examples). ⚠️ These upgrade details won't
stay for very long, so make sure to check the UI immediately.
4. In Fleet UI, check that the UI correctly reflects the upgrade state
(badge and tooltip).
5. Check that an upgrading agent with no upgrade details correctly gets
a tooltip informing that the upgrade details are not available. This can
be mocked by editing the agent document again and setting
`upgrade_started_at` to some timestamp and `upgraded_at` to `null`.

#### How to create a "super duper user"

In dev tools, run the following two commands:
```
PUT _security/role/super_duper_user
{
  "cluster" : [
    "all"
  ],
  "indices" : [
    {
      "names" : [
        "*"
      ],
      "privileges" : [
        "all"
      ],
      "field_security" : {
        "grant" : [
          "*"
        ],
        "except" : [ ]
      },
      "allow_restricted_indices" : true
    }
  ],
  "applications" : [ ],
  "run_as" : [ ],
  "metadata" : { },
  "transient_metadata" : {
    "enabled" : true
  }
}

PUT _security/user/strong_user
{

    "roles": [
      "super_duper_user",
      "superuser"
    ],
    "full_name": "Super Duper User",
    "email": "super@elastic.co",
    "password": "changeme",
    "metadata": {},
    "enabled": true
  
}
```

#### Adding upgrade details to an agent

Example commands:

```
POST .fleet-agents/_update/<agent_id>
{
  "doc": {
    "upgrade_details": {
      "target_version": "8.11",
      "action_id": "xxxxxxxx",
      "state": "UPG_SCHEDULED",
      "metadata": {
        "scheduled_at": "2023-10-04T16:34:12Z" // edit this to a better time to check that the number of hours in the tooltip message is correct
      }
    }
  }
}
```

```
POST .fleet-agents/_update/<agent_id>
  "doc": {
    "upgrade_details": {
      "target_version": "8.11",
      "action_id": "xxxxxxxx",
      "state": "UPG_DOWNLOADING",
      "metadata": {
        "download_percent": 16.4
      }
    }
  }
}
```

```
POST .fleet-agents/_update/<agent_id>
{
  "doc": {
    "upgrade_details": {
      "target_version": "8.11",
      "action_id": "xxxxxxxx",
      "state": "UPG_FAILED",
      "metadata": {
        "failed_state": "UPG_DOWNLOADING",
        "error_msg": "Something went BOOM"
      }
    }
  }
}
```

### Checklist

- [ ] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [ ] This renders correctly on smaller devices using a responsive
layout. (You can test this [in your
browser](https://www.browserstack.com/guide/responsive-testing-on-local-server))
- [ ] This was checked for [cross-browser
compatibility](https://www.elastic.co/support/matrix#matrix_browsers)

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
@pierrehilbert pierrehilbert added QA:Ready For Testing Code is merged and ready for QA to validate QA:Needs Validation Needs validation by the QA Team labels Nov 20, 2023
@blakerouse blakerouse assigned AndersonQ and unassigned ycombinator Nov 27, 2023
@blakerouse
Copy link
Contributor

Tracking and reporting upgrade details is completed in 8.12.

@amitkanfer
Copy link
Contributor

well done team! Happy to see this getting closed. 🚀

@amolnater-qasource amolnater-qasource removed the QA:Ready For Testing Code is merged and ready for QA to validate label May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request QA:Needs Validation Needs validation by the QA Team Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

No branches or pull requests

8 participants