Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix process upgrade details when null #3264

Merged
merged 8 commits into from
Feb 12, 2024

Conversation

juliaElastic
Copy link
Contributor

@juliaElastic juliaElastic commented Feb 9, 2024

What is the problem this PR solves?

Agent was never upgradeable again after an upgrade to 8.12.0

How does this PR solve the problem?

When upgrade_details: null comes from the agent doc, the logic incorrectly supposed it has a value, and set upgraded_at: now. This resulted the agent never being upgradeable again on the UI (as it is not allowed to upgrade again within 10 minutes of an upgrade).
Upgraded the schema, so that the null value is correctly translated to nil, and the condition in handleCheckin can check for nil.

How to test this PR locally

  • Start a stack at least 8.12.1 or latest.
  • enroll an agent 8.11.4 or older
  • upgrade to 8.12.0
  • wait 10 minutes so that the watcher period ends
  • expect the agent to be upgradeable again to 8.12.1
image

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

Closes #3263

@juliaElastic juliaElastic added the bug Something isn't working label Feb 9, 2024
@juliaElastic juliaElastic self-assigned this Feb 9, 2024
@juliaElastic juliaElastic requested a review from a team as a code owner February 9, 2024 12:12
@juliaElastic juliaElastic added the backport-v8.12.0 Automated backport with mergify label Feb 9, 2024
Copy link
Contributor

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we need to add this as a known issue to some of the 8.12.x releases?
Is there a work-around for customers?

@@ -497,3 +497,7 @@ type ToRetireAPIKeyIdsItems struct {
// Date/time the API key was retired
RetiredAt string `json:"retired_at,omitempty"`
}

// UpgradeDetails Additional upgrade status details.
type UpgradeDetails struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use the upgradedetails model from the agent-lib here? or it is something different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is the same, though I'm not sure how to reference it from schema.json, @michel-laterman do you know?

@juliaElastic
Copy link
Contributor Author

Would we need to add this as a known issue to some of the 8.12.x releases? Is there a work-around for customers?

How can we add to the known issues?
I guess a workaround is to either force upgrade those agents, or manually delete the upgrade_details:null value from agent docs.

@kpollich
Copy link
Member

kpollich commented Feb 9, 2024

How can we add to the known issues?

We can file an issue to ingest-docs summarizing the issue and @kilfoyle should be able to help (providing he has some time of course!) with getting a known issue into the release notes retroactively.

I guess a workaround is to either force upgrade those agents

This sounds good to me as a recommendation. Force upgrading is hopefully not overly cumbersome for users, and should be fairly straightforward for us to document as an API request if I understand correctly?

@juliaElastic
Copy link
Contributor Author

We can file an issue to ingest-docs summarizing the issue and @kilfoyle should be able to help (providing he has some time of course!) with getting a known issue into the release notes retroactively.

Thanks, raised elastic/ingest-docs#907

@lucabelluccini
Copy link
Contributor

lucabelluccini commented Feb 9, 2024

Hello,
as suggested by @michel-laterman:

@juliaElastic
Copy link
Contributor Author

Updated the doc request with an example of how to use force flag with bulk_upgrade API.

@juliaElastic juliaElastic enabled auto-merge (squash) February 9, 2024 15:51
@juliaElastic juliaElastic disabled auto-merge February 9, 2024 15:51
@juliaElastic
Copy link
Contributor Author

Unfortunately it looks like the force flag doesn't work as a workaround here, the 10m validation is there even then, we have to come up with another workaround.
I'm thinking of an update_by_query command which would remove upgrade_details:null values from agent doc, unless there is a better idea.

@lucabelluccini
Copy link
Contributor

Update by query would persist until the next check-in?
Also, we would need to be super careful on the filtering section of the update by query - the risk of breaking other agents is not negligible.

@juliaElastic
Copy link
Contributor Author

juliaElastic commented Feb 9, 2024

Update by query would persist until the next check-in? Also, we would need to be super careful on the filtering section of the update by query - the risk of breaking other agents is not negligible.

upgrade_details is not changed on checkin, only if another upgrade is attempted. Which shouldn't be a problem in this case, if the agents are upgraded to 8.12.1, they won't need another upgrade until 8.12.2, where hopefully the fleet-server fix will be included.

I've put together these steps, which are updating those docs where upgrade_details is null or not defined. It has to be run as superuser.
Tested locally that updating the docs makes the logic stop updating upgraded_at field, so the agent should be available for upgrade after 10m. After waiting 10 minutes, the agent indeed shows upgrade available again.

 POST _security/role/fleet_superuser
 {
    "indices": [
        {
            "names": [".fleet*",".kibana*"],
            "privileges": ["all"],
            "allow_restricted_indices": true
        }
    ]
  }
        
POST _security/user/fleet_superuser 
 {
    "password": "password",
    "roles": ["superuser", "fleet_superuser"]
 }

curl -sk -XPOST --user fleet_superuser:password -H 'content-type:application/json' \
  -H'x-elastic-product-origin:fleet' \
  http://localhost:9200/.fleet-agents/_update_by_query \
  -d '{
  "script": {
    "source": "ctx._source.remove(\"upgrade_details\")",
    "lang": "painless"
  },
  "query": {
    "bool": {
        "must_not": {
          "exists": {
            "field": "upgrade_details"
          }
        }
      }
    }
}'

DELETE _security/user/fleet_superuser
DELETE _security/role/fleet_superuser

// Wait at least 10 minutes, and then try upgrade agent again, it should be available on the UI.

@lucabelluccini
Copy link
Contributor

Thank you @juliaElastic

Copy link

@juliaElastic juliaElastic merged commit 6980022 into elastic:main Feb 12, 2024
mergify bot pushed a commit that referenced this pull request Feb 12, 2024
* fix process upgrade details when null

* fix test

* added changelog

* updated schema to let null be converted to nil automatically

* refactor to reduce complexity

(cherry picked from commit 6980022)
juliaElastic added a commit that referenced this pull request Feb 12, 2024
* fix process upgrade details when null

* fix test

* added changelog

* updated schema to let null be converted to nil automatically

* refactor to reduce complexity

(cherry picked from commit 6980022)

Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>
@WiegerElastic
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.12.0 Automated backport with mergify bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Agent upgraded_at field keeps updating to current time
6 participants