Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/multi loader logs collection #598

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

nosnelmil
Copy link
Contributor

Summary

Extends multi-loader by collecting key logs from nodes in the cluster for the Knative platform. Users can optionally collect the following logs:

  • TOP (resource usage metrics)
  • Prometheus snapshots
  • Logs from the Activator Pod
  • Logs from the Autoscaler Pod

Implementation Notes ⚒️

  • Added an additional Metrics field in the multi-loader config, accepting an array with any of the following values: top, prometheus, activator, autoscaler.
  • Introduced optional fields: MasterNode, ActivatorNode, AutoscalerNode, and WorkerNodes to allow users to manually specify IPs instead of relying on multi-loader to determine them (mostly unnecessary in typical scenarios).
  • Uses kubectl to automatically determine node IPs and classify them based on their roles.
  • Resets TOP metrics for all nodes before starting any experiment.
  • Collects Activator Pod logs from:
    /var/log/pods/knative-serving_activator-*/activator/*
  • Collects Autoscaler Pod logs from:
    /var/log/pods/knative-serving_autoscaler-*/autoscaler/*
  • Copies Prometheus snapshots by first triggering a snapshot via the Prometheus API on the master node and then retrieving the generated snapshot.
  • Additionally, log collection logic runs during multi-loader dry run to:
    • Validate that identified IPs are reachable.
    • Ensure SSH access and necessary permissions.
    • Execute log retrieval commands to detect potential errors.
    • Delete any collected logs after validation, as no experiments have been executed yet.

External Dependencies 🍀

  • N/A

Breaking API Changes ⚠️

  • N/A

@nosnelmil nosnelmil force-pushed the feature/multi-loader-adv-logs branch 2 times, most recently from 3dfaaaf to d5d74ac Compare February 4, 2025 03:16
@nosnelmil nosnelmil marked this pull request as draft February 4, 2025 03:17
@nosnelmil nosnelmil force-pushed the feature/multi-loader-adv-logs branch 9 times, most recently from 44717ff to c4798ba Compare February 13, 2025 06:59
@nosnelmil nosnelmil marked this pull request as ready for review February 13, 2025 07:08
@nosnelmil nosnelmil force-pushed the feature/multi-loader-adv-logs branch 2 times, most recently from 535398d to 0e11bd8 Compare February 19, 2025 16:18
Copy link
Contributor

@leokondrashov leokondrashov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice. I left some comments.

@nosnelmil nosnelmil force-pushed the feature/multi-loader-adv-logs branch 4 times, most recently from ba47317 to bfb51ab Compare March 4, 2025 03:59
@nosnelmil
Copy link
Contributor Author

@leokondrashov as discussed, added the log consolidation logic in 0dc0950

@nosnelmil nosnelmil requested a review from leokondrashov March 4, 2025 04:36
Copy link
Contributor

@leokondrashov leokondrashov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks. Please fix couple minor comments

Signed-off-by: Lenson <nosnelmil@gmail.com>

add node discovery validators

Signed-off-by: Lenson <nosnelmil@gmail.com>

add collect TOP metric functions

Signed-off-by: Lenson <nosnelmil@gmail.com>

add multi-loader metric_manager

Signed-off-by: Lenson <nosnelmil@gmail.com>

add autoscaler log collection

Signed-off-by: Lenson <nosnelmil@gmail.com>

add activator log collection

Signed-off-by: Lenson <nosnelmil@gmail.com>

add prometh log collection

Signed-off-by: Lenson <nosnelmil@gmail.com>

refactor metric manager contants

Signed-off-by: Lenson <nosnelmil@gmail.com>

minor fix for node discovery

Signed-off-by: Lenson <nosnelmil@gmail.com>

fix node discovery

Signed-off-by: Lenson <nosnelmil@gmail.com>

minor fix

Signed-off-by: Lenson <nosnelmil@gmail.com>

minor fix

Signed-off-by: Lenson <nosnelmil@gmail.com>

add logs for prometh

Signed-off-by: Lenson <nosnelmil@gmail.com>

add pause between prometh collection

Signed-off-by: Lenson <nosnelmil@gmail.com>

update wait time

Signed-off-by: Lenson <nosnelmil@gmail.com>

update condition for node discovery

Signed-off-by: Lenson <nosnelmil@gmail.com>

update logging

Signed-off-by: Lenson <nosnelmil@gmail.com>
Signed-off-by: Lenson <nosnelmil@gmail.com>

update kind ssh update script

Signed-off-by: Lenson <nosnelmil@gmail.com>

fix setup kind ssh

Signed-off-by: Lenson <nosnelmil@gmail.com>
Signed-off-by: Lenson <nosnelmil@gmail.com>

update setup metrics script

Signed-off-by: Lenson <nosnelmil@gmail.com>
Signed-off-by: Lenson <nosnelmil@gmail.com>
Signed-off-by: Lenson <nosnelmil@gmail.com>

fix log collection test

commit a05990d
Author: Lenson <nosnelmil@gmail.com>
Date:   Mon Feb 3 15:39:39 2025 +0800

    update test trigger

    Signed-off-by: Lenson <nosnelmil@gmail.com>

commit 3edb3b4
Author: Lenson <nosnelmil@gmail.com>
Date:   Mon Feb 3 15:33:06 2025 +0800

    update test

    Signed-off-by: Lenson <nosnelmil@gmail.com>

commit 56a0f7d
Author: Lenson <nosnelmil@gmail.com>
Date:   Mon Feb 3 15:18:40 2025 +0800

    fix

    Signed-off-by: Lenson <nosnelmil@gmail.com>

commit 67c520d
Author: Lenson <nosnelmil@gmail.com>
Date:   Mon Feb 3 15:06:20 2025 +0800

    fix

    Signed-off-by: Lenson <nosnelmil@gmail.com>

commit 48ff845
Author: Lenson <nosnelmil@gmail.com>
Date:   Mon Feb 3 14:46:29 2025 +0800

    test'

    Signed-off-by: Lenson <nosnelmil@gmail.com>

commit 295c761
Author: Lenson <nosnelmil@gmail.com>
Date:   Mon Feb 3 14:45:35 2025 +0800

    add adv log collection tests

    Signed-off-by: Lenson <nosnelmil@gmail.com>

commit 8469bdb
Author: Lenson <nosnelmil@gmail.com>
Date:   Mon Feb 3 14:45:05 2025 +0800

    update logging

    Signed-off-by: Lenson <nosnelmil@gmail.com>

commit 10e295a
Author: Lenson <nosnelmil@gmail.com>
Date:   Mon Feb 3 14:44:42 2025 +0800

    update kind ssh update script

    Signed-off-by: Lenson <nosnelmil@gmail.com>

commit c56a9d8
Author: Lenson <nosnelmil@gmail.com>
Date:   Mon Feb 3 13:19:27 2025 +0800

    add KinD ssh setup script

    Signed-off-by: Lenson <nosnelmil@gmail.com>

commit bf9a804
Author: Lenson <nosnelmil@gmail.com>
Date:   Mon Feb 3 10:31:55 2025 +0800

    update condition for node discovery

    Signed-off-by: Lenson <nosnelmil@gmail.com>

commit b3f078b
Author: Lenson <nosnelmil@gmail.com>
Date:   Fri Jan 31 18:35:03 2025 +0800

    add multi loader log collection

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add node discovery validators

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add collect TOP metric functions

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add multi-loader metric_manager

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add autoscaler log collection

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add activator log collection

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add prometh log collection

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    refactor metric manager contants

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    minor fix for node discovery

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    fix node discovery

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    minor fix

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    minor fix

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add logs for prometh

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add pause between prometh collection

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    update wait time

    Signed-off-by: Lenson <nosnelmil@gmail.com>

commit 9bac3c4
Author: Lenson <nosnelmil@gmail.com>
Date:   Tue Jan 21 13:00:50 2025 +0800

    update multi loader docs

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    update multi-loader docs

    Signed-off-by: Lenson <nosnelmil@gmail.com>

commit bfd17be
Author: Lenson <nosnelmil@gmail.com>
Date:   Mon Jan 20 16:30:13 2025 +0800

    minor multi loader fix

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    fix incorrect retry logging

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    remove iat and generated cli args

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    remove make clean from clean up

    Signed-off-by: Lenson <nosnelmil@gmail.com>

commit 91042aa
Author: Lenson <nosnelmil@gmail.com>
Date:   Thu Jan 16 15:53:19 2025 +0800

    update tests

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    update multi loader e2e tests

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    revert setup.cfg

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    chmod script

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    update unit tests

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    fix e2e test

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    update tests

    Signed-off-by: Lenson <nosnelmil@gmail.com>

commit 69c3c3a
Author: Lenson <nosnelmil@gmail.com>
Date:   Tue Dec 31 11:49:55 2024 +0800

    add failfast flag

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    update failfast flag description

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    update comments

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    update wordlist with multiloader specific words

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    simplify run experiment logic

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    refactor partial experiment naming

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    fix wrong indexing

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add progress in logging

    Signed-off-by: Lenson <nosnelmil@gmail.com>

commit fc3ad98
Author: Lenson <nosnelmil@gmail.com>
Date:   Sun Nov 17 14:07:35 2024 +0800

    refactor multi loader

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add multi-loader tests

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    update test

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    refactor multi-loader tests

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add loader experiment

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    update logs

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    update log verbosity

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    update logs

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    update logs

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    rename multiloader driver to runner

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    refactor common files to multiloader folder

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    refactor multiloader functions

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    rename createNewStudy function name

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    fix formatting

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    remove extra features

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    remove extra features

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add validation for platform

    Signed-off-by: Lenson <nosnelmil@gmail.com>

commit ca5e2ad
Author: Lenson <nosnelmil@gmail.com>
Date:   Sat Nov 16 18:49:35 2024 +0800

    add multi loader documentation

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    update docs

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    fix docs

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    update documentation

    Signed-off-by: Lenson <nosnelmil@gmail.com>

commit 3c7e6b5
Author: Lenson <nosnelmil@gmail.com>
Date:   Sat Nov 16 12:36:43 2024 +0800

    add multi-loader

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add multi-loader config reader

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add multi loader base

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add multi loader base

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add node group struct

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add multi loader runner

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    refactor multi loader config

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add multi loader config validators

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add knative specific config enricher

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add additional knative platform type

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add base runner entry point

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    refactor multi loader config

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    update multi loader config struct

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    update unpack study doc

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add unpack study

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add prepare experiment

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    update experiment config temp path

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add run loader function

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    update log parser

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    update log parser

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    update log parser

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add clean up function

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add logs to indicate run status

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    expose entry points for multi loader runner

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add multi loader runner execution

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    update default multi loader config path

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add cpu limit validator

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    remove extra knative feature

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    remove knative extra features

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add multi loader tests

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add basic config

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    update basic config

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    update basic config

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    add basic configs

    Signed-off-by: Lenson <nosnelmil@gmail.com>

    update base config

    Signed-off-by: Lenson <nosnelmil@gmail.com>

Signed-off-by: Lenson <nosnelmil@gmail.com>

update e2e test

Signed-off-by: Lenson <nosnelmil@gmail.com>
Signed-off-by: Lenson <nosnelmil@gmail.com>
@nosnelmil nosnelmil force-pushed the feature/multi-loader-adv-logs branch from 67ea824 to 02c8260 Compare March 12, 2025 02:31
Signed-off-by: Lenson <nosnelmil@gmail.com>
Signed-off-by: Lenson <nosnelmil@gmail.com>

update metrics description in docs

Signed-off-by: Lenson <nosnelmil@gmail.com>
Signed-off-by: Lenson <nosnelmil@gmail.com>
Signed-off-by: Lenson <nosnelmil@gmail.com>
Signed-off-by: Lenson <nosnelmil@gmail.com>
Signed-off-by: Lenson <nosnelmil@gmail.com>

add interval for prometh snapshot collection

Signed-off-by: Lenson <nosnelmil@gmail.com>
Signed-off-by: Lenson <nosnelmil@gmail.com>
Signed-off-by: Lenson <nosnelmil@gmail.com>
Signed-off-by: Lenson <nosnelmil@gmail.com>
Signed-off-by: Lenson <nosnelmil@gmail.com>
@nosnelmil nosnelmil force-pushed the feature/multi-loader-adv-logs branch from 02c8260 to 8580b19 Compare March 12, 2025 02:38
@nosnelmil
Copy link
Contributor Author

Hi @cvetkovic, this PR extends the previously added multi-loader tool by introducing enhanced log collection capabilities. The new feature allows users to gather logs from the Activator and Autoscaler nodes, retrieve TOP metrics from all cluster nodes, and capture Prometheus snapshots. Users can also specify the exact metrics they want to collect using the newly introduced Metric field in the multi-loader configuration.

I would appreciate your review and if everything looks good, I will tidy up the commits and prepare for merging into main. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants