Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(host_metrics source): add a new collector for tcp stats #22057

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

aryan9600
Copy link

@aryan9600 aryan9600 commented Dec 18, 2024

Summary

Add a new tcp collector to the host_metrics source to expose information about the systems's TCP stack. It exposes three metrics:

  • tcp_connections_total: The total number of TCP connections. It includes the state of the connection as a tag.
  • tcp_tx_queued_bytes_total: The sum of the number of bytes in the send queue across all connections.
  • tcp_rx_queued_bytes_total: The sum of the number of bytes in the receive queue across all connections.

These metrics are only available for Linux systems as it uses the netlink subsystem.

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

How did you test this PR?

  • Unit tests for the code interacting with netlink and for the code writing to the metrics buffer.
  • Manual testing by running the change locally with the below config file and comparing the output with ss -a -t on both Linux x86 and arm systems.
sources:
  host_metrics:
    type: host_metrics
    collectors:
      - tcp
sinks:
  console:
    type: console
    inputs:
      - host_metrics
    encoding:
      codec: text

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the "no-changelog" label to this PR.

Checklist

  • Please read our Vector contributor resources.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run dd-rust-license-tool write to regenerate the license inventory and commit the changes (if any). More details here.

References

closes #21972

@aryan9600 aryan9600 requested review from a team as code owners December 18, 2024 17:43
@bits-bot
Copy link

bits-bot commented Dec 18, 2024

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added domain: sources Anything related to the Vector's sources domain: external docs Anything related to Vector's external, public documentation labels Dec 18, 2024
Copy link
Member

@jszwedko jszwedko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aryan9600 ! Exposing these new metrics seem useful. What do you think of putting them under the existing network collector though? Exporting as network_tcp_*.

@aryan9600
Copy link
Author

What do you think of putting them under the existing network collector though?

i wanted to keep it as a separate collector, as these metrics are only available for linux. also, to avoid any changes to the output of users when they upgrade to the newer version with these changes. if its a blocker, then i can move it to the network collector, no issue. let me know :)

@jszwedko
Copy link
Member

What do you think of putting them under the existing network collector though?

i wanted to keep it as a separate collector, as these metrics are only available for linux. also, to avoid any changes to the output of users when they upgrade to the newer version with these changes. if its a blocker, then i can move it to the network collector, no issue. let me know :)

I think we already have a precedent of collectors having OS-specific metrics (e.g. filesystem_used_ratio only exists for Linux) so I think they'd be a good fit under collector even if they only exist for Linux.

@aryan9600 aryan9600 changed the title feat(host_metrics): add a new tcp collector feat(host_metrics): modify the network collector to expose TCP stats Dec 20, 2024
@aryan9600 aryan9600 requested a review from jszwedko December 20, 2024 19:09
@aryan9600 aryan9600 changed the title feat(host_metrics): modify the network collector to expose TCP stats feat(host_metrics source): modify the network collector to expose TCP stats Dec 21, 2024
Copy link
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @aryan9600, thank you for this PR!

I noticed big change in Cargo.lock, I wonder if you can roll that file back and apply individual updates for the newly added crates only.

@aryan9600 aryan9600 requested a review from pront January 3, 2025 13:34
@pront pront self-assigned this Jan 3, 2025
@aryan9600
Copy link
Author

a problem with having these TCP metrics inside the network collector is that the latter allows to filter the output metrics using network devices, but the former doesn't support this filtering. this is due to the fact that netlink doesn't return the network interface in the response for SOCK_DIAG messages. we could modify the tests and docs to accommodate for this or create a new collector for these metrics. open to any other suggestions.
cc: @pront @jszwedko

@pront
Copy link
Member

pront commented Jan 6, 2025

Hi @aryan9600, one test is failing. I will take another look at this PR once that is fixed. You can run any test locally with cargo test.

Copy link
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ci-run-unit-mac

@pront pront force-pushed the host-metrics-tcp branch from 6419d0b to 9faccc7 Compare January 7, 2025 16:45
@pront
Copy link
Member

pront commented Jan 7, 2025

Running this locally on macOS I get multiple compilation errors originating netlink-packet-sock-diag.

error[E0425]: cannot find value `AF_AX25` in crate `libc`
  --> .../.cargo/registry/src/index.crates.io-6f17d22bba15001f/netlink-packet-sock-diag-0.4.2/src/constants.rs:10:31
   |
10 | pub const AF_AX25: u8 = libc::AF_AX25 as u8;
   |                               ^^^^^^^ not found in `libc`

@jszwedko
Copy link
Member

jszwedko commented Jan 7, 2025

a problem with having these TCP metrics inside the network collector is that the latter allows to filter the output metrics using network devices, but the former doesn't support this filtering. this is due to the fact that netlink doesn't return the network interface in the response for SOCK_DIAG messages. we could modify the tests and docs to accommodate for this or create a new collector for these metrics. open to any other suggestions.

Ah, good point, that could be actually be a reason to separate the collector if it is not filterable by network device.

@aryan9600
Copy link
Author

@jszwedko ack, moving these metrics back into its own dedicated collector.

Add a new `tcp` collector  to the `host_metrics` source to expose
information about the systems's TCP stack. It exposes three metrics:
* `tcp_connections_total`: The total number of TCP connections. It
  includes the `state` of the connection as a tag.
* `tcp_tx_queued_bytes_total`: The sum of the number of bytes in the
   send queue across all connections.
* `tcp_rx_queued_bytes_total`: The sum of the number of bytes in the
  receive queue across all connections.

The collector is only enabled for Linux as it uses the netlink
subsystem.

Signed-off-by: Sanskar Jaiswal <jaiswalsanskar078@gmail.com>
Signed-off-by: Sanskar Jaiswal <jaiswalsanskar078@gmail.com>
@aryan9600
Copy link
Author

Running this locally on macOS I get multiple compilation errors originating netlink-packet-sock-diag.

@pront this should be fixed.

@aryan9600 aryan9600 requested a review from pront January 8, 2025 17:58
@aryan9600
Copy link
Author

tests are passing now, previous CI run failed because third party licenses were not up to date. this has been fixed now.

@aryan9600 aryan9600 changed the title feat(host_metrics source): modify the network collector to expose TCP stats feat(host_metrics source): add a new collector for tcp stats Jan 9, 2025
Copy link
Member

@jszwedko jszwedko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on the UX. I'll leave the code review to @pront

Signed-off-by: Sanskar Jaiswal <jaiswalsanskar078@gmail.com>
Signed-off-by: Sanskar Jaiswal <jaiswalsanskar078@gmail.com>
Copy link
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments on diagnostics but otherwise this is good to go 🚀

src/sources/host_metrics/tcp.rs Outdated Show resolved Hide resolved
9 => Ok(TcpState::LastAck),
10 => Ok(TcpState::Listen),
11 => Ok(TcpState::Closing),
_ => Err(TcpError::InvalidTcpState { state: value }),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we could pass the string representation of state here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm a bit confused, what would be the string representation of state here, since value doesn't correspond to a valid tcp state?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah my bad, I see that this is a catch all for states that don't map to a string.

Signed-off-by: Sanskar Jaiswal <jaiswalsanskar078@gmail.com>
Copy link
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing one last check on my side because last time we tried to generate linux specific platform docs it broke component generation on macOS. This something for us (Vector team) to figure out. PR looks great otherwise.

9 => Ok(TcpState::LastAck),
10 => Ok(TcpState::Listen),
11 => Ok(TcpState::Closing),
_ => Err(TcpError::InvalidTcpState { state: value }),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah my bad, I see that this is a catch all for states that don't map to a string.

@pront
Copy link
Member

pront commented Jan 10, 2025

make generate-component-docs on macOS:

image

Raw: diff.txt

@jszwedko: do you happen to recall how we dealt with this in the past?

@jszwedko
Copy link
Member

make generate-component-docs on macOS:

image Raw: [diff.txt](https://github.com/user-attachments/files/18378930/diff.txt)

@jszwedko: do you happen to recall how we dealt with this in the past?

I think maybe the cue files were manually modified rather than using make generate-component-docs. I believe the diff you have is accurate and should be applied 🤔

@jszwedko
Copy link
Member

Though I do see the "Only available on Linux" bit of the doc comment is dropped. That might be an issue in the scaffolding that generates the cue file based on the structs.

@pront
Copy link
Member

pront commented Jan 10, 2025

make generate-component-docs on macOS:
image
Raw: diff.txt
@jszwedko: do you happen to recall how we dealt with this in the past?

I think maybe the cue files were manually modified rather than using make generate-component-docs. I believe the diff you have is accurate and should be applied 🤔

Thanks @jszwedko. @aryan9600 please take a look at the above.

@pront
Copy link
Member

pront commented Jan 10, 2025

Though I do see the "Only available on Linux" bit of the doc comment is dropped. That might be an issue in the scaffolding that generates the cue file based on the structs.

I will take a look at this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: external docs Anything related to Vector's external, public documentation domain: sources Anything related to the Vector's sources
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a collector to track TCP connections in the host_metrics source
5 participants