Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPMI Monitoring Data Interruption Issue #3197

Open
sdlwdong opened this issue Mar 30, 2025 · 8 comments
Open

IPMI Monitoring Data Interruption Issue #3197

sdlwdong opened this issue Mar 30, 2025 · 8 comments
Labels
good first issue Good for newcomers question Further information is requested

Comments

@sdlwdong
Copy link

Question

Title:IPMI Monitoring Data Interruption Issue

Description:
After successfully configuring and establishing a normal monitoring connection for the physical machine via IPMI, we encountered an issue where data collection is interrupted after a period of time.

  1. The network connection remains stable, with no apparent abnormalities.
  2. We can successfully retrieve IPMI information using commands directly on the Hertzbeat machine.
  3. However, when clicking on "Edit Test," the connection fails with a timeout error.

This suggests there may be an underlying issue with the IPMI integration or Hertzbeat's ability to maintain the connection over time.

please help me! thanks.

标题:IPMI监控数据中断问题

描述:
通过IPMI成功配置并建立物理机的正常监控连接后,我们发现数据采集在一段时间后中断。

  1. 网络连接保持稳定,没有明显异常。
  2. 我们可以通过在Hertzbeat机器上直接使用命令成功获取IPMI信息。
  3. 然而,当点击“编辑测试”时,连接失败,提示超时错误。

请帮忙解决,谢谢。
Image

@sdlwdong sdlwdong added the question Further information is requested label Mar 30, 2025
@sdlwdong
Copy link
Author

Image

@sdlwdong
Copy link
Author

2025-03-30 17:16:19.654 [metrics-task-timeout-monitor-0] ERROR org.apache.hertzbeat.collector.dispatch.CommonDispatcher Line:168 - [Collect Timeout]:
id: 494961833735936
app: "ipmi"
metrics: "Chassis"
time: 1743326179654
code: TIMEOUT
msg: "collect timeout"

2025-03-30 17:20:39.656 [metrics-task-timeout-monitor-0] ERROR org.apache.hertzbeat.collector.dispatch.CommonDispatcher Line:168 - [Collect Timeout]:
id: 494961833735936
app: "ipmi"
metrics: "Chassis"
time: 1743326439656
code: TIMEOUT
msg: "collect timeout"

2025-03-30 17:24:59.658 [metrics-task-timeout-monitor-0] ERROR org.apache.hertzbeat.collector.dispatch.CommonDispatcher Line:168 - [Collect Timeout]:
id: 494961833735936
app: "ipmi"
metrics: "Chassis"
time: 1743326699658
code: TIMEOUT
msg: "collect timeout"

2025-03-30 17:29:19.660 [metrics-task-timeout-monitor-0] ERROR org.apache.hertzbeat.collector.dispatch.CommonDispatcher Line:168 - [Collect Timeout]:
id: 494961833735936
app: "ipmi"
metrics: "Chassis"
time: 1743326959660
code: TIMEOUT
msg: "collect timeout"

2025-03-30 17:33:39.663 [metrics-task-timeout-monitor-0] ERROR org.apache.hertzbeat.collector.dispatch.CommonDispatcher Line:168 - [Collect Timeout]:
id: 494961833735936
app: "ipmi"
metrics: "Chassis"
time: 1743327219663
code: TIMEOUT
msg: "collect timeout"

@tomsun28 tomsun28 added the good first issue Good for newcomers label Mar 31, 2025
@tomsun28 tomsun28 changed the title IPMI监控数据中断问题 IPMI Monitoring Data Interruption Issue Mar 31, 2025
@tomsun28
Copy link
Contributor

hi @sdlwdong is there more log information? The current doesn't seem to tell what the problem is.

hi @gjjjj0101 please help take a look if have time, thanks.

@sdlwdong
Copy link
Author

sdlwdong commented Apr 1, 2025

hi @sdlwdong is there more log information? The current doesn't seem to tell what the problem is.

hi @gjjjj0101 please help take a look if have time, thanks.

@gjjjj0101 Hello, which service's log do you need to see? Please guide me. Thank you ! 您好,需要看哪个服务的日志?请指导一下谢谢。

@gjjjj0101
Copy link
Contributor

I have located the problem now. When there is a problem with the communication network between the collector and the machine, the datagramChannel.receive() of nio used in the collector will not throw a network timeout exception, causing the manager's collection to time out. Therefore, the status is still up and the collection time is the earliest correct collection time.

@gjjjj0101
Copy link
Contributor

So this is a bug, I am still designing how to solve it, if you have good suggestions please share with me.

@harshita2626
Copy link

The solutions can be:
1.Network Configuration Check:

# Verify network connectivity to BMC
ping <BMC_IP> -t  # Continuous ping test
# Check for packet loss
mtr --report <BMC_IP>

2.IPMI Tool Validation:

# Test raw IPMI connectivity during failure periods
ipmitool -H <BMC_IP> -U <username> -P <password> -I lanplus chassis status

3.Hertzbeat Configuration Adjustments:

Increase timeout settings in hertzbeat.yml:

collector:
  dispatch:
    timeout: 30000  

@JuJinPark
Copy link
Contributor

Since no exception like a timeout is thrown (as it’s using UDP), how about manually setting a specific timeout? If there’s no response within a certain period, we could treat it as a failed request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers question Further information is requested
Projects
Development

No branches or pull requests

5 participants