Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File read log interval #22115

Closed
xufeixianggithub opened this issue Jan 3, 2025 · 3 comments
Closed

File read log interval #22115

xufeixianggithub opened this issue Jan 3, 2025 · 3 comments

Comments

@xufeixianggithub
Copy link

          Hi @xufeixianggithub ,

I see the issue you are describing, but I don't think we'd solve it by adding configuration to the file source. Instead, I think a more holistic way to solve this problem is for the throttle transform to support applying back-pressure. This is being tracked by #13651

You could also consider configuring the sink to apply back-pressure by limiting the concurrency or batch sizes.

I'll close this issue, but let me know if you disagree with my assessment!

Originally posted by @jszwedko in #22095 (comment)

@xufeixianggithub
Copy link
Author

I will describe my usage scenario in detail. source configuates file reading, sink configuates elasticsearch, configuates buffering and backpressing hard buffer and block modes, but in fact, according to my test results, even if buffer files are generated, they will be used to write buffer files. However, the downstream elasticsearch cpu is still high, and the upstream source is still using 40% of the total amount on a 4-core 8G server because I configured it with only single threads. I used vector top to view, and writes to source and transforms reached tens of Kelvin per second. Downstream elasticsearch cpu usage is up to 90 percent. If I configure throttle, I have to throw out logs, and in fact, if I can slow this down without introducing new middleware like kafka, I think it's perfect.

@xufeixianggithub
Copy link
Author

`sources:
app_logs_src:
type: "file"
include:
- "/alidata/info-2025-01-01-20.log"
ignore_older_secs: 125000 # 1 day
ignore_checkpoints: true
line_delimiter: "-[END]\n"
fingerprint:
strategy: "checksum"
line: 1
ignored_header_bytes: 512
max_line_bytes: 5097152
internal_metrics_src:
type: "internal_metrics"
transforms:
lua_transform:
type: lua
inputs: [ "app_logs_src" ]
version: "2"
hooks:
init: |
function (emit)

      _G.event_count = 0
      _G.max_events_per_batch = 5000
      _G.sleep_time = 2  
    end
  process: |
    function (event, emit)
      
      _G.event_count = _G.event_count + 1
    
      --print("_G.event_count: " .. _G.event_count)
      --print("Processing event: " .. tostring(event.log))
    
      if _G.event_count >= _G.max_events_per_batch then
       
        os.execute("sleep " .. _G.sleep_time)
        _G.event_count = 0  
      end
      emit(event)
    end

app_logs_parse:
type: "remap"
inputs: ["lua_transform"]
source: |
. |= parse_regex!(.message, r'^[(?P[^\]]+)] [(?P[^\]]+)] [(?P[^\]])] [(?P[^\]]+)] [(?P\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}:\d{3})] [(?P[^\]]+)] [(?P[^\]]+)] - (?P[\s\S])$')
del(.host)
del(.source_type)
del(.file)
drop_on_error: true
drop_on_abort: true
reroute_dropped: true
sinks:
dropped_sink:
inputs: [ "app_logs_parse.dropped" ]
type: file
path: "/alidata/vector/testScript/vector_error.log"
encoding:
codec: json
temp_log_test_sink:
inputs:
- app_logs_parse
type: file
path: "/alidata/vector/testScript/createLog/output.log"
encoding:
codec: json
json:
pretty: true
web_log_es_sink:
type: "elasticsearch"
inputs: ["app_logs_parse"]
endpoints:
- "http://172.18.83.221:9201"
api_version: v7
bulk:
index: "application_log.%Y-%m-%d"
batch:
max_events: 5000
timeout_secs: 5
buffer:
type: "disk"
#max_size: 269484544
max_size: 2269484544
when_full: "block"`

@xufeixianggithub
Copy link
Author

I provided a piece of code using lua stream limiting, tested, it can control the speed of three components, source,transforms,sink it can be implemented, read only 5K logs at a time, and tested, the number of logs output and the number of logs in the source file is consistent, the cpu usage is only used a few points. If anyone has seen or tried this solution, they can point out whether it has hidden dangers.

@vectordotdev vectordotdev locked and limited conversation to collaborators Jan 3, 2025
@pront pront converted this issue into discussion #22117 Jan 3, 2025

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant