File read log interval #22115

xufeixianggithub · 2025-01-03T01:05:06Z

          Hi @xufeixianggithub ,

I see the issue you are describing, but I don't think we'd solve it by adding configuration to the file source. Instead, I think a more holistic way to solve this problem is for the throttle transform to support applying back-pressure. This is being tracked by #13651

You could also consider configuring the sink to apply back-pressure by limiting the concurrency or batch sizes.

I'll close this issue, but let me know if you disagree with my assessment!

Originally posted by @jszwedko in #22095 (comment)

The text was updated successfully, but these errors were encountered:

xufeixianggithub · 2025-01-03T01:13:17Z

I will describe my usage scenario in detail. source configuates file reading, sink configuates elasticsearch, configuates buffering and backpressing hard buffer and block modes, but in fact, according to my test results, even if buffer files are generated, they will be used to write buffer files. However, the downstream elasticsearch cpu is still high, and the upstream source is still using 40% of the total amount on a 4-core 8G server because I configured it with only single threads. I used vector top to view, and writes to source and transforms reached tens of Kelvin per second. Downstream elasticsearch cpu usage is up to 90 percent. If I configure throttle, I have to throw out logs, and in fact, if I can slow this down without introducing new middleware like kafka, I think it's perfect.

xufeixianggithub · 2025-01-03T01:16:17Z

`sources:
app_logs_src:
type: "file"
include:
- "/alidata/info-2025-01-01-20.log"
ignore_older_secs: 125000 # 1 day
ignore_checkpoints: true
line_delimiter: "-[END]\n"
fingerprint:
strategy: "checksum"
line: 1
ignored_header_bytes: 512
max_line_bytes: 5097152
internal_metrics_src:
type: "internal_metrics"
transforms:
lua_transform:
type: lua
inputs: [ "app_logs_src" ]
version: "2"
hooks:
init: |
function (emit)

      _G.event_count = 0
      _G.max_events_per_batch = 5000
      _G.sleep_time = 2  
    end
  process: |
    function (event, emit)
      
      _G.event_count = _G.event_count + 1
    
      --print("_G.event_count: " .. _G.event_count)
      --print("Processing event: " .. tostring(event.log))
    
      if _G.event_count >= _G.max_events_per_batch then
       
        os.execute("sleep " .. _G.sleep_time)
        _G.event_count = 0  
      end
      emit(event)
    end

app_logs_parse:
type: "remap"
inputs: ["lua_transform"]
source: |
. |= parse_regex!(.message, r'^[(?P[^\]]+)] [(?P[^\]]+)] [(?P[^\]])] [(?P[^\]]+)] [(?P\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}:\d{3})] [(?P[^\]]+)] [(?P[^\]]+)] - (?P[\s\S])$')
del(.host)
del(.source_type)
del(.file)
drop_on_error: true
drop_on_abort: true
reroute_dropped: true
sinks:
dropped_sink:
inputs: [ "app_logs_parse.dropped" ]
type: file
path: "/alidata/vector/testScript/vector_error.log"
encoding:
codec: json
temp_log_test_sink:
inputs:
- app_logs_parse
type: file
path: "/alidata/vector/testScript/createLog/output.log"
encoding:
codec: json
json:
pretty: true
web_log_es_sink:
type: "elasticsearch"
inputs: ["app_logs_parse"]
endpoints:
- "http://172.18.83.221:9201"
api_version: v7
bulk:
index: "application_log.%Y-%m-%d"
batch:
max_events: 5000
timeout_secs: 5
buffer:
type: "disk"
#max_size: 269484544
max_size: 2269484544
when_full: "block"`

xufeixianggithub · 2025-01-03T01:19:53Z

I provided a piece of code using lua stream limiting, tested, it can control the speed of three components, source,transforms,sink it can be implemented, read only 5K logs at a time, and tested, the number of logs output and the number of logs in the source file is consistent, the cpu usage is only used a few points. If anyone has seen or tried this solution, they can point out whether it has hidden dangers.

vectordotdev locked and limited conversation to collaborators Jan 3, 2025

pront converted this issue into discussion #22117 Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

File read log interval #22115

File read log interval #22115

xufeixianggithub commented Jan 3, 2025

xufeixianggithub commented Jan 3, 2025

xufeixianggithub commented Jan 3, 2025

xufeixianggithub commented Jan 3, 2025

This issue was moved to a discussion.

This issue was moved to a discussion.

File read log interval #22115

File read log interval #22115

Comments

xufeixianggithub commented Jan 3, 2025

xufeixianggithub commented Jan 3, 2025

xufeixianggithub commented Jan 3, 2025

xufeixianggithub commented Jan 3, 2025

This issue was moved to a discussion.