Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processor trace sampling rebased #10116

Closed
wants to merge 23 commits into from
Closed

Conversation

edsiper
Copy link
Member

@edsiper edsiper commented Mar 21, 2025


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

edsiper and others added 23 commits March 21, 2025 17:53
This patch introduces a new trace sampling processor designed with a
pluggable architecture, allowing easy extension to support multiple
sampling strategies and backends.

The initial implementation includes basic probabilistic sampling, with
future patches planned to add additional sampling methods such as
rate-limiting, latency-based, and tail-based sampling.

The probabilistic sampler can be configured as follows:

  pipeline:
    inputs:
      - name: opentelemetry
        port: 4318

        processors:
          traces:
            - name: sampling
              type: probabilistic
              debug: true
              rules:
                sampling_percentage: 40

    outputs:
      - name: stdout
        match: '*'

in this configuration:
 - debug mode (debug: true) is enabled, allowing detailed logging of sampling decisions.
 - sampling_percentage: 40 ensures that 40% of traces are retained, while the rest are discarded.
 - traces that pass sampling will be forwarded to the stdout output for visibility.

Fluent Bit v4.0.0
* Copyright (C) 2015-2024 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

______ _                  _    ______ _ _             ___  _____
|  ___| |                | |   | ___ (_) |           /   ||  _  |
| |_  | |_   _  ___ _ __ | |_  | |_/ /_| |_  __   __/ /| || |/' |
|  _| | | | | |/ _ \ '_ \| __| | ___ \ | __| \ \ / / /_| ||  /| |
| |   | | |_| |  __/ | | | |_  | |_/ / | |_   \ V /\___  |\ |_/ /
\_|   |_|\__,_|\___|_| |_|\__| \____/|_|\__|   \_/     |_(_)___/

[2025/02/28 16:46:00] [ info] [fluent bit] version=4.0.0, commit=0e885e2d60, pid=778903
[2025/02/28 16:46:00] [ info] [storage] ver=1.5.2, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2025/02/28 16:46:00] [ info] [simd    ] disabled
[2025/02/28 16:46:00] [ info] [cmetrics] version=0.9.9
[2025/02/28 16:46:00] [ info] [ctraces ] version=0.6.0
[2025/02/28 16:46:00] [ info] [input:opentelemetry:opentelemetry.0] initializing
[2025/02/28 16:46:00] [ info] [input:opentelemetry:opentelemetry.0] storage_strategy='memory' (memory only)
[2025/02/28 16:46:00] [ info] [input:opentelemetry:opentelemetry.0] listening on 0.0.0.0:4318
[2025/02/28 16:46:00] [ info] [processor:sampling:sampling.0] initializing probabilistic sampling processor
[2025/02/28 16:46:00] [ info] [sp] stream processor started
[2025/02/28 16:46:00] [ info] [output:stdout:stdout.0] worker #0 started

🔍 Debug sampling 'probabilistic' (0x779068027940): before
   ┌─────────────────────────────────────────────────────────────────┐
   │ trace_id=5b8efff798038103d269b633813fc60c                       │
   ├─────────────────────────────────────────────────────────────────┤
   │ spans:                                                          │
   │   ├── id=eee19b7ec3c1b174 name=I'm a server span                │
   │   ├── id=eee19b7ec3c1b175 name=Child span of server span        │
   │   ├── id=eee19b7ec3c1b176 name=Database query                   │
   └─────────────────────────────────────────────────────────────────┘

   ┌─────────────────────────────────────────────────────────────────┐
   │ trace_id=6a9dfff798038103d269b633813fc60d                       │
   ├─────────────────────────────────────────────────────────────────┤
   │ spans:                                                          │
   │   ├── id=fff19b7ec3c1b174 name=A span in another trace          │
   └─────────────────────────────────────────────────────────────────┘

   ┌─────────────────────────────────────────────────────────────────┐
   │ trace_id=7c8efff798038103d269b633813fc60e                       │
   ├─────────────────────────────────────────────────────────────────┤
   │ spans:                                                          │
   │   ├── id=0000000000000000 name=Slow request                     │
   └─────────────────────────────────────────────────────────────────┘

   ┌─────────────────────────────────────────────────────────────────┐
   │ trace_id=8d9efff798038103d269b633813fc60f                       │
   ├─────────────────────────────────────────────────────────────────┤
   │ spans:                                                          │
   │   ├── id=0000000000000000 name=High traffic span                │
   │   ├── id=0000000000000000 name=Load testing event               │
   └─────────────────────────────────────────────────────────────────┘

   ┌─────────────────────────────────────────────────────────────────┐
   │ trace_id=9a1bfff798038103d269b633813fc610                       │
   ├─────────────────────────────────────────────────────────────────┤
   │ spans:                                                          │
   │   ├── id=0000000000000000 name=Faulty transaction               │
   │   ├── id=0000000000000000 name=Database rollback                │
   └─────────────────────────────────────────────────────────────────┘

🔍 Debug sampling 'probabilistic' (0x779068027940): after
   ┌─────────────────────────────────────────────────────────────────┐
   │ trace_id=6a9dfff798038103d269b633813fc60d                       │
   ├─────────────────────────────────────────────────────────────────┤
   │ spans:                                                          │
   │   ├── id=fff19b7ec3c1b174 name=A span in another trace          │
   └─────────────────────────────────────────────────────────────────┘

   ┌─────────────────────────────────────────────────────────────────┐
   │ trace_id=7c8efff798038103d269b633813fc60e                       │
   ├─────────────────────────────────────────────────────────────────┤
   │ spans:                                                          │
   │   ├── id=0000000000000000 name=Slow request                     │
   └─────────────────────────────────────────────────────────────────┘

|-------------------- RESOURCE SPAN --------------------|
  resource:
     - attributes:
            - service.name: 'other.service'
     - dropped_attributes_count: 0
     - schema_url: ""
  [scope_span]
    instrumentation scope:
        - name                    : other.library
        - version                 : 2.0.0
        - dropped_attributes_count: 0
        - attributes: undefined
    schema_url: ""
    [spans]
         [span #0 'A span in another trace']
             - trace_id                : 6a9dfff798038103d269b633813fc60d
             - span_id                 : fff19b7ec3c1b174
             - parent_span_id          : undefined
             - kind                    : 2 (server)
             - start_time              : 1544712660000000000
             - end_time                : 1544712662000000000
             - dropped_attributes_count: 0
             - dropped_events_count    : 0
             - dropped_links_count     : 0
             - trace_state             : (null)
             - status:
                 - code    : 0
             - attributes: none
             - events: none
             - [links]
|-------------------- RESOURCE SPAN --------------------|
  resource:
     - attributes:
            - service.name: 'latency.test.service'
     - dropped_attributes_count: 0
     - schema_url: ""
  [scope_span]
    instrumentation scope:
        - name                    : latency.test.library
        - version                 : 3.0.0
        - dropped_attributes_count: 0
        - attributes: undefined
    schema_url: ""
    [spans]
         [span #0 'Slow request']
             - trace_id                : 7c8efff798038103d269b633813fc60e
             - span_id                 : 0000000000000000
             - parent_span_id          : undefined
             - kind                    : 2 (server)
             - start_time              : 1544712660000000000
             - end_time                : 1544712675000000000
             - dropped_attributes_count: 0
             - dropped_events_count    : 0
             - dropped_links_count     : 0
             - trace_state             : (null)
             - status:
                 - code    : 0
             - attributes: none
             - events: none
             - [links]

Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
The processors callback for traces, supported only the incoming CTraces context
which aimed to be modified by the processors. This patch changes the function
prototype by adding a new optional argument to set a new output CTraces context.

Behavior on return:

- If the CTrace output context is NULL, it means the processor units should stop
  right away. The assumption is that the processor plugin did some buffering or
  simply discarded the context, no extra processing is needed.

- if the CTrace output context is "different" than the incoming CTrace, it overrides
  the original context (original context is destroyed).

Signed-off-by: Eduardo Silva <eduardo@calyptia.com>
Signed-off-by: Eduardo Silva <eduardo@calyptia.com>
…output

Signed-off-by: Eduardo Silva <eduardo@calyptia.com>
Signed-off-by: Eduardo Silva <eduardo@calyptia.com>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
For tail sampling type, this commit adds a new 'latency' conditional that
allows to select spans based on their duration (end time - start time) by
matching specific thresholds:

- threshold_ms_low : specifies the lower latency threshold. Traces with a
                     duration <= this value will be sampled.

- threshold_ms_high: specifies the upper latency threshold. Traces with a
                     duration >= this value will be sampled.

note that the thresholds are set in milliseconds.

usage:

  pipeline:
    inputs:
      - name: opentelemetry
        port: 4318

        processors:
          traces:
            - name: sampling
              type: tail
              sampling_settings:
                decision_wait: 5s
              conditions:
                - type: latency
                  threshold_ms_high: 200
                  threshold_ms_high: 3000

This tail-based sampling configuration waits 5 seconds before making a decision. It samples
traces based on latency, capturing short traces of 200ms or less and long traces of 3000ms
or more. Traces between 200ms and 3000ms are not sampled unless another condition applies.

Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
This commit introduces the string_attribute conditional to the sampling processor, allowing
traces to be sampled based on specific span or resource attributes. Users can define key-value filters
like http.method=POST to selectively capture relevant traces:

pipeline:
  inputs:
    - name: opentelemetry
      port: 4318
      processors:
        traces:
          - name: sampling
            type: tail
            sampling_settings:
              decision_wait: 5s
            conditions:
              - type: string_attribute
                key: "http.method"
                values: ["GET"]
              - type: string_attribute
                key: "service.name"
                values: ["payment-processing"]
  outputs:
    - name: stdout
      match: '*'

Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
This patch introduce the match_type property for the string_attribute conditional,
it allows the values 'strict' (default) and 'exists'.

usage:

pipeline:
  inputs:
    - name: opentelemetry
      port: 4318
      processors:
        traces:
          - name: sampling
            type: tail
            sampling_settings:
              decision_wait: 5s
            conditions:
              - type: string_attribute
                match_type: strict
                key: "http.method"
                values: ["GET"]

              - type: string_attribute
                match_type: exists
                key: "service.name"

Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
…onal

This commit introduces support for the numeric_attribute conditional in the sampling
processor, allowing traces to be sampled based on numeric attribute values. Users can
define min and max thresholds.

usage:

  pipeline:
    inputs:
      - name: opentelemetry
        port: 4318

        processors:
          traces:
            - name: sampling
              type: tail
              sampling_settings:
                decision_wait: 2s
              conditions:
                - type: numeric_attribute
                  key: "http.status_code"
                  min_value: 400
                  max_value: 504
    outputs:
      - name: stdout
        match: '*'

Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Adds a new conditional that allows to sample only the traces that contains a
specific range of spans associated to it.

The following configuration options are available:

- min_spans: minimum number of expected spans
- max_spans: maximum number of spans found in the trace

usage:

pipeline:
  inputs:
    - name: opentelemetry
      port: 4318
      processors:
        traces:
          - name: sampling
            type: tail
            sampling_settings:
              decision_wait: 2s
            conditions:
              - type: span_count
                min_spans: 3
                max_spans: 5

Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
This commit introduces support for the trace_state conditional in the sampling
processor, allowing traces to be sampled based on metadata stored in the W3C trace_state field.

configuration:

- values: Defines a list of key-value pairs to match against the trace_state. A trace is sampled
if any of the specified values exist in the trace_state. Matching follows OR logic, meaning at
least one value must be present for sampling to occur.

example:

pipeline:
  inputs:
    - name: opentelemetry
      port: 4318

      processors:
        traces:
          - name: sampling
            type: tail
            sampling_settings:
              decision_wait: 2s
            conditions:
              - type: trace_state
                values: [debug=false, priority=high]
  outputs:
    - name: stdout
      match: '*'

Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
the new max_traces option allows to control the maximum number of traces in
memory. When the value is exceeded, the oldest trace (arrival time) is deleted.

Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant