fix(file source)!: use uncompressed content for fingerprinting files (lines and ignored_header_bytes) #22050

roykim98 · 2024-12-18T06:28:18Z

Summary

This PR addresses #13193.

This PR will handle the case where attempting to fingerprint compressed content. It will also handle a use case with log rotation, where an uncompressed active file might be monitored and then rotated into a compressed file with a new inode by a log rotation service. By comparing the uncompressed lines as a source of truth, it prevents messages from processing the files 2X.

I chose to explicitly check the magic bytes header by enumerating the signatures to make checking the different headers consistent. I checked a couple compression libraries, and I figured it would be complicated to try to move the ownership of a reader into multiple decoders with inconsistent functions that did not all permit easy checking of the magic header. Of course, if only gzip is expected to be supported for the foreseeable future, this is a moot point and we can just fall back to instantiating the gzip library to check

BREAKING CHANGE: When sourcing from compressed files, ignored_header_bytes will no longer look at the compressed file's bytes, which would include the magic bytes for the compression header. Instead, it will ignore the bytes from the uncompressed content. Similarly, lines will no longer look for new line delimiters in the compressed content, but the uncompressed content. Arguably, both of these current mechanisms as a bug, as compressed content would not have any explicit lines or intentional header aside from the magic bytes.

Change Type

Bug fix
New feature
Non-functional (chore, refactoring, docs)
Performance

Is this a breaking change?

Yes
No

How did you test this PR?

Unit tests

cargo test --package file-source --lib -- fingerprinter::test --show-output

Does this PR include user facing changes?

Yes. Please add a changelog fragment based on our guidelines.
No. A maintainer will apply the "no-changelog" label to this PR.

Checklist

Please read our Vector contributor resources.

References

Closes: file source: checksum fingerprint is not correct with gzipped files #13193

bits-bot · 2024-12-18T06:28:22Z

All committers have signed the CLA.

jszwedko

Nice work @roykim98 ! I left a couple of comments about error handling, but overall this approach seems good to me. I appreciate the added tests.

lib/file-source/src/fingerprinter.rs

roykim98 · 2024-12-20T03:09:07Z

Made some final tests here:

2024-12-20T03:03:53.093248Z  INFO vector::app: Loading configs. paths=["/Users/roy/vector.yaml"]
2024-12-20T03:03:53.322430Z  INFO vector::topology::running: Running healthchecks.
2024-12-20T03:03:53.324317Z  INFO vector::topology::builder: Healthcheck passed.
2024-12-20T03:03:53.326881Z  INFO vector: Vector has started. debug="true" version="0.44.0" arch="aarch64" revision=""
2024-12-20T03:03:53.327409Z  INFO vector::app: API is disabled, enable by setting `api.enabled` to `true` and use commands like `vector top`.
2024-12-20T03:03:53.327408Z  INFO source{component_kind="source" component_id=file component_type=file}: vector::sources::file: Starting file server. include=["/Users/roy/sample-data/*.log*"] exclude=[]
2024-12-20T03:03:53.331783Z  INFO source{component_kind="source" component_id=file component_type=file}:file_server: file_source::checkpointer: Attempting to read legacy checkpoint files.
2024-12-20T03:04:30.271335Z  INFO source{component_kind="source" component_id=file component_type=file}:file_server: vector::internal_events::file::source: Found new file to watch. file=/Users/roy/sample-data/sample.log
{"file":"/Users/roy/sample-data/sample.log","message":"this is a test","source_type":"file","timestamp":"2024-12-20T03:04:30.273128Z"}
{"file":"/Users/roy/sample-data/sample.log","message":"this is a second line test","source_type":"file","timestamp":"2024-12-20T03:04:30.273833Z"}
{"file":"/Users/roy/sample-data/sample.log","message":"this is a third line test","source_type":"file","timestamp":"2024-12-20T03:04:30.273962Z"}
2024-12-20T03:04:52.859856Z  INFO source{component_kind="source" component_id=file component_type=file}:file_server: file_source::file_server: Watched file has been renamed. path="/Users/roy/sample-data/sample.log.gz" old_path="/Users/roy/sample-data/sample.log"
2024-12-20T03:05:19.523732Z  INFO source{component_kind="source" component_id=file component_type=file}:file_server: file_source::file_server: More than one file has the same fingerprint. path="/Users/roy/sample-data/test.log" old_path="/Users/roy/sample-data/sample.log.gz"
2024-12-20T03:05:19.524053Z  INFO source{component_kind="source" component_id=file component_type=file}:file_server: file_source::file_server: Switching to watch most recently modified file. new_modified_time=SystemTime { tv_sec: 1734663918, tv_nsec: 74963277 } old_modified_time=SystemTime { tv_sec: 1734663869, tv_nsec: 255448000 }
2024-12-20T03:05:21.576312Z  INFO source{component_kind="source" component_id=file component_type=file}:file_server: file_source::file_server: Watched file has been renamed. path="/Users/roy/sample-data/sample.log.gz" old_path="/Users/roy/sample-data/test.log"

Made a 3-line file called sample.log

this is a test
this is a second line test
this is a third line test

Loaded them up in the source, and then compressed it to a new inode, sample.log.gz

Created a similarly named file test.log

this is a test
this is a different test

All had the same fingerprint

sources:
  file:
    type: file
    include: 
      - /Users/roy/sample-data/*.log*
    fingerprint:
      strategy: checksum
      ignored_header_bytes: 0
      lines: 1
    ignore_checkpoints: true  # just so testing is easily repeatable
    data_dir: data/  # make sure this directory exists where you run the pipeline
sinks:
  console:
    type: console
    inputs: ["file"]
    encoding:
      codec: json

jszwedko

Thanks @roykim98 ! I refactored the check function, but otherwise this looks good to me! I appreciate the thorough testing.

lib/file-source/src/fingerprinter.rs

roykim98 · 2024-12-21T00:54:08Z

One of the workflows was failing due to some whitespaces in the *.cue files, which I just removed, but it revoked the automerge label and build checks. Looks like you may have to re-add the automerge label, sorry for the inconvenience @jszwedko .

Ran make check-fmt locally, it passes now.

roykim98 · 2024-12-21T01:49:41Z

# Run the Clippy linter to catch common mistakes.
cargo vdev check rust --clippy
# Ensure all code is properly formatted. Code can be run through `rustfmt` using `cargo fmt` to ensure it is properly formatted.
cargo vdev check fmt
# Ensure the internal metrics that Vector emits conform to standards.
cargo vdev check events
# Ensure the `LICENSE-3rdparty.csv` file is up to date with the licenses each of Vector's dependencies are published under.
cargo vdev check licenses

These additional checks pass locally.

roykim98 · 2024-12-21T02:16:12Z

make check-component-docs
cd rust-doc && make docs

Also ran these

pront · 2025-01-06T16:44:45Z

Apologies if I missed something here, @jszwedko it seems like all your review points were addressed in this one and we are willing to break the existing behavior?

roykim98 · 2025-01-06T17:41:30Z

  Parse using klog............................................FAILED (expectation)  {
  "file": "klog.go",
  "id": 28133,
  "level": "info",
  "line": 70,
  "message": "hello from klog",
  - "timestamp": "2024-05-05T17:59:40.692994Z"
  + "timestamp": "2025-05-05T17:59:40.692994Z"
}

@pront @jszwedko It looks like one check failed, but I suspect it actually doesn't have anything to do with my PR given the only thing that changed is the year date. There is probably some logic that uses an unmocked timestamp and current year.

pront · 2025-01-06T17:48:34Z

True. Can you please git merge origin master and commit the changes?

… files

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>

roykim98 · 2025-01-06T18:10:27Z

@pront I just rebased off the latest master and force-pushed. No conflicts so no new changed lines.

jszwedko · 2025-01-06T18:33:44Z

Apologies, I still have to circle back to this. I think some the comment you left, @roykim98 , is relevant and the error handling isn't quite right at the moment.

roykim98 · 2025-01-06T19:02:49Z

@jszwedko

I presume you mean this comment? #22050 (comment)
Regarding:

            fp.seek(SeekFrom::Start(0))?;
            fp.read_exact(&mut magic)?;

If this is the final thing, I can add a quick commit that will change this to:

            fp.seek(SeekFrom::Start(0))?;
            let result = fp.read_exact(&mut magic);

            if result.is_err() {
                fp.seek(SeekFrom::Start(0))?;
                return Err(result.unwrap_err());
            }

jszwedko · 2025-01-07T23:12:57Z

@jszwedko

I presume you mean this comment? #22050 (comment) Regarding:

            fp.seek(SeekFrom::Start(0))?;
            fp.read_exact(&mut magic)?;

If this is the final thing, I can add a quick commit that will change this to:

            fp.seek(SeekFrom::Start(0))?;
            let result = fp.read_exact(&mut magic);

            if result.is_err() {
                fp.seek(SeekFrom::Start(0))?;
                return Err(result.unwrap_err());
            }

Right, yeah, something like that is what I was thinking; however, thinking about it more, I think the caller needs to know if the error occurred during the seek or during the read_exact to know if the file handle is still usable after the check function finishes. Also, however, our case, if the check function errors we abort reading the file and toss away the handle regardless so it is a bit moot. I think instead we could just add a doc comment that if an error is returned the file handle may be in an unusable state.

jszwedko · 2025-01-07T23:15:56Z

Apologies if I missed something here, @jszwedko it seems like all your review points were addressed in this one and we are willing to break the existing behavior?

Regarding the breaking change, I think it is an appropriate one to make the fingerprinting currently isn't working as expected for compressed input files.

roykim98 · 2025-01-08T04:14:08Z

@jszwedko Added a doc comment and added the attempt to reset on read error.

jszwedko

Great, thanks @roykim98 !

changelog.d/22050-fingerprint-uncompressed-file-content.fix.md

lib/file-source/src/fingerprinter.rs

pront

Looks good. Left a couple of nits.

Co-authored-by: Pavlos Rontidis <pavlos.rontidis@gmail.com>

roykim98 · 2025-01-09T00:53:32Z

@pront Addressed your changes, cargo clippy and cargo test both pass

roykim98 · 2025-01-09T19:42:30Z

@pront Is this good to merge now? Anything else required on my end?

pront · 2025-01-09T19:50:56Z

@pront Is this good to merge now? Anything else required on my end?

Doing a final check now.

lib/file-source/src/fingerprinter.rs

github-actions bot added the domain: external docs Anything related to Vector's external, public documentation label Dec 18, 2024

roykim98 force-pushed the roy/fingerprint-rotation branch from ccdfe60 to ae39cc0 Compare December 18, 2024 06:32

github-actions bot added the domain: sources Anything related to the Vector's sources label Dec 18, 2024

roykim98 marked this pull request as ready for review December 18, 2024 07:12

roykim98 requested review from a team as code owners December 18, 2024 07:12

rtrieu approved these changes Dec 18, 2024

View reviewed changes

jszwedko reviewed Dec 19, 2024

View reviewed changes

lib/file-source/src/fingerprinter.rs Outdated Show resolved Hide resolved

lib/file-source/src/fingerprinter.rs Outdated Show resolved Hide resolved

roykim98 requested a review from jszwedko December 20, 2024 03:09

jszwedko approved these changes Dec 20, 2024

View reviewed changes

jszwedko enabled auto-merge December 20, 2024 22:47

roykim98 commented Dec 21, 2024

View reviewed changes

lib/file-source/src/fingerprinter.rs Outdated Show resolved Hide resolved

auto-merge was automatically disabled December 21, 2024 00:49
Head branch was pushed to by a user without write access

roykim98 requested a review from jszwedko December 21, 2024 00:54

roykim98 and others added 7 commits January 6, 2025 10:08

Check the crc of uncompressed lines when reading from compressed gzip…

3c14ac0

… files

Update documentation

9c020b6

Handle header bytes properly and add a test

0de0859

Use fingerprint instead of fingerprinter

ee120e7

Clarify that only gzip is supported at this time

83e2373

Propagate errors upward as alluded to in @jszwedko comments

1345fab

Refactor check

c5bd911

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>

jszwedko and others added 4 commits January 6, 2025 10:08

Have to seek after finding algorithm too

78c731f

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>

Remove whitespace

4c11ceb

Clippy

9453a28

make check-component-docs

ef6809f

roykim98 force-pushed the roy/fingerprint-rotation branch from c14294c to ef6809f Compare January 6, 2025 18:09

Attempt to reset the position and also add a doc comment

fae0cce

jszwedko approved these changes Jan 8, 2025

View reviewed changes

pront reviewed Jan 8, 2025

View reviewed changes

changelog.d/22050-fingerprint-uncompressed-file-content.fix.md Outdated Show resolved Hide resolved

pront reviewed Jan 8, 2025

View reviewed changes

lib/file-source/src/fingerprinter.rs Outdated Show resolved Hide resolved

pront approved these changes Jan 8, 2025

View reviewed changes

roykim98 and others added 2 commits January 8, 2025 16:47

Update changelog.d/22050-fingerprint-uncompressed-file-content.fix.md

3a2c3e0

Co-authored-by: Pavlos Rontidis <pavlos.rontidis@gmail.com>

Use GZIP_MAGIC

3b57e38

pront reviewed Jan 9, 2025

View reviewed changes

lib/file-source/src/fingerprinter.rs Show resolved Hide resolved

Update lib/file-source/src/fingerprinter.rs

1c968ee

pront enabled auto-merge January 9, 2025 20:00

pront added this pull request to the merge queue Jan 9, 2025

Merged via the queue into vectordotdev:master with commit 68ee4f2 Jan 9, 2025
40 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(file source)!: use uncompressed content for fingerprinting files (lines and ignored_header_bytes) #22050

fix(file source)!: use uncompressed content for fingerprinting files (lines and ignored_header_bytes) #22050

roykim98 commented Dec 18, 2024 •

edited

Loading

bits-bot commented Dec 18, 2024 •

edited

Loading

jszwedko left a comment

roykim98 commented Dec 20, 2024

jszwedko left a comment

roykim98 commented Dec 21, 2024

roykim98 commented Dec 21, 2024

roykim98 commented Dec 21, 2024

pront commented Jan 6, 2025

roykim98 commented Jan 6, 2025 •

edited

Loading

pront commented Jan 6, 2025

roykim98 commented Jan 6, 2025

jszwedko commented Jan 6, 2025

roykim98 commented Jan 6, 2025

jszwedko commented Jan 7, 2025

jszwedko commented Jan 7, 2025

roykim98 commented Jan 8, 2025

jszwedko left a comment

pront left a comment

roykim98 commented Jan 9, 2025

roykim98 commented Jan 9, 2025

pront commented Jan 9, 2025

fix(file source)!: use uncompressed content for fingerprinting files (lines and ignored_header_bytes) #22050

fix(file source)!: use uncompressed content for fingerprinting files (lines and ignored_header_bytes) #22050

Conversation

roykim98 commented Dec 18, 2024 • edited Loading

Summary

Change Type

Is this a breaking change?

How did you test this PR?

Does this PR include user facing changes?

Checklist

References

bits-bot commented Dec 18, 2024 • edited Loading

jszwedko left a comment

Choose a reason for hiding this comment

roykim98 commented Dec 20, 2024

jszwedko left a comment

Choose a reason for hiding this comment

roykim98 commented Dec 21, 2024

roykim98 commented Dec 21, 2024

roykim98 commented Dec 21, 2024

pront commented Jan 6, 2025

roykim98 commented Jan 6, 2025 • edited Loading

pront commented Jan 6, 2025

roykim98 commented Jan 6, 2025

jszwedko commented Jan 6, 2025

roykim98 commented Jan 6, 2025

jszwedko commented Jan 7, 2025

jszwedko commented Jan 7, 2025

roykim98 commented Jan 8, 2025

jszwedko left a comment

Choose a reason for hiding this comment

pront left a comment

Choose a reason for hiding this comment

roykim98 commented Jan 9, 2025

roykim98 commented Jan 9, 2025

pront commented Jan 9, 2025

roykim98 commented Dec 18, 2024 •

edited

Loading

bits-bot commented Dec 18, 2024 •

edited

Loading

roykim98 commented Jan 6, 2025 •

edited

Loading