Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(file source)!: use uncompressed content for fingerprinting files (lines and ignored_header_bytes) #22050

Merged
merged 15 commits into from
Jan 9, 2025

Conversation

roykim98
Copy link
Contributor

@roykim98 roykim98 commented Dec 18, 2024

Summary

This PR addresses #13193.

This PR will handle the case where attempting to fingerprint compressed content. It will also handle a use case with log rotation, where an uncompressed active file might be monitored and then rotated into a compressed file with a new inode by a log rotation service. By comparing the uncompressed lines as a source of truth, it prevents messages from processing the files 2X.

I chose to explicitly check the magic bytes header by enumerating the signatures to make checking the different headers consistent. I checked a couple compression libraries, and I figured it would be complicated to try to move the ownership of a reader into multiple decoders with inconsistent functions that did not all permit easy checking of the magic header. Of course, if only gzip is expected to be supported for the foreseeable future, this is a moot point and we can just fall back to instantiating the gzip library to check

BREAKING CHANGE: When sourcing from compressed files, ignored_header_bytes will no longer look at the compressed file's bytes, which would include the magic bytes for the compression header. Instead, it will ignore the bytes from the uncompressed content. Similarly, lines will no longer look for new line delimiters in the compressed content, but the uncompressed content. Arguably, both of these current mechanisms as a bug, as compressed content would not have any explicit lines or intentional header aside from the magic bytes.

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

How did you test this PR?

Unit tests

cargo test --package file-source --lib -- fingerprinter::test --show-output

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the "no-changelog" label to this PR.

Checklist

References

@bits-bot
Copy link

bits-bot commented Dec 18, 2024

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added the domain: external docs Anything related to Vector's external, public documentation label Dec 18, 2024
@roykim98 roykim98 force-pushed the roy/fingerprint-rotation branch from ccdfe60 to ae39cc0 Compare December 18, 2024 06:32
@github-actions github-actions bot added the domain: sources Anything related to the Vector's sources label Dec 18, 2024
@roykim98 roykim98 marked this pull request as ready for review December 18, 2024 07:12
@roykim98 roykim98 requested review from a team as code owners December 18, 2024 07:12
Copy link
Member

@jszwedko jszwedko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @roykim98 ! I left a couple of comments about error handling, but overall this approach seems good to me. I appreciate the added tests.

lib/file-source/src/fingerprinter.rs Outdated Show resolved Hide resolved
lib/file-source/src/fingerprinter.rs Outdated Show resolved Hide resolved
@roykim98
Copy link
Contributor Author

Made some final tests here:

2024-12-20T03:03:53.093248Z  INFO vector::app: Loading configs. paths=["/Users/roy/vector.yaml"]
2024-12-20T03:03:53.322430Z  INFO vector::topology::running: Running healthchecks.
2024-12-20T03:03:53.324317Z  INFO vector::topology::builder: Healthcheck passed.
2024-12-20T03:03:53.326881Z  INFO vector: Vector has started. debug="true" version="0.44.0" arch="aarch64" revision=""
2024-12-20T03:03:53.327409Z  INFO vector::app: API is disabled, enable by setting `api.enabled` to `true` and use commands like `vector top`.
2024-12-20T03:03:53.327408Z  INFO source{component_kind="source" component_id=file component_type=file}: vector::sources::file: Starting file server. include=["/Users/roy/sample-data/*.log*"] exclude=[]
2024-12-20T03:03:53.331783Z  INFO source{component_kind="source" component_id=file component_type=file}:file_server: file_source::checkpointer: Attempting to read legacy checkpoint files.
2024-12-20T03:04:30.271335Z  INFO source{component_kind="source" component_id=file component_type=file}:file_server: vector::internal_events::file::source: Found new file to watch. file=/Users/roy/sample-data/sample.log
{"file":"/Users/roy/sample-data/sample.log","message":"this is a test","source_type":"file","timestamp":"2024-12-20T03:04:30.273128Z"}
{"file":"/Users/roy/sample-data/sample.log","message":"this is a second line test","source_type":"file","timestamp":"2024-12-20T03:04:30.273833Z"}
{"file":"/Users/roy/sample-data/sample.log","message":"this is a third line test","source_type":"file","timestamp":"2024-12-20T03:04:30.273962Z"}
2024-12-20T03:04:52.859856Z  INFO source{component_kind="source" component_id=file component_type=file}:file_server: file_source::file_server: Watched file has been renamed. path="/Users/roy/sample-data/sample.log.gz" old_path="/Users/roy/sample-data/sample.log"
2024-12-20T03:05:19.523732Z  INFO source{component_kind="source" component_id=file component_type=file}:file_server: file_source::file_server: More than one file has the same fingerprint. path="/Users/roy/sample-data/test.log" old_path="/Users/roy/sample-data/sample.log.gz"
2024-12-20T03:05:19.524053Z  INFO source{component_kind="source" component_id=file component_type=file}:file_server: file_source::file_server: Switching to watch most recently modified file. new_modified_time=SystemTime { tv_sec: 1734663918, tv_nsec: 74963277 } old_modified_time=SystemTime { tv_sec: 1734663869, tv_nsec: 255448000 }
2024-12-20T03:05:21.576312Z  INFO source{component_kind="source" component_id=file component_type=file}:file_server: file_source::file_server: Watched file has been renamed. path="/Users/roy/sample-data/sample.log.gz" old_path="/Users/roy/sample-data/test.log"

Made a 3-line file called sample.log

this is a test
this is a second line test
this is a third line test

Loaded them up in the source, and then compressed it to a new inode, sample.log.gz

Created a similarly named file test.log

this is a test
this is a different test

All had the same fingerprint

sources:
  file:
    type: file
    include: 
      - /Users/roy/sample-data/*.log*
    fingerprint:
      strategy: checksum
      ignored_header_bytes: 0
      lines: 1
    ignore_checkpoints: true  # just so testing is easily repeatable
    data_dir: data/  # make sure this directory exists where you run the pipeline
sinks:
  console:
    type: console
    inputs: ["file"]
    encoding:
      codec: json

@roykim98 roykim98 requested a review from jszwedko December 20, 2024 03:09
Copy link
Member

@jszwedko jszwedko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @roykim98 ! I refactored the check function, but otherwise this looks good to me! I appreciate the thorough testing.

@jszwedko jszwedko enabled auto-merge December 20, 2024 22:47
auto-merge was automatically disabled December 21, 2024 00:49

Head branch was pushed to by a user without write access

@roykim98
Copy link
Contributor Author

One of the workflows was failing due to some whitespaces in the *.cue files, which I just removed, but it revoked the automerge label and build checks. Looks like you may have to re-add the automerge label, sorry for the inconvenience @jszwedko .

Ran make check-fmt locally, it passes now.

@roykim98 roykim98 requested a review from jszwedko December 21, 2024 00:54
@roykim98
Copy link
Contributor Author

# Run the Clippy linter to catch common mistakes.
cargo vdev check rust --clippy
# Ensure all code is properly formatted. Code can be run through `rustfmt` using `cargo fmt` to ensure it is properly formatted.
cargo vdev check fmt
# Ensure the internal metrics that Vector emits conform to standards.
cargo vdev check events
# Ensure the `LICENSE-3rdparty.csv` file is up to date with the licenses each of Vector's dependencies are published under.
cargo vdev check licenses

These additional checks pass locally.

@roykim98
Copy link
Contributor Author

make check-component-docs
cd rust-doc && make docs

Also ran these

@pront
Copy link
Member

pront commented Jan 6, 2025

Apologies if I missed something here, @jszwedko it seems like all your review points were addressed in this one and we are willing to break the existing behavior?

@roykim98
Copy link
Contributor Author

roykim98 commented Jan 6, 2025

  Parse using klog............................................FAILED (expectation)  {
  "file": "klog.go",
  "id": 28133,
  "level": "info",
  "line": 70,
  "message": "hello from klog",
  - "timestamp": "2024-05-05T17:59:40.692994Z"
  + "timestamp": "2025-05-05T17:59:40.692994Z"
}

@pront @jszwedko It looks like one check failed, but I suspect it actually doesn't have anything to do with my PR given the only thing that changed is the year date. There is probably some logic that uses an unmocked timestamp and current year.

@pront
Copy link
Member

pront commented Jan 6, 2025

True. Can you please git merge origin master and commit the changes?

@roykim98 roykim98 force-pushed the roy/fingerprint-rotation branch from c14294c to ef6809f Compare January 6, 2025 18:09
@roykim98
Copy link
Contributor Author

roykim98 commented Jan 6, 2025

@pront I just rebased off the latest master and force-pushed. No conflicts so no new changed lines.

@jszwedko
Copy link
Member

jszwedko commented Jan 6, 2025

Apologies, I still have to circle back to this. I think some the comment you left, @roykim98 , is relevant and the error handling isn't quite right at the moment.

@roykim98
Copy link
Contributor Author

roykim98 commented Jan 6, 2025

@jszwedko

I presume you mean this comment? #22050 (comment)
Regarding:

            fp.seek(SeekFrom::Start(0))?;
            fp.read_exact(&mut magic)?;

If this is the final thing, I can add a quick commit that will change this to:

            fp.seek(SeekFrom::Start(0))?;
            let result = fp.read_exact(&mut magic);

            if result.is_err() {
                fp.seek(SeekFrom::Start(0))?;
                return Err(result.unwrap_err());
            } 

@jszwedko
Copy link
Member

jszwedko commented Jan 7, 2025

@jszwedko

I presume you mean this comment? #22050 (comment) Regarding:

            fp.seek(SeekFrom::Start(0))?;
            fp.read_exact(&mut magic)?;

If this is the final thing, I can add a quick commit that will change this to:

            fp.seek(SeekFrom::Start(0))?;
            let result = fp.read_exact(&mut magic);

            if result.is_err() {
                fp.seek(SeekFrom::Start(0))?;
                return Err(result.unwrap_err());
            } 

Right, yeah, something like that is what I was thinking; however, thinking about it more, I think the caller needs to know if the error occurred during the seek or during the read_exact to know if the file handle is still usable after the check function finishes. Also, however, our case, if the check function errors we abort reading the file and toss away the handle regardless so it is a bit moot. I think instead we could just add a doc comment that if an error is returned the file handle may be in an unusable state.

@jszwedko
Copy link
Member

jszwedko commented Jan 7, 2025

Apologies if I missed something here, @jszwedko it seems like all your review points were addressed in this one and we are willing to break the existing behavior?

Regarding the breaking change, I think it is an appropriate one to make the fingerprinting currently isn't working as expected for compressed input files.

@roykim98
Copy link
Contributor Author

roykim98 commented Jan 8, 2025

@jszwedko Added a doc comment and added the attempt to reset on read error.

Copy link
Member

@jszwedko jszwedko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks @roykim98 !

Copy link
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Left a couple of nits.

roykim98 and others added 2 commits January 8, 2025 16:47
@roykim98
Copy link
Contributor Author

roykim98 commented Jan 9, 2025

@pront Addressed your changes, cargo clippy and cargo test both pass

@roykim98
Copy link
Contributor Author

roykim98 commented Jan 9, 2025

@pront Is this good to merge now? Anything else required on my end?

@pront
Copy link
Member

pront commented Jan 9, 2025

@pront Is this good to merge now? Anything else required on my end?

Doing a final check now.

@pront pront enabled auto-merge January 9, 2025 20:00
@pront pront added this pull request to the merge queue Jan 9, 2025
Merged via the queue into vectordotdev:master with commit 68ee4f2 Jan 9, 2025
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: external docs Anything related to Vector's external, public documentation domain: sources Anything related to the Vector's sources
Projects
None yet
Development

Successfully merging this pull request may close these issues.

file source: checksum fingerprint is not correct with gzipped files
5 participants