Skip to content

Conversation

ampleyfly
Copy link

@ampleyfly ampleyfly commented Aug 24, 2025

This makes it possible to create reproducible .tar.gz files without overriding time.time(), by setting the gzip header field mtime to 0.

@ampleyfly ampleyfly requested a review from ethanfurman as a code owner August 24, 2025 15:02
@python-cla-bot
Copy link

python-cla-bot bot commented Aug 24, 2025

All commit authors signed the Contributor License Agreement.

CLA signed

@bedevere-app
Copy link

bedevere-app bot commented Aug 24, 2025

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the documentation of tarfile.open() & co.

@ampleyfly
Copy link
Author

Do you think I should do something similar to compresslevel, where it is explicitly only allowed in certain modes (involving creating a gzip archive, in this case), or is it better to simply ignore it if not applicable?

@picnixz
Copy link
Member

picnixz commented Aug 24, 2025

I'm actually torn between:

  • having a single parameter for reproducibility
  • allow to set all dynamic parameters separately.

@ampleyfly
Copy link
Author

ampleyfly commented Aug 24, 2025

I just realized my test is not failable, since fobj.mtime is not set until decompression has started.
EDIT: Should be fixed now. Also added test for the stream mode.

@ampleyfly ampleyfly marked this pull request as draft August 24, 2025 17:03
Copy link
Contributor

@maurycy maurycy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ampleyfly @picnixz

Maybe it's better to just allow setting a timestamp (eg: timestamp or mtime)?

Reproducibility is a strong promise that should be confirmed with tests: is it across the same files? the same user? the same machine? are supported compressions fully deterministic always?

Perhaps we could enable the user to create a reproducible file, and document on how.

@ampleyfly ampleyfly force-pushed the tarfile-tar-gzip-mtime-75707 branch from 40264fc to 8087788 Compare August 24, 2025 17:20
@ampleyfly
Copy link
Author

@ampleyfly @picnixz

Maybe it's better to just allow setting a timestamp (eg: timestamp or mtime)?

Reproducibility is a strong promise that should be confirmed with tests: is it across the same files? the same user? the same machine? are supported compressions fully deterministic always?

Good points. That was very naïve of me :)
I think the reproducible option would be nice, but I see now that that would require a lot more work.
Is an mtime option a good idea on its own?

@ampleyfly
Copy link
Author

The GNU tar documentation has some pointers for making "more reproducible" archives (https://www.gnu.org/software/tar/manual/html_section/Reproducibility.html).
It also claims to show how to accomplish reproducible archives with tar + gzip, noting that gzip is reproducible, with the right options.

@ampleyfly ampleyfly force-pushed the tarfile-tar-gzip-mtime-75707 branch from 8087788 to 112aa92 Compare August 24, 2025 17:47
@ampleyfly ampleyfly changed the title gh-75707: tarfile: Add optional open() argument "reprodicuble" gh-75707: tarfile: Add optional open() argument "mtime" Aug 24, 2025
@ampleyfly ampleyfly marked this pull request as ready for review August 24, 2025 18:31
@ampleyfly
Copy link
Author

ampleyfly commented Aug 25, 2025

This PR is now just adding the mtime option.

@ethanfurman
Copy link
Member

Is there a test to ensure that not setting the mtime still functions correctly?

@ampleyfly ampleyfly force-pushed the tarfile-tar-gzip-mtime-75707 branch from 112aa92 to 5c472a0 Compare August 25, 2025 20:19
@ampleyfly
Copy link
Author

I added tests that use the workaround of temporarily setting time.time to set a known value and check that that value ends up in the mtime field of the gzip header.

@ethanfurman
Copy link
Member

That's a fine solution for single-threaded tests, but if test are run in parallel there could eventually be hard-to-reproduce bugs.
Another solution would be to record the time before making the gzip, the time after making the gzip, and then verifying that the gzip header timestamp is somewhere in that range.

@ampleyfly ampleyfly force-pushed the tarfile-tar-gzip-mtime-75707 branch from 5c472a0 to 3bdbfee Compare August 28, 2025 20:43
ampleyfly and others added 2 commits August 28, 2025 22:49
This makes it possible to set the gzip header mtime field without
overriding time.time(), making it useful when creating reproducible
archives.
@ampleyfly ampleyfly force-pushed the tarfile-tar-gzip-mtime-75707 branch from 3bdbfee to 7a7b637 Compare August 28, 2025 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants