Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZIP exports to S3 are corrupt #475

Closed
jnm opened this issue Aug 22, 2018 · 9 comments
Closed

ZIP exports to S3 are corrupt #475

jnm opened this issue Aug 22, 2018 · 9 comments
Assignees

Comments

@jnm
Copy link
Member

jnm commented Aug 22, 2018

To reproduce, enter the KC Django shell and run:

from onadata.libs.utils.export_tools import *                                   
stuffs = ' '.join(map(str, range(9 * 1024 * 1024)))                             
print 'megabytes of stuff:', len(stuffs) / 1024 / 1024                          
import random                                                                   
min_chunk = 2 * 1024 * 1024                                                     
max_chunk = 7 * 1024 * 1024                                                     
storage = get_storage_class()()                                                 
filename = 'aaa__test.zip'                                                      
out_f = storage.open(filename, 'wb')                                            
import StringIO                                                                 
sio = StringIO.StringIO(stuffs)                                                 
import zipfile                                                                  
out_zip = zipfile.ZipFile(out_f, 'w', zipfile.ZIP_STORED, allowZip64=True)      
while True:                                                                     
    chunk_size = random.randint(min_chunk, max_chunk)                           
    print 'trying chunk of', chunk_size, 'bytes'                                
    leido = sio.read(chunk_size)                                                
    print '\tread', len(leido), 'bytes'                                         
    if not len(leido):                                                          
        break                                                                   
    out_zip.writestr(leido[:8], leido)                                          
out_zip.close()                                                                 
out_f.seek(0)                                                                   
out_f.close()                                                                   
                                                                                
# test the generated zip                                                        
in_f = storage.open(filename, 'r')                                              
in_zip = zipfile.ZipFile(in_f, 'r', allowZip64=True)                            
# "Read all the files in the archive and check their CRC’s and file headers. Return the name of the first bad file, or else return None."
if in_zip.testzip() is not None:                                                
    print "yep, it's broken!"                                                   
                                                                                
storage.delete(filename) 

Will whittle this down in future comments.

@jnm jnm self-assigned this Aug 22, 2018
@jnm
Copy link
Member Author

jnm commented Aug 22, 2018

@darm0ck 👋

@jnm
Copy link
Member Author

jnm commented Aug 22, 2018

Given that AWS_S3_FILE_BUFFER_SIZE = 50 * 1024 * 1024:

AWS_S3_FILE_BUFFER_SIZE = 50 * 1024 * 1024

...it makes sense that we have to write more than 50 MB to have an issue. The following code creates a valid ZIP:

stuffs = 'F' * 49 * 1024 * 1024                                                 
from onadata.libs.utils.export_tools import *                                                                                                 
storage = get_storage_class()()                                                 
filename = 'aaa__test.zip'                                                      
out_f = storage.open(filename, 'wb')                                            
import StringIO                                                                 
sio = StringIO.StringIO(stuffs)                                                 
import zipfile                                                                  
out_zip = zipfile.ZipFile(out_f, 'w', zipfile.ZIP_STORED, allowZip64=True)                                                                               
out_zip.writestr('funstuff', stuffs)                                            
out_zip.close()                                                                 
out_f.seek(0)                                                                   
out_f.close()                                                                   
                                                                                
# test the generated zip                                                        
in_f = storage.open(filename, 'r')                                              
in_zip = zipfile.ZipFile(in_f, 'r', allowZip64=True)                            
# "Read all the files in the archive and check their CRC’s and file headers. Return the name of the first bad file, or else return None."
if in_zip.testzip() is not None:                                                
    print "yep, it's broken!"                                                   
                                                                                
storage.delete(filename) 

...but changing the first line to stuffs = 'F' * 50 * 1024 * 1024 causes:

error                                     Traceback (most recent call last)
<ipython-input-16-ef11e6098e88> in <module>()
     26 '''                                                                             
     27 out_zip.writestr('funstuff', stuffs)
---> 28 out_zip.close()
     29 out_f.seek(0)
     30 out_f.close()

/usr/lib/python2.7/zipfile.pyc in close(self)
   1362                 endrec = struct.pack(structEndArchive, stringEndArchive,
   1363                                     0, 0, centDirCount, centDirCount,
-> 1364                                     centDirSize, centDirOffset, len(self._comment))
   1365                 self.fp.write(endrec)
   1366                 self.fp.write(self._comment)

error: integer out of range for 'L' format code

ipdb shows that centDirSize is negative here.

@jnm
Copy link
Member Author

jnm commented Aug 22, 2018

Uhh...

> /usr/lib/python2.7/zipfile.py(1326)close()
   1325                         raise
-> 1326                     self.fp.write(centdir)
   1327                     self.fp.write(filename)

ipdb> self.fp.tell()
52428838
ipdb> self.fp
<S3BotoStorageFile: aaa__test.zip>
ipdb> unt
> /usr/lib/python2.7/zipfile.py(1327)close()
   1326                     self.fp.write(centdir)
-> 1327                     self.fp.write(filename)
   1328                     self.fp.write(extra_data)

ipdb> self.fp.tell()
46

@jnm
Copy link
Member Author

jnm commented Aug 22, 2018

😢

In [1]: from onadata.libs.utils.export_tools import *

In [2]: storage = get_storage_class()()

In [3]: out_f = storage.open('aaa__sadness', 'w')

In [4]: out_f.write('x' * 50 * 1024 * 1024)

In [5]: out_f.tell()
Out[5]: 52428800

In [6]: out_f.write('hi')

In [7]: out_f.tell()
Out[7]: 2

@jnm
Copy link
Member Author

jnm commented Aug 23, 2018

I opened jschneier/django-storages#566. For now, I'll try to implement a wrapper class that forbids seek() and implements tell() using the sum of write sizes.

@jnm
Copy link
Member Author

jnm commented Jul 21, 2020

My fork needs to be updated for the Python 3 version of django-storages and boto3

@jnm jnm reopened this Jul 21, 2020
jnm added a commit to kobotoolbox/kpi that referenced this issue Jul 21, 2020
@jnm
Copy link
Member Author

jnm commented Jul 21, 2020

@noliveleger, the fork has been updated: https://github.com/jnm/django-storages/commits/s3boto3_accurate_tell
I'll leave this open, and you can close it once you've added the requirement to your new Django 2.2 branch.

@jnm jnm closed this as completed Sep 8, 2020
noliveleger added a commit to kobotoolbox/kpi that referenced this issue Oct 19, 2020
…oto3-accurate-tell

Make `tell()` accurate when using s3boto3, by…
@jnm
Copy link
Member Author

jnm commented Jun 26, 2021

This has reared its head again with the upgrade to Python 3. I think something about Python's zipfile module has changed—possibly that it attempts to seek back to the directory header to fill in previously-unknown CRC and length information after adding each file. I do notice flag 0x08 being set when the underlying file is not seekable; according to wikipedia:

If the bit at offset 3 (0x08) of the general-purpose flags field is set, then the CRC-32 and file sizes are not known when the header is written. The fields in the local header are filled with zero, and the CRC-32 and size are appended in a 12-byte structure (optionally preceded by a 4-byte signature) immediately after the compressed data: …

I originally said that I'd forbid seek(), but that broke unrelated things and led me to settle on the milder alternative of logging a warning. I will now go back and make seek() raise an AttributeError, which the zipfile module will recognize as indicating an unseekable file.

Internal discussion: https://chat.kobotoolbox.org/#narrow/stream/6-Support/topic/ZIP.20exports.20failing

@jnm jnm reopened this Jun 26, 2021
jnm added a commit that referenced this issue Jun 26, 2021
@jnm
Copy link
Member Author

jnm commented Jun 30, 2021

Not critical anymore; let's do future work in #743.

@jnm jnm closed this as completed Jun 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant