The size parameter for gzip.open().read()





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







3















When working with the gzip library in Python, very often I'd come across code that use the .read() function in a pattern that look like this:



with gzip.open(filename) as bytestream:
bytestream.read(16)
buf = bytestream.read(
IMAGE_SIZE * IMAGE_SIZE * num_images * NUM_CHANNELS
)
data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)


While I'm familiar with the context manager pattern, I struggle to really grasp what is it that the first line of code within the with context manager is doing at all.



This is the documentation for the read() function:




Read at most n characters from stream.



Read from underlying buffer until we have n characters or we hit EOF.
If n is negative or omitted, read until EOF.




If that is the case, the functional role of the first line bytestream.read(16) would have to be reading and thus skipping the first 16 characters, presumably because they act as meta-data or header. However, when I have some images, how would I know to use 16 as the argument for the read call, instead of say, 32 or, 8, or 64?



I recalled plenty a time coming across completely identical code as above except having the author use bytestream.read(8) instead of bytestream.read(16) or just as likely, any other value. Digging into the file character-by-character show no discernible pattern to determine the length of the header character.



In other words, how do one determine the parameter to be used in the read function call? or how do one know the length of the header characters in a gzip-compressed file?



My guess was that it has something to do with the bytes, but after searching through the documentation and online references I can't confirm that.



Reproducible details



My hypothesis, after countless hours of troubleshooting is that the first 16 characters represent some sort of header or meta-data. So the first line in that code is to skip the 16 characters and store the remaining in a variable named buf. However, digging into the data I found no way to determine why or how the value 16 was chosen. I have read the bytes in character by character, and also tried reading + casting them as np.float, but there is no discernible patterns that suggest the meta-data ends at the 16th character and the actual data begins on the 17th.



The following code reads the data from this website and extracts the first 30 characters. Notice that it is indiscernible where the header row "ends" (16th apparently, after the second appearance of x1c`) and the data begins:



import gzip
import numpy as np

train_data_filename = 'data_input/train-images-idx3-ubyte.gz'
IMAGE_SIZE = 28
NUM_CHANNELS = 1

def extract_data(filename, num_images):
with gzip.open(filename) as bytestream:
first30 = bytestream.read(30)
return first30

first30= extract_data(train_data_filename, 10)
print(first30)
# returns: b'x00x00x08x03x00x00xea`x00x00x00x1cx00x00x00x1cx00x00x00x00x00x00x00x00x00x00x00x00x00x00'


If we modify the code to cast them as np.float32, such that all characters were now in numeric (float), again there was no apparent pattern to distinguish where the header / meta-data ends and where the data begins.



Any reference or advice would be very appreciated!










share|improve this question

























  • it just specific the buffer size. See this :stackoverflow.com/questions/1035340/…

    – i_th
    Jan 3 at 10:38













  • This is very specific to the individual data format. The folks writing the code presumably knew enough about what they were parsing to make the assumption at hand. Your question doesn't specify anything whatsoever about the data format the code is intended to parse, whereas the code's needs are entirely driven by that format's specification, so... how is this expected to be answerable?

    – Charles Duffy
    Jan 3 at 14:39













  • Only a very small subset of formats (like JSON, or s-expressions, or msgpack) are schema-carrying -- in a large majority of cases, details of which fields exist at which offsets &c. are out-of-band. If you're lucky, there's a formal specification document, or something like a protobuf spec that can be used to automatically generate machine parsers... but whether those things exist isn't something where we could even tell you where to start unless you find documentation from the author of the data describing the format it's in.

    – Charles Duffy
    Jan 3 at 14:44













  • I understand the value 8 is chosen in a way specific to that problem at hand. But the question arises out of an observably general pattern, so I was really hoping to understand on a more general level, how, given a gzip-compressed file, can an end user know how to read from the file while escaping the first n characters of headers / meta-data. Without the a priori knowledge that the author of the file has

    – onlyphantom
    Jan 3 at 14:48











  • Why do you expect there to be any difference between deciding how large the header is in a stream that's being decompressed from gzip'd input and deciding how large the header is in any other format? This isn't a gzip header, it's a header within the content itself, and it's subject to that content's file format. Using gzip.open(...).read() will only ever return contents that are (from gzip's perspective) data -- there's no gzip-specific metadata returned at all.

    – Charles Duffy
    Jan 3 at 14:49




















3















When working with the gzip library in Python, very often I'd come across code that use the .read() function in a pattern that look like this:



with gzip.open(filename) as bytestream:
bytestream.read(16)
buf = bytestream.read(
IMAGE_SIZE * IMAGE_SIZE * num_images * NUM_CHANNELS
)
data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)


While I'm familiar with the context manager pattern, I struggle to really grasp what is it that the first line of code within the with context manager is doing at all.



This is the documentation for the read() function:




Read at most n characters from stream.



Read from underlying buffer until we have n characters or we hit EOF.
If n is negative or omitted, read until EOF.




If that is the case, the functional role of the first line bytestream.read(16) would have to be reading and thus skipping the first 16 characters, presumably because they act as meta-data or header. However, when I have some images, how would I know to use 16 as the argument for the read call, instead of say, 32 or, 8, or 64?



I recalled plenty a time coming across completely identical code as above except having the author use bytestream.read(8) instead of bytestream.read(16) or just as likely, any other value. Digging into the file character-by-character show no discernible pattern to determine the length of the header character.



In other words, how do one determine the parameter to be used in the read function call? or how do one know the length of the header characters in a gzip-compressed file?



My guess was that it has something to do with the bytes, but after searching through the documentation and online references I can't confirm that.



Reproducible details



My hypothesis, after countless hours of troubleshooting is that the first 16 characters represent some sort of header or meta-data. So the first line in that code is to skip the 16 characters and store the remaining in a variable named buf. However, digging into the data I found no way to determine why or how the value 16 was chosen. I have read the bytes in character by character, and also tried reading + casting them as np.float, but there is no discernible patterns that suggest the meta-data ends at the 16th character and the actual data begins on the 17th.



The following code reads the data from this website and extracts the first 30 characters. Notice that it is indiscernible where the header row "ends" (16th apparently, after the second appearance of x1c`) and the data begins:



import gzip
import numpy as np

train_data_filename = 'data_input/train-images-idx3-ubyte.gz'
IMAGE_SIZE = 28
NUM_CHANNELS = 1

def extract_data(filename, num_images):
with gzip.open(filename) as bytestream:
first30 = bytestream.read(30)
return first30

first30= extract_data(train_data_filename, 10)
print(first30)
# returns: b'x00x00x08x03x00x00xea`x00x00x00x1cx00x00x00x1cx00x00x00x00x00x00x00x00x00x00x00x00x00x00'


If we modify the code to cast them as np.float32, such that all characters were now in numeric (float), again there was no apparent pattern to distinguish where the header / meta-data ends and where the data begins.



Any reference or advice would be very appreciated!










share|improve this question

























  • it just specific the buffer size. See this :stackoverflow.com/questions/1035340/…

    – i_th
    Jan 3 at 10:38













  • This is very specific to the individual data format. The folks writing the code presumably knew enough about what they were parsing to make the assumption at hand. Your question doesn't specify anything whatsoever about the data format the code is intended to parse, whereas the code's needs are entirely driven by that format's specification, so... how is this expected to be answerable?

    – Charles Duffy
    Jan 3 at 14:39













  • Only a very small subset of formats (like JSON, or s-expressions, or msgpack) are schema-carrying -- in a large majority of cases, details of which fields exist at which offsets &c. are out-of-band. If you're lucky, there's a formal specification document, or something like a protobuf spec that can be used to automatically generate machine parsers... but whether those things exist isn't something where we could even tell you where to start unless you find documentation from the author of the data describing the format it's in.

    – Charles Duffy
    Jan 3 at 14:44













  • I understand the value 8 is chosen in a way specific to that problem at hand. But the question arises out of an observably general pattern, so I was really hoping to understand on a more general level, how, given a gzip-compressed file, can an end user know how to read from the file while escaping the first n characters of headers / meta-data. Without the a priori knowledge that the author of the file has

    – onlyphantom
    Jan 3 at 14:48











  • Why do you expect there to be any difference between deciding how large the header is in a stream that's being decompressed from gzip'd input and deciding how large the header is in any other format? This isn't a gzip header, it's a header within the content itself, and it's subject to that content's file format. Using gzip.open(...).read() will only ever return contents that are (from gzip's perspective) data -- there's no gzip-specific metadata returned at all.

    – Charles Duffy
    Jan 3 at 14:49
















3












3








3


2






When working with the gzip library in Python, very often I'd come across code that use the .read() function in a pattern that look like this:



with gzip.open(filename) as bytestream:
bytestream.read(16)
buf = bytestream.read(
IMAGE_SIZE * IMAGE_SIZE * num_images * NUM_CHANNELS
)
data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)


While I'm familiar with the context manager pattern, I struggle to really grasp what is it that the first line of code within the with context manager is doing at all.



This is the documentation for the read() function:




Read at most n characters from stream.



Read from underlying buffer until we have n characters or we hit EOF.
If n is negative or omitted, read until EOF.




If that is the case, the functional role of the first line bytestream.read(16) would have to be reading and thus skipping the first 16 characters, presumably because they act as meta-data or header. However, when I have some images, how would I know to use 16 as the argument for the read call, instead of say, 32 or, 8, or 64?



I recalled plenty a time coming across completely identical code as above except having the author use bytestream.read(8) instead of bytestream.read(16) or just as likely, any other value. Digging into the file character-by-character show no discernible pattern to determine the length of the header character.



In other words, how do one determine the parameter to be used in the read function call? or how do one know the length of the header characters in a gzip-compressed file?



My guess was that it has something to do with the bytes, but after searching through the documentation and online references I can't confirm that.



Reproducible details



My hypothesis, after countless hours of troubleshooting is that the first 16 characters represent some sort of header or meta-data. So the first line in that code is to skip the 16 characters and store the remaining in a variable named buf. However, digging into the data I found no way to determine why or how the value 16 was chosen. I have read the bytes in character by character, and also tried reading + casting them as np.float, but there is no discernible patterns that suggest the meta-data ends at the 16th character and the actual data begins on the 17th.



The following code reads the data from this website and extracts the first 30 characters. Notice that it is indiscernible where the header row "ends" (16th apparently, after the second appearance of x1c`) and the data begins:



import gzip
import numpy as np

train_data_filename = 'data_input/train-images-idx3-ubyte.gz'
IMAGE_SIZE = 28
NUM_CHANNELS = 1

def extract_data(filename, num_images):
with gzip.open(filename) as bytestream:
first30 = bytestream.read(30)
return first30

first30= extract_data(train_data_filename, 10)
print(first30)
# returns: b'x00x00x08x03x00x00xea`x00x00x00x1cx00x00x00x1cx00x00x00x00x00x00x00x00x00x00x00x00x00x00'


If we modify the code to cast them as np.float32, such that all characters were now in numeric (float), again there was no apparent pattern to distinguish where the header / meta-data ends and where the data begins.



Any reference or advice would be very appreciated!










share|improve this question
















When working with the gzip library in Python, very often I'd come across code that use the .read() function in a pattern that look like this:



with gzip.open(filename) as bytestream:
bytestream.read(16)
buf = bytestream.read(
IMAGE_SIZE * IMAGE_SIZE * num_images * NUM_CHANNELS
)
data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)


While I'm familiar with the context manager pattern, I struggle to really grasp what is it that the first line of code within the with context manager is doing at all.



This is the documentation for the read() function:




Read at most n characters from stream.



Read from underlying buffer until we have n characters or we hit EOF.
If n is negative or omitted, read until EOF.




If that is the case, the functional role of the first line bytestream.read(16) would have to be reading and thus skipping the first 16 characters, presumably because they act as meta-data or header. However, when I have some images, how would I know to use 16 as the argument for the read call, instead of say, 32 or, 8, or 64?



I recalled plenty a time coming across completely identical code as above except having the author use bytestream.read(8) instead of bytestream.read(16) or just as likely, any other value. Digging into the file character-by-character show no discernible pattern to determine the length of the header character.



In other words, how do one determine the parameter to be used in the read function call? or how do one know the length of the header characters in a gzip-compressed file?



My guess was that it has something to do with the bytes, but after searching through the documentation and online references I can't confirm that.



Reproducible details



My hypothesis, after countless hours of troubleshooting is that the first 16 characters represent some sort of header or meta-data. So the first line in that code is to skip the 16 characters and store the remaining in a variable named buf. However, digging into the data I found no way to determine why or how the value 16 was chosen. I have read the bytes in character by character, and also tried reading + casting them as np.float, but there is no discernible patterns that suggest the meta-data ends at the 16th character and the actual data begins on the 17th.



The following code reads the data from this website and extracts the first 30 characters. Notice that it is indiscernible where the header row "ends" (16th apparently, after the second appearance of x1c`) and the data begins:



import gzip
import numpy as np

train_data_filename = 'data_input/train-images-idx3-ubyte.gz'
IMAGE_SIZE = 28
NUM_CHANNELS = 1

def extract_data(filename, num_images):
with gzip.open(filename) as bytestream:
first30 = bytestream.read(30)
return first30

first30= extract_data(train_data_filename, 10)
print(first30)
# returns: b'x00x00x08x03x00x00xea`x00x00x00x1cx00x00x00x1cx00x00x00x00x00x00x00x00x00x00x00x00x00x00'


If we modify the code to cast them as np.float32, such that all characters were now in numeric (float), again there was no apparent pattern to distinguish where the header / meta-data ends and where the data begins.



Any reference or advice would be very appreciated!







python gzip






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 3 at 14:09







onlyphantom

















asked Jan 3 at 9:47









onlyphantomonlyphantom

1,89511226




1,89511226













  • it just specific the buffer size. See this :stackoverflow.com/questions/1035340/…

    – i_th
    Jan 3 at 10:38













  • This is very specific to the individual data format. The folks writing the code presumably knew enough about what they were parsing to make the assumption at hand. Your question doesn't specify anything whatsoever about the data format the code is intended to parse, whereas the code's needs are entirely driven by that format's specification, so... how is this expected to be answerable?

    – Charles Duffy
    Jan 3 at 14:39













  • Only a very small subset of formats (like JSON, or s-expressions, or msgpack) are schema-carrying -- in a large majority of cases, details of which fields exist at which offsets &c. are out-of-band. If you're lucky, there's a formal specification document, or something like a protobuf spec that can be used to automatically generate machine parsers... but whether those things exist isn't something where we could even tell you where to start unless you find documentation from the author of the data describing the format it's in.

    – Charles Duffy
    Jan 3 at 14:44













  • I understand the value 8 is chosen in a way specific to that problem at hand. But the question arises out of an observably general pattern, so I was really hoping to understand on a more general level, how, given a gzip-compressed file, can an end user know how to read from the file while escaping the first n characters of headers / meta-data. Without the a priori knowledge that the author of the file has

    – onlyphantom
    Jan 3 at 14:48











  • Why do you expect there to be any difference between deciding how large the header is in a stream that's being decompressed from gzip'd input and deciding how large the header is in any other format? This isn't a gzip header, it's a header within the content itself, and it's subject to that content's file format. Using gzip.open(...).read() will only ever return contents that are (from gzip's perspective) data -- there's no gzip-specific metadata returned at all.

    – Charles Duffy
    Jan 3 at 14:49





















  • it just specific the buffer size. See this :stackoverflow.com/questions/1035340/…

    – i_th
    Jan 3 at 10:38













  • This is very specific to the individual data format. The folks writing the code presumably knew enough about what they were parsing to make the assumption at hand. Your question doesn't specify anything whatsoever about the data format the code is intended to parse, whereas the code's needs are entirely driven by that format's specification, so... how is this expected to be answerable?

    – Charles Duffy
    Jan 3 at 14:39













  • Only a very small subset of formats (like JSON, or s-expressions, or msgpack) are schema-carrying -- in a large majority of cases, details of which fields exist at which offsets &c. are out-of-band. If you're lucky, there's a formal specification document, or something like a protobuf spec that can be used to automatically generate machine parsers... but whether those things exist isn't something where we could even tell you where to start unless you find documentation from the author of the data describing the format it's in.

    – Charles Duffy
    Jan 3 at 14:44













  • I understand the value 8 is chosen in a way specific to that problem at hand. But the question arises out of an observably general pattern, so I was really hoping to understand on a more general level, how, given a gzip-compressed file, can an end user know how to read from the file while escaping the first n characters of headers / meta-data. Without the a priori knowledge that the author of the file has

    – onlyphantom
    Jan 3 at 14:48











  • Why do you expect there to be any difference between deciding how large the header is in a stream that's being decompressed from gzip'd input and deciding how large the header is in any other format? This isn't a gzip header, it's a header within the content itself, and it's subject to that content's file format. Using gzip.open(...).read() will only ever return contents that are (from gzip's perspective) data -- there's no gzip-specific metadata returned at all.

    – Charles Duffy
    Jan 3 at 14:49



















it just specific the buffer size. See this :stackoverflow.com/questions/1035340/…

– i_th
Jan 3 at 10:38







it just specific the buffer size. See this :stackoverflow.com/questions/1035340/…

– i_th
Jan 3 at 10:38















This is very specific to the individual data format. The folks writing the code presumably knew enough about what they were parsing to make the assumption at hand. Your question doesn't specify anything whatsoever about the data format the code is intended to parse, whereas the code's needs are entirely driven by that format's specification, so... how is this expected to be answerable?

– Charles Duffy
Jan 3 at 14:39







This is very specific to the individual data format. The folks writing the code presumably knew enough about what they were parsing to make the assumption at hand. Your question doesn't specify anything whatsoever about the data format the code is intended to parse, whereas the code's needs are entirely driven by that format's specification, so... how is this expected to be answerable?

– Charles Duffy
Jan 3 at 14:39















Only a very small subset of formats (like JSON, or s-expressions, or msgpack) are schema-carrying -- in a large majority of cases, details of which fields exist at which offsets &c. are out-of-band. If you're lucky, there's a formal specification document, or something like a protobuf spec that can be used to automatically generate machine parsers... but whether those things exist isn't something where we could even tell you where to start unless you find documentation from the author of the data describing the format it's in.

– Charles Duffy
Jan 3 at 14:44







Only a very small subset of formats (like JSON, or s-expressions, or msgpack) are schema-carrying -- in a large majority of cases, details of which fields exist at which offsets &c. are out-of-band. If you're lucky, there's a formal specification document, or something like a protobuf spec that can be used to automatically generate machine parsers... but whether those things exist isn't something where we could even tell you where to start unless you find documentation from the author of the data describing the format it's in.

– Charles Duffy
Jan 3 at 14:44















I understand the value 8 is chosen in a way specific to that problem at hand. But the question arises out of an observably general pattern, so I was really hoping to understand on a more general level, how, given a gzip-compressed file, can an end user know how to read from the file while escaping the first n characters of headers / meta-data. Without the a priori knowledge that the author of the file has

– onlyphantom
Jan 3 at 14:48





I understand the value 8 is chosen in a way specific to that problem at hand. But the question arises out of an observably general pattern, so I was really hoping to understand on a more general level, how, given a gzip-compressed file, can an end user know how to read from the file while escaping the first n characters of headers / meta-data. Without the a priori knowledge that the author of the file has

– onlyphantom
Jan 3 at 14:48













Why do you expect there to be any difference between deciding how large the header is in a stream that's being decompressed from gzip'd input and deciding how large the header is in any other format? This isn't a gzip header, it's a header within the content itself, and it's subject to that content's file format. Using gzip.open(...).read() will only ever return contents that are (from gzip's perspective) data -- there's no gzip-specific metadata returned at all.

– Charles Duffy
Jan 3 at 14:49







Why do you expect there to be any difference between deciding how large the header is in a stream that's being decompressed from gzip'd input and deciding how large the header is in any other format? This isn't a gzip header, it's a header within the content itself, and it's subject to that content's file format. Using gzip.open(...).read() will only ever return contents that are (from gzip's perspective) data -- there's no gzip-specific metadata returned at all.

– Charles Duffy
Jan 3 at 14:49














2 Answers
2






active

oldest

votes


















1














From gzip's perspective, everything it's returning to you is data. There is no metadata or gzip-specific header contents prepended to that data stream, so there's no need for any kind of algorithm to figure out how much content gzip is prepending to that stream: The number of bytes it prepends is zero.





Scroll down to the bottom of the page you linked; there's a header titled FILE FORMATS FOR THE MNIST DATABASE.



That format specification tells you exactly what the format is, and thus how many bytes are used for each header. Specifically, the first four items in each file are described as follows:



0000     32 bit integer  0x00000803(2051) magic number 
0004 32 bit integer 60000 number of images
0008 32 bit integer 28 number of rows
0012 32 bit integer 28 number of columns


Thus, if you want to skip all four of those items, you would take 16 bytes off the top.






share|improve this answer
























  • thanks a lot, this is going a long way to clearing some of that. Can you break it down to how you calculated "16 bytes off the top"? Really appreciate it.

    – onlyphantom
    Jan 3 at 14:54











  • The items listed are 4 32-bit integers; 8 bits to a byte; so (4*32/8) == 16.

    – Charles Duffy
    Jan 3 at 14:56











  • That's perfect. Thank you.

    – onlyphantom
    Jan 3 at 14:57






  • 1





    The file format looks a bit more complicated than that. The magic number is encoding the data type and number of dimensions of the data. The third byte is 0x08 which the page lists as meaning the data is in unsigned bytes. The fourth byte is 0x03, which means there will we three subsequent dimensions, each 32-bit big endian integers (I presume unsigned, but it doesn't say). So in total there will be a 16 byte header. It would seem the code snippet has basically ignored the general file format and focused on a specific subset of the file format.

    – Dunes
    Jan 3 at 15:09













  • @Dunes so if I understand this correctly, there’s the first four bytes up to how you explained it in the first two sentences. Then 12 extra bytes because we take it that 0x03 means three subsequent dimensions, each 4 bytes for a total of 12. 12+4 (first 4) = 16. Is that correct?

    – onlyphantom
    Jan 3 at 15:30



















0














From the code snippet, bytestream.read(16) reads or skips the first 16 bytes of bytestream. When you quoted that read() reads at most n characters from the stream, it does so, but also it appears that python stores a single character in 1 byte, making 16 characters occupy 16 bytes.



See more on chars and bytes https://pymotw.com/3/gzip/#reading-compressed-data



The code snippet is primarily interested in the contents of buf, skipping the first 16 bytes of the stream. To understand how to determine the parameter that goes into first bytestream.read() AKA determine how many bytes of the compressed image file to skip, we must understand what the rest of the code does. Particularly, what file are we reading and what are we trying to accomplish with numpy(?) library (saving rgb images in a 1D numpy array?).



I am definitely not an expert on image processing, but it seems that bytestream.read(16) is a unique solution for a unique problem of processing some unique compressed image file. Thus, it is hard to tell how to determine how many bytes to skip without seeing more code and understanding more logic behind the snippet.






share|improve this answer
























  • The code bytestream.read(16) reads and skips the first 16 characters, based on documentation I read online. Most likely it's trying to skip the metadata such as headers. However, how to determine that parameter to be 16 instead of 8 or 32 is still beyond me. Let me update the question a bit more with added details

    – onlyphantom
    Jan 3 at 13:46












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54019755%2fthe-size-parameter-for-gzip-open-read%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














From gzip's perspective, everything it's returning to you is data. There is no metadata or gzip-specific header contents prepended to that data stream, so there's no need for any kind of algorithm to figure out how much content gzip is prepending to that stream: The number of bytes it prepends is zero.





Scroll down to the bottom of the page you linked; there's a header titled FILE FORMATS FOR THE MNIST DATABASE.



That format specification tells you exactly what the format is, and thus how many bytes are used for each header. Specifically, the first four items in each file are described as follows:



0000     32 bit integer  0x00000803(2051) magic number 
0004 32 bit integer 60000 number of images
0008 32 bit integer 28 number of rows
0012 32 bit integer 28 number of columns


Thus, if you want to skip all four of those items, you would take 16 bytes off the top.






share|improve this answer
























  • thanks a lot, this is going a long way to clearing some of that. Can you break it down to how you calculated "16 bytes off the top"? Really appreciate it.

    – onlyphantom
    Jan 3 at 14:54











  • The items listed are 4 32-bit integers; 8 bits to a byte; so (4*32/8) == 16.

    – Charles Duffy
    Jan 3 at 14:56











  • That's perfect. Thank you.

    – onlyphantom
    Jan 3 at 14:57






  • 1





    The file format looks a bit more complicated than that. The magic number is encoding the data type and number of dimensions of the data. The third byte is 0x08 which the page lists as meaning the data is in unsigned bytes. The fourth byte is 0x03, which means there will we three subsequent dimensions, each 32-bit big endian integers (I presume unsigned, but it doesn't say). So in total there will be a 16 byte header. It would seem the code snippet has basically ignored the general file format and focused on a specific subset of the file format.

    – Dunes
    Jan 3 at 15:09













  • @Dunes so if I understand this correctly, there’s the first four bytes up to how you explained it in the first two sentences. Then 12 extra bytes because we take it that 0x03 means three subsequent dimensions, each 4 bytes for a total of 12. 12+4 (first 4) = 16. Is that correct?

    – onlyphantom
    Jan 3 at 15:30
















1














From gzip's perspective, everything it's returning to you is data. There is no metadata or gzip-specific header contents prepended to that data stream, so there's no need for any kind of algorithm to figure out how much content gzip is prepending to that stream: The number of bytes it prepends is zero.





Scroll down to the bottom of the page you linked; there's a header titled FILE FORMATS FOR THE MNIST DATABASE.



That format specification tells you exactly what the format is, and thus how many bytes are used for each header. Specifically, the first four items in each file are described as follows:



0000     32 bit integer  0x00000803(2051) magic number 
0004 32 bit integer 60000 number of images
0008 32 bit integer 28 number of rows
0012 32 bit integer 28 number of columns


Thus, if you want to skip all four of those items, you would take 16 bytes off the top.






share|improve this answer
























  • thanks a lot, this is going a long way to clearing some of that. Can you break it down to how you calculated "16 bytes off the top"? Really appreciate it.

    – onlyphantom
    Jan 3 at 14:54











  • The items listed are 4 32-bit integers; 8 bits to a byte; so (4*32/8) == 16.

    – Charles Duffy
    Jan 3 at 14:56











  • That's perfect. Thank you.

    – onlyphantom
    Jan 3 at 14:57






  • 1





    The file format looks a bit more complicated than that. The magic number is encoding the data type and number of dimensions of the data. The third byte is 0x08 which the page lists as meaning the data is in unsigned bytes. The fourth byte is 0x03, which means there will we three subsequent dimensions, each 32-bit big endian integers (I presume unsigned, but it doesn't say). So in total there will be a 16 byte header. It would seem the code snippet has basically ignored the general file format and focused on a specific subset of the file format.

    – Dunes
    Jan 3 at 15:09













  • @Dunes so if I understand this correctly, there’s the first four bytes up to how you explained it in the first two sentences. Then 12 extra bytes because we take it that 0x03 means three subsequent dimensions, each 4 bytes for a total of 12. 12+4 (first 4) = 16. Is that correct?

    – onlyphantom
    Jan 3 at 15:30














1












1








1







From gzip's perspective, everything it's returning to you is data. There is no metadata or gzip-specific header contents prepended to that data stream, so there's no need for any kind of algorithm to figure out how much content gzip is prepending to that stream: The number of bytes it prepends is zero.





Scroll down to the bottom of the page you linked; there's a header titled FILE FORMATS FOR THE MNIST DATABASE.



That format specification tells you exactly what the format is, and thus how many bytes are used for each header. Specifically, the first four items in each file are described as follows:



0000     32 bit integer  0x00000803(2051) magic number 
0004 32 bit integer 60000 number of images
0008 32 bit integer 28 number of rows
0012 32 bit integer 28 number of columns


Thus, if you want to skip all four of those items, you would take 16 bytes off the top.






share|improve this answer













From gzip's perspective, everything it's returning to you is data. There is no metadata or gzip-specific header contents prepended to that data stream, so there's no need for any kind of algorithm to figure out how much content gzip is prepending to that stream: The number of bytes it prepends is zero.





Scroll down to the bottom of the page you linked; there's a header titled FILE FORMATS FOR THE MNIST DATABASE.



That format specification tells you exactly what the format is, and thus how many bytes are used for each header. Specifically, the first four items in each file are described as follows:



0000     32 bit integer  0x00000803(2051) magic number 
0004 32 bit integer 60000 number of images
0008 32 bit integer 28 number of rows
0012 32 bit integer 28 number of columns


Thus, if you want to skip all four of those items, you would take 16 bytes off the top.







share|improve this answer












share|improve this answer



share|improve this answer










answered Jan 3 at 14:48









Charles DuffyCharles Duffy

181k28206261




181k28206261













  • thanks a lot, this is going a long way to clearing some of that. Can you break it down to how you calculated "16 bytes off the top"? Really appreciate it.

    – onlyphantom
    Jan 3 at 14:54











  • The items listed are 4 32-bit integers; 8 bits to a byte; so (4*32/8) == 16.

    – Charles Duffy
    Jan 3 at 14:56











  • That's perfect. Thank you.

    – onlyphantom
    Jan 3 at 14:57






  • 1





    The file format looks a bit more complicated than that. The magic number is encoding the data type and number of dimensions of the data. The third byte is 0x08 which the page lists as meaning the data is in unsigned bytes. The fourth byte is 0x03, which means there will we three subsequent dimensions, each 32-bit big endian integers (I presume unsigned, but it doesn't say). So in total there will be a 16 byte header. It would seem the code snippet has basically ignored the general file format and focused on a specific subset of the file format.

    – Dunes
    Jan 3 at 15:09













  • @Dunes so if I understand this correctly, there’s the first four bytes up to how you explained it in the first two sentences. Then 12 extra bytes because we take it that 0x03 means three subsequent dimensions, each 4 bytes for a total of 12. 12+4 (first 4) = 16. Is that correct?

    – onlyphantom
    Jan 3 at 15:30



















  • thanks a lot, this is going a long way to clearing some of that. Can you break it down to how you calculated "16 bytes off the top"? Really appreciate it.

    – onlyphantom
    Jan 3 at 14:54











  • The items listed are 4 32-bit integers; 8 bits to a byte; so (4*32/8) == 16.

    – Charles Duffy
    Jan 3 at 14:56











  • That's perfect. Thank you.

    – onlyphantom
    Jan 3 at 14:57






  • 1





    The file format looks a bit more complicated than that. The magic number is encoding the data type and number of dimensions of the data. The third byte is 0x08 which the page lists as meaning the data is in unsigned bytes. The fourth byte is 0x03, which means there will we three subsequent dimensions, each 32-bit big endian integers (I presume unsigned, but it doesn't say). So in total there will be a 16 byte header. It would seem the code snippet has basically ignored the general file format and focused on a specific subset of the file format.

    – Dunes
    Jan 3 at 15:09













  • @Dunes so if I understand this correctly, there’s the first four bytes up to how you explained it in the first two sentences. Then 12 extra bytes because we take it that 0x03 means three subsequent dimensions, each 4 bytes for a total of 12. 12+4 (first 4) = 16. Is that correct?

    – onlyphantom
    Jan 3 at 15:30

















thanks a lot, this is going a long way to clearing some of that. Can you break it down to how you calculated "16 bytes off the top"? Really appreciate it.

– onlyphantom
Jan 3 at 14:54





thanks a lot, this is going a long way to clearing some of that. Can you break it down to how you calculated "16 bytes off the top"? Really appreciate it.

– onlyphantom
Jan 3 at 14:54













The items listed are 4 32-bit integers; 8 bits to a byte; so (4*32/8) == 16.

– Charles Duffy
Jan 3 at 14:56





The items listed are 4 32-bit integers; 8 bits to a byte; so (4*32/8) == 16.

– Charles Duffy
Jan 3 at 14:56













That's perfect. Thank you.

– onlyphantom
Jan 3 at 14:57





That's perfect. Thank you.

– onlyphantom
Jan 3 at 14:57




1




1





The file format looks a bit more complicated than that. The magic number is encoding the data type and number of dimensions of the data. The third byte is 0x08 which the page lists as meaning the data is in unsigned bytes. The fourth byte is 0x03, which means there will we three subsequent dimensions, each 32-bit big endian integers (I presume unsigned, but it doesn't say). So in total there will be a 16 byte header. It would seem the code snippet has basically ignored the general file format and focused on a specific subset of the file format.

– Dunes
Jan 3 at 15:09







The file format looks a bit more complicated than that. The magic number is encoding the data type and number of dimensions of the data. The third byte is 0x08 which the page lists as meaning the data is in unsigned bytes. The fourth byte is 0x03, which means there will we three subsequent dimensions, each 32-bit big endian integers (I presume unsigned, but it doesn't say). So in total there will be a 16 byte header. It would seem the code snippet has basically ignored the general file format and focused on a specific subset of the file format.

– Dunes
Jan 3 at 15:09















@Dunes so if I understand this correctly, there’s the first four bytes up to how you explained it in the first two sentences. Then 12 extra bytes because we take it that 0x03 means three subsequent dimensions, each 4 bytes for a total of 12. 12+4 (first 4) = 16. Is that correct?

– onlyphantom
Jan 3 at 15:30





@Dunes so if I understand this correctly, there’s the first four bytes up to how you explained it in the first two sentences. Then 12 extra bytes because we take it that 0x03 means three subsequent dimensions, each 4 bytes for a total of 12. 12+4 (first 4) = 16. Is that correct?

– onlyphantom
Jan 3 at 15:30













0














From the code snippet, bytestream.read(16) reads or skips the first 16 bytes of bytestream. When you quoted that read() reads at most n characters from the stream, it does so, but also it appears that python stores a single character in 1 byte, making 16 characters occupy 16 bytes.



See more on chars and bytes https://pymotw.com/3/gzip/#reading-compressed-data



The code snippet is primarily interested in the contents of buf, skipping the first 16 bytes of the stream. To understand how to determine the parameter that goes into first bytestream.read() AKA determine how many bytes of the compressed image file to skip, we must understand what the rest of the code does. Particularly, what file are we reading and what are we trying to accomplish with numpy(?) library (saving rgb images in a 1D numpy array?).



I am definitely not an expert on image processing, but it seems that bytestream.read(16) is a unique solution for a unique problem of processing some unique compressed image file. Thus, it is hard to tell how to determine how many bytes to skip without seeing more code and understanding more logic behind the snippet.






share|improve this answer
























  • The code bytestream.read(16) reads and skips the first 16 characters, based on documentation I read online. Most likely it's trying to skip the metadata such as headers. However, how to determine that parameter to be 16 instead of 8 or 32 is still beyond me. Let me update the question a bit more with added details

    – onlyphantom
    Jan 3 at 13:46
















0














From the code snippet, bytestream.read(16) reads or skips the first 16 bytes of bytestream. When you quoted that read() reads at most n characters from the stream, it does so, but also it appears that python stores a single character in 1 byte, making 16 characters occupy 16 bytes.



See more on chars and bytes https://pymotw.com/3/gzip/#reading-compressed-data



The code snippet is primarily interested in the contents of buf, skipping the first 16 bytes of the stream. To understand how to determine the parameter that goes into first bytestream.read() AKA determine how many bytes of the compressed image file to skip, we must understand what the rest of the code does. Particularly, what file are we reading and what are we trying to accomplish with numpy(?) library (saving rgb images in a 1D numpy array?).



I am definitely not an expert on image processing, but it seems that bytestream.read(16) is a unique solution for a unique problem of processing some unique compressed image file. Thus, it is hard to tell how to determine how many bytes to skip without seeing more code and understanding more logic behind the snippet.






share|improve this answer
























  • The code bytestream.read(16) reads and skips the first 16 characters, based on documentation I read online. Most likely it's trying to skip the metadata such as headers. However, how to determine that parameter to be 16 instead of 8 or 32 is still beyond me. Let me update the question a bit more with added details

    – onlyphantom
    Jan 3 at 13:46














0












0








0







From the code snippet, bytestream.read(16) reads or skips the first 16 bytes of bytestream. When you quoted that read() reads at most n characters from the stream, it does so, but also it appears that python stores a single character in 1 byte, making 16 characters occupy 16 bytes.



See more on chars and bytes https://pymotw.com/3/gzip/#reading-compressed-data



The code snippet is primarily interested in the contents of buf, skipping the first 16 bytes of the stream. To understand how to determine the parameter that goes into first bytestream.read() AKA determine how many bytes of the compressed image file to skip, we must understand what the rest of the code does. Particularly, what file are we reading and what are we trying to accomplish with numpy(?) library (saving rgb images in a 1D numpy array?).



I am definitely not an expert on image processing, but it seems that bytestream.read(16) is a unique solution for a unique problem of processing some unique compressed image file. Thus, it is hard to tell how to determine how many bytes to skip without seeing more code and understanding more logic behind the snippet.






share|improve this answer













From the code snippet, bytestream.read(16) reads or skips the first 16 bytes of bytestream. When you quoted that read() reads at most n characters from the stream, it does so, but also it appears that python stores a single character in 1 byte, making 16 characters occupy 16 bytes.



See more on chars and bytes https://pymotw.com/3/gzip/#reading-compressed-data



The code snippet is primarily interested in the contents of buf, skipping the first 16 bytes of the stream. To understand how to determine the parameter that goes into first bytestream.read() AKA determine how many bytes of the compressed image file to skip, we must understand what the rest of the code does. Particularly, what file are we reading and what are we trying to accomplish with numpy(?) library (saving rgb images in a 1D numpy array?).



I am definitely not an expert on image processing, but it seems that bytestream.read(16) is a unique solution for a unique problem of processing some unique compressed image file. Thus, it is hard to tell how to determine how many bytes to skip without seeing more code and understanding more logic behind the snippet.







share|improve this answer












share|improve this answer



share|improve this answer










answered Jan 3 at 12:02









Alexander StrakhovAlexander Strakhov

16318




16318













  • The code bytestream.read(16) reads and skips the first 16 characters, based on documentation I read online. Most likely it's trying to skip the metadata such as headers. However, how to determine that parameter to be 16 instead of 8 or 32 is still beyond me. Let me update the question a bit more with added details

    – onlyphantom
    Jan 3 at 13:46



















  • The code bytestream.read(16) reads and skips the first 16 characters, based on documentation I read online. Most likely it's trying to skip the metadata such as headers. However, how to determine that parameter to be 16 instead of 8 or 32 is still beyond me. Let me update the question a bit more with added details

    – onlyphantom
    Jan 3 at 13:46

















The code bytestream.read(16) reads and skips the first 16 characters, based on documentation I read online. Most likely it's trying to skip the metadata such as headers. However, how to determine that parameter to be 16 instead of 8 or 32 is still beyond me. Let me update the question a bit more with added details

– onlyphantom
Jan 3 at 13:46





The code bytestream.read(16) reads and skips the first 16 characters, based on documentation I read online. Most likely it's trying to skip the metadata such as headers. However, how to determine that parameter to be 16 instead of 8 or 32 is still beyond me. Let me update the question a bit more with added details

– onlyphantom
Jan 3 at 13:46


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54019755%2fthe-size-parameter-for-gzip-open-read%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

MongoDB - Not Authorized To Execute Command

How to fix TextFormField cause rebuild widget in Flutter

in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith