The size parameter for gzip.open().read()

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

When working with the gzip library in Python, very often I'd come across code that use the .read() function in a pattern that look like this:

with gzip.open(filename) as bytestream:

    bytestream.read(16) 

    buf = bytestream.read(

        IMAGE_SIZE * IMAGE_SIZE * num_images * NUM_CHANNELS

    )

    data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)

While I'm familiar with the context manager pattern, I struggle to really grasp what is it that the first line of code within the with context manager is doing at all.

This is the documentation for the read() function:

Read at most n characters from stream.

Read from underlying buffer until we have n characters or we hit EOF.
If n is negative or omitted, read until EOF.

If that is the case, the functional role of the first line bytestream.read(16) would have to be reading and thus skipping the first 16 characters, presumably because they act as meta-data or header. However, when I have some images, how would I know to use 16 as the argument for the read call, instead of say, 32 or, 8, or 64?

I recalled plenty a time coming across completely identical code as above except having the author use bytestream.read(8) instead of bytestream.read(16) or just as likely, any other value. Digging into the file character-by-character show no discernible pattern to determine the length of the header character.

In other words, how do one determine the parameter to be used in the read function call? or how do one know the length of the header characters in a gzip-compressed file?

My guess was that it has something to do with the bytes, but after searching through the documentation and online references I can't confirm that.

Reproducible details

My hypothesis, after countless hours of troubleshooting is that the first 16 characters represent some sort of header or meta-data. So the first line in that code is to skip the 16 characters and store the remaining in a variable named buf. However, digging into the data I found no way to determine why or how the value 16 was chosen. I have read the bytes in character by character, and also tried reading + casting them as np.float, but there is no discernible patterns that suggest the meta-data ends at the 16th character and the actual data begins on the 17th.

The following code reads the data from this website and extracts the first 30 characters. Notice that it is indiscernible where the header row "ends" (16th apparently, after the second appearance of x1c`) and the data begins:

import gzip

import numpy as np



train_data_filename = 'data_input/train-images-idx3-ubyte.gz'

IMAGE_SIZE = 28

NUM_CHANNELS = 1



def extract_data(filename, num_images):

    with gzip.open(filename) as bytestream:

        first30 = bytestream.read(30)

        return first30



first30= extract_data(train_data_filename, 10)

print(first30)

# returns: b'x00x00x08x03x00x00xea`x00x00x00x1cx00x00x00x1cx00x00x00x00x00x00x00x00x00x00x00x00x00x00'

If we modify the code to cast them as np.float32, such that all characters were now in numeric (float), again there was no apparent pattern to distinguish where the header / meta-data ends and where the data begins.

Any reference or advice would be very appreciated!

edited Jan 3 at 14:09

asked Jan 3 at 9:47

onlyphantom

1,89511226

it just specific the buffer size. See this :stackoverflow.com/questions/1035340/…

– i_th
Jan 3 at 10:38

This is very specific to the individual data format. The folks writing the code presumably knew enough about what they were parsing to make the assumption at hand. Your question doesn't specify anything whatsoever about the data format the code is intended to parse, whereas the code's needs are entirely driven by that format's specification, so... how is this expected to be answerable?

– Charles Duffy
Jan 3 at 14:39

Only a very small subset of formats (like JSON, or s-expressions, or msgpack) are schema-carrying -- in a large majority of cases, details of which fields exist at which offsets &c. are out-of-band. If you're lucky, there's a formal specification document, or something like a protobuf spec that can be used to automatically generate machine parsers... but whether those things exist isn't something where we could even tell you where to start unless you find documentation from the author of the data describing the format it's in.

– Charles Duffy
Jan 3 at 14:44

I understand the value 8 is chosen in a way specific to that problem at hand. But the question arises out of an observably general pattern, so I was really hoping to understand on a more general level, how, given a gzip-compressed file, can an end user know how to read from the file while escaping the first n characters of headers / meta-data. Without the a priori knowledge that the author of the file has

– onlyphantom
Jan 3 at 14:48

Why do you expect there to be any difference between deciding how large the header is in a stream that's being decompressed from gzip'd input and deciding how large the header is in any other format? This isn't a gzip header, it's a header within the content itself, and it's subject to that content's file format. Using gzip.open(...).read() will only ever return contents that are (from gzip's perspective) data -- there's no gzip-specific metadata returned at all.

– Charles Duffy
Jan 3 at 14:49

|
show 2 more comments

When working with the gzip library in Python, very often I'd come across code that use the .read() function in a pattern that look like this:

with gzip.open(filename) as bytestream:

    bytestream.read(16) 

    buf = bytestream.read(

        IMAGE_SIZE * IMAGE_SIZE * num_images * NUM_CHANNELS

    )

    data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)

While I'm familiar with the context manager pattern, I struggle to really grasp what is it that the first line of code within the with context manager is doing at all.

This is the documentation for the read() function:

Read at most n characters from stream.

Read from underlying buffer until we have n characters or we hit EOF.
If n is negative or omitted, read until EOF.

In other words, how do one determine the parameter to be used in the read function call? or how do one know the length of the header characters in a gzip-compressed file?

My guess was that it has something to do with the bytes, but after searching through the documentation and online references I can't confirm that.

Reproducible details

import gzip

import numpy as np



train_data_filename = 'data_input/train-images-idx3-ubyte.gz'

IMAGE_SIZE = 28

NUM_CHANNELS = 1



def extract_data(filename, num_images):

    with gzip.open(filename) as bytestream:

        first30 = bytestream.read(30)

        return first30



first30= extract_data(train_data_filename, 10)

print(first30)

# returns: b'x00x00x08x03x00x00xea`x00x00x00x1cx00x00x00x1cx00x00x00x00x00x00x00x00x00x00x00x00x00x00'

Any reference or advice would be very appreciated!

edited Jan 3 at 14:09

asked Jan 3 at 9:47

onlyphantom

1,89511226

it just specific the buffer size. See this :stackoverflow.com/questions/1035340/…

– i_th
Jan 3 at 10:38

This is very specific to the individual data format. The folks writing the code presumably knew enough about what they were parsing to make the assumption at hand. Your question doesn't specify anything whatsoever about the data format the code is intended to parse, whereas the code's needs are entirely driven by that format's specification, so... how is this expected to be answerable?

– Charles Duffy
Jan 3 at 14:39

Only a very small subset of formats (like JSON, or s-expressions, or msgpack) are schema-carrying -- in a large majority of cases, details of which fields exist at which offsets &c. are out-of-band. If you're lucky, there's a formal specification document, or something like a protobuf spec that can be used to automatically generate machine parsers... but whether those things exist isn't something where we could even tell you where to start unless you find documentation from the author of the data describing the format it's in.

– Charles Duffy
Jan 3 at 14:44

I understand the value 8 is chosen in a way specific to that problem at hand. But the question arises out of an observably general pattern, so I was really hoping to understand on a more general level, how, given a gzip-compressed file, can an end user know how to read from the file while escaping the first n characters of headers / meta-data. Without the a priori knowledge that the author of the file has

– onlyphantom
Jan 3 at 14:48

Why do you expect there to be any difference between deciding how large the header is in a stream that's being decompressed from gzip'd input and deciding how large the header is in any other format? This isn't a gzip header, it's a header within the content itself, and it's subject to that content's file format. Using gzip.open(...).read() will only ever return contents that are (from gzip's perspective) data -- there's no gzip-specific metadata returned at all.

– Charles Duffy
Jan 3 at 14:49

|
show 2 more comments

When working with the gzip library in Python, very often I'd come across code that use the .read() function in a pattern that look like this:

with gzip.open(filename) as bytestream:

    bytestream.read(16) 

    buf = bytestream.read(

        IMAGE_SIZE * IMAGE_SIZE * num_images * NUM_CHANNELS

    )

    data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)

While I'm familiar with the context manager pattern, I struggle to really grasp what is it that the first line of code within the with context manager is doing at all.

This is the documentation for the read() function:

Read at most n characters from stream.

Read from underlying buffer until we have n characters or we hit EOF.
If n is negative or omitted, read until EOF.

In other words, how do one determine the parameter to be used in the read function call? or how do one know the length of the header characters in a gzip-compressed file?

My guess was that it has something to do with the bytes, but after searching through the documentation and online references I can't confirm that.

Reproducible details

import gzip

import numpy as np



train_data_filename = 'data_input/train-images-idx3-ubyte.gz'

IMAGE_SIZE = 28

NUM_CHANNELS = 1



def extract_data(filename, num_images):

    with gzip.open(filename) as bytestream:

        first30 = bytestream.read(30)

        return first30



first30= extract_data(train_data_filename, 10)

print(first30)

# returns: b'x00x00x08x03x00x00xea`x00x00x00x1cx00x00x00x1cx00x00x00x00x00x00x00x00x00x00x00x00x00x00'

Any reference or advice would be very appreciated!

edited Jan 3 at 14:09

asked Jan 3 at 9:47

onlyphantom

1,89511226

When working with the gzip library in Python, very often I'd come across code that use the .read() function in a pattern that look like this:

with gzip.open(filename) as bytestream:

    bytestream.read(16) 

    buf = bytestream.read(

        IMAGE_SIZE * IMAGE_SIZE * num_images * NUM_CHANNELS

    )

    data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)

While I'm familiar with the context manager pattern, I struggle to really grasp what is it that the first line of code within the with context manager is doing at all.

This is the documentation for the read() function:

Read at most n characters from stream.

Read from underlying buffer until we have n characters or we hit EOF.
If n is negative or omitted, read until EOF.

In other words, how do one determine the parameter to be used in the read function call? or how do one know the length of the header characters in a gzip-compressed file?

My guess was that it has something to do with the bytes, but after searching through the documentation and online references I can't confirm that.

Reproducible details

import gzip

import numpy as np



train_data_filename = 'data_input/train-images-idx3-ubyte.gz'

IMAGE_SIZE = 28

NUM_CHANNELS = 1



def extract_data(filename, num_images):

    with gzip.open(filename) as bytestream:

        first30 = bytestream.read(30)

        return first30



first30= extract_data(train_data_filename, 10)

print(first30)

# returns: b'x00x00x08x03x00x00xea`x00x00x00x1cx00x00x00x1cx00x00x00x00x00x00x00x00x00x00x00x00x00x00'

Any reference or advice would be very appreciated!

python gzip

edited Jan 3 at 14:09

asked Jan 3 at 9:47

onlyphantom

1,89511226

edited Jan 3 at 14:09

asked Jan 3 at 9:47

onlyphantom

1,89511226

edited Jan 3 at 14:09

asked Jan 3 at 9:47

onlyphantom

1,89511226

asked Jan 3 at 9:47

onlyphantom

1,89511226

asked Jan 3 at 9:47

onlyphantom

1,89511226

it just specific the buffer size. See this :stackoverflow.com/questions/1035340/…

– i_th
Jan 3 at 10:38

This is very specific to the individual data format. The folks writing the code presumably knew enough about what they were parsing to make the assumption at hand. Your question doesn't specify anything whatsoever about the data format the code is intended to parse, whereas the code's needs are entirely driven by that format's specification, so... how is this expected to be answerable?

– Charles Duffy
Jan 3 at 14:39

Only a very small subset of formats (like JSON, or s-expressions, or msgpack) are schema-carrying -- in a large majority of cases, details of which fields exist at which offsets &c. are out-of-band. If you're lucky, there's a formal specification document, or something like a protobuf spec that can be used to automatically generate machine parsers... but whether those things exist isn't something where we could even tell you where to start unless you find documentation from the author of the data describing the format it's in.

– Charles Duffy
Jan 3 at 14:44

I understand the value 8 is chosen in a way specific to that problem at hand. But the question arises out of an observably general pattern, so I was really hoping to understand on a more general level, how, given a gzip-compressed file, can an end user know how to read from the file while escaping the first n characters of headers / meta-data. Without the a priori knowledge that the author of the file has

– onlyphantom
Jan 3 at 14:48

Why do you expect there to be any difference between deciding how large the header is in a stream that's being decompressed from gzip'd input and deciding how large the header is in any other format? This isn't a gzip header, it's a header within the content itself, and it's subject to that content's file format. Using gzip.open(...).read() will only ever return contents that are (from gzip's perspective) data -- there's no gzip-specific metadata returned at all.

– Charles Duffy
Jan 3 at 14:49

|
show 2 more comments

it just specific the buffer size. See this :stackoverflow.com/questions/1035340/…

– i_th
Jan 3 at 10:38

This is very specific to the individual data format. The folks writing the code presumably knew enough about what they were parsing to make the assumption at hand. Your question doesn't specify anything whatsoever about the data format the code is intended to parse, whereas the code's needs are entirely driven by that format's specification, so... how is this expected to be answerable?

– Charles Duffy
Jan 3 at 14:39

Only a very small subset of formats (like JSON, or s-expressions, or msgpack) are schema-carrying -- in a large majority of cases, details of which fields exist at which offsets &c. are out-of-band. If you're lucky, there's a formal specification document, or something like a protobuf spec that can be used to automatically generate machine parsers... but whether those things exist isn't something where we could even tell you where to start unless you find documentation from the author of the data describing the format it's in.

– Charles Duffy
Jan 3 at 14:44

I understand the value 8 is chosen in a way specific to that problem at hand. But the question arises out of an observably general pattern, so I was really hoping to understand on a more general level, how, given a gzip-compressed file, can an end user know how to read from the file while escaping the first n characters of headers / meta-data. Without the a priori knowledge that the author of the file has

– onlyphantom
Jan 3 at 14:48

Why do you expect there to be any difference between deciding how large the header is in a stream that's being decompressed from gzip'd input and deciding how large the header is in any other format? This isn't a gzip header, it's a header within the content itself, and it's subject to that content's file format. Using gzip.open(...).read() will only ever return contents that are (from gzip's perspective) data -- there's no gzip-specific metadata returned at all.

– Charles Duffy
Jan 3 at 14:49

it just specific the buffer size. See this :stackoverflow.com/questions/1035340/…

– i_th
Jan 3 at 10:38

This is very specific to the individual data format. The folks writing the code presumably knew enough about what they were parsing to make the assumption at hand. Your question doesn't specify anything whatsoever about the data format the code is intended to parse, whereas the code's needs are entirely driven by that format's specification, so... how is this expected to be answerable?

– Charles Duffy
Jan 3 at 14:39

Only a very small subset of formats (like JSON, or s-expressions, or msgpack) are schema-carrying -- in a large majority of cases, details of which fields exist at which offsets &c. are out-of-band. If you're lucky, there's a formal specification document, or something like a protobuf spec that can be used to automatically generate machine parsers... but whether those things exist isn't something where we could even tell you where to start unless you find documentation from the author of the data describing the format it's in.

– Charles Duffy
Jan 3 at 14:44

I understand the value 8 is chosen in a way specific to that problem at hand. But the question arises out of an observably general pattern, so I was really hoping to understand on a more general level, how, given a gzip-compressed file, can an end user know how to read from the file while escaping the first n characters of headers / meta-data. Without the a priori knowledge that the author of the file has

– onlyphantom
Jan 3 at 14:48

Why do you expect there to be any difference between deciding how large the header is in a stream that's being decompressed from gzip'd input and deciding how large the header is in any other format? This isn't a gzip header, it's a header within the content itself, and it's subject to that content's file format. Using gzip.open(...).read() will only ever return contents that are (from gzip's perspective) data -- there's no gzip-specific metadata returned at all.

– Charles Duffy
Jan 3 at 14:49

|
show 2 more comments

2 Answers
2

active

oldest

votes

From gzip's perspective, everything it's returning to you is data. There is no metadata or gzip-specific header contents prepended to that data stream, so there's no need for any kind of algorithm to figure out how much content gzip is prepending to that stream: The number of bytes it prepends is zero.

Scroll down to the bottom of the page you linked; there's a header titled FILE FORMATS FOR THE MNIST DATABASE.

That format specification tells you exactly what the format is, and thus how many bytes are used for each header. Specifically, the first four items in each file are described as follows:

0000     32 bit integer  0x00000803(2051) magic number 

0004     32 bit integer  60000            number of images 

0008     32 bit integer  28               number of rows 

0012     32 bit integer  28               number of columns

Thus, if you want to skip all four of those items, you would take 16 bytes off the top.

answered Jan 3 at 14:48

Charles Duffy

181k28206261

thanks a lot, this is going a long way to clearing some of that. Can you break it down to how you calculated "16 bytes off the top"? Really appreciate it.

– onlyphantom
Jan 3 at 14:54

The items listed are 4 32-bit integers; 8 bits to a byte; so (4*32/8) == 16.

– Charles Duffy
Jan 3 at 14:56

That's perfect. Thank you.

– onlyphantom
Jan 3 at 14:57

1

The file format looks a bit more complicated than that. The magic number is encoding the data type and number of dimensions of the data. The third byte is 0x08 which the page lists as meaning the data is in unsigned bytes. The fourth byte is 0x03, which means there will we three subsequent dimensions, each 32-bit big endian integers (I presume unsigned, but it doesn't say). So in total there will be a 16 byte header. It would seem the code snippet has basically ignored the general file format and focused on a specific subset of the file format.

– Dunes
Jan 3 at 15:09

@Dunes so if I understand this correctly, there’s the first four bytes up to how you explained it in the first two sentences. Then 12 extra bytes because we take it that 0x03 means three subsequent dimensions, each 4 bytes for a total of 12. 12+4 (first 4) = 16. Is that correct?

– onlyphantom
Jan 3 at 15:30

|
show 1 more comment

From the code snippet, bytestream.read(16) reads or skips the first 16 bytes of bytestream. When you quoted that read() reads at most n characters from the stream, it does so, but also it appears that python stores a single character in 1 byte, making 16 characters occupy 16 bytes.

See more on chars and bytes https://pymotw.com/3/gzip/#reading-compressed-data

The code snippet is primarily interested in the contents of buf, skipping the first 16 bytes of the stream. To understand how to determine the parameter that goes into first bytestream.read() AKA determine how many bytes of the compressed image file to skip, we must understand what the rest of the code does. Particularly, what file are we reading and what are we trying to accomplish with numpy(?) library (saving rgb images in a 1D numpy array?).

I am definitely not an expert on image processing, but it seems that bytestream.read(16) is a unique solution for a unique problem of processing some unique compressed image file. Thus, it is hard to tell how to determine how many bytes to skip without seeing more code and understanding more logic behind the snippet.

answered Jan 3 at 12:02

Alexander Strakhov

16318

The code bytestream.read(16) reads and skips the first 16 characters, based on documentation I read online. Most likely it's trying to skip the metadata such as headers. However, how to determine that parameter to be 16 instead of 8 or 32 is still beyond me. Let me update the question a bit more with added details

– onlyphantom
Jan 3 at 13:46

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54019755%2fthe-size-parameter-for-gzip-open-read%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Scroll down to the bottom of the page you linked; there's a header titled FILE FORMATS FOR THE MNIST DATABASE.

That format specification tells you exactly what the format is, and thus how many bytes are used for each header. Specifically, the first four items in each file are described as follows:

0000     32 bit integer  0x00000803(2051) magic number 

0004     32 bit integer  60000            number of images 

0008     32 bit integer  28               number of rows 

0012     32 bit integer  28               number of columns

Thus, if you want to skip all four of those items, you would take 16 bytes off the top.

answered Jan 3 at 14:48

Charles Duffy

181k28206261

thanks a lot, this is going a long way to clearing some of that. Can you break it down to how you calculated "16 bytes off the top"? Really appreciate it.

– onlyphantom
Jan 3 at 14:54

The items listed are 4 32-bit integers; 8 bits to a byte; so (4*32/8) == 16.

– Charles Duffy
Jan 3 at 14:56

That's perfect. Thank you.

– onlyphantom
Jan 3 at 14:57

1

The file format looks a bit more complicated than that. The magic number is encoding the data type and number of dimensions of the data. The third byte is 0x08 which the page lists as meaning the data is in unsigned bytes. The fourth byte is 0x03, which means there will we three subsequent dimensions, each 32-bit big endian integers (I presume unsigned, but it doesn't say). So in total there will be a 16 byte header. It would seem the code snippet has basically ignored the general file format and focused on a specific subset of the file format.

– Dunes
Jan 3 at 15:09

@Dunes so if I understand this correctly, there’s the first four bytes up to how you explained it in the first two sentences. Then 12 extra bytes because we take it that 0x03 means three subsequent dimensions, each 4 bytes for a total of 12. 12+4 (first 4) = 16. Is that correct?

– onlyphantom
Jan 3 at 15:30

|
show 1 more comment

Scroll down to the bottom of the page you linked; there's a header titled FILE FORMATS FOR THE MNIST DATABASE.

That format specification tells you exactly what the format is, and thus how many bytes are used for each header. Specifically, the first four items in each file are described as follows:

0000     32 bit integer  0x00000803(2051) magic number 

0004     32 bit integer  60000            number of images 

0008     32 bit integer  28               number of rows 

0012     32 bit integer  28               number of columns

Thus, if you want to skip all four of those items, you would take 16 bytes off the top.

answered Jan 3 at 14:48

Charles Duffy

181k28206261

thanks a lot, this is going a long way to clearing some of that. Can you break it down to how you calculated "16 bytes off the top"? Really appreciate it.

– onlyphantom
Jan 3 at 14:54

The items listed are 4 32-bit integers; 8 bits to a byte; so (4*32/8) == 16.

– Charles Duffy
Jan 3 at 14:56

That's perfect. Thank you.

– onlyphantom
Jan 3 at 14:57

1

The file format looks a bit more complicated than that. The magic number is encoding the data type and number of dimensions of the data. The third byte is 0x08 which the page lists as meaning the data is in unsigned bytes. The fourth byte is 0x03, which means there will we three subsequent dimensions, each 32-bit big endian integers (I presume unsigned, but it doesn't say). So in total there will be a 16 byte header. It would seem the code snippet has basically ignored the general file format and focused on a specific subset of the file format.

– Dunes
Jan 3 at 15:09

@Dunes so if I understand this correctly, there’s the first four bytes up to how you explained it in the first two sentences. Then 12 extra bytes because we take it that 0x03 means three subsequent dimensions, each 4 bytes for a total of 12. 12+4 (first 4) = 16. Is that correct?

– onlyphantom
Jan 3 at 15:30

|
show 1 more comment

Scroll down to the bottom of the page you linked; there's a header titled FILE FORMATS FOR THE MNIST DATABASE.

That format specification tells you exactly what the format is, and thus how many bytes are used for each header. Specifically, the first four items in each file are described as follows:

0000     32 bit integer  0x00000803(2051) magic number 

0004     32 bit integer  60000            number of images 

0008     32 bit integer  28               number of rows 

0012     32 bit integer  28               number of columns

Thus, if you want to skip all four of those items, you would take 16 bytes off the top.

answered Jan 3 at 14:48

Charles Duffy

181k28206261

Scroll down to the bottom of the page you linked; there's a header titled FILE FORMATS FOR THE MNIST DATABASE.

That format specification tells you exactly what the format is, and thus how many bytes are used for each header. Specifically, the first four items in each file are described as follows:

0000     32 bit integer  0x00000803(2051) magic number 

0004     32 bit integer  60000            number of images 

0008     32 bit integer  28               number of rows 

0012     32 bit integer  28               number of columns

Thus, if you want to skip all four of those items, you would take 16 bytes off the top.

answered Jan 3 at 14:48

Charles Duffy

181k28206261

answered Jan 3 at 14:48

Charles Duffy

181k28206261

answered Jan 3 at 14:48

Charles Duffy

181k28206261

answered Jan 3 at 14:48

Charles Duffy

181k28206261

thanks a lot, this is going a long way to clearing some of that. Can you break it down to how you calculated "16 bytes off the top"? Really appreciate it.

– onlyphantom
Jan 3 at 14:54

The items listed are 4 32-bit integers; 8 bits to a byte; so (4*32/8) == 16.

– Charles Duffy
Jan 3 at 14:56

That's perfect. Thank you.

– onlyphantom
Jan 3 at 14:57

1

The file format looks a bit more complicated than that. The magic number is encoding the data type and number of dimensions of the data. The third byte is 0x08 which the page lists as meaning the data is in unsigned bytes. The fourth byte is 0x03, which means there will we three subsequent dimensions, each 32-bit big endian integers (I presume unsigned, but it doesn't say). So in total there will be a 16 byte header. It would seem the code snippet has basically ignored the general file format and focused on a specific subset of the file format.

– Dunes
Jan 3 at 15:09

@Dunes so if I understand this correctly, there’s the first four bytes up to how you explained it in the first two sentences. Then 12 extra bytes because we take it that 0x03 means three subsequent dimensions, each 4 bytes for a total of 12. 12+4 (first 4) = 16. Is that correct?

– onlyphantom
Jan 3 at 15:30

|
show 1 more comment

thanks a lot, this is going a long way to clearing some of that. Can you break it down to how you calculated "16 bytes off the top"? Really appreciate it.

– onlyphantom
Jan 3 at 14:54

The items listed are 4 32-bit integers; 8 bits to a byte; so (4*32/8) == 16.

– Charles Duffy
Jan 3 at 14:56

That's perfect. Thank you.

– onlyphantom
Jan 3 at 14:57

1

The file format looks a bit more complicated than that. The magic number is encoding the data type and number of dimensions of the data. The third byte is 0x08 which the page lists as meaning the data is in unsigned bytes. The fourth byte is 0x03, which means there will we three subsequent dimensions, each 32-bit big endian integers (I presume unsigned, but it doesn't say). So in total there will be a 16 byte header. It would seem the code snippet has basically ignored the general file format and focused on a specific subset of the file format.

– Dunes
Jan 3 at 15:09

@Dunes so if I understand this correctly, there’s the first four bytes up to how you explained it in the first two sentences. Then 12 extra bytes because we take it that 0x03 means three subsequent dimensions, each 4 bytes for a total of 12. 12+4 (first 4) = 16. Is that correct?

– onlyphantom
Jan 3 at 15:30

thanks a lot, this is going a long way to clearing some of that. Can you break it down to how you calculated "16 bytes off the top"? Really appreciate it.

– onlyphantom
Jan 3 at 14:54

The items listed are 4 32-bit integers; 8 bits to a byte; so (4*32/8) == 16.

– Charles Duffy
Jan 3 at 14:56

That's perfect. Thank you.

– onlyphantom
Jan 3 at 14:57

The file format looks a bit more complicated than that. The magic number is encoding the data type and number of dimensions of the data. The third byte is 0x08 which the page lists as meaning the data is in unsigned bytes. The fourth byte is 0x03, which means there will we three subsequent dimensions, each 32-bit big endian integers (I presume unsigned, but it doesn't say). So in total there will be a 16 byte header. It would seem the code snippet has basically ignored the general file format and focused on a specific subset of the file format.

– Dunes
Jan 3 at 15:09

@Dunes so if I understand this correctly, there’s the first four bytes up to how you explained it in the first two sentences. Then 12 extra bytes because we take it that 0x03 means three subsequent dimensions, each 4 bytes for a total of 12. 12+4 (first 4) = 16. Is that correct?

– onlyphantom
Jan 3 at 15:30

|
show 1 more comment

See more on chars and bytes https://pymotw.com/3/gzip/#reading-compressed-data

answered Jan 3 at 12:02

Alexander Strakhov

16318

The code bytestream.read(16) reads and skips the first 16 characters, based on documentation I read online. Most likely it's trying to skip the metadata such as headers. However, how to determine that parameter to be 16 instead of 8 or 32 is still beyond me. Let me update the question a bit more with added details

– onlyphantom
Jan 3 at 13:46

add a comment |

See more on chars and bytes https://pymotw.com/3/gzip/#reading-compressed-data

answered Jan 3 at 12:02

Alexander Strakhov

16318

The code bytestream.read(16) reads and skips the first 16 characters, based on documentation I read online. Most likely it's trying to skip the metadata such as headers. However, how to determine that parameter to be 16 instead of 8 or 32 is still beyond me. Let me update the question a bit more with added details

– onlyphantom
Jan 3 at 13:46

add a comment |

See more on chars and bytes https://pymotw.com/3/gzip/#reading-compressed-data

answered Jan 3 at 12:02

Alexander Strakhov

16318

See more on chars and bytes https://pymotw.com/3/gzip/#reading-compressed-data

answered Jan 3 at 12:02

Alexander Strakhov

16318

answered Jan 3 at 12:02

Alexander Strakhov

16318

answered Jan 3 at 12:02

Alexander Strakhov

16318

answered Jan 3 at 12:02

Alexander Strakhov

16318

The code bytestream.read(16) reads and skips the first 16 characters, based on documentation I read online. Most likely it's trying to skip the metadata such as headers. However, how to determine that parameter to be 16 instead of 8 or 32 is still beyond me. Let me update the question a bit more with added details

– onlyphantom
Jan 3 at 13:46

add a comment |

The code bytestream.read(16) reads and skips the first 16 characters, based on documentation I read online. Most likely it's trying to skip the metadata such as headers. However, how to determine that parameter to be 16 instead of 8 or 32 is still beyond me. Let me update the question a bit more with added details

– onlyphantom
Jan 3 at 13:46

The code bytestream.read(16) reads and skips the first 16 characters, based on documentation I read online. Most likely it's trying to skip the metadata such as headers. However, how to determine that parameter to be 16 instead of 8 or 32 is still beyond me. Let me update the question a bit more with added details

– onlyphantom
Jan 3 at 13:46

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu