Uncompressing .gz files and storing them in a .tar.gz archive
I have the following problem: I am writing a function that looks for a bunch of .gz
files, uncompresses them, and stores the individually uncompressed files in a bigger .tar.gz
archive. So far, I managed to implement it with the following code, but manually computing the uncompressed file size and setting the TarInfo
size seem rather hackish and I would like to know whether there is a more idiomatic solution to my problem:
import gzip
import os
import pathlib
import tarfile
def gather_compressed_files(input_dir: pathlib.Path, output_file: str):
with tarfile.open(output_file, 'w:gz') as tar:
for input_file in input_dir.glob('*.gz'):
with gzip.open(input_file) as fd:
tar_info = tarfile.TarInfo(input_file.stem)
tar_info.size = fd.seek(0, os.SEEK_END)
fd.seek(0, os.SEEK_SET)
tar.addfile(tar_info, fd)
I tried to create a TarInfo
object the following way instead of manually creating it:
tar_info = tar.gettarinfo(arcname=input_file.stem, fileobj=fd)
However, this functions retrieves the path of the original .gz
file we opened as fd
to compute its size, and thus only provides a tar_info.size
parameter corresponding to the compressed .gz
data and not to the uncompressed data, which is not what I want. Not setting the tar_fino.size
parameter at all doesn't work either because addfile
uses said size when passed a file descriptor.
Is there a better, more idiomatic way to achieve this or am I stuck with my current solution?
python python-3.x gzip tarfile
add a comment |
I have the following problem: I am writing a function that looks for a bunch of .gz
files, uncompresses them, and stores the individually uncompressed files in a bigger .tar.gz
archive. So far, I managed to implement it with the following code, but manually computing the uncompressed file size and setting the TarInfo
size seem rather hackish and I would like to know whether there is a more idiomatic solution to my problem:
import gzip
import os
import pathlib
import tarfile
def gather_compressed_files(input_dir: pathlib.Path, output_file: str):
with tarfile.open(output_file, 'w:gz') as tar:
for input_file in input_dir.glob('*.gz'):
with gzip.open(input_file) as fd:
tar_info = tarfile.TarInfo(input_file.stem)
tar_info.size = fd.seek(0, os.SEEK_END)
fd.seek(0, os.SEEK_SET)
tar.addfile(tar_info, fd)
I tried to create a TarInfo
object the following way instead of manually creating it:
tar_info = tar.gettarinfo(arcname=input_file.stem, fileobj=fd)
However, this functions retrieves the path of the original .gz
file we opened as fd
to compute its size, and thus only provides a tar_info.size
parameter corresponding to the compressed .gz
data and not to the uncompressed data, which is not what I want. Not setting the tar_fino.size
parameter at all doesn't work either because addfile
uses said size when passed a file descriptor.
Is there a better, more idiomatic way to achieve this or am I stuck with my current solution?
python python-3.x gzip tarfile
Could you give an example of what your trying to? Are all thesetar.gz
files stored in a directory? Do you want to merge all these files into onetar.gz
file? I'm just verifying so I understand your problem correctly.
– RoadRunner
Jan 2 at 15:12
I've got a directory with.gz
files that I try to store individually uncompressed in a.tar.gz
file.
– Morwenn
Jan 2 at 15:15
add a comment |
I have the following problem: I am writing a function that looks for a bunch of .gz
files, uncompresses them, and stores the individually uncompressed files in a bigger .tar.gz
archive. So far, I managed to implement it with the following code, but manually computing the uncompressed file size and setting the TarInfo
size seem rather hackish and I would like to know whether there is a more idiomatic solution to my problem:
import gzip
import os
import pathlib
import tarfile
def gather_compressed_files(input_dir: pathlib.Path, output_file: str):
with tarfile.open(output_file, 'w:gz') as tar:
for input_file in input_dir.glob('*.gz'):
with gzip.open(input_file) as fd:
tar_info = tarfile.TarInfo(input_file.stem)
tar_info.size = fd.seek(0, os.SEEK_END)
fd.seek(0, os.SEEK_SET)
tar.addfile(tar_info, fd)
I tried to create a TarInfo
object the following way instead of manually creating it:
tar_info = tar.gettarinfo(arcname=input_file.stem, fileobj=fd)
However, this functions retrieves the path of the original .gz
file we opened as fd
to compute its size, and thus only provides a tar_info.size
parameter corresponding to the compressed .gz
data and not to the uncompressed data, which is not what I want. Not setting the tar_fino.size
parameter at all doesn't work either because addfile
uses said size when passed a file descriptor.
Is there a better, more idiomatic way to achieve this or am I stuck with my current solution?
python python-3.x gzip tarfile
I have the following problem: I am writing a function that looks for a bunch of .gz
files, uncompresses them, and stores the individually uncompressed files in a bigger .tar.gz
archive. So far, I managed to implement it with the following code, but manually computing the uncompressed file size and setting the TarInfo
size seem rather hackish and I would like to know whether there is a more idiomatic solution to my problem:
import gzip
import os
import pathlib
import tarfile
def gather_compressed_files(input_dir: pathlib.Path, output_file: str):
with tarfile.open(output_file, 'w:gz') as tar:
for input_file in input_dir.glob('*.gz'):
with gzip.open(input_file) as fd:
tar_info = tarfile.TarInfo(input_file.stem)
tar_info.size = fd.seek(0, os.SEEK_END)
fd.seek(0, os.SEEK_SET)
tar.addfile(tar_info, fd)
I tried to create a TarInfo
object the following way instead of manually creating it:
tar_info = tar.gettarinfo(arcname=input_file.stem, fileobj=fd)
However, this functions retrieves the path of the original .gz
file we opened as fd
to compute its size, and thus only provides a tar_info.size
parameter corresponding to the compressed .gz
data and not to the uncompressed data, which is not what I want. Not setting the tar_fino.size
parameter at all doesn't work either because addfile
uses said size when passed a file descriptor.
Is there a better, more idiomatic way to achieve this or am I stuck with my current solution?
python python-3.x gzip tarfile
python python-3.x gzip tarfile
asked Jan 2 at 15:08
MorwennMorwenn
13.1k872125
13.1k872125
Could you give an example of what your trying to? Are all thesetar.gz
files stored in a directory? Do you want to merge all these files into onetar.gz
file? I'm just verifying so I understand your problem correctly.
– RoadRunner
Jan 2 at 15:12
I've got a directory with.gz
files that I try to store individually uncompressed in a.tar.gz
file.
– Morwenn
Jan 2 at 15:15
add a comment |
Could you give an example of what your trying to? Are all thesetar.gz
files stored in a directory? Do you want to merge all these files into onetar.gz
file? I'm just verifying so I understand your problem correctly.
– RoadRunner
Jan 2 at 15:12
I've got a directory with.gz
files that I try to store individually uncompressed in a.tar.gz
file.
– Morwenn
Jan 2 at 15:15
Could you give an example of what your trying to? Are all these
tar.gz
files stored in a directory? Do you want to merge all these files into one tar.gz
file? I'm just verifying so I understand your problem correctly.– RoadRunner
Jan 2 at 15:12
Could you give an example of what your trying to? Are all these
tar.gz
files stored in a directory? Do you want to merge all these files into one tar.gz
file? I'm just verifying so I understand your problem correctly.– RoadRunner
Jan 2 at 15:12
I've got a directory with
.gz
files that I try to store individually uncompressed in a .tar.gz
file.– Morwenn
Jan 2 at 15:15
I've got a directory with
.gz
files that I try to store individually uncompressed in a .tar.gz
file.– Morwenn
Jan 2 at 15:15
add a comment |
1 Answer
1
active
oldest
votes
Your approach is the only way to avoid decompressing the file completely to disk or RAM. After all, you need to know the size ahead of time to add to the tar file, and gzip
files don't really know their own decompressed size. The ISIZE
header field theoretically provides the decompressed size, but the field was defined back in the 32 bit days, so it's actually the size modulo 2**32
; a file originally 4 GB in size and one that was 0 B file would have the same ISIZE
. Regardless, Python doesn't expose ISIZE
, so even if it was useful, there would be no built-in way to do this (you can always muck about with manual parsing, but that's not exactly clean or idiomatic).
If you want to avoid decompressing the file twice (once to seek
forward, once to actually add it to the tar file), at the expense of decompressing it to disk, you can use a tempfile.TemporaryFile
to avoid double decompression (without needing to store the original file in memory) with a slight tweak:
import shutil
import tempfile
def gather_compressed_files(input_dir: pathlib.Path, output_file: str):
with tarfile.open(output_file, 'w:gz') as tar:
for input_file in input_dir.glob('*.gz'):
with tempfile.TemporaryFile() as tf:
# Could combine both in one with, but this way we close the gzip
# file ASAP
with gzip.open(input_file) as fd:
shutil.copyfileobj(fd, tf)
tar_info = tarfile.TarInfo(input_file.stem)
tar_info.size = tf.tell()
tf.seek(0)
tar.addfile(tar_info, tf)
Thanks for the answer. It makes me a bit sad thatISIZE
can't be used, but I guess we can't have all the toys we want to play with :)
– Morwenn
Jan 3 at 9:55
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54008676%2funcompressing-gz-files-and-storing-them-in-a-tar-gz-archive%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Your approach is the only way to avoid decompressing the file completely to disk or RAM. After all, you need to know the size ahead of time to add to the tar file, and gzip
files don't really know their own decompressed size. The ISIZE
header field theoretically provides the decompressed size, but the field was defined back in the 32 bit days, so it's actually the size modulo 2**32
; a file originally 4 GB in size and one that was 0 B file would have the same ISIZE
. Regardless, Python doesn't expose ISIZE
, so even if it was useful, there would be no built-in way to do this (you can always muck about with manual parsing, but that's not exactly clean or idiomatic).
If you want to avoid decompressing the file twice (once to seek
forward, once to actually add it to the tar file), at the expense of decompressing it to disk, you can use a tempfile.TemporaryFile
to avoid double decompression (without needing to store the original file in memory) with a slight tweak:
import shutil
import tempfile
def gather_compressed_files(input_dir: pathlib.Path, output_file: str):
with tarfile.open(output_file, 'w:gz') as tar:
for input_file in input_dir.glob('*.gz'):
with tempfile.TemporaryFile() as tf:
# Could combine both in one with, but this way we close the gzip
# file ASAP
with gzip.open(input_file) as fd:
shutil.copyfileobj(fd, tf)
tar_info = tarfile.TarInfo(input_file.stem)
tar_info.size = tf.tell()
tf.seek(0)
tar.addfile(tar_info, tf)
Thanks for the answer. It makes me a bit sad thatISIZE
can't be used, but I guess we can't have all the toys we want to play with :)
– Morwenn
Jan 3 at 9:55
add a comment |
Your approach is the only way to avoid decompressing the file completely to disk or RAM. After all, you need to know the size ahead of time to add to the tar file, and gzip
files don't really know their own decompressed size. The ISIZE
header field theoretically provides the decompressed size, but the field was defined back in the 32 bit days, so it's actually the size modulo 2**32
; a file originally 4 GB in size and one that was 0 B file would have the same ISIZE
. Regardless, Python doesn't expose ISIZE
, so even if it was useful, there would be no built-in way to do this (you can always muck about with manual parsing, but that's not exactly clean or idiomatic).
If you want to avoid decompressing the file twice (once to seek
forward, once to actually add it to the tar file), at the expense of decompressing it to disk, you can use a tempfile.TemporaryFile
to avoid double decompression (without needing to store the original file in memory) with a slight tweak:
import shutil
import tempfile
def gather_compressed_files(input_dir: pathlib.Path, output_file: str):
with tarfile.open(output_file, 'w:gz') as tar:
for input_file in input_dir.glob('*.gz'):
with tempfile.TemporaryFile() as tf:
# Could combine both in one with, but this way we close the gzip
# file ASAP
with gzip.open(input_file) as fd:
shutil.copyfileobj(fd, tf)
tar_info = tarfile.TarInfo(input_file.stem)
tar_info.size = tf.tell()
tf.seek(0)
tar.addfile(tar_info, tf)
Thanks for the answer. It makes me a bit sad thatISIZE
can't be used, but I guess we can't have all the toys we want to play with :)
– Morwenn
Jan 3 at 9:55
add a comment |
Your approach is the only way to avoid decompressing the file completely to disk or RAM. After all, you need to know the size ahead of time to add to the tar file, and gzip
files don't really know their own decompressed size. The ISIZE
header field theoretically provides the decompressed size, but the field was defined back in the 32 bit days, so it's actually the size modulo 2**32
; a file originally 4 GB in size and one that was 0 B file would have the same ISIZE
. Regardless, Python doesn't expose ISIZE
, so even if it was useful, there would be no built-in way to do this (you can always muck about with manual parsing, but that's not exactly clean or idiomatic).
If you want to avoid decompressing the file twice (once to seek
forward, once to actually add it to the tar file), at the expense of decompressing it to disk, you can use a tempfile.TemporaryFile
to avoid double decompression (without needing to store the original file in memory) with a slight tweak:
import shutil
import tempfile
def gather_compressed_files(input_dir: pathlib.Path, output_file: str):
with tarfile.open(output_file, 'w:gz') as tar:
for input_file in input_dir.glob('*.gz'):
with tempfile.TemporaryFile() as tf:
# Could combine both in one with, but this way we close the gzip
# file ASAP
with gzip.open(input_file) as fd:
shutil.copyfileobj(fd, tf)
tar_info = tarfile.TarInfo(input_file.stem)
tar_info.size = tf.tell()
tf.seek(0)
tar.addfile(tar_info, tf)
Your approach is the only way to avoid decompressing the file completely to disk or RAM. After all, you need to know the size ahead of time to add to the tar file, and gzip
files don't really know their own decompressed size. The ISIZE
header field theoretically provides the decompressed size, but the field was defined back in the 32 bit days, so it's actually the size modulo 2**32
; a file originally 4 GB in size and one that was 0 B file would have the same ISIZE
. Regardless, Python doesn't expose ISIZE
, so even if it was useful, there would be no built-in way to do this (you can always muck about with manual parsing, but that's not exactly clean or idiomatic).
If you want to avoid decompressing the file twice (once to seek
forward, once to actually add it to the tar file), at the expense of decompressing it to disk, you can use a tempfile.TemporaryFile
to avoid double decompression (without needing to store the original file in memory) with a slight tweak:
import shutil
import tempfile
def gather_compressed_files(input_dir: pathlib.Path, output_file: str):
with tarfile.open(output_file, 'w:gz') as tar:
for input_file in input_dir.glob('*.gz'):
with tempfile.TemporaryFile() as tf:
# Could combine both in one with, but this way we close the gzip
# file ASAP
with gzip.open(input_file) as fd:
shutil.copyfileobj(fd, tf)
tar_info = tarfile.TarInfo(input_file.stem)
tar_info.size = tf.tell()
tf.seek(0)
tar.addfile(tar_info, tf)
edited Jan 2 at 15:26
answered Jan 2 at 15:20
ShadowRangerShadowRanger
63k66099
63k66099
Thanks for the answer. It makes me a bit sad thatISIZE
can't be used, but I guess we can't have all the toys we want to play with :)
– Morwenn
Jan 3 at 9:55
add a comment |
Thanks for the answer. It makes me a bit sad thatISIZE
can't be used, but I guess we can't have all the toys we want to play with :)
– Morwenn
Jan 3 at 9:55
Thanks for the answer. It makes me a bit sad that
ISIZE
can't be used, but I guess we can't have all the toys we want to play with :)– Morwenn
Jan 3 at 9:55
Thanks for the answer. It makes me a bit sad that
ISIZE
can't be used, but I guess we can't have all the toys we want to play with :)– Morwenn
Jan 3 at 9:55
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54008676%2funcompressing-gz-files-and-storing-them-in-a-tar-gz-archive%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Could you give an example of what your trying to? Are all these
tar.gz
files stored in a directory? Do you want to merge all these files into onetar.gz
file? I'm just verifying so I understand your problem correctly.– RoadRunner
Jan 2 at 15:12
I've got a directory with
.gz
files that I try to store individually uncompressed in a.tar.gz
file.– Morwenn
Jan 2 at 15:15