Uncompressing .gz files and storing them in a .tar.gz archive












2















I have the following problem: I am writing a function that looks for a bunch of .gz files, uncompresses them, and stores the individually uncompressed files in a bigger .tar.gz archive. So far, I managed to implement it with the following code, but manually computing the uncompressed file size and setting the TarInfo size seem rather hackish and I would like to know whether there is a more idiomatic solution to my problem:



import gzip
import os
import pathlib
import tarfile

def gather_compressed_files(input_dir: pathlib.Path, output_file: str):
with tarfile.open(output_file, 'w:gz') as tar:
for input_file in input_dir.glob('*.gz'):
with gzip.open(input_file) as fd:
tar_info = tarfile.TarInfo(input_file.stem)
tar_info.size = fd.seek(0, os.SEEK_END)
fd.seek(0, os.SEEK_SET)
tar.addfile(tar_info, fd)


I tried to create a TarInfo object the following way instead of manually creating it:



tar_info = tar.gettarinfo(arcname=input_file.stem, fileobj=fd)


However, this functions retrieves the path of the original .gz file we opened as fd to compute its size, and thus only provides a tar_info.size parameter corresponding to the compressed .gz data and not to the uncompressed data, which is not what I want. Not setting the tar_fino.size parameter at all doesn't work either because addfile uses said size when passed a file descriptor.



Is there a better, more idiomatic way to achieve this or am I stuck with my current solution?










share|improve this question























  • Could you give an example of what your trying to? Are all these tar.gz files stored in a directory? Do you want to merge all these files into one tar.gz file? I'm just verifying so I understand your problem correctly.

    – RoadRunner
    Jan 2 at 15:12













  • I've got a directory with .gz files that I try to store individually uncompressed in a .tar.gz file.

    – Morwenn
    Jan 2 at 15:15
















2















I have the following problem: I am writing a function that looks for a bunch of .gz files, uncompresses them, and stores the individually uncompressed files in a bigger .tar.gz archive. So far, I managed to implement it with the following code, but manually computing the uncompressed file size and setting the TarInfo size seem rather hackish and I would like to know whether there is a more idiomatic solution to my problem:



import gzip
import os
import pathlib
import tarfile

def gather_compressed_files(input_dir: pathlib.Path, output_file: str):
with tarfile.open(output_file, 'w:gz') as tar:
for input_file in input_dir.glob('*.gz'):
with gzip.open(input_file) as fd:
tar_info = tarfile.TarInfo(input_file.stem)
tar_info.size = fd.seek(0, os.SEEK_END)
fd.seek(0, os.SEEK_SET)
tar.addfile(tar_info, fd)


I tried to create a TarInfo object the following way instead of manually creating it:



tar_info = tar.gettarinfo(arcname=input_file.stem, fileobj=fd)


However, this functions retrieves the path of the original .gz file we opened as fd to compute its size, and thus only provides a tar_info.size parameter corresponding to the compressed .gz data and not to the uncompressed data, which is not what I want. Not setting the tar_fino.size parameter at all doesn't work either because addfile uses said size when passed a file descriptor.



Is there a better, more idiomatic way to achieve this or am I stuck with my current solution?










share|improve this question























  • Could you give an example of what your trying to? Are all these tar.gz files stored in a directory? Do you want to merge all these files into one tar.gz file? I'm just verifying so I understand your problem correctly.

    – RoadRunner
    Jan 2 at 15:12













  • I've got a directory with .gz files that I try to store individually uncompressed in a .tar.gz file.

    – Morwenn
    Jan 2 at 15:15














2












2








2








I have the following problem: I am writing a function that looks for a bunch of .gz files, uncompresses them, and stores the individually uncompressed files in a bigger .tar.gz archive. So far, I managed to implement it with the following code, but manually computing the uncompressed file size and setting the TarInfo size seem rather hackish and I would like to know whether there is a more idiomatic solution to my problem:



import gzip
import os
import pathlib
import tarfile

def gather_compressed_files(input_dir: pathlib.Path, output_file: str):
with tarfile.open(output_file, 'w:gz') as tar:
for input_file in input_dir.glob('*.gz'):
with gzip.open(input_file) as fd:
tar_info = tarfile.TarInfo(input_file.stem)
tar_info.size = fd.seek(0, os.SEEK_END)
fd.seek(0, os.SEEK_SET)
tar.addfile(tar_info, fd)


I tried to create a TarInfo object the following way instead of manually creating it:



tar_info = tar.gettarinfo(arcname=input_file.stem, fileobj=fd)


However, this functions retrieves the path of the original .gz file we opened as fd to compute its size, and thus only provides a tar_info.size parameter corresponding to the compressed .gz data and not to the uncompressed data, which is not what I want. Not setting the tar_fino.size parameter at all doesn't work either because addfile uses said size when passed a file descriptor.



Is there a better, more idiomatic way to achieve this or am I stuck with my current solution?










share|improve this question














I have the following problem: I am writing a function that looks for a bunch of .gz files, uncompresses them, and stores the individually uncompressed files in a bigger .tar.gz archive. So far, I managed to implement it with the following code, but manually computing the uncompressed file size and setting the TarInfo size seem rather hackish and I would like to know whether there is a more idiomatic solution to my problem:



import gzip
import os
import pathlib
import tarfile

def gather_compressed_files(input_dir: pathlib.Path, output_file: str):
with tarfile.open(output_file, 'w:gz') as tar:
for input_file in input_dir.glob('*.gz'):
with gzip.open(input_file) as fd:
tar_info = tarfile.TarInfo(input_file.stem)
tar_info.size = fd.seek(0, os.SEEK_END)
fd.seek(0, os.SEEK_SET)
tar.addfile(tar_info, fd)


I tried to create a TarInfo object the following way instead of manually creating it:



tar_info = tar.gettarinfo(arcname=input_file.stem, fileobj=fd)


However, this functions retrieves the path of the original .gz file we opened as fd to compute its size, and thus only provides a tar_info.size parameter corresponding to the compressed .gz data and not to the uncompressed data, which is not what I want. Not setting the tar_fino.size parameter at all doesn't work either because addfile uses said size when passed a file descriptor.



Is there a better, more idiomatic way to achieve this or am I stuck with my current solution?







python python-3.x gzip tarfile






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jan 2 at 15:08









MorwennMorwenn

13.1k872125




13.1k872125













  • Could you give an example of what your trying to? Are all these tar.gz files stored in a directory? Do you want to merge all these files into one tar.gz file? I'm just verifying so I understand your problem correctly.

    – RoadRunner
    Jan 2 at 15:12













  • I've got a directory with .gz files that I try to store individually uncompressed in a .tar.gz file.

    – Morwenn
    Jan 2 at 15:15



















  • Could you give an example of what your trying to? Are all these tar.gz files stored in a directory? Do you want to merge all these files into one tar.gz file? I'm just verifying so I understand your problem correctly.

    – RoadRunner
    Jan 2 at 15:12













  • I've got a directory with .gz files that I try to store individually uncompressed in a .tar.gz file.

    – Morwenn
    Jan 2 at 15:15

















Could you give an example of what your trying to? Are all these tar.gz files stored in a directory? Do you want to merge all these files into one tar.gz file? I'm just verifying so I understand your problem correctly.

– RoadRunner
Jan 2 at 15:12







Could you give an example of what your trying to? Are all these tar.gz files stored in a directory? Do you want to merge all these files into one tar.gz file? I'm just verifying so I understand your problem correctly.

– RoadRunner
Jan 2 at 15:12















I've got a directory with .gz files that I try to store individually uncompressed in a .tar.gz file.

– Morwenn
Jan 2 at 15:15





I've got a directory with .gz files that I try to store individually uncompressed in a .tar.gz file.

– Morwenn
Jan 2 at 15:15












1 Answer
1






active

oldest

votes


















2














Your approach is the only way to avoid decompressing the file completely to disk or RAM. After all, you need to know the size ahead of time to add to the tar file, and gzip files don't really know their own decompressed size. The ISIZE header field theoretically provides the decompressed size, but the field was defined back in the 32 bit days, so it's actually the size modulo 2**32; a file originally 4 GB in size and one that was 0 B file would have the same ISIZE. Regardless, Python doesn't expose ISIZE, so even if it was useful, there would be no built-in way to do this (you can always muck about with manual parsing, but that's not exactly clean or idiomatic).



If you want to avoid decompressing the file twice (once to seek forward, once to actually add it to the tar file), at the expense of decompressing it to disk, you can use a tempfile.TemporaryFile to avoid double decompression (without needing to store the original file in memory) with a slight tweak:



import shutil
import tempfile

def gather_compressed_files(input_dir: pathlib.Path, output_file: str):
with tarfile.open(output_file, 'w:gz') as tar:
for input_file in input_dir.glob('*.gz'):
with tempfile.TemporaryFile() as tf:
# Could combine both in one with, but this way we close the gzip
# file ASAP
with gzip.open(input_file) as fd:
shutil.copyfileobj(fd, tf)
tar_info = tarfile.TarInfo(input_file.stem)
tar_info.size = tf.tell()
tf.seek(0)
tar.addfile(tar_info, tf)





share|improve this answer


























  • Thanks for the answer. It makes me a bit sad that ISIZE can't be used, but I guess we can't have all the toys we want to play with :)

    – Morwenn
    Jan 3 at 9:55











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54008676%2funcompressing-gz-files-and-storing-them-in-a-tar-gz-archive%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














Your approach is the only way to avoid decompressing the file completely to disk or RAM. After all, you need to know the size ahead of time to add to the tar file, and gzip files don't really know their own decompressed size. The ISIZE header field theoretically provides the decompressed size, but the field was defined back in the 32 bit days, so it's actually the size modulo 2**32; a file originally 4 GB in size and one that was 0 B file would have the same ISIZE. Regardless, Python doesn't expose ISIZE, so even if it was useful, there would be no built-in way to do this (you can always muck about with manual parsing, but that's not exactly clean or idiomatic).



If you want to avoid decompressing the file twice (once to seek forward, once to actually add it to the tar file), at the expense of decompressing it to disk, you can use a tempfile.TemporaryFile to avoid double decompression (without needing to store the original file in memory) with a slight tweak:



import shutil
import tempfile

def gather_compressed_files(input_dir: pathlib.Path, output_file: str):
with tarfile.open(output_file, 'w:gz') as tar:
for input_file in input_dir.glob('*.gz'):
with tempfile.TemporaryFile() as tf:
# Could combine both in one with, but this way we close the gzip
# file ASAP
with gzip.open(input_file) as fd:
shutil.copyfileobj(fd, tf)
tar_info = tarfile.TarInfo(input_file.stem)
tar_info.size = tf.tell()
tf.seek(0)
tar.addfile(tar_info, tf)





share|improve this answer


























  • Thanks for the answer. It makes me a bit sad that ISIZE can't be used, but I guess we can't have all the toys we want to play with :)

    – Morwenn
    Jan 3 at 9:55
















2














Your approach is the only way to avoid decompressing the file completely to disk or RAM. After all, you need to know the size ahead of time to add to the tar file, and gzip files don't really know their own decompressed size. The ISIZE header field theoretically provides the decompressed size, but the field was defined back in the 32 bit days, so it's actually the size modulo 2**32; a file originally 4 GB in size and one that was 0 B file would have the same ISIZE. Regardless, Python doesn't expose ISIZE, so even if it was useful, there would be no built-in way to do this (you can always muck about with manual parsing, but that's not exactly clean or idiomatic).



If you want to avoid decompressing the file twice (once to seek forward, once to actually add it to the tar file), at the expense of decompressing it to disk, you can use a tempfile.TemporaryFile to avoid double decompression (without needing to store the original file in memory) with a slight tweak:



import shutil
import tempfile

def gather_compressed_files(input_dir: pathlib.Path, output_file: str):
with tarfile.open(output_file, 'w:gz') as tar:
for input_file in input_dir.glob('*.gz'):
with tempfile.TemporaryFile() as tf:
# Could combine both in one with, but this way we close the gzip
# file ASAP
with gzip.open(input_file) as fd:
shutil.copyfileobj(fd, tf)
tar_info = tarfile.TarInfo(input_file.stem)
tar_info.size = tf.tell()
tf.seek(0)
tar.addfile(tar_info, tf)





share|improve this answer


























  • Thanks for the answer. It makes me a bit sad that ISIZE can't be used, but I guess we can't have all the toys we want to play with :)

    – Morwenn
    Jan 3 at 9:55














2












2








2







Your approach is the only way to avoid decompressing the file completely to disk or RAM. After all, you need to know the size ahead of time to add to the tar file, and gzip files don't really know their own decompressed size. The ISIZE header field theoretically provides the decompressed size, but the field was defined back in the 32 bit days, so it's actually the size modulo 2**32; a file originally 4 GB in size and one that was 0 B file would have the same ISIZE. Regardless, Python doesn't expose ISIZE, so even if it was useful, there would be no built-in way to do this (you can always muck about with manual parsing, but that's not exactly clean or idiomatic).



If you want to avoid decompressing the file twice (once to seek forward, once to actually add it to the tar file), at the expense of decompressing it to disk, you can use a tempfile.TemporaryFile to avoid double decompression (without needing to store the original file in memory) with a slight tweak:



import shutil
import tempfile

def gather_compressed_files(input_dir: pathlib.Path, output_file: str):
with tarfile.open(output_file, 'w:gz') as tar:
for input_file in input_dir.glob('*.gz'):
with tempfile.TemporaryFile() as tf:
# Could combine both in one with, but this way we close the gzip
# file ASAP
with gzip.open(input_file) as fd:
shutil.copyfileobj(fd, tf)
tar_info = tarfile.TarInfo(input_file.stem)
tar_info.size = tf.tell()
tf.seek(0)
tar.addfile(tar_info, tf)





share|improve this answer















Your approach is the only way to avoid decompressing the file completely to disk or RAM. After all, you need to know the size ahead of time to add to the tar file, and gzip files don't really know their own decompressed size. The ISIZE header field theoretically provides the decompressed size, but the field was defined back in the 32 bit days, so it's actually the size modulo 2**32; a file originally 4 GB in size and one that was 0 B file would have the same ISIZE. Regardless, Python doesn't expose ISIZE, so even if it was useful, there would be no built-in way to do this (you can always muck about with manual parsing, but that's not exactly clean or idiomatic).



If you want to avoid decompressing the file twice (once to seek forward, once to actually add it to the tar file), at the expense of decompressing it to disk, you can use a tempfile.TemporaryFile to avoid double decompression (without needing to store the original file in memory) with a slight tweak:



import shutil
import tempfile

def gather_compressed_files(input_dir: pathlib.Path, output_file: str):
with tarfile.open(output_file, 'w:gz') as tar:
for input_file in input_dir.glob('*.gz'):
with tempfile.TemporaryFile() as tf:
# Could combine both in one with, but this way we close the gzip
# file ASAP
with gzip.open(input_file) as fd:
shutil.copyfileobj(fd, tf)
tar_info = tarfile.TarInfo(input_file.stem)
tar_info.size = tf.tell()
tf.seek(0)
tar.addfile(tar_info, tf)






share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 2 at 15:26

























answered Jan 2 at 15:20









ShadowRangerShadowRanger

63k66099




63k66099













  • Thanks for the answer. It makes me a bit sad that ISIZE can't be used, but I guess we can't have all the toys we want to play with :)

    – Morwenn
    Jan 3 at 9:55



















  • Thanks for the answer. It makes me a bit sad that ISIZE can't be used, but I guess we can't have all the toys we want to play with :)

    – Morwenn
    Jan 3 at 9:55

















Thanks for the answer. It makes me a bit sad that ISIZE can't be used, but I guess we can't have all the toys we want to play with :)

– Morwenn
Jan 3 at 9:55





Thanks for the answer. It makes me a bit sad that ISIZE can't be used, but I guess we can't have all the toys we want to play with :)

– Morwenn
Jan 3 at 9:55




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54008676%2funcompressing-gz-files-and-storing-them-in-a-tar-gz-archive%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

'app-layout' is not a known element: how to share Component with different Modules

android studio warns about leanback feature tag usage required on manifest while using Unity exported app?

WPF add header to Image with URL pettitions [duplicate]