Uncompressing .gz files and storing them in a .tar.gz archive

I have the following problem: I am writing a function that looks for a bunch of .gz files, uncompresses them, and stores the individually uncompressed files in a bigger .tar.gz archive. So far, I managed to implement it with the following code, but manually computing the uncompressed file size and setting the TarInfo size seem rather hackish and I would like to know whether there is a more idiomatic solution to my problem:

import gzip

import os

import pathlib

import tarfile



def gather_compressed_files(input_dir: pathlib.Path, output_file: str):

    with tarfile.open(output_file, 'w:gz') as tar:

        for input_file in input_dir.glob('*.gz'):

            with gzip.open(input_file) as fd:

                tar_info = tarfile.TarInfo(input_file.stem)

                tar_info.size = fd.seek(0, os.SEEK_END)

                fd.seek(0, os.SEEK_SET)

                tar.addfile(tar_info, fd)

I tried to create a TarInfo object the following way instead of manually creating it:

tar_info = tar.gettarinfo(arcname=input_file.stem, fileobj=fd)

However, this functions retrieves the path of the original .gz file we opened as fd to compute its size, and thus only provides a tar_info.size parameter corresponding to the compressed .gz data and not to the uncompressed data, which is not what I want. Not setting the tar_fino.size parameter at all doesn't work either because addfile uses said size when passed a file descriptor.

Is there a better, more idiomatic way to achieve this or am I stuck with my current solution?

asked Jan 2 at 15:08

Morwenn

13.1k872125

Could you give an example of what your trying to? Are all these tar.gz files stored in a directory? Do you want to merge all these files into one tar.gz file? I'm just verifying so I understand your problem correctly.

– RoadRunner
Jan 2 at 15:12

I've got a directory with .gz files that I try to store individually uncompressed in a .tar.gz file.

– Morwenn
Jan 2 at 15:15

add a comment |

import gzip

import os

import pathlib

import tarfile



def gather_compressed_files(input_dir: pathlib.Path, output_file: str):

    with tarfile.open(output_file, 'w:gz') as tar:

        for input_file in input_dir.glob('*.gz'):

            with gzip.open(input_file) as fd:

                tar_info = tarfile.TarInfo(input_file.stem)

                tar_info.size = fd.seek(0, os.SEEK_END)

                fd.seek(0, os.SEEK_SET)

                tar.addfile(tar_info, fd)

I tried to create a TarInfo object the following way instead of manually creating it:

tar_info = tar.gettarinfo(arcname=input_file.stem, fileobj=fd)

Is there a better, more idiomatic way to achieve this or am I stuck with my current solution?

asked Jan 2 at 15:08

Morwenn

13.1k872125

Could you give an example of what your trying to? Are all these tar.gz files stored in a directory? Do you want to merge all these files into one tar.gz file? I'm just verifying so I understand your problem correctly.

– RoadRunner
Jan 2 at 15:12

I've got a directory with .gz files that I try to store individually uncompressed in a .tar.gz file.

– Morwenn
Jan 2 at 15:15

add a comment |

import gzip

import os

import pathlib

import tarfile



def gather_compressed_files(input_dir: pathlib.Path, output_file: str):

    with tarfile.open(output_file, 'w:gz') as tar:

        for input_file in input_dir.glob('*.gz'):

            with gzip.open(input_file) as fd:

                tar_info = tarfile.TarInfo(input_file.stem)

                tar_info.size = fd.seek(0, os.SEEK_END)

                fd.seek(0, os.SEEK_SET)

                tar.addfile(tar_info, fd)

I tried to create a TarInfo object the following way instead of manually creating it:

tar_info = tar.gettarinfo(arcname=input_file.stem, fileobj=fd)

Is there a better, more idiomatic way to achieve this or am I stuck with my current solution?

asked Jan 2 at 15:08

Morwenn

13.1k872125

import gzip

import os

import pathlib

import tarfile



def gather_compressed_files(input_dir: pathlib.Path, output_file: str):

    with tarfile.open(output_file, 'w:gz') as tar:

        for input_file in input_dir.glob('*.gz'):

            with gzip.open(input_file) as fd:

                tar_info = tarfile.TarInfo(input_file.stem)

                tar_info.size = fd.seek(0, os.SEEK_END)

                fd.seek(0, os.SEEK_SET)

                tar.addfile(tar_info, fd)

I tried to create a TarInfo object the following way instead of manually creating it:

tar_info = tar.gettarinfo(arcname=input_file.stem, fileobj=fd)

Is there a better, more idiomatic way to achieve this or am I stuck with my current solution?

python python-3.x gzip tarfile

asked Jan 2 at 15:08

Morwenn

13.1k872125

asked Jan 2 at 15:08

Morwenn

13.1k872125

asked Jan 2 at 15:08

Morwenn

13.1k872125

asked Jan 2 at 15:08

Morwenn

13.1k872125

asked Jan 2 at 15:08

Morwenn

13.1k872125

Could you give an example of what your trying to? Are all these tar.gz files stored in a directory? Do you want to merge all these files into one tar.gz file? I'm just verifying so I understand your problem correctly.

– RoadRunner
Jan 2 at 15:12

I've got a directory with .gz files that I try to store individually uncompressed in a .tar.gz file.

– Morwenn
Jan 2 at 15:15

add a comment |

Could you give an example of what your trying to? Are all these tar.gz files stored in a directory? Do you want to merge all these files into one tar.gz file? I'm just verifying so I understand your problem correctly.

– RoadRunner
Jan 2 at 15:12

I've got a directory with .gz files that I try to store individually uncompressed in a .tar.gz file.

– Morwenn
Jan 2 at 15:15

Could you give an example of what your trying to? Are all these tar.gz files stored in a directory? Do you want to merge all these files into one tar.gz file? I'm just verifying so I understand your problem correctly.

– RoadRunner
Jan 2 at 15:12

I've got a directory with .gz files that I try to store individually uncompressed in a .tar.gz file.

– Morwenn
Jan 2 at 15:15

add a comment |

1 Answer
1

active

oldest

votes

Your approach is the only way to avoid decompressing the file completely to disk or RAM. After all, you need to know the size ahead of time to add to the tar file, and gzip files don't really know their own decompressed size. The ISIZE header field theoretically provides the decompressed size, but the field was defined back in the 32 bit days, so it's actually the size modulo 2**32; a file originally 4 GB in size and one that was 0 B file would have the same ISIZE. Regardless, Python doesn't expose ISIZE, so even if it was useful, there would be no built-in way to do this (you can always muck about with manual parsing, but that's not exactly clean or idiomatic).

If you want to avoid decompressing the file twice (once to seek forward, once to actually add it to the tar file), at the expense of decompressing it to disk, you can use a tempfile.TemporaryFile to avoid double decompression (without needing to store the original file in memory) with a slight tweak:

import shutil

import tempfile



def gather_compressed_files(input_dir: pathlib.Path, output_file: str):

    with tarfile.open(output_file, 'w:gz') as tar:

        for input_file in input_dir.glob('*.gz'):

            with tempfile.TemporaryFile() as tf:

                # Could combine both in one with, but this way we close the gzip

                # file ASAP

                with gzip.open(input_file) as fd:

                    shutil.copyfileobj(fd, tf)

                tar_info = tarfile.TarInfo(input_file.stem)

                tar_info.size = tf.tell()

                tf.seek(0)

                tar.addfile(tar_info, tf)

edited Jan 2 at 15:26

answered Jan 2 at 15:20

ShadowRanger

63k66099

Thanks for the answer. It makes me a bit sad that ISIZE can't be used, but I guess we can't have all the toys we want to play with :)

– Morwenn
Jan 3 at 9:55

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54008676%2funcompressing-gz-files-and-storing-them-in-a-tar-gz-archive%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

import shutil

import tempfile



def gather_compressed_files(input_dir: pathlib.Path, output_file: str):

    with tarfile.open(output_file, 'w:gz') as tar:

        for input_file in input_dir.glob('*.gz'):

            with tempfile.TemporaryFile() as tf:

                # Could combine both in one with, but this way we close the gzip

                # file ASAP

                with gzip.open(input_file) as fd:

                    shutil.copyfileobj(fd, tf)

                tar_info = tarfile.TarInfo(input_file.stem)

                tar_info.size = tf.tell()

                tf.seek(0)

                tar.addfile(tar_info, tf)

edited Jan 2 at 15:26

answered Jan 2 at 15:20

ShadowRanger

63k66099

Thanks for the answer. It makes me a bit sad that ISIZE can't be used, but I guess we can't have all the toys we want to play with :)

– Morwenn
Jan 3 at 9:55

add a comment |

import shutil

import tempfile



def gather_compressed_files(input_dir: pathlib.Path, output_file: str):

    with tarfile.open(output_file, 'w:gz') as tar:

        for input_file in input_dir.glob('*.gz'):

            with tempfile.TemporaryFile() as tf:

                # Could combine both in one with, but this way we close the gzip

                # file ASAP

                with gzip.open(input_file) as fd:

                    shutil.copyfileobj(fd, tf)

                tar_info = tarfile.TarInfo(input_file.stem)

                tar_info.size = tf.tell()

                tf.seek(0)

                tar.addfile(tar_info, tf)

edited Jan 2 at 15:26

answered Jan 2 at 15:20

ShadowRanger

63k66099

Thanks for the answer. It makes me a bit sad that ISIZE can't be used, but I guess we can't have all the toys we want to play with :)

– Morwenn
Jan 3 at 9:55

add a comment |

import shutil

import tempfile



def gather_compressed_files(input_dir: pathlib.Path, output_file: str):

    with tarfile.open(output_file, 'w:gz') as tar:

        for input_file in input_dir.glob('*.gz'):

            with tempfile.TemporaryFile() as tf:

                # Could combine both in one with, but this way we close the gzip

                # file ASAP

                with gzip.open(input_file) as fd:

                    shutil.copyfileobj(fd, tf)

                tar_info = tarfile.TarInfo(input_file.stem)

                tar_info.size = tf.tell()

                tf.seek(0)

                tar.addfile(tar_info, tf)

edited Jan 2 at 15:26

answered Jan 2 at 15:20

ShadowRanger

63k66099

import shutil

import tempfile



def gather_compressed_files(input_dir: pathlib.Path, output_file: str):

    with tarfile.open(output_file, 'w:gz') as tar:

        for input_file in input_dir.glob('*.gz'):

            with tempfile.TemporaryFile() as tf:

                # Could combine both in one with, but this way we close the gzip

                # file ASAP

                with gzip.open(input_file) as fd:

                    shutil.copyfileobj(fd, tf)

                tar_info = tarfile.TarInfo(input_file.stem)

                tar_info.size = tf.tell()

                tf.seek(0)

                tar.addfile(tar_info, tf)

edited Jan 2 at 15:26

answered Jan 2 at 15:20

ShadowRanger

63k66099

edited Jan 2 at 15:26

answered Jan 2 at 15:20

ShadowRanger

63k66099

answered Jan 2 at 15:20

ShadowRanger

63k66099

answered Jan 2 at 15:20

ShadowRanger

63k66099

Thanks for the answer. It makes me a bit sad that ISIZE can't be used, but I guess we can't have all the toys we want to play with :)

– Morwenn
Jan 3 at 9:55

add a comment |

Thanks for the answer. It makes me a bit sad that ISIZE can't be used, but I guess we can't have all the toys we want to play with :)

– Morwenn
Jan 3 at 9:55

Thanks for the answer. It makes me a bit sad that ISIZE can't be used, but I guess we can't have all the toys we want to play with :)

– Morwenn
Jan 3 at 9:55

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu