Streaming parquet file python and only downsampling

I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe? Ultimately, I would like to have the data in dataframe format to work with.

Am I wrong to attempt to do this without using a spark framework?

I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in.
Any tips or suggestions would be greatly appreciated!

asked Jan 2 at 15:28

Sjoseph

3021214

add a comment |

Am I wrong to attempt to do this without using a spark framework?

I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in.
Any tips or suggestions would be greatly appreciated!

asked Jan 2 at 15:28

Sjoseph

3021214

add a comment |

Am I wrong to attempt to do this without using a spark framework?

I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in.
Any tips or suggestions would be greatly appreciated!

asked Jan 2 at 15:28

Sjoseph

3021214

Am I wrong to attempt to do this without using a spark framework?

I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in.
Any tips or suggestions would be greatly appreciated!

python-3.x parquet pyarrow fastparquet

asked Jan 2 at 15:28

Sjoseph

3021214

asked Jan 2 at 15:28

Sjoseph

3021214

asked Jan 2 at 15:28

Sjoseph

3021214

asked Jan 2 at 15:28

Sjoseph

3021214

asked Jan 2 at 15:28

Sjoseph

3021214

add a comment |

2 Answers
2

active

oldest

votes

Spark is certainly a viable choice for this task.

We're planning to add streaming read logic in pyarrow this year (2019, see https://issues.apache.org/jira/browse/ARROW-3771 and related issues). In the meantime, I would recommend reading one row group at a time to mitigate the memory use issues. You can do this with pyarrow.parquet.ParquetFile and its read_row_group method

answered Jan 2 at 16:15

Wes McKinney

56.5k1911594

Thank you for the tips! I have queried the file using 'num_row_groups' and my file has only 1 'row_group'. I assume this means I won't have anything to gain by using 'read_row_group'?

– Sjoseph
Jan 2 at 17:03

No, you will only gain something from this when you write your Parquet files with multiple row groups. When using pyarrow for writing them, you should set the chunk_size argument to the number of rows that fit nicely into RAM. But beware that the smaller you set this argument, the slower reading gets. You're probably best of by setting it chunk_size=len(table) / 60 so that you get 100 MiB chunks.

– Uwe L. Korn
Jan 2 at 17:31

Thank you for the suggestion but I do not have control of the parquet file format. I assume my only option is to get set-up with pyspark/spark?

– Sjoseph
Jan 2 at 17:42

add a comment |

This is not an answer, I'm posting here because this is the only relevant post I can find on Stackoverflow. I'm trying to use read_row_group function, but python would just exit with code 139. There is no other error messages, not sure how to fix this..

from pyarrow.parquet import ParquetFile

path = "sample.parquet"

f = ParquetFile(source = path)

print(f.num_row_groups) # it will print number of groups



# if I read the entire file:

df = f.read() # this works



# try to read row group

row_df = f.read_row_group(0)



# I get

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Python version 3.6.3

pyarrow version 0.11.1

answered Jan 24 at 18:28

neghez

350314

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54008975%2fstreaming-parquet-file-python-and-only-downsampling%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Spark is certainly a viable choice for this task.

answered Jan 2 at 16:15

Wes McKinney

56.5k1911594

Thank you for the tips! I have queried the file using 'num_row_groups' and my file has only 1 'row_group'. I assume this means I won't have anything to gain by using 'read_row_group'?

– Sjoseph
Jan 2 at 17:03

No, you will only gain something from this when you write your Parquet files with multiple row groups. When using pyarrow for writing them, you should set the chunk_size argument to the number of rows that fit nicely into RAM. But beware that the smaller you set this argument, the slower reading gets. You're probably best of by setting it chunk_size=len(table) / 60 so that you get 100 MiB chunks.

– Uwe L. Korn
Jan 2 at 17:31

Thank you for the suggestion but I do not have control of the parquet file format. I assume my only option is to get set-up with pyspark/spark?

– Sjoseph
Jan 2 at 17:42

add a comment |

Spark is certainly a viable choice for this task.

answered Jan 2 at 16:15

Wes McKinney

56.5k1911594

Thank you for the tips! I have queried the file using 'num_row_groups' and my file has only 1 'row_group'. I assume this means I won't have anything to gain by using 'read_row_group'?

– Sjoseph
Jan 2 at 17:03

No, you will only gain something from this when you write your Parquet files with multiple row groups. When using pyarrow for writing them, you should set the chunk_size argument to the number of rows that fit nicely into RAM. But beware that the smaller you set this argument, the slower reading gets. You're probably best of by setting it chunk_size=len(table) / 60 so that you get 100 MiB chunks.

– Uwe L. Korn
Jan 2 at 17:31

Thank you for the suggestion but I do not have control of the parquet file format. I assume my only option is to get set-up with pyspark/spark?

– Sjoseph
Jan 2 at 17:42

add a comment |

Spark is certainly a viable choice for this task.

answered Jan 2 at 16:15

Wes McKinney

56.5k1911594

Spark is certainly a viable choice for this task.

answered Jan 2 at 16:15

Wes McKinney

56.5k1911594

answered Jan 2 at 16:15

Wes McKinney

56.5k1911594

answered Jan 2 at 16:15

Wes McKinney

56.5k1911594

answered Jan 2 at 16:15

Wes McKinney

56.5k1911594

Thank you for the tips! I have queried the file using 'num_row_groups' and my file has only 1 'row_group'. I assume this means I won't have anything to gain by using 'read_row_group'?

– Sjoseph
Jan 2 at 17:03

No, you will only gain something from this when you write your Parquet files with multiple row groups. When using pyarrow for writing them, you should set the chunk_size argument to the number of rows that fit nicely into RAM. But beware that the smaller you set this argument, the slower reading gets. You're probably best of by setting it chunk_size=len(table) / 60 so that you get 100 MiB chunks.

– Uwe L. Korn
Jan 2 at 17:31

Thank you for the suggestion but I do not have control of the parquet file format. I assume my only option is to get set-up with pyspark/spark?

– Sjoseph
Jan 2 at 17:42

add a comment |

Thank you for the tips! I have queried the file using 'num_row_groups' and my file has only 1 'row_group'. I assume this means I won't have anything to gain by using 'read_row_group'?

– Sjoseph
Jan 2 at 17:03

No, you will only gain something from this when you write your Parquet files with multiple row groups. When using pyarrow for writing them, you should set the chunk_size argument to the number of rows that fit nicely into RAM. But beware that the smaller you set this argument, the slower reading gets. You're probably best of by setting it chunk_size=len(table) / 60 so that you get 100 MiB chunks.

– Uwe L. Korn
Jan 2 at 17:31

Thank you for the suggestion but I do not have control of the parquet file format. I assume my only option is to get set-up with pyspark/spark?

– Sjoseph
Jan 2 at 17:42

Thank you for the tips! I have queried the file using 'num_row_groups' and my file has only 1 'row_group'. I assume this means I won't have anything to gain by using 'read_row_group'?

– Sjoseph
Jan 2 at 17:03

No, you will only gain something from this when you write your Parquet files with multiple row groups. When using pyarrow for writing them, you should set the chunk_size argument to the number of rows that fit nicely into RAM. But beware that the smaller you set this argument, the slower reading gets. You're probably best of by setting it chunk_size=len(table) / 60 so that you get 100 MiB chunks.

– Uwe L. Korn
Jan 2 at 17:31

Thank you for the suggestion but I do not have control of the parquet file format. I assume my only option is to get set-up with pyspark/spark?

– Sjoseph
Jan 2 at 17:42

add a comment |

from pyarrow.parquet import ParquetFile

path = "sample.parquet"

f = ParquetFile(source = path)

print(f.num_row_groups) # it will print number of groups



# if I read the entire file:

df = f.read() # this works



# try to read row group

row_df = f.read_row_group(0)



# I get

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Python version 3.6.3

pyarrow version 0.11.1

answered Jan 24 at 18:28

neghez

350314

add a comment |

from pyarrow.parquet import ParquetFile

path = "sample.parquet"

f = ParquetFile(source = path)

print(f.num_row_groups) # it will print number of groups



# if I read the entire file:

df = f.read() # this works



# try to read row group

row_df = f.read_row_group(0)



# I get

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Python version 3.6.3

pyarrow version 0.11.1

answered Jan 24 at 18:28

neghez

350314

add a comment |

from pyarrow.parquet import ParquetFile

path = "sample.parquet"

f = ParquetFile(source = path)

print(f.num_row_groups) # it will print number of groups



# if I read the entire file:

df = f.read() # this works



# try to read row group

row_df = f.read_row_group(0)



# I get

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Python version 3.6.3

pyarrow version 0.11.1

answered Jan 24 at 18:28

neghez

350314

from pyarrow.parquet import ParquetFile

path = "sample.parquet"

f = ParquetFile(source = path)

print(f.num_row_groups) # it will print number of groups



# if I read the entire file:

df = f.read() # this works



# try to read row group

row_df = f.read_row_group(0)



# I get

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Python version 3.6.3

pyarrow version 0.11.1

answered Jan 24 at 18:28

neghez

350314

answered Jan 24 at 18:28

neghez

350314

answered Jan 24 at 18:28

neghez

350314

answered Jan 24 at 18:28

neghez

350314

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu