Streaming parquet file python and only downsampling
I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe? Ultimately, I would like to have the data in dataframe format to work with.
Am I wrong to attempt to do this without using a spark framework?
I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in.
Any tips or suggestions would be greatly appreciated!
python-3.x parquet pyarrow fastparquet
add a comment |
I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe? Ultimately, I would like to have the data in dataframe format to work with.
Am I wrong to attempt to do this without using a spark framework?
I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in.
Any tips or suggestions would be greatly appreciated!
python-3.x parquet pyarrow fastparquet
add a comment |
I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe? Ultimately, I would like to have the data in dataframe format to work with.
Am I wrong to attempt to do this without using a spark framework?
I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in.
Any tips or suggestions would be greatly appreciated!
python-3.x parquet pyarrow fastparquet
I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe? Ultimately, I would like to have the data in dataframe format to work with.
Am I wrong to attempt to do this without using a spark framework?
I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in.
Any tips or suggestions would be greatly appreciated!
python-3.x parquet pyarrow fastparquet
python-3.x parquet pyarrow fastparquet
asked Jan 2 at 15:28
SjosephSjoseph
3021214
3021214
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
Spark is certainly a viable choice for this task.
We're planning to add streaming read logic in pyarrow this year (2019, see https://issues.apache.org/jira/browse/ARROW-3771 and related issues). In the meantime, I would recommend reading one row group at a time to mitigate the memory use issues. You can do this with pyarrow.parquet.ParquetFile and its read_row_group method
Thank you for the tips! I have queried the file using 'num_row_groups' and my file has only 1 'row_group'. I assume this means I won't have anything to gain by using 'read_row_group'?
– Sjoseph
Jan 2 at 17:03
No, you will only gain something from this when you write your Parquet files with multiple row groups. When usingpyarrowfor writing them, you should set thechunk_sizeargument to the number of rows that fit nicely into RAM. But beware that the smaller you set this argument, the slower reading gets. You're probably best of by setting itchunk_size=len(table) / 60so that you get 100 MiB chunks.
– Uwe L. Korn
Jan 2 at 17:31
Thank you for the suggestion but I do not have control of the parquet file format. I assume my only option is to get set-up with pyspark/spark?
– Sjoseph
Jan 2 at 17:42
add a comment |
This is not an answer, I'm posting here because this is the only relevant post I can find on Stackoverflow. I'm trying to use read_row_group function, but python would just exit with code 139. There is no other error messages, not sure how to fix this..
from pyarrow.parquet import ParquetFile
path = "sample.parquet"
f = ParquetFile(source = path)
print(f.num_row_groups) # it will print number of groups
# if I read the entire file:
df = f.read() # this works
# try to read row group
row_df = f.read_row_group(0)
# I get
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
Python version 3.6.3
pyarrow version 0.11.1
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54008975%2fstreaming-parquet-file-python-and-only-downsampling%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Spark is certainly a viable choice for this task.
We're planning to add streaming read logic in pyarrow this year (2019, see https://issues.apache.org/jira/browse/ARROW-3771 and related issues). In the meantime, I would recommend reading one row group at a time to mitigate the memory use issues. You can do this with pyarrow.parquet.ParquetFile and its read_row_group method
Thank you for the tips! I have queried the file using 'num_row_groups' and my file has only 1 'row_group'. I assume this means I won't have anything to gain by using 'read_row_group'?
– Sjoseph
Jan 2 at 17:03
No, you will only gain something from this when you write your Parquet files with multiple row groups. When usingpyarrowfor writing them, you should set thechunk_sizeargument to the number of rows that fit nicely into RAM. But beware that the smaller you set this argument, the slower reading gets. You're probably best of by setting itchunk_size=len(table) / 60so that you get 100 MiB chunks.
– Uwe L. Korn
Jan 2 at 17:31
Thank you for the suggestion but I do not have control of the parquet file format. I assume my only option is to get set-up with pyspark/spark?
– Sjoseph
Jan 2 at 17:42
add a comment |
Spark is certainly a viable choice for this task.
We're planning to add streaming read logic in pyarrow this year (2019, see https://issues.apache.org/jira/browse/ARROW-3771 and related issues). In the meantime, I would recommend reading one row group at a time to mitigate the memory use issues. You can do this with pyarrow.parquet.ParquetFile and its read_row_group method
Thank you for the tips! I have queried the file using 'num_row_groups' and my file has only 1 'row_group'. I assume this means I won't have anything to gain by using 'read_row_group'?
– Sjoseph
Jan 2 at 17:03
No, you will only gain something from this when you write your Parquet files with multiple row groups. When usingpyarrowfor writing them, you should set thechunk_sizeargument to the number of rows that fit nicely into RAM. But beware that the smaller you set this argument, the slower reading gets. You're probably best of by setting itchunk_size=len(table) / 60so that you get 100 MiB chunks.
– Uwe L. Korn
Jan 2 at 17:31
Thank you for the suggestion but I do not have control of the parquet file format. I assume my only option is to get set-up with pyspark/spark?
– Sjoseph
Jan 2 at 17:42
add a comment |
Spark is certainly a viable choice for this task.
We're planning to add streaming read logic in pyarrow this year (2019, see https://issues.apache.org/jira/browse/ARROW-3771 and related issues). In the meantime, I would recommend reading one row group at a time to mitigate the memory use issues. You can do this with pyarrow.parquet.ParquetFile and its read_row_group method
Spark is certainly a viable choice for this task.
We're planning to add streaming read logic in pyarrow this year (2019, see https://issues.apache.org/jira/browse/ARROW-3771 and related issues). In the meantime, I would recommend reading one row group at a time to mitigate the memory use issues. You can do this with pyarrow.parquet.ParquetFile and its read_row_group method
answered Jan 2 at 16:15
Wes McKinneyWes McKinney
56.5k1911594
56.5k1911594
Thank you for the tips! I have queried the file using 'num_row_groups' and my file has only 1 'row_group'. I assume this means I won't have anything to gain by using 'read_row_group'?
– Sjoseph
Jan 2 at 17:03
No, you will only gain something from this when you write your Parquet files with multiple row groups. When usingpyarrowfor writing them, you should set thechunk_sizeargument to the number of rows that fit nicely into RAM. But beware that the smaller you set this argument, the slower reading gets. You're probably best of by setting itchunk_size=len(table) / 60so that you get 100 MiB chunks.
– Uwe L. Korn
Jan 2 at 17:31
Thank you for the suggestion but I do not have control of the parquet file format. I assume my only option is to get set-up with pyspark/spark?
– Sjoseph
Jan 2 at 17:42
add a comment |
Thank you for the tips! I have queried the file using 'num_row_groups' and my file has only 1 'row_group'. I assume this means I won't have anything to gain by using 'read_row_group'?
– Sjoseph
Jan 2 at 17:03
No, you will only gain something from this when you write your Parquet files with multiple row groups. When usingpyarrowfor writing them, you should set thechunk_sizeargument to the number of rows that fit nicely into RAM. But beware that the smaller you set this argument, the slower reading gets. You're probably best of by setting itchunk_size=len(table) / 60so that you get 100 MiB chunks.
– Uwe L. Korn
Jan 2 at 17:31
Thank you for the suggestion but I do not have control of the parquet file format. I assume my only option is to get set-up with pyspark/spark?
– Sjoseph
Jan 2 at 17:42
Thank you for the tips! I have queried the file using 'num_row_groups' and my file has only 1 'row_group'. I assume this means I won't have anything to gain by using 'read_row_group'?
– Sjoseph
Jan 2 at 17:03
Thank you for the tips! I have queried the file using 'num_row_groups' and my file has only 1 'row_group'. I assume this means I won't have anything to gain by using 'read_row_group'?
– Sjoseph
Jan 2 at 17:03
No, you will only gain something from this when you write your Parquet files with multiple row groups. When using
pyarrow for writing them, you should set the chunk_size argument to the number of rows that fit nicely into RAM. But beware that the smaller you set this argument, the slower reading gets. You're probably best of by setting it chunk_size=len(table) / 60 so that you get 100 MiB chunks.– Uwe L. Korn
Jan 2 at 17:31
No, you will only gain something from this when you write your Parquet files with multiple row groups. When using
pyarrow for writing them, you should set the chunk_size argument to the number of rows that fit nicely into RAM. But beware that the smaller you set this argument, the slower reading gets. You're probably best of by setting it chunk_size=len(table) / 60 so that you get 100 MiB chunks.– Uwe L. Korn
Jan 2 at 17:31
Thank you for the suggestion but I do not have control of the parquet file format. I assume my only option is to get set-up with pyspark/spark?
– Sjoseph
Jan 2 at 17:42
Thank you for the suggestion but I do not have control of the parquet file format. I assume my only option is to get set-up with pyspark/spark?
– Sjoseph
Jan 2 at 17:42
add a comment |
This is not an answer, I'm posting here because this is the only relevant post I can find on Stackoverflow. I'm trying to use read_row_group function, but python would just exit with code 139. There is no other error messages, not sure how to fix this..
from pyarrow.parquet import ParquetFile
path = "sample.parquet"
f = ParquetFile(source = path)
print(f.num_row_groups) # it will print number of groups
# if I read the entire file:
df = f.read() # this works
# try to read row group
row_df = f.read_row_group(0)
# I get
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
Python version 3.6.3
pyarrow version 0.11.1
add a comment |
This is not an answer, I'm posting here because this is the only relevant post I can find on Stackoverflow. I'm trying to use read_row_group function, but python would just exit with code 139. There is no other error messages, not sure how to fix this..
from pyarrow.parquet import ParquetFile
path = "sample.parquet"
f = ParquetFile(source = path)
print(f.num_row_groups) # it will print number of groups
# if I read the entire file:
df = f.read() # this works
# try to read row group
row_df = f.read_row_group(0)
# I get
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
Python version 3.6.3
pyarrow version 0.11.1
add a comment |
This is not an answer, I'm posting here because this is the only relevant post I can find on Stackoverflow. I'm trying to use read_row_group function, but python would just exit with code 139. There is no other error messages, not sure how to fix this..
from pyarrow.parquet import ParquetFile
path = "sample.parquet"
f = ParquetFile(source = path)
print(f.num_row_groups) # it will print number of groups
# if I read the entire file:
df = f.read() # this works
# try to read row group
row_df = f.read_row_group(0)
# I get
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
Python version 3.6.3
pyarrow version 0.11.1
This is not an answer, I'm posting here because this is the only relevant post I can find on Stackoverflow. I'm trying to use read_row_group function, but python would just exit with code 139. There is no other error messages, not sure how to fix this..
from pyarrow.parquet import ParquetFile
path = "sample.parquet"
f = ParquetFile(source = path)
print(f.num_row_groups) # it will print number of groups
# if I read the entire file:
df = f.read() # this works
# try to read row group
row_df = f.read_row_group(0)
# I get
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
Python version 3.6.3
pyarrow version 0.11.1
answered Jan 24 at 18:28
neghezneghez
350314
350314
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54008975%2fstreaming-parquet-file-python-and-only-downsampling%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
