Streaming parquet file python and only downsampling












2















I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe? Ultimately, I would like to have the data in dataframe format to work with.



Am I wrong to attempt to do this without using a spark framework?



I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in.
Any tips or suggestions would be greatly appreciated!










share|improve this question



























    2















    I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe? Ultimately, I would like to have the data in dataframe format to work with.



    Am I wrong to attempt to do this without using a spark framework?



    I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in.
    Any tips or suggestions would be greatly appreciated!










    share|improve this question

























      2












      2








      2








      I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe? Ultimately, I would like to have the data in dataframe format to work with.



      Am I wrong to attempt to do this without using a spark framework?



      I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in.
      Any tips or suggestions would be greatly appreciated!










      share|improve this question














      I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe? Ultimately, I would like to have the data in dataframe format to work with.



      Am I wrong to attempt to do this without using a spark framework?



      I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in.
      Any tips or suggestions would be greatly appreciated!







      python-3.x parquet pyarrow fastparquet






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Jan 2 at 15:28









      SjosephSjoseph

      3021214




      3021214
























          2 Answers
          2






          active

          oldest

          votes


















          1














          Spark is certainly a viable choice for this task.



          We're planning to add streaming read logic in pyarrow this year (2019, see https://issues.apache.org/jira/browse/ARROW-3771 and related issues). In the meantime, I would recommend reading one row group at a time to mitigate the memory use issues. You can do this with pyarrow.parquet.ParquetFile and its read_row_group method






          share|improve this answer
























          • Thank you for the tips! I have queried the file using 'num_row_groups' and my file has only 1 'row_group'. I assume this means I won't have anything to gain by using 'read_row_group'?

            – Sjoseph
            Jan 2 at 17:03











          • No, you will only gain something from this when you write your Parquet files with multiple row groups. When using pyarrow for writing them, you should set the chunk_size argument to the number of rows that fit nicely into RAM. But beware that the smaller you set this argument, the slower reading gets. You're probably best of by setting it chunk_size=len(table) / 60 so that you get 100 MiB chunks.

            – Uwe L. Korn
            Jan 2 at 17:31











          • Thank you for the suggestion but I do not have control of the parquet file format. I assume my only option is to get set-up with pyspark/spark?

            – Sjoseph
            Jan 2 at 17:42



















          0














          This is not an answer, I'm posting here because this is the only relevant post I can find on Stackoverflow. I'm trying to use read_row_group function, but python would just exit with code 139. There is no other error messages, not sure how to fix this..



          from pyarrow.parquet import ParquetFile
          path = "sample.parquet"
          f = ParquetFile(source = path)
          print(f.num_row_groups) # it will print number of groups

          # if I read the entire file:
          df = f.read() # this works

          # try to read row group
          row_df = f.read_row_group(0)

          # I get
          Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)


          Python version 3.6.3



          pyarrow version 0.11.1






          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54008975%2fstreaming-parquet-file-python-and-only-downsampling%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            Spark is certainly a viable choice for this task.



            We're planning to add streaming read logic in pyarrow this year (2019, see https://issues.apache.org/jira/browse/ARROW-3771 and related issues). In the meantime, I would recommend reading one row group at a time to mitigate the memory use issues. You can do this with pyarrow.parquet.ParquetFile and its read_row_group method






            share|improve this answer
























            • Thank you for the tips! I have queried the file using 'num_row_groups' and my file has only 1 'row_group'. I assume this means I won't have anything to gain by using 'read_row_group'?

              – Sjoseph
              Jan 2 at 17:03











            • No, you will only gain something from this when you write your Parquet files with multiple row groups. When using pyarrow for writing them, you should set the chunk_size argument to the number of rows that fit nicely into RAM. But beware that the smaller you set this argument, the slower reading gets. You're probably best of by setting it chunk_size=len(table) / 60 so that you get 100 MiB chunks.

              – Uwe L. Korn
              Jan 2 at 17:31











            • Thank you for the suggestion but I do not have control of the parquet file format. I assume my only option is to get set-up with pyspark/spark?

              – Sjoseph
              Jan 2 at 17:42
















            1














            Spark is certainly a viable choice for this task.



            We're planning to add streaming read logic in pyarrow this year (2019, see https://issues.apache.org/jira/browse/ARROW-3771 and related issues). In the meantime, I would recommend reading one row group at a time to mitigate the memory use issues. You can do this with pyarrow.parquet.ParquetFile and its read_row_group method






            share|improve this answer
























            • Thank you for the tips! I have queried the file using 'num_row_groups' and my file has only 1 'row_group'. I assume this means I won't have anything to gain by using 'read_row_group'?

              – Sjoseph
              Jan 2 at 17:03











            • No, you will only gain something from this when you write your Parquet files with multiple row groups. When using pyarrow for writing them, you should set the chunk_size argument to the number of rows that fit nicely into RAM. But beware that the smaller you set this argument, the slower reading gets. You're probably best of by setting it chunk_size=len(table) / 60 so that you get 100 MiB chunks.

              – Uwe L. Korn
              Jan 2 at 17:31











            • Thank you for the suggestion but I do not have control of the parquet file format. I assume my only option is to get set-up with pyspark/spark?

              – Sjoseph
              Jan 2 at 17:42














            1












            1








            1







            Spark is certainly a viable choice for this task.



            We're planning to add streaming read logic in pyarrow this year (2019, see https://issues.apache.org/jira/browse/ARROW-3771 and related issues). In the meantime, I would recommend reading one row group at a time to mitigate the memory use issues. You can do this with pyarrow.parquet.ParquetFile and its read_row_group method






            share|improve this answer













            Spark is certainly a viable choice for this task.



            We're planning to add streaming read logic in pyarrow this year (2019, see https://issues.apache.org/jira/browse/ARROW-3771 and related issues). In the meantime, I would recommend reading one row group at a time to mitigate the memory use issues. You can do this with pyarrow.parquet.ParquetFile and its read_row_group method







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Jan 2 at 16:15









            Wes McKinneyWes McKinney

            56.5k1911594




            56.5k1911594













            • Thank you for the tips! I have queried the file using 'num_row_groups' and my file has only 1 'row_group'. I assume this means I won't have anything to gain by using 'read_row_group'?

              – Sjoseph
              Jan 2 at 17:03











            • No, you will only gain something from this when you write your Parquet files with multiple row groups. When using pyarrow for writing them, you should set the chunk_size argument to the number of rows that fit nicely into RAM. But beware that the smaller you set this argument, the slower reading gets. You're probably best of by setting it chunk_size=len(table) / 60 so that you get 100 MiB chunks.

              – Uwe L. Korn
              Jan 2 at 17:31











            • Thank you for the suggestion but I do not have control of the parquet file format. I assume my only option is to get set-up with pyspark/spark?

              – Sjoseph
              Jan 2 at 17:42



















            • Thank you for the tips! I have queried the file using 'num_row_groups' and my file has only 1 'row_group'. I assume this means I won't have anything to gain by using 'read_row_group'?

              – Sjoseph
              Jan 2 at 17:03











            • No, you will only gain something from this when you write your Parquet files with multiple row groups. When using pyarrow for writing them, you should set the chunk_size argument to the number of rows that fit nicely into RAM. But beware that the smaller you set this argument, the slower reading gets. You're probably best of by setting it chunk_size=len(table) / 60 so that you get 100 MiB chunks.

              – Uwe L. Korn
              Jan 2 at 17:31











            • Thank you for the suggestion but I do not have control of the parquet file format. I assume my only option is to get set-up with pyspark/spark?

              – Sjoseph
              Jan 2 at 17:42

















            Thank you for the tips! I have queried the file using 'num_row_groups' and my file has only 1 'row_group'. I assume this means I won't have anything to gain by using 'read_row_group'?

            – Sjoseph
            Jan 2 at 17:03





            Thank you for the tips! I have queried the file using 'num_row_groups' and my file has only 1 'row_group'. I assume this means I won't have anything to gain by using 'read_row_group'?

            – Sjoseph
            Jan 2 at 17:03













            No, you will only gain something from this when you write your Parquet files with multiple row groups. When using pyarrow for writing them, you should set the chunk_size argument to the number of rows that fit nicely into RAM. But beware that the smaller you set this argument, the slower reading gets. You're probably best of by setting it chunk_size=len(table) / 60 so that you get 100 MiB chunks.

            – Uwe L. Korn
            Jan 2 at 17:31





            No, you will only gain something from this when you write your Parquet files with multiple row groups. When using pyarrow for writing them, you should set the chunk_size argument to the number of rows that fit nicely into RAM. But beware that the smaller you set this argument, the slower reading gets. You're probably best of by setting it chunk_size=len(table) / 60 so that you get 100 MiB chunks.

            – Uwe L. Korn
            Jan 2 at 17:31













            Thank you for the suggestion but I do not have control of the parquet file format. I assume my only option is to get set-up with pyspark/spark?

            – Sjoseph
            Jan 2 at 17:42





            Thank you for the suggestion but I do not have control of the parquet file format. I assume my only option is to get set-up with pyspark/spark?

            – Sjoseph
            Jan 2 at 17:42













            0














            This is not an answer, I'm posting here because this is the only relevant post I can find on Stackoverflow. I'm trying to use read_row_group function, but python would just exit with code 139. There is no other error messages, not sure how to fix this..



            from pyarrow.parquet import ParquetFile
            path = "sample.parquet"
            f = ParquetFile(source = path)
            print(f.num_row_groups) # it will print number of groups

            # if I read the entire file:
            df = f.read() # this works

            # try to read row group
            row_df = f.read_row_group(0)

            # I get
            Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)


            Python version 3.6.3



            pyarrow version 0.11.1






            share|improve this answer




























              0














              This is not an answer, I'm posting here because this is the only relevant post I can find on Stackoverflow. I'm trying to use read_row_group function, but python would just exit with code 139. There is no other error messages, not sure how to fix this..



              from pyarrow.parquet import ParquetFile
              path = "sample.parquet"
              f = ParquetFile(source = path)
              print(f.num_row_groups) # it will print number of groups

              # if I read the entire file:
              df = f.read() # this works

              # try to read row group
              row_df = f.read_row_group(0)

              # I get
              Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)


              Python version 3.6.3



              pyarrow version 0.11.1






              share|improve this answer


























                0












                0








                0







                This is not an answer, I'm posting here because this is the only relevant post I can find on Stackoverflow. I'm trying to use read_row_group function, but python would just exit with code 139. There is no other error messages, not sure how to fix this..



                from pyarrow.parquet import ParquetFile
                path = "sample.parquet"
                f = ParquetFile(source = path)
                print(f.num_row_groups) # it will print number of groups

                # if I read the entire file:
                df = f.read() # this works

                # try to read row group
                row_df = f.read_row_group(0)

                # I get
                Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)


                Python version 3.6.3



                pyarrow version 0.11.1






                share|improve this answer













                This is not an answer, I'm posting here because this is the only relevant post I can find on Stackoverflow. I'm trying to use read_row_group function, but python would just exit with code 139. There is no other error messages, not sure how to fix this..



                from pyarrow.parquet import ParquetFile
                path = "sample.parquet"
                f = ParquetFile(source = path)
                print(f.num_row_groups) # it will print number of groups

                # if I read the entire file:
                df = f.read() # this works

                # try to read row group
                row_df = f.read_row_group(0)

                # I get
                Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)


                Python version 3.6.3



                pyarrow version 0.11.1







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Jan 24 at 18:28









                neghezneghez

                350314




                350314






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54008975%2fstreaming-parquet-file-python-and-only-downsampling%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    MongoDB - Not Authorized To Execute Command

                    How to fix TextFormField cause rebuild widget in Flutter

                    in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith