Using pool to read multiple files in parallel takes forever on Jupyter Windows:












1















I want to read 22 files (stored on my hard disk) with around 300,000 rows each to store in a single pandas data frame. My code was able to do it in 15-25 minutes. I initial thought is: I should make it faster using more CPUs. (correct me if I am wrong here, and if all CPU can't read the data from same hard disk at the same time, however, in this case we can assume data might be present at different hard disks later on, so this exercise is still useful).



I found few posts like this and this and tried the code below.



import os
import pandas as pd
from multiprocessing import Pool

def read_psv(filename):
'reads one row of a file (pipe delimited) to a pandas dataframe'
return pd.read_csv(filename,
delimiter='|',
skiprows=1, #need this as first row is junk
nrows=1, #Just one row for faster testing
encoding = "ISO-8859-1", #need this as well
low_memory=False
)



files = os.listdir('.') #getting all files, will use glob later
df1 = pd.concat((read_psv(f) for f in files[0:6]), ignore_index=True, axis=0, sort=False) #takes less than 1 second

pool = Pool(processes=3)
df_list = pool.map(read_psv, files[0:6]) #takes forever
#df2 = pd.concat(df_list, ignore_index=True) #cant reach this


This takes forever (more than 30-60 minutes, without finishing when I kill the process). I also went through a similar question like mine but of no use.



EDIT: I am using Jupyter on Windows.










share|improve this question

























  • I understood why it keeps on running forever. I was using this code on windows and it requires to define pool in a if __name__ = '__main__': clause. Otherwise it creates a runtime error. Please see this for more details. stackoverflow.com/questions/20222534/…

    – Gaurav Singhal
    Nov 21 '18 at 8:50
















1















I want to read 22 files (stored on my hard disk) with around 300,000 rows each to store in a single pandas data frame. My code was able to do it in 15-25 minutes. I initial thought is: I should make it faster using more CPUs. (correct me if I am wrong here, and if all CPU can't read the data from same hard disk at the same time, however, in this case we can assume data might be present at different hard disks later on, so this exercise is still useful).



I found few posts like this and this and tried the code below.



import os
import pandas as pd
from multiprocessing import Pool

def read_psv(filename):
'reads one row of a file (pipe delimited) to a pandas dataframe'
return pd.read_csv(filename,
delimiter='|',
skiprows=1, #need this as first row is junk
nrows=1, #Just one row for faster testing
encoding = "ISO-8859-1", #need this as well
low_memory=False
)



files = os.listdir('.') #getting all files, will use glob later
df1 = pd.concat((read_psv(f) for f in files[0:6]), ignore_index=True, axis=0, sort=False) #takes less than 1 second

pool = Pool(processes=3)
df_list = pool.map(read_psv, files[0:6]) #takes forever
#df2 = pd.concat(df_list, ignore_index=True) #cant reach this


This takes forever (more than 30-60 minutes, without finishing when I kill the process). I also went through a similar question like mine but of no use.



EDIT: I am using Jupyter on Windows.










share|improve this question

























  • I understood why it keeps on running forever. I was using this code on windows and it requires to define pool in a if __name__ = '__main__': clause. Otherwise it creates a runtime error. Please see this for more details. stackoverflow.com/questions/20222534/…

    – Gaurav Singhal
    Nov 21 '18 at 8:50














1












1








1








I want to read 22 files (stored on my hard disk) with around 300,000 rows each to store in a single pandas data frame. My code was able to do it in 15-25 minutes. I initial thought is: I should make it faster using more CPUs. (correct me if I am wrong here, and if all CPU can't read the data from same hard disk at the same time, however, in this case we can assume data might be present at different hard disks later on, so this exercise is still useful).



I found few posts like this and this and tried the code below.



import os
import pandas as pd
from multiprocessing import Pool

def read_psv(filename):
'reads one row of a file (pipe delimited) to a pandas dataframe'
return pd.read_csv(filename,
delimiter='|',
skiprows=1, #need this as first row is junk
nrows=1, #Just one row for faster testing
encoding = "ISO-8859-1", #need this as well
low_memory=False
)



files = os.listdir('.') #getting all files, will use glob later
df1 = pd.concat((read_psv(f) for f in files[0:6]), ignore_index=True, axis=0, sort=False) #takes less than 1 second

pool = Pool(processes=3)
df_list = pool.map(read_psv, files[0:6]) #takes forever
#df2 = pd.concat(df_list, ignore_index=True) #cant reach this


This takes forever (more than 30-60 minutes, without finishing when I kill the process). I also went through a similar question like mine but of no use.



EDIT: I am using Jupyter on Windows.










share|improve this question
















I want to read 22 files (stored on my hard disk) with around 300,000 rows each to store in a single pandas data frame. My code was able to do it in 15-25 minutes. I initial thought is: I should make it faster using more CPUs. (correct me if I am wrong here, and if all CPU can't read the data from same hard disk at the same time, however, in this case we can assume data might be present at different hard disks later on, so this exercise is still useful).



I found few posts like this and this and tried the code below.



import os
import pandas as pd
from multiprocessing import Pool

def read_psv(filename):
'reads one row of a file (pipe delimited) to a pandas dataframe'
return pd.read_csv(filename,
delimiter='|',
skiprows=1, #need this as first row is junk
nrows=1, #Just one row for faster testing
encoding = "ISO-8859-1", #need this as well
low_memory=False
)



files = os.listdir('.') #getting all files, will use glob later
df1 = pd.concat((read_psv(f) for f in files[0:6]), ignore_index=True, axis=0, sort=False) #takes less than 1 second

pool = Pool(processes=3)
df_list = pool.map(read_psv, files[0:6]) #takes forever
#df2 = pd.concat(df_list, ignore_index=True) #cant reach this


This takes forever (more than 30-60 minutes, without finishing when I kill the process). I also went through a similar question like mine but of no use.



EDIT: I am using Jupyter on Windows.







python windows pandas jupyter-notebook python-multiprocessing






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 21 '18 at 14:26







Gaurav Singhal

















asked Nov 20 '18 at 14:00









Gaurav SinghalGaurav Singhal

4031316




4031316













  • I understood why it keeps on running forever. I was using this code on windows and it requires to define pool in a if __name__ = '__main__': clause. Otherwise it creates a runtime error. Please see this for more details. stackoverflow.com/questions/20222534/…

    – Gaurav Singhal
    Nov 21 '18 at 8:50



















  • I understood why it keeps on running forever. I was using this code on windows and it requires to define pool in a if __name__ = '__main__': clause. Otherwise it creates a runtime error. Please see this for more details. stackoverflow.com/questions/20222534/…

    – Gaurav Singhal
    Nov 21 '18 at 8:50

















I understood why it keeps on running forever. I was using this code on windows and it requires to define pool in a if __name__ = '__main__': clause. Otherwise it creates a runtime error. Please see this for more details. stackoverflow.com/questions/20222534/…

– Gaurav Singhal
Nov 21 '18 at 8:50





I understood why it keeps on running forever. I was using this code on windows and it requires to define pool in a if __name__ = '__main__': clause. Otherwise it creates a runtime error. Please see this for more details. stackoverflow.com/questions/20222534/…

– Gaurav Singhal
Nov 21 '18 at 8:50












2 Answers
2






active

oldest

votes


















1














Your task is IO-bound, the bottleneck is the hard-drive. The CPU has to do only a little work to parse each line in the CSV.



Disk reads are fastest when they are sequential. If you want to read a large file, it's best to let the disk seek the beginning and then just read all of its bytes sequentially.



If you have multiple large files on the same hard-drive and read from them using multiple processes, then the disk head will have to jump back and forth between them, where each jump takes up to 10 ms.



Multiprocessing can still make your code faster, but you will need to store your files on multiple disks, so each disk head can focus on reading one file.



Another alternative is to buy an SSD. Disk seek time is much lower at 0.1 ms and throughput is around 5x faster.






share|improve this answer


























  • Thanks for the answer. My hard disk is SSD only. Also, in future I might be doing this on multiple hard disks. Also, the code breaks even when I try to read 6 files just with one row.

    – Gaurav Singhal
    Nov 21 '18 at 4:16



















0














So the issue is not related to Bad Performance or getting stuck at I/O. The issue is related to Jupyter and Windows. On Windows we need to include an if clause like this: if __name__ = '__main__': before initializing the Pool. For Jupyter, we need to save the worker in a separate file and import it in the code. Jupyter is also problematic as it does not give the error log by default. I got to know about windows issue when I ran the code on a python shell. I got to know about Jupyter error when I ran the code on Ipython Shell. Following post has helped me a lot.



For Jupyter



For Windows Issue






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53394680%2fusing-pool-to-read-multiple-files-in-parallel-takes-forever-on-jupyter-windows%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    Your task is IO-bound, the bottleneck is the hard-drive. The CPU has to do only a little work to parse each line in the CSV.



    Disk reads are fastest when they are sequential. If you want to read a large file, it's best to let the disk seek the beginning and then just read all of its bytes sequentially.



    If you have multiple large files on the same hard-drive and read from them using multiple processes, then the disk head will have to jump back and forth between them, where each jump takes up to 10 ms.



    Multiprocessing can still make your code faster, but you will need to store your files on multiple disks, so each disk head can focus on reading one file.



    Another alternative is to buy an SSD. Disk seek time is much lower at 0.1 ms and throughput is around 5x faster.






    share|improve this answer


























    • Thanks for the answer. My hard disk is SSD only. Also, in future I might be doing this on multiple hard disks. Also, the code breaks even when I try to read 6 files just with one row.

      – Gaurav Singhal
      Nov 21 '18 at 4:16
















    1














    Your task is IO-bound, the bottleneck is the hard-drive. The CPU has to do only a little work to parse each line in the CSV.



    Disk reads are fastest when they are sequential. If you want to read a large file, it's best to let the disk seek the beginning and then just read all of its bytes sequentially.



    If you have multiple large files on the same hard-drive and read from them using multiple processes, then the disk head will have to jump back and forth between them, where each jump takes up to 10 ms.



    Multiprocessing can still make your code faster, but you will need to store your files on multiple disks, so each disk head can focus on reading one file.



    Another alternative is to buy an SSD. Disk seek time is much lower at 0.1 ms and throughput is around 5x faster.






    share|improve this answer


























    • Thanks for the answer. My hard disk is SSD only. Also, in future I might be doing this on multiple hard disks. Also, the code breaks even when I try to read 6 files just with one row.

      – Gaurav Singhal
      Nov 21 '18 at 4:16














    1












    1








    1







    Your task is IO-bound, the bottleneck is the hard-drive. The CPU has to do only a little work to parse each line in the CSV.



    Disk reads are fastest when they are sequential. If you want to read a large file, it's best to let the disk seek the beginning and then just read all of its bytes sequentially.



    If you have multiple large files on the same hard-drive and read from them using multiple processes, then the disk head will have to jump back and forth between them, where each jump takes up to 10 ms.



    Multiprocessing can still make your code faster, but you will need to store your files on multiple disks, so each disk head can focus on reading one file.



    Another alternative is to buy an SSD. Disk seek time is much lower at 0.1 ms and throughput is around 5x faster.






    share|improve this answer















    Your task is IO-bound, the bottleneck is the hard-drive. The CPU has to do only a little work to parse each line in the CSV.



    Disk reads are fastest when they are sequential. If you want to read a large file, it's best to let the disk seek the beginning and then just read all of its bytes sequentially.



    If you have multiple large files on the same hard-drive and read from them using multiple processes, then the disk head will have to jump back and forth between them, where each jump takes up to 10 ms.



    Multiprocessing can still make your code faster, but you will need to store your files on multiple disks, so each disk head can focus on reading one file.



    Another alternative is to buy an SSD. Disk seek time is much lower at 0.1 ms and throughput is around 5x faster.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 20 '18 at 22:46

























    answered Nov 20 '18 at 14:48









    faflfafl

    4,31121127




    4,31121127













    • Thanks for the answer. My hard disk is SSD only. Also, in future I might be doing this on multiple hard disks. Also, the code breaks even when I try to read 6 files just with one row.

      – Gaurav Singhal
      Nov 21 '18 at 4:16



















    • Thanks for the answer. My hard disk is SSD only. Also, in future I might be doing this on multiple hard disks. Also, the code breaks even when I try to read 6 files just with one row.

      – Gaurav Singhal
      Nov 21 '18 at 4:16

















    Thanks for the answer. My hard disk is SSD only. Also, in future I might be doing this on multiple hard disks. Also, the code breaks even when I try to read 6 files just with one row.

    – Gaurav Singhal
    Nov 21 '18 at 4:16





    Thanks for the answer. My hard disk is SSD only. Also, in future I might be doing this on multiple hard disks. Also, the code breaks even when I try to read 6 files just with one row.

    – Gaurav Singhal
    Nov 21 '18 at 4:16













    0














    So the issue is not related to Bad Performance or getting stuck at I/O. The issue is related to Jupyter and Windows. On Windows we need to include an if clause like this: if __name__ = '__main__': before initializing the Pool. For Jupyter, we need to save the worker in a separate file and import it in the code. Jupyter is also problematic as it does not give the error log by default. I got to know about windows issue when I ran the code on a python shell. I got to know about Jupyter error when I ran the code on Ipython Shell. Following post has helped me a lot.



    For Jupyter



    For Windows Issue






    share|improve this answer




























      0














      So the issue is not related to Bad Performance or getting stuck at I/O. The issue is related to Jupyter and Windows. On Windows we need to include an if clause like this: if __name__ = '__main__': before initializing the Pool. For Jupyter, we need to save the worker in a separate file and import it in the code. Jupyter is also problematic as it does not give the error log by default. I got to know about windows issue when I ran the code on a python shell. I got to know about Jupyter error when I ran the code on Ipython Shell. Following post has helped me a lot.



      For Jupyter



      For Windows Issue






      share|improve this answer


























        0












        0








        0







        So the issue is not related to Bad Performance or getting stuck at I/O. The issue is related to Jupyter and Windows. On Windows we need to include an if clause like this: if __name__ = '__main__': before initializing the Pool. For Jupyter, we need to save the worker in a separate file and import it in the code. Jupyter is also problematic as it does not give the error log by default. I got to know about windows issue when I ran the code on a python shell. I got to know about Jupyter error when I ran the code on Ipython Shell. Following post has helped me a lot.



        For Jupyter



        For Windows Issue






        share|improve this answer













        So the issue is not related to Bad Performance or getting stuck at I/O. The issue is related to Jupyter and Windows. On Windows we need to include an if clause like this: if __name__ = '__main__': before initializing the Pool. For Jupyter, we need to save the worker in a separate file and import it in the code. Jupyter is also problematic as it does not give the error log by default. I got to know about windows issue when I ran the code on a python shell. I got to know about Jupyter error when I ran the code on Ipython Shell. Following post has helped me a lot.



        For Jupyter



        For Windows Issue







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 21 '18 at 14:25









        Gaurav SinghalGaurav Singhal

        4031316




        4031316






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53394680%2fusing-pool-to-read-multiple-files-in-parallel-takes-forever-on-jupyter-windows%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            MongoDB - Not Authorized To Execute Command

            How to fix TextFormField cause rebuild widget in Flutter

            in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith