Using pool to read multiple files in parallel takes forever on Jupyter Windows:

I want to read 22 files (stored on my hard disk) with around 300,000 rows each to store in a single pandas data frame. My code was able to do it in 15-25 minutes. I initial thought is: I should make it faster using more CPUs. (correct me if I am wrong here, and if all CPU can't read the data from same hard disk at the same time, however, in this case we can assume data might be present at different hard disks later on, so this exercise is still useful).

I found few posts like this and this and tried the code below.

import os

import pandas as pd

from multiprocessing import Pool



def read_psv(filename):

    'reads one row of a file (pipe delimited) to a pandas dataframe'

    return pd.read_csv(filename,

                       delimiter='|',

                       skiprows=1, #need this as first row is junk

                       nrows=1, #Just one row for faster testing                    

                       encoding = "ISO-8859-1", #need this as well                       

                       low_memory=False

                      )







files = os.listdir('.') #getting all files, will use glob later

df1 = pd.concat((read_psv(f) for f in files[0:6]), ignore_index=True, axis=0, sort=False) #takes less than 1 second



pool = Pool(processes=3)

df_list = pool.map(read_psv, files[0:6]) #takes forever

#df2 =  pd.concat(df_list, ignore_index=True) #cant reach this

This takes forever (more than 30-60 minutes, without finishing when I kill the process). I also went through a similar question like mine but of no use.

EDIT: I am using Jupyter on Windows.

edited Nov 21 '18 at 14:26

asked Nov 20 '18 at 14:00

Gaurav Singhal

4031316

I understood why it keeps on running forever. I was using this code on windows and it requires to define pool in a if __name__ = '__main__': clause. Otherwise it creates a runtime error. Please see this for more details. stackoverflow.com/questions/20222534/…

– Gaurav Singhal
Nov 21 '18 at 8:50

add a comment |

I found few posts like this and this and tried the code below.

import os

import pandas as pd

from multiprocessing import Pool



def read_psv(filename):

    'reads one row of a file (pipe delimited) to a pandas dataframe'

    return pd.read_csv(filename,

                       delimiter='|',

                       skiprows=1, #need this as first row is junk

                       nrows=1, #Just one row for faster testing                    

                       encoding = "ISO-8859-1", #need this as well                       

                       low_memory=False

                      )







files = os.listdir('.') #getting all files, will use glob later

df1 = pd.concat((read_psv(f) for f in files[0:6]), ignore_index=True, axis=0, sort=False) #takes less than 1 second



pool = Pool(processes=3)

df_list = pool.map(read_psv, files[0:6]) #takes forever

#df2 =  pd.concat(df_list, ignore_index=True) #cant reach this

This takes forever (more than 30-60 minutes, without finishing when I kill the process). I also went through a similar question like mine but of no use.

EDIT: I am using Jupyter on Windows.

edited Nov 21 '18 at 14:26

asked Nov 20 '18 at 14:00

Gaurav Singhal

4031316

I understood why it keeps on running forever. I was using this code on windows and it requires to define pool in a if __name__ = '__main__': clause. Otherwise it creates a runtime error. Please see this for more details. stackoverflow.com/questions/20222534/…

– Gaurav Singhal
Nov 21 '18 at 8:50

add a comment |

I found few posts like this and this and tried the code below.

import os

import pandas as pd

from multiprocessing import Pool



def read_psv(filename):

    'reads one row of a file (pipe delimited) to a pandas dataframe'

    return pd.read_csv(filename,

                       delimiter='|',

                       skiprows=1, #need this as first row is junk

                       nrows=1, #Just one row for faster testing                    

                       encoding = "ISO-8859-1", #need this as well                       

                       low_memory=False

                      )







files = os.listdir('.') #getting all files, will use glob later

df1 = pd.concat((read_psv(f) for f in files[0:6]), ignore_index=True, axis=0, sort=False) #takes less than 1 second



pool = Pool(processes=3)

df_list = pool.map(read_psv, files[0:6]) #takes forever

#df2 =  pd.concat(df_list, ignore_index=True) #cant reach this

This takes forever (more than 30-60 minutes, without finishing when I kill the process). I also went through a similar question like mine but of no use.

EDIT: I am using Jupyter on Windows.

edited Nov 21 '18 at 14:26

asked Nov 20 '18 at 14:00

Gaurav Singhal

4031316

I found few posts like this and this and tried the code below.

import os

import pandas as pd

from multiprocessing import Pool



def read_psv(filename):

    'reads one row of a file (pipe delimited) to a pandas dataframe'

    return pd.read_csv(filename,

                       delimiter='|',

                       skiprows=1, #need this as first row is junk

                       nrows=1, #Just one row for faster testing                    

                       encoding = "ISO-8859-1", #need this as well                       

                       low_memory=False

                      )







files = os.listdir('.') #getting all files, will use glob later

df1 = pd.concat((read_psv(f) for f in files[0:6]), ignore_index=True, axis=0, sort=False) #takes less than 1 second



pool = Pool(processes=3)

df_list = pool.map(read_psv, files[0:6]) #takes forever

#df2 =  pd.concat(df_list, ignore_index=True) #cant reach this

This takes forever (more than 30-60 minutes, without finishing when I kill the process). I also went through a similar question like mine but of no use.

EDIT: I am using Jupyter on Windows.

python windows pandas jupyter-notebook python-multiprocessing

edited Nov 21 '18 at 14:26

asked Nov 20 '18 at 14:00

Gaurav Singhal

4031316

edited Nov 21 '18 at 14:26

asked Nov 20 '18 at 14:00

Gaurav Singhal

4031316

edited Nov 21 '18 at 14:26

asked Nov 20 '18 at 14:00

Gaurav Singhal

4031316

asked Nov 20 '18 at 14:00

Gaurav Singhal

4031316

asked Nov 20 '18 at 14:00

Gaurav Singhal

4031316

I understood why it keeps on running forever. I was using this code on windows and it requires to define pool in a if __name__ = '__main__': clause. Otherwise it creates a runtime error. Please see this for more details. stackoverflow.com/questions/20222534/…

– Gaurav Singhal
Nov 21 '18 at 8:50

add a comment |

I understood why it keeps on running forever. I was using this code on windows and it requires to define pool in a if __name__ = '__main__': clause. Otherwise it creates a runtime error. Please see this for more details. stackoverflow.com/questions/20222534/…

– Gaurav Singhal
Nov 21 '18 at 8:50

I understood why it keeps on running forever. I was using this code on windows and it requires to define pool in a if __name__ = '__main__': clause. Otherwise it creates a runtime error. Please see this for more details. stackoverflow.com/questions/20222534/…

– Gaurav Singhal
Nov 21 '18 at 8:50

add a comment |

2 Answers
2

active

oldest

votes

Your task is IO-bound, the bottleneck is the hard-drive. The CPU has to do only a little work to parse each line in the CSV.

Disk reads are fastest when they are sequential. If you want to read a large file, it's best to let the disk seek the beginning and then just read all of its bytes sequentially.

If you have multiple large files on the same hard-drive and read from them using multiple processes, then the disk head will have to jump back and forth between them, where each jump takes up to 10 ms.

Multiprocessing can still make your code faster, but you will need to store your files on multiple disks, so each disk head can focus on reading one file.

Another alternative is to buy an SSD. Disk seek time is much lower at 0.1 ms and throughput is around 5x faster.

edited Nov 20 '18 at 22:46

answered Nov 20 '18 at 14:48

fafl

4,31121127

Thanks for the answer. My hard disk is SSD only. Also, in future I might be doing this on multiple hard disks. Also, the code breaks even when I try to read 6 files just with one row.

– Gaurav Singhal
Nov 21 '18 at 4:16

add a comment |

So the issue is not related to Bad Performance or getting stuck at I/O. The issue is related to Jupyter and Windows. On Windows we need to include an if clause like this: if __name__ = '__main__': before initializing the Pool. For Jupyter, we need to save the worker in a separate file and import it in the code. Jupyter is also problematic as it does not give the error log by default. I got to know about windows issue when I ran the code on a python shell. I got to know about Jupyter error when I ran the code on Ipython Shell. Following post has helped me a lot.

For Jupyter

For Windows Issue

answered Nov 21 '18 at 14:25

Gaurav Singhal

4031316

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53394680%2fusing-pool-to-read-multiple-files-in-parallel-takes-forever-on-jupyter-windows%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Your task is IO-bound, the bottleneck is the hard-drive. The CPU has to do only a little work to parse each line in the CSV.

Disk reads are fastest when they are sequential. If you want to read a large file, it's best to let the disk seek the beginning and then just read all of its bytes sequentially.

Multiprocessing can still make your code faster, but you will need to store your files on multiple disks, so each disk head can focus on reading one file.

Another alternative is to buy an SSD. Disk seek time is much lower at 0.1 ms and throughput is around 5x faster.

edited Nov 20 '18 at 22:46

answered Nov 20 '18 at 14:48

fafl

4,31121127

Thanks for the answer. My hard disk is SSD only. Also, in future I might be doing this on multiple hard disks. Also, the code breaks even when I try to read 6 files just with one row.

– Gaurav Singhal
Nov 21 '18 at 4:16

add a comment |

Your task is IO-bound, the bottleneck is the hard-drive. The CPU has to do only a little work to parse each line in the CSV.

Disk reads are fastest when they are sequential. If you want to read a large file, it's best to let the disk seek the beginning and then just read all of its bytes sequentially.

Multiprocessing can still make your code faster, but you will need to store your files on multiple disks, so each disk head can focus on reading one file.

Another alternative is to buy an SSD. Disk seek time is much lower at 0.1 ms and throughput is around 5x faster.

edited Nov 20 '18 at 22:46

answered Nov 20 '18 at 14:48

fafl

4,31121127

Thanks for the answer. My hard disk is SSD only. Also, in future I might be doing this on multiple hard disks. Also, the code breaks even when I try to read 6 files just with one row.

– Gaurav Singhal
Nov 21 '18 at 4:16

add a comment |

Your task is IO-bound, the bottleneck is the hard-drive. The CPU has to do only a little work to parse each line in the CSV.

Disk reads are fastest when they are sequential. If you want to read a large file, it's best to let the disk seek the beginning and then just read all of its bytes sequentially.

Multiprocessing can still make your code faster, but you will need to store your files on multiple disks, so each disk head can focus on reading one file.

Another alternative is to buy an SSD. Disk seek time is much lower at 0.1 ms and throughput is around 5x faster.

edited Nov 20 '18 at 22:46

answered Nov 20 '18 at 14:48

fafl

4,31121127

Your task is IO-bound, the bottleneck is the hard-drive. The CPU has to do only a little work to parse each line in the CSV.

Disk reads are fastest when they are sequential. If you want to read a large file, it's best to let the disk seek the beginning and then just read all of its bytes sequentially.

Multiprocessing can still make your code faster, but you will need to store your files on multiple disks, so each disk head can focus on reading one file.

Another alternative is to buy an SSD. Disk seek time is much lower at 0.1 ms and throughput is around 5x faster.

edited Nov 20 '18 at 22:46

answered Nov 20 '18 at 14:48

fafl

4,31121127

edited Nov 20 '18 at 22:46

answered Nov 20 '18 at 14:48

fafl

4,31121127

answered Nov 20 '18 at 14:48

fafl

4,31121127

answered Nov 20 '18 at 14:48

fafl

4,31121127

Thanks for the answer. My hard disk is SSD only. Also, in future I might be doing this on multiple hard disks. Also, the code breaks even when I try to read 6 files just with one row.

– Gaurav Singhal
Nov 21 '18 at 4:16

add a comment |

Thanks for the answer. My hard disk is SSD only. Also, in future I might be doing this on multiple hard disks. Also, the code breaks even when I try to read 6 files just with one row.

– Gaurav Singhal
Nov 21 '18 at 4:16

Thanks for the answer. My hard disk is SSD only. Also, in future I might be doing this on multiple hard disks. Also, the code breaks even when I try to read 6 files just with one row.

– Gaurav Singhal
Nov 21 '18 at 4:16

add a comment |

For Jupyter

For Windows Issue

answered Nov 21 '18 at 14:25

Gaurav Singhal

4031316

add a comment |

For Jupyter

For Windows Issue

answered Nov 21 '18 at 14:25

Gaurav Singhal

4031316

add a comment |

For Jupyter

For Windows Issue

answered Nov 21 '18 at 14:25

Gaurav Singhal

4031316

For Jupyter

For Windows Issue

answered Nov 21 '18 at 14:25

Gaurav Singhal

4031316

answered Nov 21 '18 at 14:25

Gaurav Singhal

4031316

answered Nov 21 '18 at 14:25

Gaurav Singhal

4031316

answered Nov 21 '18 at 14:25

Gaurav Singhal

4031316

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu