Using pool to read multiple files in parallel takes forever on Jupyter Windows:
I want to read 22 files (stored on my hard disk) with around 300,000 rows each to store in a single pandas
data frame. My code was able to do it in 15-25 minutes. I initial thought is: I should make it faster using more CPUs. (correct me if I am wrong here, and if all CPU can't read the data from same hard disk at the same time, however, in this case we can assume data might be present at different hard disks later on, so this exercise is still useful).
I found few posts like this and this and tried the code below.
import os
import pandas as pd
from multiprocessing import Pool
def read_psv(filename):
'reads one row of a file (pipe delimited) to a pandas dataframe'
return pd.read_csv(filename,
delimiter='|',
skiprows=1, #need this as first row is junk
nrows=1, #Just one row for faster testing
encoding = "ISO-8859-1", #need this as well
low_memory=False
)
files = os.listdir('.') #getting all files, will use glob later
df1 = pd.concat((read_psv(f) for f in files[0:6]), ignore_index=True, axis=0, sort=False) #takes less than 1 second
pool = Pool(processes=3)
df_list = pool.map(read_psv, files[0:6]) #takes forever
#df2 = pd.concat(df_list, ignore_index=True) #cant reach this
This takes forever (more than 30-60 minutes, without finishing when I kill the process). I also went through a similar question like mine but of no use.
EDIT: I am using Jupyter on Windows.
python windows pandas jupyter-notebook python-multiprocessing
add a comment |
I want to read 22 files (stored on my hard disk) with around 300,000 rows each to store in a single pandas
data frame. My code was able to do it in 15-25 minutes. I initial thought is: I should make it faster using more CPUs. (correct me if I am wrong here, and if all CPU can't read the data from same hard disk at the same time, however, in this case we can assume data might be present at different hard disks later on, so this exercise is still useful).
I found few posts like this and this and tried the code below.
import os
import pandas as pd
from multiprocessing import Pool
def read_psv(filename):
'reads one row of a file (pipe delimited) to a pandas dataframe'
return pd.read_csv(filename,
delimiter='|',
skiprows=1, #need this as first row is junk
nrows=1, #Just one row for faster testing
encoding = "ISO-8859-1", #need this as well
low_memory=False
)
files = os.listdir('.') #getting all files, will use glob later
df1 = pd.concat((read_psv(f) for f in files[0:6]), ignore_index=True, axis=0, sort=False) #takes less than 1 second
pool = Pool(processes=3)
df_list = pool.map(read_psv, files[0:6]) #takes forever
#df2 = pd.concat(df_list, ignore_index=True) #cant reach this
This takes forever (more than 30-60 minutes, without finishing when I kill the process). I also went through a similar question like mine but of no use.
EDIT: I am using Jupyter on Windows.
python windows pandas jupyter-notebook python-multiprocessing
I understood why it keeps on running forever. I was using this code on windows and it requires to definepool
in aif __name__ = '__main__':
clause. Otherwise it creates a runtime error. Please see this for more details. stackoverflow.com/questions/20222534/…
– Gaurav Singhal
Nov 21 '18 at 8:50
add a comment |
I want to read 22 files (stored on my hard disk) with around 300,000 rows each to store in a single pandas
data frame. My code was able to do it in 15-25 minutes. I initial thought is: I should make it faster using more CPUs. (correct me if I am wrong here, and if all CPU can't read the data from same hard disk at the same time, however, in this case we can assume data might be present at different hard disks later on, so this exercise is still useful).
I found few posts like this and this and tried the code below.
import os
import pandas as pd
from multiprocessing import Pool
def read_psv(filename):
'reads one row of a file (pipe delimited) to a pandas dataframe'
return pd.read_csv(filename,
delimiter='|',
skiprows=1, #need this as first row is junk
nrows=1, #Just one row for faster testing
encoding = "ISO-8859-1", #need this as well
low_memory=False
)
files = os.listdir('.') #getting all files, will use glob later
df1 = pd.concat((read_psv(f) for f in files[0:6]), ignore_index=True, axis=0, sort=False) #takes less than 1 second
pool = Pool(processes=3)
df_list = pool.map(read_psv, files[0:6]) #takes forever
#df2 = pd.concat(df_list, ignore_index=True) #cant reach this
This takes forever (more than 30-60 minutes, without finishing when I kill the process). I also went through a similar question like mine but of no use.
EDIT: I am using Jupyter on Windows.
python windows pandas jupyter-notebook python-multiprocessing
I want to read 22 files (stored on my hard disk) with around 300,000 rows each to store in a single pandas
data frame. My code was able to do it in 15-25 minutes. I initial thought is: I should make it faster using more CPUs. (correct me if I am wrong here, and if all CPU can't read the data from same hard disk at the same time, however, in this case we can assume data might be present at different hard disks later on, so this exercise is still useful).
I found few posts like this and this and tried the code below.
import os
import pandas as pd
from multiprocessing import Pool
def read_psv(filename):
'reads one row of a file (pipe delimited) to a pandas dataframe'
return pd.read_csv(filename,
delimiter='|',
skiprows=1, #need this as first row is junk
nrows=1, #Just one row for faster testing
encoding = "ISO-8859-1", #need this as well
low_memory=False
)
files = os.listdir('.') #getting all files, will use glob later
df1 = pd.concat((read_psv(f) for f in files[0:6]), ignore_index=True, axis=0, sort=False) #takes less than 1 second
pool = Pool(processes=3)
df_list = pool.map(read_psv, files[0:6]) #takes forever
#df2 = pd.concat(df_list, ignore_index=True) #cant reach this
This takes forever (more than 30-60 minutes, without finishing when I kill the process). I also went through a similar question like mine but of no use.
EDIT: I am using Jupyter on Windows.
python windows pandas jupyter-notebook python-multiprocessing
python windows pandas jupyter-notebook python-multiprocessing
edited Nov 21 '18 at 14:26
Gaurav Singhal
asked Nov 20 '18 at 14:00


Gaurav SinghalGaurav Singhal
4031316
4031316
I understood why it keeps on running forever. I was using this code on windows and it requires to definepool
in aif __name__ = '__main__':
clause. Otherwise it creates a runtime error. Please see this for more details. stackoverflow.com/questions/20222534/…
– Gaurav Singhal
Nov 21 '18 at 8:50
add a comment |
I understood why it keeps on running forever. I was using this code on windows and it requires to definepool
in aif __name__ = '__main__':
clause. Otherwise it creates a runtime error. Please see this for more details. stackoverflow.com/questions/20222534/…
– Gaurav Singhal
Nov 21 '18 at 8:50
I understood why it keeps on running forever. I was using this code on windows and it requires to define
pool
in a if __name__ = '__main__':
clause. Otherwise it creates a runtime error. Please see this for more details. stackoverflow.com/questions/20222534/…– Gaurav Singhal
Nov 21 '18 at 8:50
I understood why it keeps on running forever. I was using this code on windows and it requires to define
pool
in a if __name__ = '__main__':
clause. Otherwise it creates a runtime error. Please see this for more details. stackoverflow.com/questions/20222534/…– Gaurav Singhal
Nov 21 '18 at 8:50
add a comment |
2 Answers
2
active
oldest
votes
Your task is IO-bound, the bottleneck is the hard-drive. The CPU has to do only a little work to parse each line in the CSV.
Disk reads are fastest when they are sequential. If you want to read a large file, it's best to let the disk seek the beginning and then just read all of its bytes sequentially.
If you have multiple large files on the same hard-drive and read from them using multiple processes, then the disk head will have to jump back and forth between them, where each jump takes up to 10 ms.
Multiprocessing can still make your code faster, but you will need to store your files on multiple disks, so each disk head can focus on reading one file.
Another alternative is to buy an SSD. Disk seek time is much lower at 0.1 ms and throughput is around 5x faster.
Thanks for the answer. My hard disk is SSD only. Also, in future I might be doing this on multiple hard disks. Also, the code breaks even when I try to read 6 files just with one row.
– Gaurav Singhal
Nov 21 '18 at 4:16
add a comment |
So the issue is not related to Bad Performance or getting stuck at I/O. The issue is related to Jupyter and Windows. On Windows we need to include an if clause like this: if __name__ = '__main__':
before initializing the Pool. For Jupyter, we need to save the worker in a separate file and import it in the code. Jupyter is also problematic as it does not give the error log by default. I got to know about windows issue when I ran the code on a python shell. I got to know about Jupyter error when I ran the code on Ipython Shell. Following post has helped me a lot.
For Jupyter
For Windows Issue
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53394680%2fusing-pool-to-read-multiple-files-in-parallel-takes-forever-on-jupyter-windows%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Your task is IO-bound, the bottleneck is the hard-drive. The CPU has to do only a little work to parse each line in the CSV.
Disk reads are fastest when they are sequential. If you want to read a large file, it's best to let the disk seek the beginning and then just read all of its bytes sequentially.
If you have multiple large files on the same hard-drive and read from them using multiple processes, then the disk head will have to jump back and forth between them, where each jump takes up to 10 ms.
Multiprocessing can still make your code faster, but you will need to store your files on multiple disks, so each disk head can focus on reading one file.
Another alternative is to buy an SSD. Disk seek time is much lower at 0.1 ms and throughput is around 5x faster.
Thanks for the answer. My hard disk is SSD only. Also, in future I might be doing this on multiple hard disks. Also, the code breaks even when I try to read 6 files just with one row.
– Gaurav Singhal
Nov 21 '18 at 4:16
add a comment |
Your task is IO-bound, the bottleneck is the hard-drive. The CPU has to do only a little work to parse each line in the CSV.
Disk reads are fastest when they are sequential. If you want to read a large file, it's best to let the disk seek the beginning and then just read all of its bytes sequentially.
If you have multiple large files on the same hard-drive and read from them using multiple processes, then the disk head will have to jump back and forth between them, where each jump takes up to 10 ms.
Multiprocessing can still make your code faster, but you will need to store your files on multiple disks, so each disk head can focus on reading one file.
Another alternative is to buy an SSD. Disk seek time is much lower at 0.1 ms and throughput is around 5x faster.
Thanks for the answer. My hard disk is SSD only. Also, in future I might be doing this on multiple hard disks. Also, the code breaks even when I try to read 6 files just with one row.
– Gaurav Singhal
Nov 21 '18 at 4:16
add a comment |
Your task is IO-bound, the bottleneck is the hard-drive. The CPU has to do only a little work to parse each line in the CSV.
Disk reads are fastest when they are sequential. If you want to read a large file, it's best to let the disk seek the beginning and then just read all of its bytes sequentially.
If you have multiple large files on the same hard-drive and read from them using multiple processes, then the disk head will have to jump back and forth between them, where each jump takes up to 10 ms.
Multiprocessing can still make your code faster, but you will need to store your files on multiple disks, so each disk head can focus on reading one file.
Another alternative is to buy an SSD. Disk seek time is much lower at 0.1 ms and throughput is around 5x faster.
Your task is IO-bound, the bottleneck is the hard-drive. The CPU has to do only a little work to parse each line in the CSV.
Disk reads are fastest when they are sequential. If you want to read a large file, it's best to let the disk seek the beginning and then just read all of its bytes sequentially.
If you have multiple large files on the same hard-drive and read from them using multiple processes, then the disk head will have to jump back and forth between them, where each jump takes up to 10 ms.
Multiprocessing can still make your code faster, but you will need to store your files on multiple disks, so each disk head can focus on reading one file.
Another alternative is to buy an SSD. Disk seek time is much lower at 0.1 ms and throughput is around 5x faster.
edited Nov 20 '18 at 22:46
answered Nov 20 '18 at 14:48


faflfafl
4,31121127
4,31121127
Thanks for the answer. My hard disk is SSD only. Also, in future I might be doing this on multiple hard disks. Also, the code breaks even when I try to read 6 files just with one row.
– Gaurav Singhal
Nov 21 '18 at 4:16
add a comment |
Thanks for the answer. My hard disk is SSD only. Also, in future I might be doing this on multiple hard disks. Also, the code breaks even when I try to read 6 files just with one row.
– Gaurav Singhal
Nov 21 '18 at 4:16
Thanks for the answer. My hard disk is SSD only. Also, in future I might be doing this on multiple hard disks. Also, the code breaks even when I try to read 6 files just with one row.
– Gaurav Singhal
Nov 21 '18 at 4:16
Thanks for the answer. My hard disk is SSD only. Also, in future I might be doing this on multiple hard disks. Also, the code breaks even when I try to read 6 files just with one row.
– Gaurav Singhal
Nov 21 '18 at 4:16
add a comment |
So the issue is not related to Bad Performance or getting stuck at I/O. The issue is related to Jupyter and Windows. On Windows we need to include an if clause like this: if __name__ = '__main__':
before initializing the Pool. For Jupyter, we need to save the worker in a separate file and import it in the code. Jupyter is also problematic as it does not give the error log by default. I got to know about windows issue when I ran the code on a python shell. I got to know about Jupyter error when I ran the code on Ipython Shell. Following post has helped me a lot.
For Jupyter
For Windows Issue
add a comment |
So the issue is not related to Bad Performance or getting stuck at I/O. The issue is related to Jupyter and Windows. On Windows we need to include an if clause like this: if __name__ = '__main__':
before initializing the Pool. For Jupyter, we need to save the worker in a separate file and import it in the code. Jupyter is also problematic as it does not give the error log by default. I got to know about windows issue when I ran the code on a python shell. I got to know about Jupyter error when I ran the code on Ipython Shell. Following post has helped me a lot.
For Jupyter
For Windows Issue
add a comment |
So the issue is not related to Bad Performance or getting stuck at I/O. The issue is related to Jupyter and Windows. On Windows we need to include an if clause like this: if __name__ = '__main__':
before initializing the Pool. For Jupyter, we need to save the worker in a separate file and import it in the code. Jupyter is also problematic as it does not give the error log by default. I got to know about windows issue when I ran the code on a python shell. I got to know about Jupyter error when I ran the code on Ipython Shell. Following post has helped me a lot.
For Jupyter
For Windows Issue
So the issue is not related to Bad Performance or getting stuck at I/O. The issue is related to Jupyter and Windows. On Windows we need to include an if clause like this: if __name__ = '__main__':
before initializing the Pool. For Jupyter, we need to save the worker in a separate file and import it in the code. Jupyter is also problematic as it does not give the error log by default. I got to know about windows issue when I ran the code on a python shell. I got to know about Jupyter error when I ran the code on Ipython Shell. Following post has helped me a lot.
For Jupyter
For Windows Issue
answered Nov 21 '18 at 14:25


Gaurav SinghalGaurav Singhal
4031316
4031316
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53394680%2fusing-pool-to-read-multiple-files-in-parallel-takes-forever-on-jupyter-windows%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I understood why it keeps on running forever. I was using this code on windows and it requires to define
pool
in aif __name__ = '__main__':
clause. Otherwise it creates a runtime error. Please see this for more details. stackoverflow.com/questions/20222534/…– Gaurav Singhal
Nov 21 '18 at 8:50