Why is pandas' indexing so slow? How to make it faster?
Running this code shows the difference in speed between pandas and a regular python list:
ser = pd.Series(range(100))
lst = ser.tolist()
for _ in range(10):
pandas_time = 0
list_time = 0
for _ in range(100000):
r = randint(0, len(ser)-1)
t = time()
ser[r]
pandas_time += time() - t
t = time()
lst[r]
list_time += time() - t
print(pandas_time, list_time)
The results (10 trials of indexing random elements 100000 times):
Pandas Regular List
0.6404812335968018 0.03125190734863281
0.6560468673706055 0.0
0.5779874324798584 0.01562190055847168
0.5467743873596191 0.015621662139892578
0.6106545925140381 0.004016399383544922
0.5866603851318359 0.029597759246826172
0.7981059551239014 0.016004562377929688
0.8128316402435303 0.013040542602539062
0.5566465854644775 0.021578073501586914
0.6386256217956543 0.00500178337097168
Indexing a pandas series seems to be 30 - 100 times slower than a python list. Why? How can we speed this up?
python pandas
add a comment |
Running this code shows the difference in speed between pandas and a regular python list:
ser = pd.Series(range(100))
lst = ser.tolist()
for _ in range(10):
pandas_time = 0
list_time = 0
for _ in range(100000):
r = randint(0, len(ser)-1)
t = time()
ser[r]
pandas_time += time() - t
t = time()
lst[r]
list_time += time() - t
print(pandas_time, list_time)
The results (10 trials of indexing random elements 100000 times):
Pandas Regular List
0.6404812335968018 0.03125190734863281
0.6560468673706055 0.0
0.5779874324798584 0.01562190055847168
0.5467743873596191 0.015621662139892578
0.6106545925140381 0.004016399383544922
0.5866603851318359 0.029597759246826172
0.7981059551239014 0.016004562377929688
0.8128316402435303 0.013040542602539062
0.5566465854644775 0.021578073501586914
0.6386256217956543 0.00500178337097168
Indexing a pandas series seems to be 30 - 100 times slower than a python list. Why? How can we speed this up?
python pandas
because pandas is optimized to index many values in a vectorized fashion, not single values like a list. It also supports index alignment, which is often quite useful, but adds significant overhead compared to say, mere numpy.
– juanpa.arrivillaga
Jan 1 at 18:31
add a comment |
Running this code shows the difference in speed between pandas and a regular python list:
ser = pd.Series(range(100))
lst = ser.tolist()
for _ in range(10):
pandas_time = 0
list_time = 0
for _ in range(100000):
r = randint(0, len(ser)-1)
t = time()
ser[r]
pandas_time += time() - t
t = time()
lst[r]
list_time += time() - t
print(pandas_time, list_time)
The results (10 trials of indexing random elements 100000 times):
Pandas Regular List
0.6404812335968018 0.03125190734863281
0.6560468673706055 0.0
0.5779874324798584 0.01562190055847168
0.5467743873596191 0.015621662139892578
0.6106545925140381 0.004016399383544922
0.5866603851318359 0.029597759246826172
0.7981059551239014 0.016004562377929688
0.8128316402435303 0.013040542602539062
0.5566465854644775 0.021578073501586914
0.6386256217956543 0.00500178337097168
Indexing a pandas series seems to be 30 - 100 times slower than a python list. Why? How can we speed this up?
python pandas
Running this code shows the difference in speed between pandas and a regular python list:
ser = pd.Series(range(100))
lst = ser.tolist()
for _ in range(10):
pandas_time = 0
list_time = 0
for _ in range(100000):
r = randint(0, len(ser)-1)
t = time()
ser[r]
pandas_time += time() - t
t = time()
lst[r]
list_time += time() - t
print(pandas_time, list_time)
The results (10 trials of indexing random elements 100000 times):
Pandas Regular List
0.6404812335968018 0.03125190734863281
0.6560468673706055 0.0
0.5779874324798584 0.01562190055847168
0.5467743873596191 0.015621662139892578
0.6106545925140381 0.004016399383544922
0.5866603851318359 0.029597759246826172
0.7981059551239014 0.016004562377929688
0.8128316402435303 0.013040542602539062
0.5566465854644775 0.021578073501586914
0.6386256217956543 0.00500178337097168
Indexing a pandas series seems to be 30 - 100 times slower than a python list. Why? How can we speed this up?
python pandas
python pandas
asked Jan 1 at 17:17
PrimusaPrimusa
7,9812732
7,9812732
because pandas is optimized to index many values in a vectorized fashion, not single values like a list. It also supports index alignment, which is often quite useful, but adds significant overhead compared to say, mere numpy.
– juanpa.arrivillaga
Jan 1 at 18:31
add a comment |
because pandas is optimized to index many values in a vectorized fashion, not single values like a list. It also supports index alignment, which is often quite useful, but adds significant overhead compared to say, mere numpy.
– juanpa.arrivillaga
Jan 1 at 18:31
because pandas is optimized to index many values in a vectorized fashion, not single values like a list. It also supports index alignment, which is often quite useful, but adds significant overhead compared to say, mere numpy.
– juanpa.arrivillaga
Jan 1 at 18:31
because pandas is optimized to index many values in a vectorized fashion, not single values like a list. It also supports index alignment, which is often quite useful, but adds significant overhead compared to say, mere numpy.
– juanpa.arrivillaga
Jan 1 at 18:31
add a comment |
1 Answer
1
active
oldest
votes
I checked pandas'
source code. The __getitem__
implementation in a pandas series has a lot of additional business logic compared to the regular python list, because the pandas series supports indexing with lists and iterables.
When indexing a pandas series the series:
Tries to apply the key if it is callable
Gets the value of the index at that key (sounds simple enough, but keep in mind the index is another pandas object that also has to support more than regular indexing)
Checks if 2) is scalar
If it's scalar return the result
These additional steps slow down the __getitem__
dramatically compared to a regular python list.
To workaround this you can directly work with the underlying numpy array. Here we use ser.values
to index instead:
ser = pd.Series(range(100))
lst = ser.tolist()
ser = ser.values
for _ in range(10):
pandas_time = 0
list_time = 0
for _ in range(1000000):
r = randint(0, len(ser)-1)
t = time()
ser[r]
pandas_time += time() - t
t = time()
lst[r]
list_time += time() - t
print(pandas_time, list_time)
After indexing 1000000 random elements 10 times, we find that using .values
is much faster than just indexing the pandas series but is still slower than using a python list
pd.Series.values Regular List
0.18845057487487793 0.04786252975463867
0.10950899124145508 0.11034011840820312
0.048889875411987305 0.09512066841125488
0.17272686958312988 0.1406867504119873
0.14252233505249023 0.048066139221191406
0.06352949142456055 0.07906699180603027
0.1405477523803711 0.07815265655517578
0.18746685981750488 0.08007645606994629
0.1405184268951416 0.0781564712524414
0.07921838760375977 0.1412496566772461
To summarize using .values
is the way to go when you need to quickly index a pandas Series. While it looks like .tolist()
is faster, keep in mind that it will only be slightly faster when indexing individual elements. Numpy arrays will support much faster fancy indexing like indexing with multiple elements.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53997424%2fwhy-is-pandas-indexing-so-slow-how-to-make-it-faster%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I checked pandas'
source code. The __getitem__
implementation in a pandas series has a lot of additional business logic compared to the regular python list, because the pandas series supports indexing with lists and iterables.
When indexing a pandas series the series:
Tries to apply the key if it is callable
Gets the value of the index at that key (sounds simple enough, but keep in mind the index is another pandas object that also has to support more than regular indexing)
Checks if 2) is scalar
If it's scalar return the result
These additional steps slow down the __getitem__
dramatically compared to a regular python list.
To workaround this you can directly work with the underlying numpy array. Here we use ser.values
to index instead:
ser = pd.Series(range(100))
lst = ser.tolist()
ser = ser.values
for _ in range(10):
pandas_time = 0
list_time = 0
for _ in range(1000000):
r = randint(0, len(ser)-1)
t = time()
ser[r]
pandas_time += time() - t
t = time()
lst[r]
list_time += time() - t
print(pandas_time, list_time)
After indexing 1000000 random elements 10 times, we find that using .values
is much faster than just indexing the pandas series but is still slower than using a python list
pd.Series.values Regular List
0.18845057487487793 0.04786252975463867
0.10950899124145508 0.11034011840820312
0.048889875411987305 0.09512066841125488
0.17272686958312988 0.1406867504119873
0.14252233505249023 0.048066139221191406
0.06352949142456055 0.07906699180603027
0.1405477523803711 0.07815265655517578
0.18746685981750488 0.08007645606994629
0.1405184268951416 0.0781564712524414
0.07921838760375977 0.1412496566772461
To summarize using .values
is the way to go when you need to quickly index a pandas Series. While it looks like .tolist()
is faster, keep in mind that it will only be slightly faster when indexing individual elements. Numpy arrays will support much faster fancy indexing like indexing with multiple elements.
add a comment |
I checked pandas'
source code. The __getitem__
implementation in a pandas series has a lot of additional business logic compared to the regular python list, because the pandas series supports indexing with lists and iterables.
When indexing a pandas series the series:
Tries to apply the key if it is callable
Gets the value of the index at that key (sounds simple enough, but keep in mind the index is another pandas object that also has to support more than regular indexing)
Checks if 2) is scalar
If it's scalar return the result
These additional steps slow down the __getitem__
dramatically compared to a regular python list.
To workaround this you can directly work with the underlying numpy array. Here we use ser.values
to index instead:
ser = pd.Series(range(100))
lst = ser.tolist()
ser = ser.values
for _ in range(10):
pandas_time = 0
list_time = 0
for _ in range(1000000):
r = randint(0, len(ser)-1)
t = time()
ser[r]
pandas_time += time() - t
t = time()
lst[r]
list_time += time() - t
print(pandas_time, list_time)
After indexing 1000000 random elements 10 times, we find that using .values
is much faster than just indexing the pandas series but is still slower than using a python list
pd.Series.values Regular List
0.18845057487487793 0.04786252975463867
0.10950899124145508 0.11034011840820312
0.048889875411987305 0.09512066841125488
0.17272686958312988 0.1406867504119873
0.14252233505249023 0.048066139221191406
0.06352949142456055 0.07906699180603027
0.1405477523803711 0.07815265655517578
0.18746685981750488 0.08007645606994629
0.1405184268951416 0.0781564712524414
0.07921838760375977 0.1412496566772461
To summarize using .values
is the way to go when you need to quickly index a pandas Series. While it looks like .tolist()
is faster, keep in mind that it will only be slightly faster when indexing individual elements. Numpy arrays will support much faster fancy indexing like indexing with multiple elements.
add a comment |
I checked pandas'
source code. The __getitem__
implementation in a pandas series has a lot of additional business logic compared to the regular python list, because the pandas series supports indexing with lists and iterables.
When indexing a pandas series the series:
Tries to apply the key if it is callable
Gets the value of the index at that key (sounds simple enough, but keep in mind the index is another pandas object that also has to support more than regular indexing)
Checks if 2) is scalar
If it's scalar return the result
These additional steps slow down the __getitem__
dramatically compared to a regular python list.
To workaround this you can directly work with the underlying numpy array. Here we use ser.values
to index instead:
ser = pd.Series(range(100))
lst = ser.tolist()
ser = ser.values
for _ in range(10):
pandas_time = 0
list_time = 0
for _ in range(1000000):
r = randint(0, len(ser)-1)
t = time()
ser[r]
pandas_time += time() - t
t = time()
lst[r]
list_time += time() - t
print(pandas_time, list_time)
After indexing 1000000 random elements 10 times, we find that using .values
is much faster than just indexing the pandas series but is still slower than using a python list
pd.Series.values Regular List
0.18845057487487793 0.04786252975463867
0.10950899124145508 0.11034011840820312
0.048889875411987305 0.09512066841125488
0.17272686958312988 0.1406867504119873
0.14252233505249023 0.048066139221191406
0.06352949142456055 0.07906699180603027
0.1405477523803711 0.07815265655517578
0.18746685981750488 0.08007645606994629
0.1405184268951416 0.0781564712524414
0.07921838760375977 0.1412496566772461
To summarize using .values
is the way to go when you need to quickly index a pandas Series. While it looks like .tolist()
is faster, keep in mind that it will only be slightly faster when indexing individual elements. Numpy arrays will support much faster fancy indexing like indexing with multiple elements.
I checked pandas'
source code. The __getitem__
implementation in a pandas series has a lot of additional business logic compared to the regular python list, because the pandas series supports indexing with lists and iterables.
When indexing a pandas series the series:
Tries to apply the key if it is callable
Gets the value of the index at that key (sounds simple enough, but keep in mind the index is another pandas object that also has to support more than regular indexing)
Checks if 2) is scalar
If it's scalar return the result
These additional steps slow down the __getitem__
dramatically compared to a regular python list.
To workaround this you can directly work with the underlying numpy array. Here we use ser.values
to index instead:
ser = pd.Series(range(100))
lst = ser.tolist()
ser = ser.values
for _ in range(10):
pandas_time = 0
list_time = 0
for _ in range(1000000):
r = randint(0, len(ser)-1)
t = time()
ser[r]
pandas_time += time() - t
t = time()
lst[r]
list_time += time() - t
print(pandas_time, list_time)
After indexing 1000000 random elements 10 times, we find that using .values
is much faster than just indexing the pandas series but is still slower than using a python list
pd.Series.values Regular List
0.18845057487487793 0.04786252975463867
0.10950899124145508 0.11034011840820312
0.048889875411987305 0.09512066841125488
0.17272686958312988 0.1406867504119873
0.14252233505249023 0.048066139221191406
0.06352949142456055 0.07906699180603027
0.1405477523803711 0.07815265655517578
0.18746685981750488 0.08007645606994629
0.1405184268951416 0.0781564712524414
0.07921838760375977 0.1412496566772461
To summarize using .values
is the way to go when you need to quickly index a pandas Series. While it looks like .tolist()
is faster, keep in mind that it will only be slightly faster when indexing individual elements. Numpy arrays will support much faster fancy indexing like indexing with multiple elements.
answered Jan 1 at 17:17
PrimusaPrimusa
7,9812732
7,9812732
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53997424%2fwhy-is-pandas-indexing-so-slow-how-to-make-it-faster%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
because pandas is optimized to index many values in a vectorized fashion, not single values like a list. It also supports index alignment, which is often quite useful, but adds significant overhead compared to say, mere numpy.
– juanpa.arrivillaga
Jan 1 at 18:31