Why is pandas' indexing so slow? How to make it faster?

Running this code shows the difference in speed between pandas and a regular python list:

ser = pd.Series(range(100))

lst = ser.tolist()



for _ in range(10):

    pandas_time = 0

    list_time = 0

    for _ in range(100000):

        r = randint(0, len(ser)-1)

        t = time()

        ser[r]

        pandas_time += time() - t



        t = time()

        lst[r]

        list_time += time() - t



    print(pandas_time, list_time)

The results (10 trials of indexing random elements 100000 times):

Pandas             Regular List

0.6404812335968018 0.03125190734863281

0.6560468673706055 0.0

0.5779874324798584 0.01562190055847168

0.5467743873596191 0.015621662139892578

0.6106545925140381 0.004016399383544922

0.5866603851318359 0.029597759246826172

0.7981059551239014 0.016004562377929688

0.8128316402435303 0.013040542602539062

0.5566465854644775 0.021578073501586914

0.6386256217956543 0.00500178337097168

Indexing a pandas series seems to be 30 - 100 times slower than a python list. Why? How can we speed this up?

asked Jan 1 at 17:17

Primusa

7,9812732

because pandas is optimized to index many values in a vectorized fashion, not single values like a list. It also supports index alignment, which is often quite useful, but adds significant overhead compared to say, mere numpy.

– juanpa.arrivillaga
Jan 1 at 18:31

add a comment |

Running this code shows the difference in speed between pandas and a regular python list:

ser = pd.Series(range(100))

lst = ser.tolist()



for _ in range(10):

    pandas_time = 0

    list_time = 0

    for _ in range(100000):

        r = randint(0, len(ser)-1)

        t = time()

        ser[r]

        pandas_time += time() - t



        t = time()

        lst[r]

        list_time += time() - t



    print(pandas_time, list_time)

The results (10 trials of indexing random elements 100000 times):

Pandas             Regular List

0.6404812335968018 0.03125190734863281

0.6560468673706055 0.0

0.5779874324798584 0.01562190055847168

0.5467743873596191 0.015621662139892578

0.6106545925140381 0.004016399383544922

0.5866603851318359 0.029597759246826172

0.7981059551239014 0.016004562377929688

0.8128316402435303 0.013040542602539062

0.5566465854644775 0.021578073501586914

0.6386256217956543 0.00500178337097168

Indexing a pandas series seems to be 30 - 100 times slower than a python list. Why? How can we speed this up?

asked Jan 1 at 17:17

Primusa

7,9812732

because pandas is optimized to index many values in a vectorized fashion, not single values like a list. It also supports index alignment, which is often quite useful, but adds significant overhead compared to say, mere numpy.

– juanpa.arrivillaga
Jan 1 at 18:31

add a comment |

Running this code shows the difference in speed between pandas and a regular python list:

ser = pd.Series(range(100))

lst = ser.tolist()



for _ in range(10):

    pandas_time = 0

    list_time = 0

    for _ in range(100000):

        r = randint(0, len(ser)-1)

        t = time()

        ser[r]

        pandas_time += time() - t



        t = time()

        lst[r]

        list_time += time() - t



    print(pandas_time, list_time)

The results (10 trials of indexing random elements 100000 times):

Pandas             Regular List

0.6404812335968018 0.03125190734863281

0.6560468673706055 0.0

0.5779874324798584 0.01562190055847168

0.5467743873596191 0.015621662139892578

0.6106545925140381 0.004016399383544922

0.5866603851318359 0.029597759246826172

0.7981059551239014 0.016004562377929688

0.8128316402435303 0.013040542602539062

0.5566465854644775 0.021578073501586914

0.6386256217956543 0.00500178337097168

Indexing a pandas series seems to be 30 - 100 times slower than a python list. Why? How can we speed this up?

asked Jan 1 at 17:17

Primusa

7,9812732

Running this code shows the difference in speed between pandas and a regular python list:

ser = pd.Series(range(100))

lst = ser.tolist()



for _ in range(10):

    pandas_time = 0

    list_time = 0

    for _ in range(100000):

        r = randint(0, len(ser)-1)

        t = time()

        ser[r]

        pandas_time += time() - t



        t = time()

        lst[r]

        list_time += time() - t



    print(pandas_time, list_time)

The results (10 trials of indexing random elements 100000 times):

Pandas             Regular List

0.6404812335968018 0.03125190734863281

0.6560468673706055 0.0

0.5779874324798584 0.01562190055847168

0.5467743873596191 0.015621662139892578

0.6106545925140381 0.004016399383544922

0.5866603851318359 0.029597759246826172

0.7981059551239014 0.016004562377929688

0.8128316402435303 0.013040542602539062

0.5566465854644775 0.021578073501586914

0.6386256217956543 0.00500178337097168

Indexing a pandas series seems to be 30 - 100 times slower than a python list. Why? How can we speed this up?

python pandas

asked Jan 1 at 17:17

Primusa

7,9812732

asked Jan 1 at 17:17

Primusa

7,9812732

asked Jan 1 at 17:17

Primusa

7,9812732

asked Jan 1 at 17:17

Primusa

7,9812732

asked Jan 1 at 17:17

Primusa

7,9812732

because pandas is optimized to index many values in a vectorized fashion, not single values like a list. It also supports index alignment, which is often quite useful, but adds significant overhead compared to say, mere numpy.

– juanpa.arrivillaga
Jan 1 at 18:31

add a comment |

because pandas is optimized to index many values in a vectorized fashion, not single values like a list. It also supports index alignment, which is often quite useful, but adds significant overhead compared to say, mere numpy.

– juanpa.arrivillaga
Jan 1 at 18:31

because pandas is optimized to index many values in a vectorized fashion, not single values like a list. It also supports index alignment, which is often quite useful, but adds significant overhead compared to say, mere numpy.

– juanpa.arrivillaga
Jan 1 at 18:31

add a comment |

1 Answer
1

active

oldest

votes

I checked pandas' source code. The __getitem__ implementation in a pandas series has a lot of additional business logic compared to the regular python list, because the pandas series supports indexing with lists and iterables.

When indexing a pandas series the series:

Tries to apply the key if it is callable

Gets the value of the index at that key (sounds simple enough, but keep in mind the index is another pandas object that also has to support more than regular indexing)

Checks if 2) is scalar

If it's scalar return the result

These additional steps slow down the __getitem__ dramatically compared to a regular python list.

To workaround this you can directly work with the underlying numpy array. Here we use ser.values to index instead:

ser = pd.Series(range(100))

lst = ser.tolist()



ser = ser.values



for _ in range(10):

    pandas_time = 0

    list_time = 0

    for _ in range(1000000):

        r = randint(0, len(ser)-1)

        t = time()

        ser[r]

        pandas_time += time() - t



        t = time()

        lst[r]

        list_time += time() - t



    print(pandas_time, list_time)

After indexing 1000000 random elements 10 times, we find that using .values is much faster than just indexing the pandas series but is still slower than using a python list

pd.Series.values    Regular List

0.18845057487487793 0.04786252975463867

0.10950899124145508 0.11034011840820312

0.048889875411987305 0.09512066841125488

0.17272686958312988 0.1406867504119873

0.14252233505249023 0.048066139221191406

0.06352949142456055 0.07906699180603027

0.1405477523803711 0.07815265655517578

0.18746685981750488 0.08007645606994629

0.1405184268951416 0.0781564712524414

0.07921838760375977 0.1412496566772461

To summarize using .values is the way to go when you need to quickly index a pandas Series. While it looks like .tolist() is faster, keep in mind that it will only be slightly faster when indexing individual elements. Numpy arrays will support much faster fancy indexing like indexing with multiple elements.

answered Jan 1 at 17:17

Primusa

7,9812732

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53997424%2fwhy-is-pandas-indexing-so-slow-how-to-make-it-faster%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

When indexing a pandas series the series:

Tries to apply the key if it is callable

Gets the value of the index at that key (sounds simple enough, but keep in mind the index is another pandas object that also has to support more than regular indexing)

Checks if 2) is scalar

If it's scalar return the result

These additional steps slow down the __getitem__ dramatically compared to a regular python list.

To workaround this you can directly work with the underlying numpy array. Here we use ser.values to index instead:

ser = pd.Series(range(100))

lst = ser.tolist()



ser = ser.values



for _ in range(10):

    pandas_time = 0

    list_time = 0

    for _ in range(1000000):

        r = randint(0, len(ser)-1)

        t = time()

        ser[r]

        pandas_time += time() - t



        t = time()

        lst[r]

        list_time += time() - t



    print(pandas_time, list_time)

After indexing 1000000 random elements 10 times, we find that using .values is much faster than just indexing the pandas series but is still slower than using a python list

pd.Series.values    Regular List

0.18845057487487793 0.04786252975463867

0.10950899124145508 0.11034011840820312

0.048889875411987305 0.09512066841125488

0.17272686958312988 0.1406867504119873

0.14252233505249023 0.048066139221191406

0.06352949142456055 0.07906699180603027

0.1405477523803711 0.07815265655517578

0.18746685981750488 0.08007645606994629

0.1405184268951416 0.0781564712524414

0.07921838760375977 0.1412496566772461

answered Jan 1 at 17:17

Primusa

7,9812732

add a comment |

When indexing a pandas series the series:

Tries to apply the key if it is callable

Gets the value of the index at that key (sounds simple enough, but keep in mind the index is another pandas object that also has to support more than regular indexing)

Checks if 2) is scalar

If it's scalar return the result

These additional steps slow down the __getitem__ dramatically compared to a regular python list.

To workaround this you can directly work with the underlying numpy array. Here we use ser.values to index instead:

ser = pd.Series(range(100))

lst = ser.tolist()



ser = ser.values



for _ in range(10):

    pandas_time = 0

    list_time = 0

    for _ in range(1000000):

        r = randint(0, len(ser)-1)

        t = time()

        ser[r]

        pandas_time += time() - t



        t = time()

        lst[r]

        list_time += time() - t



    print(pandas_time, list_time)

After indexing 1000000 random elements 10 times, we find that using .values is much faster than just indexing the pandas series but is still slower than using a python list

pd.Series.values    Regular List

0.18845057487487793 0.04786252975463867

0.10950899124145508 0.11034011840820312

0.048889875411987305 0.09512066841125488

0.17272686958312988 0.1406867504119873

0.14252233505249023 0.048066139221191406

0.06352949142456055 0.07906699180603027

0.1405477523803711 0.07815265655517578

0.18746685981750488 0.08007645606994629

0.1405184268951416 0.0781564712524414

0.07921838760375977 0.1412496566772461

answered Jan 1 at 17:17

Primusa

7,9812732

add a comment |

When indexing a pandas series the series:

Tries to apply the key if it is callable

Gets the value of the index at that key (sounds simple enough, but keep in mind the index is another pandas object that also has to support more than regular indexing)

Checks if 2) is scalar

If it's scalar return the result

These additional steps slow down the __getitem__ dramatically compared to a regular python list.

To workaround this you can directly work with the underlying numpy array. Here we use ser.values to index instead:

ser = pd.Series(range(100))

lst = ser.tolist()



ser = ser.values



for _ in range(10):

    pandas_time = 0

    list_time = 0

    for _ in range(1000000):

        r = randint(0, len(ser)-1)

        t = time()

        ser[r]

        pandas_time += time() - t



        t = time()

        lst[r]

        list_time += time() - t



    print(pandas_time, list_time)

After indexing 1000000 random elements 10 times, we find that using .values is much faster than just indexing the pandas series but is still slower than using a python list

pd.Series.values    Regular List

0.18845057487487793 0.04786252975463867

0.10950899124145508 0.11034011840820312

0.048889875411987305 0.09512066841125488

0.17272686958312988 0.1406867504119873

0.14252233505249023 0.048066139221191406

0.06352949142456055 0.07906699180603027

0.1405477523803711 0.07815265655517578

0.18746685981750488 0.08007645606994629

0.1405184268951416 0.0781564712524414

0.07921838760375977 0.1412496566772461

answered Jan 1 at 17:17

Primusa

7,9812732

When indexing a pandas series the series:

Tries to apply the key if it is callable

Gets the value of the index at that key (sounds simple enough, but keep in mind the index is another pandas object that also has to support more than regular indexing)

Checks if 2) is scalar

If it's scalar return the result

These additional steps slow down the __getitem__ dramatically compared to a regular python list.

To workaround this you can directly work with the underlying numpy array. Here we use ser.values to index instead:

ser = pd.Series(range(100))

lst = ser.tolist()



ser = ser.values



for _ in range(10):

    pandas_time = 0

    list_time = 0

    for _ in range(1000000):

        r = randint(0, len(ser)-1)

        t = time()

        ser[r]

        pandas_time += time() - t



        t = time()

        lst[r]

        list_time += time() - t



    print(pandas_time, list_time)

After indexing 1000000 random elements 10 times, we find that using .values is much faster than just indexing the pandas series but is still slower than using a python list

pd.Series.values    Regular List

0.18845057487487793 0.04786252975463867

0.10950899124145508 0.11034011840820312

0.048889875411987305 0.09512066841125488

0.17272686958312988 0.1406867504119873

0.14252233505249023 0.048066139221191406

0.06352949142456055 0.07906699180603027

0.1405477523803711 0.07815265655517578

0.18746685981750488 0.08007645606994629

0.1405184268951416 0.0781564712524414

0.07921838760375977 0.1412496566772461

answered Jan 1 at 17:17

Primusa

7,9812732

answered Jan 1 at 17:17

Primusa

7,9812732

answered Jan 1 at 17:17

Primusa

7,9812732

answered Jan 1 at 17:17

Primusa

7,9812732

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu