Why is pandas' indexing so slow? How to make it faster?












0















Running this code shows the difference in speed between pandas and a regular python list:



ser = pd.Series(range(100))
lst = ser.tolist()

for _ in range(10):
pandas_time = 0
list_time = 0
for _ in range(100000):
r = randint(0, len(ser)-1)
t = time()
ser[r]
pandas_time += time() - t

t = time()
lst[r]
list_time += time() - t

print(pandas_time, list_time)


The results (10 trials of indexing random elements 100000 times):



Pandas             Regular List
0.6404812335968018 0.03125190734863281
0.6560468673706055 0.0
0.5779874324798584 0.01562190055847168
0.5467743873596191 0.015621662139892578
0.6106545925140381 0.004016399383544922
0.5866603851318359 0.029597759246826172
0.7981059551239014 0.016004562377929688
0.8128316402435303 0.013040542602539062
0.5566465854644775 0.021578073501586914
0.6386256217956543 0.00500178337097168


Indexing a pandas series seems to be 30 - 100 times slower than a python list. Why? How can we speed this up?










share|improve this question























  • because pandas is optimized to index many values in a vectorized fashion, not single values like a list. It also supports index alignment, which is often quite useful, but adds significant overhead compared to say, mere numpy.

    – juanpa.arrivillaga
    Jan 1 at 18:31


















0















Running this code shows the difference in speed between pandas and a regular python list:



ser = pd.Series(range(100))
lst = ser.tolist()

for _ in range(10):
pandas_time = 0
list_time = 0
for _ in range(100000):
r = randint(0, len(ser)-1)
t = time()
ser[r]
pandas_time += time() - t

t = time()
lst[r]
list_time += time() - t

print(pandas_time, list_time)


The results (10 trials of indexing random elements 100000 times):



Pandas             Regular List
0.6404812335968018 0.03125190734863281
0.6560468673706055 0.0
0.5779874324798584 0.01562190055847168
0.5467743873596191 0.015621662139892578
0.6106545925140381 0.004016399383544922
0.5866603851318359 0.029597759246826172
0.7981059551239014 0.016004562377929688
0.8128316402435303 0.013040542602539062
0.5566465854644775 0.021578073501586914
0.6386256217956543 0.00500178337097168


Indexing a pandas series seems to be 30 - 100 times slower than a python list. Why? How can we speed this up?










share|improve this question























  • because pandas is optimized to index many values in a vectorized fashion, not single values like a list. It also supports index alignment, which is often quite useful, but adds significant overhead compared to say, mere numpy.

    – juanpa.arrivillaga
    Jan 1 at 18:31
















0












0








0








Running this code shows the difference in speed between pandas and a regular python list:



ser = pd.Series(range(100))
lst = ser.tolist()

for _ in range(10):
pandas_time = 0
list_time = 0
for _ in range(100000):
r = randint(0, len(ser)-1)
t = time()
ser[r]
pandas_time += time() - t

t = time()
lst[r]
list_time += time() - t

print(pandas_time, list_time)


The results (10 trials of indexing random elements 100000 times):



Pandas             Regular List
0.6404812335968018 0.03125190734863281
0.6560468673706055 0.0
0.5779874324798584 0.01562190055847168
0.5467743873596191 0.015621662139892578
0.6106545925140381 0.004016399383544922
0.5866603851318359 0.029597759246826172
0.7981059551239014 0.016004562377929688
0.8128316402435303 0.013040542602539062
0.5566465854644775 0.021578073501586914
0.6386256217956543 0.00500178337097168


Indexing a pandas series seems to be 30 - 100 times slower than a python list. Why? How can we speed this up?










share|improve this question














Running this code shows the difference in speed between pandas and a regular python list:



ser = pd.Series(range(100))
lst = ser.tolist()

for _ in range(10):
pandas_time = 0
list_time = 0
for _ in range(100000):
r = randint(0, len(ser)-1)
t = time()
ser[r]
pandas_time += time() - t

t = time()
lst[r]
list_time += time() - t

print(pandas_time, list_time)


The results (10 trials of indexing random elements 100000 times):



Pandas             Regular List
0.6404812335968018 0.03125190734863281
0.6560468673706055 0.0
0.5779874324798584 0.01562190055847168
0.5467743873596191 0.015621662139892578
0.6106545925140381 0.004016399383544922
0.5866603851318359 0.029597759246826172
0.7981059551239014 0.016004562377929688
0.8128316402435303 0.013040542602539062
0.5566465854644775 0.021578073501586914
0.6386256217956543 0.00500178337097168


Indexing a pandas series seems to be 30 - 100 times slower than a python list. Why? How can we speed this up?







python pandas






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jan 1 at 17:17









PrimusaPrimusa

7,9812732




7,9812732













  • because pandas is optimized to index many values in a vectorized fashion, not single values like a list. It also supports index alignment, which is often quite useful, but adds significant overhead compared to say, mere numpy.

    – juanpa.arrivillaga
    Jan 1 at 18:31





















  • because pandas is optimized to index many values in a vectorized fashion, not single values like a list. It also supports index alignment, which is often quite useful, but adds significant overhead compared to say, mere numpy.

    – juanpa.arrivillaga
    Jan 1 at 18:31



















because pandas is optimized to index many values in a vectorized fashion, not single values like a list. It also supports index alignment, which is often quite useful, but adds significant overhead compared to say, mere numpy.

– juanpa.arrivillaga
Jan 1 at 18:31







because pandas is optimized to index many values in a vectorized fashion, not single values like a list. It also supports index alignment, which is often quite useful, but adds significant overhead compared to say, mere numpy.

– juanpa.arrivillaga
Jan 1 at 18:31














1 Answer
1






active

oldest

votes


















2














I checked pandas' source code. The __getitem__ implementation in a pandas series has a lot of additional business logic compared to the regular python list, because the pandas series supports indexing with lists and iterables.



When indexing a pandas series the series:




  1. Tries to apply the key if it is callable


  2. Gets the value of the index at that key (sounds simple enough, but keep in mind the index is another pandas object that also has to support more than regular indexing)


  3. Checks if 2) is scalar


  4. If it's scalar return the result



These additional steps slow down the __getitem__ dramatically compared to a regular python list.



To workaround this you can directly work with the underlying numpy array. Here we use ser.values to index instead:



ser = pd.Series(range(100))
lst = ser.tolist()

ser = ser.values

for _ in range(10):
pandas_time = 0
list_time = 0
for _ in range(1000000):
r = randint(0, len(ser)-1)
t = time()
ser[r]
pandas_time += time() - t

t = time()
lst[r]
list_time += time() - t

print(pandas_time, list_time)


After indexing 1000000 random elements 10 times, we find that using .values is much faster than just indexing the pandas series but is still slower than using a python list



pd.Series.values    Regular List
0.18845057487487793 0.04786252975463867
0.10950899124145508 0.11034011840820312
0.048889875411987305 0.09512066841125488
0.17272686958312988 0.1406867504119873
0.14252233505249023 0.048066139221191406
0.06352949142456055 0.07906699180603027
0.1405477523803711 0.07815265655517578
0.18746685981750488 0.08007645606994629
0.1405184268951416 0.0781564712524414
0.07921838760375977 0.1412496566772461


To summarize using .values is the way to go when you need to quickly index a pandas Series. While it looks like .tolist() is faster, keep in mind that it will only be slightly faster when indexing individual elements. Numpy arrays will support much faster fancy indexing like indexing with multiple elements.






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53997424%2fwhy-is-pandas-indexing-so-slow-how-to-make-it-faster%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    2














    I checked pandas' source code. The __getitem__ implementation in a pandas series has a lot of additional business logic compared to the regular python list, because the pandas series supports indexing with lists and iterables.



    When indexing a pandas series the series:




    1. Tries to apply the key if it is callable


    2. Gets the value of the index at that key (sounds simple enough, but keep in mind the index is another pandas object that also has to support more than regular indexing)


    3. Checks if 2) is scalar


    4. If it's scalar return the result



    These additional steps slow down the __getitem__ dramatically compared to a regular python list.



    To workaround this you can directly work with the underlying numpy array. Here we use ser.values to index instead:



    ser = pd.Series(range(100))
    lst = ser.tolist()

    ser = ser.values

    for _ in range(10):
    pandas_time = 0
    list_time = 0
    for _ in range(1000000):
    r = randint(0, len(ser)-1)
    t = time()
    ser[r]
    pandas_time += time() - t

    t = time()
    lst[r]
    list_time += time() - t

    print(pandas_time, list_time)


    After indexing 1000000 random elements 10 times, we find that using .values is much faster than just indexing the pandas series but is still slower than using a python list



    pd.Series.values    Regular List
    0.18845057487487793 0.04786252975463867
    0.10950899124145508 0.11034011840820312
    0.048889875411987305 0.09512066841125488
    0.17272686958312988 0.1406867504119873
    0.14252233505249023 0.048066139221191406
    0.06352949142456055 0.07906699180603027
    0.1405477523803711 0.07815265655517578
    0.18746685981750488 0.08007645606994629
    0.1405184268951416 0.0781564712524414
    0.07921838760375977 0.1412496566772461


    To summarize using .values is the way to go when you need to quickly index a pandas Series. While it looks like .tolist() is faster, keep in mind that it will only be slightly faster when indexing individual elements. Numpy arrays will support much faster fancy indexing like indexing with multiple elements.






    share|improve this answer




























      2














      I checked pandas' source code. The __getitem__ implementation in a pandas series has a lot of additional business logic compared to the regular python list, because the pandas series supports indexing with lists and iterables.



      When indexing a pandas series the series:




      1. Tries to apply the key if it is callable


      2. Gets the value of the index at that key (sounds simple enough, but keep in mind the index is another pandas object that also has to support more than regular indexing)


      3. Checks if 2) is scalar


      4. If it's scalar return the result



      These additional steps slow down the __getitem__ dramatically compared to a regular python list.



      To workaround this you can directly work with the underlying numpy array. Here we use ser.values to index instead:



      ser = pd.Series(range(100))
      lst = ser.tolist()

      ser = ser.values

      for _ in range(10):
      pandas_time = 0
      list_time = 0
      for _ in range(1000000):
      r = randint(0, len(ser)-1)
      t = time()
      ser[r]
      pandas_time += time() - t

      t = time()
      lst[r]
      list_time += time() - t

      print(pandas_time, list_time)


      After indexing 1000000 random elements 10 times, we find that using .values is much faster than just indexing the pandas series but is still slower than using a python list



      pd.Series.values    Regular List
      0.18845057487487793 0.04786252975463867
      0.10950899124145508 0.11034011840820312
      0.048889875411987305 0.09512066841125488
      0.17272686958312988 0.1406867504119873
      0.14252233505249023 0.048066139221191406
      0.06352949142456055 0.07906699180603027
      0.1405477523803711 0.07815265655517578
      0.18746685981750488 0.08007645606994629
      0.1405184268951416 0.0781564712524414
      0.07921838760375977 0.1412496566772461


      To summarize using .values is the way to go when you need to quickly index a pandas Series. While it looks like .tolist() is faster, keep in mind that it will only be slightly faster when indexing individual elements. Numpy arrays will support much faster fancy indexing like indexing with multiple elements.






      share|improve this answer


























        2












        2








        2







        I checked pandas' source code. The __getitem__ implementation in a pandas series has a lot of additional business logic compared to the regular python list, because the pandas series supports indexing with lists and iterables.



        When indexing a pandas series the series:




        1. Tries to apply the key if it is callable


        2. Gets the value of the index at that key (sounds simple enough, but keep in mind the index is another pandas object that also has to support more than regular indexing)


        3. Checks if 2) is scalar


        4. If it's scalar return the result



        These additional steps slow down the __getitem__ dramatically compared to a regular python list.



        To workaround this you can directly work with the underlying numpy array. Here we use ser.values to index instead:



        ser = pd.Series(range(100))
        lst = ser.tolist()

        ser = ser.values

        for _ in range(10):
        pandas_time = 0
        list_time = 0
        for _ in range(1000000):
        r = randint(0, len(ser)-1)
        t = time()
        ser[r]
        pandas_time += time() - t

        t = time()
        lst[r]
        list_time += time() - t

        print(pandas_time, list_time)


        After indexing 1000000 random elements 10 times, we find that using .values is much faster than just indexing the pandas series but is still slower than using a python list



        pd.Series.values    Regular List
        0.18845057487487793 0.04786252975463867
        0.10950899124145508 0.11034011840820312
        0.048889875411987305 0.09512066841125488
        0.17272686958312988 0.1406867504119873
        0.14252233505249023 0.048066139221191406
        0.06352949142456055 0.07906699180603027
        0.1405477523803711 0.07815265655517578
        0.18746685981750488 0.08007645606994629
        0.1405184268951416 0.0781564712524414
        0.07921838760375977 0.1412496566772461


        To summarize using .values is the way to go when you need to quickly index a pandas Series. While it looks like .tolist() is faster, keep in mind that it will only be slightly faster when indexing individual elements. Numpy arrays will support much faster fancy indexing like indexing with multiple elements.






        share|improve this answer













        I checked pandas' source code. The __getitem__ implementation in a pandas series has a lot of additional business logic compared to the regular python list, because the pandas series supports indexing with lists and iterables.



        When indexing a pandas series the series:




        1. Tries to apply the key if it is callable


        2. Gets the value of the index at that key (sounds simple enough, but keep in mind the index is another pandas object that also has to support more than regular indexing)


        3. Checks if 2) is scalar


        4. If it's scalar return the result



        These additional steps slow down the __getitem__ dramatically compared to a regular python list.



        To workaround this you can directly work with the underlying numpy array. Here we use ser.values to index instead:



        ser = pd.Series(range(100))
        lst = ser.tolist()

        ser = ser.values

        for _ in range(10):
        pandas_time = 0
        list_time = 0
        for _ in range(1000000):
        r = randint(0, len(ser)-1)
        t = time()
        ser[r]
        pandas_time += time() - t

        t = time()
        lst[r]
        list_time += time() - t

        print(pandas_time, list_time)


        After indexing 1000000 random elements 10 times, we find that using .values is much faster than just indexing the pandas series but is still slower than using a python list



        pd.Series.values    Regular List
        0.18845057487487793 0.04786252975463867
        0.10950899124145508 0.11034011840820312
        0.048889875411987305 0.09512066841125488
        0.17272686958312988 0.1406867504119873
        0.14252233505249023 0.048066139221191406
        0.06352949142456055 0.07906699180603027
        0.1405477523803711 0.07815265655517578
        0.18746685981750488 0.08007645606994629
        0.1405184268951416 0.0781564712524414
        0.07921838760375977 0.1412496566772461


        To summarize using .values is the way to go when you need to quickly index a pandas Series. While it looks like .tolist() is faster, keep in mind that it will only be slightly faster when indexing individual elements. Numpy arrays will support much faster fancy indexing like indexing with multiple elements.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Jan 1 at 17:17









        PrimusaPrimusa

        7,9812732




        7,9812732
































            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53997424%2fwhy-is-pandas-indexing-so-slow-how-to-make-it-faster%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            MongoDB - Not Authorized To Execute Command

            How to fix TextFormField cause rebuild widget in Flutter

            in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith