How do I unnest (explode) a column in a pandas DataFrame?












38















I have the following DataFrame where one of the columns is an object (list type cell):



df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})
df
Out[458]:
A B
0 1 [1, 2]
1 2 [1, 2]


My expected output is:



   A  B
0 1 1
1 1 2
3 2 1
4 2 2


What should I do to achieve this?





Related question



pandas: When cell contents are lists, create a row for each element in the list



Good question and answer but only handle one column with list(In my answer the self-def function will work for multiple columns, also the accepted answer is use the most time consuming apply , which is not recommended, check more info When should I ever want to use pandas apply() in my code?)










share|improve this question




















  • 3





    Related, unnesting strings: stackoverflow.com/q/48197234/4909087

    – coldspeed
    Nov 12 '18 at 12:00






  • 1





    Couple of related posts: here, here, here, here, ...

    – Cleb
    Jan 4 at 12:43











  • This question is asked and answered 4 years after the original question and serves no purpose other then to divide possible answers to several questions instead of giving users a single place to find the solution.

    – ImportanceOfBeingErnest
    Feb 1 at 16:45
















38















I have the following DataFrame where one of the columns is an object (list type cell):



df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})
df
Out[458]:
A B
0 1 [1, 2]
1 2 [1, 2]


My expected output is:



   A  B
0 1 1
1 1 2
3 2 1
4 2 2


What should I do to achieve this?





Related question



pandas: When cell contents are lists, create a row for each element in the list



Good question and answer but only handle one column with list(In my answer the self-def function will work for multiple columns, also the accepted answer is use the most time consuming apply , which is not recommended, check more info When should I ever want to use pandas apply() in my code?)










share|improve this question




















  • 3





    Related, unnesting strings: stackoverflow.com/q/48197234/4909087

    – coldspeed
    Nov 12 '18 at 12:00






  • 1





    Couple of related posts: here, here, here, here, ...

    – Cleb
    Jan 4 at 12:43











  • This question is asked and answered 4 years after the original question and serves no purpose other then to divide possible answers to several questions instead of giving users a single place to find the solution.

    – ImportanceOfBeingErnest
    Feb 1 at 16:45














38












38








38


17






I have the following DataFrame where one of the columns is an object (list type cell):



df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})
df
Out[458]:
A B
0 1 [1, 2]
1 2 [1, 2]


My expected output is:



   A  B
0 1 1
1 1 2
3 2 1
4 2 2


What should I do to achieve this?





Related question



pandas: When cell contents are lists, create a row for each element in the list



Good question and answer but only handle one column with list(In my answer the self-def function will work for multiple columns, also the accepted answer is use the most time consuming apply , which is not recommended, check more info When should I ever want to use pandas apply() in my code?)










share|improve this question
















I have the following DataFrame where one of the columns is an object (list type cell):



df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})
df
Out[458]:
A B
0 1 [1, 2]
1 2 [1, 2]


My expected output is:



   A  B
0 1 1
1 1 2
3 2 1
4 2 2


What should I do to achieve this?





Related question



pandas: When cell contents are lists, create a row for each element in the list



Good question and answer but only handle one column with list(In my answer the self-def function will work for multiple columns, also the accepted answer is use the most time consuming apply , which is not recommended, check more info When should I ever want to use pandas apply() in my code?)







python pandas dataframe






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Feb 2 at 0:58







Wen-Ben

















asked Nov 9 '18 at 2:19









Wen-BenWen-Ben

111k83266




111k83266








  • 3





    Related, unnesting strings: stackoverflow.com/q/48197234/4909087

    – coldspeed
    Nov 12 '18 at 12:00






  • 1





    Couple of related posts: here, here, here, here, ...

    – Cleb
    Jan 4 at 12:43











  • This question is asked and answered 4 years after the original question and serves no purpose other then to divide possible answers to several questions instead of giving users a single place to find the solution.

    – ImportanceOfBeingErnest
    Feb 1 at 16:45














  • 3





    Related, unnesting strings: stackoverflow.com/q/48197234/4909087

    – coldspeed
    Nov 12 '18 at 12:00






  • 1





    Couple of related posts: here, here, here, here, ...

    – Cleb
    Jan 4 at 12:43











  • This question is asked and answered 4 years after the original question and serves no purpose other then to divide possible answers to several questions instead of giving users a single place to find the solution.

    – ImportanceOfBeingErnest
    Feb 1 at 16:45








3




3





Related, unnesting strings: stackoverflow.com/q/48197234/4909087

– coldspeed
Nov 12 '18 at 12:00





Related, unnesting strings: stackoverflow.com/q/48197234/4909087

– coldspeed
Nov 12 '18 at 12:00




1




1





Couple of related posts: here, here, here, here, ...

– Cleb
Jan 4 at 12:43





Couple of related posts: here, here, here, here, ...

– Cleb
Jan 4 at 12:43













This question is asked and answered 4 years after the original question and serves no purpose other then to divide possible answers to several questions instead of giving users a single place to find the solution.

– ImportanceOfBeingErnest
Feb 1 at 16:45





This question is asked and answered 4 years after the original question and serves no purpose other then to divide possible answers to several questions instead of giving users a single place to find the solution.

– ImportanceOfBeingErnest
Feb 1 at 16:45












5 Answers
5






active

oldest

votes


















42





+100











As an user with both R and python, I have seen this type of question a couple of times.





In R, they have the built-in function from package tidyr called unnest. But in Python(pandas) there is no built-in function for this type of question.





I know object columns type always make the data hard to convert with a pandas' function. When I received the data like this , the first thing that came to mind was to 'flatten' or unnest the columns .





Method 1
apply + pd.Series (easy to understand but in terms of performance not recommended . )



df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})
Out[463]:
A B
0 1 1
1 1 2
0 2 1
1 2 2




Method 2 using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )



df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})
df
Out[465]:
A B
0 1 1
0 1 2
1 2 1
1 2 2


Method 2.1 for example besides A we have A.1 .....A.n. If we still use the method(Method 2) above it is hard for us to re-create the columns one by one .



Solution : join or merge with the index after 'unnest' the single columns



s=pd.DataFrame({'B':np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B',1),how='left')
Out[477]:
B A
0 1 1
0 2 1
1 1 2
1 2 2


If you need the column order exactly the same as before, add reindex at the end.



s.join(df.drop('B',1),how='left').reindex(columns=df.columns)




Method 3 recreate the list



pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]:
A B
0 1 1
1 1 2
2 2 1
3 2 2


If more than two columns



s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]:
0 1 A B
0 0 1 1 [1, 2]
1 0 2 1 [1, 2]
2 1 1 2 [1, 2]
3 1 2 2 [1, 2]




Method 4 using reindex or loc



df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]:
A B
0 1 1
0 1 2
1 2 1
1 2 2

#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))


Method 5 when the list only contains unique values:



df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]})
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]:
B A
0 1 1
1 2 1
2 3 2
3 4 2


Method 6 using numpy for high performance:



newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
pd.DataFrame(data=newvalues[0],columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2




Method 7 : using base function itertools cycle and chain: Pure python solution just for fun



from itertools import cycle,chain
l=df.values.tolist()
l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l]
pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2




Special case (two columns type object)



df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]})
df
Out[592]:
A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4]




Self-def function



def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')

unnesting(df,['B','C'])
Out[609]:
B C A
0 1 1 1
0 2 2 1
1 3 3 2
1 4 4 2




Summary :



I am using pandas and python functions for this type of question. If you are worried about the speed of the above solutions, check user3483203's answer , since he is using numpy and most of the time numpy is faster . I recommend Cpython and numba if speed matters in your case.






share|improve this answer

































    16





    +100









    Option 1



    If all of the sublists in the other column are the same length, numpy can be an efficient option here:



    vals = np.array(df.B.values.tolist())    
    a = np.repeat(df.A, vals.shape[1])

    pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)




       A  B
    0 1 1
    1 1 2
    2 2 1
    3 2 2




    Option 2



    If the sublists have different length, you need an additional step:



    vals = df.B.values.tolist()
    rs = [len(r) for r in vals]
    a = np.repeat(df.A, rs)

    pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)




       A  B
    0 1 1
    1 1 2
    2 2 1
    3 2 2




    Option 3



    I took a shot at generalizing this to work to flatten N columns and tile M columns, I'll work later on making it more efficient:



    df = pd.DataFrame({'A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
    'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C']})




       A          B          C  D
    0 1 [1, 2] [1, 2, 3] A
    1 2 [1, 2, 3] [1, 2] B
    2 3 [1] [1, 2] C




    def unnest(df, tile, explode):
    vals = df[explode].sum(1)
    rs = [len(r) for r in vals]
    a = np.repeat(df[tile].values, rs, axis=0)
    b = np.concatenate(vals.values)
    d = np.column_stack((a, b))
    return pd.DataFrame(d, columns = tile + ['_'.join(explode)])

    unnest(df, ['A', 'D'], ['B', 'C'])




        A  D B_C
    0 1 A 1
    1 1 A 2
    2 1 A 1
    3 1 A 2
    4 1 A 3
    5 2 B 1
    6 2 B 2
    7 2 B 3
    8 2 B 1
    9 2 B 2
    10 3 C 1
    11 3 C 1
    12 3 C 2




    Functions



    def wen1(df):
    return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0: 'B'})

    def wen2(df):
    return pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})

    def wen3(df):
    s = pd.DataFrame({'B': np.concatenate(df.B.values)}, index=df.index.repeat(df.B.str.len()))
    return s.join(df.drop('B', 1), how='left')

    def wen4(df):
    return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

    def chris1(df):
    vals = np.array(df.B.values.tolist())
    a = np.repeat(df.A, vals.shape[1])
    return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

    def chris2(df):
    vals = df.B.values.tolist()
    rs = [len(r) for r in vals]
    a = np.repeat(df.A.values, rs)
    return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)


    Timings



    import pandas as pd
    import matplotlib.pyplot as plt
    import numpy as np
    from timeit import timeit

    res = pd.DataFrame(
    index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
    columns=[10, 50, 100, 500, 1000, 5000, 10000],
    dtype=float
    )

    for f in res.index:
    for c in res.columns:
    df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
    df = pd.concat([df]*c)
    stmt = '{}(df)'.format(f)
    setp = 'from __main__ import df, {}'.format(f)
    res.at[f, c] = timeit(stmt, setp, number=50)

    ax = res.div(res.min()).T.plot(loglog=True)
    ax.set_xlabel("N")
    ax.set_ylabel("time (relative)")


    Performance



    enter image description here






    share|improve this answer

































      7














      One alternative is to apply the meshgrid recipe over the rows of the columns to unnest:



      import numpy as np
      import pandas as pd


      def unnest(frame, explode):
      def mesh(values):
      return np.array(np.meshgrid(*values)).T.reshape(-1, len(values))

      data = np.vstack(mesh(row) for row in frame[explode].values)
      return pd.DataFrame(data=data, columns=explode)


      df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
      print(unnest(df, ['A', 'B'])) # base
      print()

      df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [3, 4]], 'C': [[1, 2], [3, 4]]})
      print(unnest(df, ['A', 'B', 'C'])) # multiple columns
      print()

      df = pd.DataFrame({'A': [1, 2, 3], 'B': [[1, 2], [1, 2, 3], [1]],
      'C': [[1, 2, 3], [1, 2], [1, 2]], 'D': ['A', 'B', 'C']})

      print(unnest(df, ['A', 'B'])) # uneven length lists
      print()
      print(unnest(df, ['D', 'B'])) # different types
      print()


      Output



         A  B
      0 1 1
      1 1 2
      2 2 1
      3 2 2

      A B C
      0 1 1 1
      1 1 2 1
      2 1 1 2
      3 1 2 2
      4 2 3 3
      5 2 4 3
      6 2 3 4
      7 2 4 4

      A B
      0 1 1
      1 1 2
      2 2 1
      3 2 2
      4 2 3
      5 3 1

      D B
      0 A 1
      1 A 2
      2 B 1
      3 B 2
      4 B 3
      5 C 1





      share|improve this answer































        1














        My 5 cents:



        df[['B', 'B2']] = pd.DataFrame(df['B'].values.tolist())

        df[['A', 'B']].append(df[['A', 'B2']].rename(columns={'B2': 'B'}),
        ignore_index=True)


        and another 5



        df[['B1', 'B2']] = pd.DataFrame([*df['B']]) # if values.tolist() is too boring

        (pd.wide_to_long(df.drop('B', 1), 'B', 'A', '')
        .reset_index(level=1, drop=True)
        .reset_index())


        both resulting in the same



           A  B
        0 1 1
        1 2 1
        2 1 2
        3 2 2





        share|improve this answer

































          -1














          Something pretty not recommended (at least work in this case):



          df=pd.concat([df]*2).sort_index()
          it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
          df['B']=df['B'].apply(lambda x:next(it))


          concat + sort_index + iter + apply + next.



          Now:



          print(df)


          Is:



             A  B
          0 1 1
          0 1 2
          1 2 1
          1 2 2


          If care about index:



          df=df.reset_index(drop=True)


          Now:



          print(df)


          Is:



             A  B
          0 1 1
          1 1 2
          2 2 1
          3 2 2





          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53218931%2fhow-do-i-unnest-explode-a-column-in-a-pandas-dataframe%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            5 Answers
            5






            active

            oldest

            votes








            5 Answers
            5






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            42





            +100











            As an user with both R and python, I have seen this type of question a couple of times.





            In R, they have the built-in function from package tidyr called unnest. But in Python(pandas) there is no built-in function for this type of question.





            I know object columns type always make the data hard to convert with a pandas' function. When I received the data like this , the first thing that came to mind was to 'flatten' or unnest the columns .





            Method 1
            apply + pd.Series (easy to understand but in terms of performance not recommended . )



            df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})
            Out[463]:
            A B
            0 1 1
            1 1 2
            0 2 1
            1 2 2




            Method 2 using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )



            df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})
            df
            Out[465]:
            A B
            0 1 1
            0 1 2
            1 2 1
            1 2 2


            Method 2.1 for example besides A we have A.1 .....A.n. If we still use the method(Method 2) above it is hard for us to re-create the columns one by one .



            Solution : join or merge with the index after 'unnest' the single columns



            s=pd.DataFrame({'B':np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len()))
            s.join(df.drop('B',1),how='left')
            Out[477]:
            B A
            0 1 1
            0 2 1
            1 1 2
            1 2 2


            If you need the column order exactly the same as before, add reindex at the end.



            s.join(df.drop('B',1),how='left').reindex(columns=df.columns)




            Method 3 recreate the list



            pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
            Out[488]:
            A B
            0 1 1
            1 1 2
            2 2 1
            3 2 2


            If more than two columns



            s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
            s.merge(df,left_on=0,right_index=True)
            Out[491]:
            0 1 A B
            0 0 1 1 [1, 2]
            1 0 2 1 [1, 2]
            2 1 1 2 [1, 2]
            3 1 2 2 [1, 2]




            Method 4 using reindex or loc



            df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
            Out[554]:
            A B
            0 1 1
            0 1 2
            1 2 1
            1 2 2

            #df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))


            Method 5 when the list only contains unique values:



            df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]})
            from collections import ChainMap
            d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
            pd.DataFrame(list(d.items()),columns=df.columns[::-1])
            Out[574]:
            B A
            0 1 1
            1 2 1
            2 3 2
            3 4 2


            Method 6 using numpy for high performance:



            newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
            pd.DataFrame(data=newvalues[0],columns=df.columns)
            A B
            0 1 1
            1 1 2
            2 2 1
            3 2 2




            Method 7 : using base function itertools cycle and chain: Pure python solution just for fun



            from itertools import cycle,chain
            l=df.values.tolist()
            l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l]
            pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
            A B
            0 1 1
            1 1 2
            2 2 1
            3 2 2




            Special case (two columns type object)



            df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]})
            df
            Out[592]:
            A B C
            0 1 [1, 2] [1, 2]
            1 2 [3, 4] [3, 4]




            Self-def function



            def unnesting(df, explode):
            idx=df.index.repeat(df[explode[0]].str.len())
            df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
            df1.index=idx
            return df1.join(df.drop(explode,1),how='left')

            unnesting(df,['B','C'])
            Out[609]:
            B C A
            0 1 1 1
            0 2 2 1
            1 3 3 2
            1 4 4 2




            Summary :



            I am using pandas and python functions for this type of question. If you are worried about the speed of the above solutions, check user3483203's answer , since he is using numpy and most of the time numpy is faster . I recommend Cpython and numba if speed matters in your case.






            share|improve this answer






























              42





              +100











              As an user with both R and python, I have seen this type of question a couple of times.





              In R, they have the built-in function from package tidyr called unnest. But in Python(pandas) there is no built-in function for this type of question.





              I know object columns type always make the data hard to convert with a pandas' function. When I received the data like this , the first thing that came to mind was to 'flatten' or unnest the columns .





              Method 1
              apply + pd.Series (easy to understand but in terms of performance not recommended . )



              df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})
              Out[463]:
              A B
              0 1 1
              1 1 2
              0 2 1
              1 2 2




              Method 2 using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )



              df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})
              df
              Out[465]:
              A B
              0 1 1
              0 1 2
              1 2 1
              1 2 2


              Method 2.1 for example besides A we have A.1 .....A.n. If we still use the method(Method 2) above it is hard for us to re-create the columns one by one .



              Solution : join or merge with the index after 'unnest' the single columns



              s=pd.DataFrame({'B':np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len()))
              s.join(df.drop('B',1),how='left')
              Out[477]:
              B A
              0 1 1
              0 2 1
              1 1 2
              1 2 2


              If you need the column order exactly the same as before, add reindex at the end.



              s.join(df.drop('B',1),how='left').reindex(columns=df.columns)




              Method 3 recreate the list



              pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
              Out[488]:
              A B
              0 1 1
              1 1 2
              2 2 1
              3 2 2


              If more than two columns



              s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
              s.merge(df,left_on=0,right_index=True)
              Out[491]:
              0 1 A B
              0 0 1 1 [1, 2]
              1 0 2 1 [1, 2]
              2 1 1 2 [1, 2]
              3 1 2 2 [1, 2]




              Method 4 using reindex or loc



              df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
              Out[554]:
              A B
              0 1 1
              0 1 2
              1 2 1
              1 2 2

              #df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))


              Method 5 when the list only contains unique values:



              df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]})
              from collections import ChainMap
              d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
              pd.DataFrame(list(d.items()),columns=df.columns[::-1])
              Out[574]:
              B A
              0 1 1
              1 2 1
              2 3 2
              3 4 2


              Method 6 using numpy for high performance:



              newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
              pd.DataFrame(data=newvalues[0],columns=df.columns)
              A B
              0 1 1
              1 1 2
              2 2 1
              3 2 2




              Method 7 : using base function itertools cycle and chain: Pure python solution just for fun



              from itertools import cycle,chain
              l=df.values.tolist()
              l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l]
              pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
              A B
              0 1 1
              1 1 2
              2 2 1
              3 2 2




              Special case (two columns type object)



              df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]})
              df
              Out[592]:
              A B C
              0 1 [1, 2] [1, 2]
              1 2 [3, 4] [3, 4]




              Self-def function



              def unnesting(df, explode):
              idx=df.index.repeat(df[explode[0]].str.len())
              df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
              df1.index=idx
              return df1.join(df.drop(explode,1),how='left')

              unnesting(df,['B','C'])
              Out[609]:
              B C A
              0 1 1 1
              0 2 2 1
              1 3 3 2
              1 4 4 2




              Summary :



              I am using pandas and python functions for this type of question. If you are worried about the speed of the above solutions, check user3483203's answer , since he is using numpy and most of the time numpy is faster . I recommend Cpython and numba if speed matters in your case.






              share|improve this answer




























                42





                +100







                42





                +100



                42




                +100







                As an user with both R and python, I have seen this type of question a couple of times.





                In R, they have the built-in function from package tidyr called unnest. But in Python(pandas) there is no built-in function for this type of question.





                I know object columns type always make the data hard to convert with a pandas' function. When I received the data like this , the first thing that came to mind was to 'flatten' or unnest the columns .





                Method 1
                apply + pd.Series (easy to understand but in terms of performance not recommended . )



                df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})
                Out[463]:
                A B
                0 1 1
                1 1 2
                0 2 1
                1 2 2




                Method 2 using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )



                df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})
                df
                Out[465]:
                A B
                0 1 1
                0 1 2
                1 2 1
                1 2 2


                Method 2.1 for example besides A we have A.1 .....A.n. If we still use the method(Method 2) above it is hard for us to re-create the columns one by one .



                Solution : join or merge with the index after 'unnest' the single columns



                s=pd.DataFrame({'B':np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len()))
                s.join(df.drop('B',1),how='left')
                Out[477]:
                B A
                0 1 1
                0 2 1
                1 1 2
                1 2 2


                If you need the column order exactly the same as before, add reindex at the end.



                s.join(df.drop('B',1),how='left').reindex(columns=df.columns)




                Method 3 recreate the list



                pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
                Out[488]:
                A B
                0 1 1
                1 1 2
                2 2 1
                3 2 2


                If more than two columns



                s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
                s.merge(df,left_on=0,right_index=True)
                Out[491]:
                0 1 A B
                0 0 1 1 [1, 2]
                1 0 2 1 [1, 2]
                2 1 1 2 [1, 2]
                3 1 2 2 [1, 2]




                Method 4 using reindex or loc



                df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
                Out[554]:
                A B
                0 1 1
                0 1 2
                1 2 1
                1 2 2

                #df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))


                Method 5 when the list only contains unique values:



                df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]})
                from collections import ChainMap
                d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
                pd.DataFrame(list(d.items()),columns=df.columns[::-1])
                Out[574]:
                B A
                0 1 1
                1 2 1
                2 3 2
                3 4 2


                Method 6 using numpy for high performance:



                newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
                pd.DataFrame(data=newvalues[0],columns=df.columns)
                A B
                0 1 1
                1 1 2
                2 2 1
                3 2 2




                Method 7 : using base function itertools cycle and chain: Pure python solution just for fun



                from itertools import cycle,chain
                l=df.values.tolist()
                l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l]
                pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
                A B
                0 1 1
                1 1 2
                2 2 1
                3 2 2




                Special case (two columns type object)



                df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]})
                df
                Out[592]:
                A B C
                0 1 [1, 2] [1, 2]
                1 2 [3, 4] [3, 4]




                Self-def function



                def unnesting(df, explode):
                idx=df.index.repeat(df[explode[0]].str.len())
                df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
                df1.index=idx
                return df1.join(df.drop(explode,1),how='left')

                unnesting(df,['B','C'])
                Out[609]:
                B C A
                0 1 1 1
                0 2 2 1
                1 3 3 2
                1 4 4 2




                Summary :



                I am using pandas and python functions for this type of question. If you are worried about the speed of the above solutions, check user3483203's answer , since he is using numpy and most of the time numpy is faster . I recommend Cpython and numba if speed matters in your case.






                share|improve this answer

















                As an user with both R and python, I have seen this type of question a couple of times.





                In R, they have the built-in function from package tidyr called unnest. But in Python(pandas) there is no built-in function for this type of question.





                I know object columns type always make the data hard to convert with a pandas' function. When I received the data like this , the first thing that came to mind was to 'flatten' or unnest the columns .





                Method 1
                apply + pd.Series (easy to understand but in terms of performance not recommended . )



                df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})
                Out[463]:
                A B
                0 1 1
                1 1 2
                0 2 1
                1 2 2




                Method 2 using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )



                df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})
                df
                Out[465]:
                A B
                0 1 1
                0 1 2
                1 2 1
                1 2 2


                Method 2.1 for example besides A we have A.1 .....A.n. If we still use the method(Method 2) above it is hard for us to re-create the columns one by one .



                Solution : join or merge with the index after 'unnest' the single columns



                s=pd.DataFrame({'B':np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len()))
                s.join(df.drop('B',1),how='left')
                Out[477]:
                B A
                0 1 1
                0 2 1
                1 1 2
                1 2 2


                If you need the column order exactly the same as before, add reindex at the end.



                s.join(df.drop('B',1),how='left').reindex(columns=df.columns)




                Method 3 recreate the list



                pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
                Out[488]:
                A B
                0 1 1
                1 1 2
                2 2 1
                3 2 2


                If more than two columns



                s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
                s.merge(df,left_on=0,right_index=True)
                Out[491]:
                0 1 A B
                0 0 1 1 [1, 2]
                1 0 2 1 [1, 2]
                2 1 1 2 [1, 2]
                3 1 2 2 [1, 2]




                Method 4 using reindex or loc



                df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
                Out[554]:
                A B
                0 1 1
                0 1 2
                1 2 1
                1 2 2

                #df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))


                Method 5 when the list only contains unique values:



                df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]})
                from collections import ChainMap
                d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
                pd.DataFrame(list(d.items()),columns=df.columns[::-1])
                Out[574]:
                B A
                0 1 1
                1 2 1
                2 3 2
                3 4 2


                Method 6 using numpy for high performance:



                newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
                pd.DataFrame(data=newvalues[0],columns=df.columns)
                A B
                0 1 1
                1 1 2
                2 2 1
                3 2 2




                Method 7 : using base function itertools cycle and chain: Pure python solution just for fun



                from itertools import cycle,chain
                l=df.values.tolist()
                l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l]
                pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
                A B
                0 1 1
                1 1 2
                2 2 1
                3 2 2




                Special case (two columns type object)



                df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]})
                df
                Out[592]:
                A B C
                0 1 [1, 2] [1, 2]
                1 2 [3, 4] [3, 4]




                Self-def function



                def unnesting(df, explode):
                idx=df.index.repeat(df[explode[0]].str.len())
                df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
                df1.index=idx
                return df1.join(df.drop(explode,1),how='left')

                unnesting(df,['B','C'])
                Out[609]:
                B C A
                0 1 1 1
                0 2 2 1
                1 3 3 2
                1 4 4 2




                Summary :



                I am using pandas and python functions for this type of question. If you are worried about the speed of the above solutions, check user3483203's answer , since he is using numpy and most of the time numpy is faster . I recommend Cpython and numba if speed matters in your case.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Dec 26 '18 at 15:24









                Jack Moody

                640621




                640621










                answered Nov 9 '18 at 2:20









                Wen-BenWen-Ben

                111k83266




                111k83266

























                    16





                    +100









                    Option 1



                    If all of the sublists in the other column are the same length, numpy can be an efficient option here:



                    vals = np.array(df.B.values.tolist())    
                    a = np.repeat(df.A, vals.shape[1])

                    pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)




                       A  B
                    0 1 1
                    1 1 2
                    2 2 1
                    3 2 2




                    Option 2



                    If the sublists have different length, you need an additional step:



                    vals = df.B.values.tolist()
                    rs = [len(r) for r in vals]
                    a = np.repeat(df.A, rs)

                    pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)




                       A  B
                    0 1 1
                    1 1 2
                    2 2 1
                    3 2 2




                    Option 3



                    I took a shot at generalizing this to work to flatten N columns and tile M columns, I'll work later on making it more efficient:



                    df = pd.DataFrame({'A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
                    'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C']})




                       A          B          C  D
                    0 1 [1, 2] [1, 2, 3] A
                    1 2 [1, 2, 3] [1, 2] B
                    2 3 [1] [1, 2] C




                    def unnest(df, tile, explode):
                    vals = df[explode].sum(1)
                    rs = [len(r) for r in vals]
                    a = np.repeat(df[tile].values, rs, axis=0)
                    b = np.concatenate(vals.values)
                    d = np.column_stack((a, b))
                    return pd.DataFrame(d, columns = tile + ['_'.join(explode)])

                    unnest(df, ['A', 'D'], ['B', 'C'])




                        A  D B_C
                    0 1 A 1
                    1 1 A 2
                    2 1 A 1
                    3 1 A 2
                    4 1 A 3
                    5 2 B 1
                    6 2 B 2
                    7 2 B 3
                    8 2 B 1
                    9 2 B 2
                    10 3 C 1
                    11 3 C 1
                    12 3 C 2




                    Functions



                    def wen1(df):
                    return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0: 'B'})

                    def wen2(df):
                    return pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})

                    def wen3(df):
                    s = pd.DataFrame({'B': np.concatenate(df.B.values)}, index=df.index.repeat(df.B.str.len()))
                    return s.join(df.drop('B', 1), how='left')

                    def wen4(df):
                    return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

                    def chris1(df):
                    vals = np.array(df.B.values.tolist())
                    a = np.repeat(df.A, vals.shape[1])
                    return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

                    def chris2(df):
                    vals = df.B.values.tolist()
                    rs = [len(r) for r in vals]
                    a = np.repeat(df.A.values, rs)
                    return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)


                    Timings



                    import pandas as pd
                    import matplotlib.pyplot as plt
                    import numpy as np
                    from timeit import timeit

                    res = pd.DataFrame(
                    index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
                    columns=[10, 50, 100, 500, 1000, 5000, 10000],
                    dtype=float
                    )

                    for f in res.index:
                    for c in res.columns:
                    df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
                    df = pd.concat([df]*c)
                    stmt = '{}(df)'.format(f)
                    setp = 'from __main__ import df, {}'.format(f)
                    res.at[f, c] = timeit(stmt, setp, number=50)

                    ax = res.div(res.min()).T.plot(loglog=True)
                    ax.set_xlabel("N")
                    ax.set_ylabel("time (relative)")


                    Performance



                    enter image description here






                    share|improve this answer






























                      16





                      +100









                      Option 1



                      If all of the sublists in the other column are the same length, numpy can be an efficient option here:



                      vals = np.array(df.B.values.tolist())    
                      a = np.repeat(df.A, vals.shape[1])

                      pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)




                         A  B
                      0 1 1
                      1 1 2
                      2 2 1
                      3 2 2




                      Option 2



                      If the sublists have different length, you need an additional step:



                      vals = df.B.values.tolist()
                      rs = [len(r) for r in vals]
                      a = np.repeat(df.A, rs)

                      pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)




                         A  B
                      0 1 1
                      1 1 2
                      2 2 1
                      3 2 2




                      Option 3



                      I took a shot at generalizing this to work to flatten N columns and tile M columns, I'll work later on making it more efficient:



                      df = pd.DataFrame({'A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
                      'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C']})




                         A          B          C  D
                      0 1 [1, 2] [1, 2, 3] A
                      1 2 [1, 2, 3] [1, 2] B
                      2 3 [1] [1, 2] C




                      def unnest(df, tile, explode):
                      vals = df[explode].sum(1)
                      rs = [len(r) for r in vals]
                      a = np.repeat(df[tile].values, rs, axis=0)
                      b = np.concatenate(vals.values)
                      d = np.column_stack((a, b))
                      return pd.DataFrame(d, columns = tile + ['_'.join(explode)])

                      unnest(df, ['A', 'D'], ['B', 'C'])




                          A  D B_C
                      0 1 A 1
                      1 1 A 2
                      2 1 A 1
                      3 1 A 2
                      4 1 A 3
                      5 2 B 1
                      6 2 B 2
                      7 2 B 3
                      8 2 B 1
                      9 2 B 2
                      10 3 C 1
                      11 3 C 1
                      12 3 C 2




                      Functions



                      def wen1(df):
                      return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0: 'B'})

                      def wen2(df):
                      return pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})

                      def wen3(df):
                      s = pd.DataFrame({'B': np.concatenate(df.B.values)}, index=df.index.repeat(df.B.str.len()))
                      return s.join(df.drop('B', 1), how='left')

                      def wen4(df):
                      return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

                      def chris1(df):
                      vals = np.array(df.B.values.tolist())
                      a = np.repeat(df.A, vals.shape[1])
                      return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

                      def chris2(df):
                      vals = df.B.values.tolist()
                      rs = [len(r) for r in vals]
                      a = np.repeat(df.A.values, rs)
                      return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)


                      Timings



                      import pandas as pd
                      import matplotlib.pyplot as plt
                      import numpy as np
                      from timeit import timeit

                      res = pd.DataFrame(
                      index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
                      columns=[10, 50, 100, 500, 1000, 5000, 10000],
                      dtype=float
                      )

                      for f in res.index:
                      for c in res.columns:
                      df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
                      df = pd.concat([df]*c)
                      stmt = '{}(df)'.format(f)
                      setp = 'from __main__ import df, {}'.format(f)
                      res.at[f, c] = timeit(stmt, setp, number=50)

                      ax = res.div(res.min()).T.plot(loglog=True)
                      ax.set_xlabel("N")
                      ax.set_ylabel("time (relative)")


                      Performance



                      enter image description here






                      share|improve this answer




























                        16





                        +100







                        16





                        +100



                        16




                        +100





                        Option 1



                        If all of the sublists in the other column are the same length, numpy can be an efficient option here:



                        vals = np.array(df.B.values.tolist())    
                        a = np.repeat(df.A, vals.shape[1])

                        pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)




                           A  B
                        0 1 1
                        1 1 2
                        2 2 1
                        3 2 2




                        Option 2



                        If the sublists have different length, you need an additional step:



                        vals = df.B.values.tolist()
                        rs = [len(r) for r in vals]
                        a = np.repeat(df.A, rs)

                        pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)




                           A  B
                        0 1 1
                        1 1 2
                        2 2 1
                        3 2 2




                        Option 3



                        I took a shot at generalizing this to work to flatten N columns and tile M columns, I'll work later on making it more efficient:



                        df = pd.DataFrame({'A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
                        'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C']})




                           A          B          C  D
                        0 1 [1, 2] [1, 2, 3] A
                        1 2 [1, 2, 3] [1, 2] B
                        2 3 [1] [1, 2] C




                        def unnest(df, tile, explode):
                        vals = df[explode].sum(1)
                        rs = [len(r) for r in vals]
                        a = np.repeat(df[tile].values, rs, axis=0)
                        b = np.concatenate(vals.values)
                        d = np.column_stack((a, b))
                        return pd.DataFrame(d, columns = tile + ['_'.join(explode)])

                        unnest(df, ['A', 'D'], ['B', 'C'])




                            A  D B_C
                        0 1 A 1
                        1 1 A 2
                        2 1 A 1
                        3 1 A 2
                        4 1 A 3
                        5 2 B 1
                        6 2 B 2
                        7 2 B 3
                        8 2 B 1
                        9 2 B 2
                        10 3 C 1
                        11 3 C 1
                        12 3 C 2




                        Functions



                        def wen1(df):
                        return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0: 'B'})

                        def wen2(df):
                        return pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})

                        def wen3(df):
                        s = pd.DataFrame({'B': np.concatenate(df.B.values)}, index=df.index.repeat(df.B.str.len()))
                        return s.join(df.drop('B', 1), how='left')

                        def wen4(df):
                        return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

                        def chris1(df):
                        vals = np.array(df.B.values.tolist())
                        a = np.repeat(df.A, vals.shape[1])
                        return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

                        def chris2(df):
                        vals = df.B.values.tolist()
                        rs = [len(r) for r in vals]
                        a = np.repeat(df.A.values, rs)
                        return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)


                        Timings



                        import pandas as pd
                        import matplotlib.pyplot as plt
                        import numpy as np
                        from timeit import timeit

                        res = pd.DataFrame(
                        index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
                        columns=[10, 50, 100, 500, 1000, 5000, 10000],
                        dtype=float
                        )

                        for f in res.index:
                        for c in res.columns:
                        df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
                        df = pd.concat([df]*c)
                        stmt = '{}(df)'.format(f)
                        setp = 'from __main__ import df, {}'.format(f)
                        res.at[f, c] = timeit(stmt, setp, number=50)

                        ax = res.div(res.min()).T.plot(loglog=True)
                        ax.set_xlabel("N")
                        ax.set_ylabel("time (relative)")


                        Performance



                        enter image description here






                        share|improve this answer















                        Option 1



                        If all of the sublists in the other column are the same length, numpy can be an efficient option here:



                        vals = np.array(df.B.values.tolist())    
                        a = np.repeat(df.A, vals.shape[1])

                        pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)




                           A  B
                        0 1 1
                        1 1 2
                        2 2 1
                        3 2 2




                        Option 2



                        If the sublists have different length, you need an additional step:



                        vals = df.B.values.tolist()
                        rs = [len(r) for r in vals]
                        a = np.repeat(df.A, rs)

                        pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)




                           A  B
                        0 1 1
                        1 1 2
                        2 2 1
                        3 2 2




                        Option 3



                        I took a shot at generalizing this to work to flatten N columns and tile M columns, I'll work later on making it more efficient:



                        df = pd.DataFrame({'A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
                        'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C']})




                           A          B          C  D
                        0 1 [1, 2] [1, 2, 3] A
                        1 2 [1, 2, 3] [1, 2] B
                        2 3 [1] [1, 2] C




                        def unnest(df, tile, explode):
                        vals = df[explode].sum(1)
                        rs = [len(r) for r in vals]
                        a = np.repeat(df[tile].values, rs, axis=0)
                        b = np.concatenate(vals.values)
                        d = np.column_stack((a, b))
                        return pd.DataFrame(d, columns = tile + ['_'.join(explode)])

                        unnest(df, ['A', 'D'], ['B', 'C'])




                            A  D B_C
                        0 1 A 1
                        1 1 A 2
                        2 1 A 1
                        3 1 A 2
                        4 1 A 3
                        5 2 B 1
                        6 2 B 2
                        7 2 B 3
                        8 2 B 1
                        9 2 B 2
                        10 3 C 1
                        11 3 C 1
                        12 3 C 2




                        Functions



                        def wen1(df):
                        return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0: 'B'})

                        def wen2(df):
                        return pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})

                        def wen3(df):
                        s = pd.DataFrame({'B': np.concatenate(df.B.values)}, index=df.index.repeat(df.B.str.len()))
                        return s.join(df.drop('B', 1), how='left')

                        def wen4(df):
                        return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

                        def chris1(df):
                        vals = np.array(df.B.values.tolist())
                        a = np.repeat(df.A, vals.shape[1])
                        return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

                        def chris2(df):
                        vals = df.B.values.tolist()
                        rs = [len(r) for r in vals]
                        a = np.repeat(df.A.values, rs)
                        return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)


                        Timings



                        import pandas as pd
                        import matplotlib.pyplot as plt
                        import numpy as np
                        from timeit import timeit

                        res = pd.DataFrame(
                        index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
                        columns=[10, 50, 100, 500, 1000, 5000, 10000],
                        dtype=float
                        )

                        for f in res.index:
                        for c in res.columns:
                        df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
                        df = pd.concat([df]*c)
                        stmt = '{}(df)'.format(f)
                        setp = 'from __main__ import df, {}'.format(f)
                        res.at[f, c] = timeit(stmt, setp, number=50)

                        ax = res.div(res.min()).T.plot(loglog=True)
                        ax.set_xlabel("N")
                        ax.set_ylabel("time (relative)")


                        Performance



                        enter image description here







                        share|improve this answer














                        share|improve this answer



                        share|improve this answer








                        edited Nov 9 '18 at 4:15

























                        answered Nov 9 '18 at 2:35









                        user3483203user3483203

                        31.2k82655




                        31.2k82655























                            7














                            One alternative is to apply the meshgrid recipe over the rows of the columns to unnest:



                            import numpy as np
                            import pandas as pd


                            def unnest(frame, explode):
                            def mesh(values):
                            return np.array(np.meshgrid(*values)).T.reshape(-1, len(values))

                            data = np.vstack(mesh(row) for row in frame[explode].values)
                            return pd.DataFrame(data=data, columns=explode)


                            df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
                            print(unnest(df, ['A', 'B'])) # base
                            print()

                            df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [3, 4]], 'C': [[1, 2], [3, 4]]})
                            print(unnest(df, ['A', 'B', 'C'])) # multiple columns
                            print()

                            df = pd.DataFrame({'A': [1, 2, 3], 'B': [[1, 2], [1, 2, 3], [1]],
                            'C': [[1, 2, 3], [1, 2], [1, 2]], 'D': ['A', 'B', 'C']})

                            print(unnest(df, ['A', 'B'])) # uneven length lists
                            print()
                            print(unnest(df, ['D', 'B'])) # different types
                            print()


                            Output



                               A  B
                            0 1 1
                            1 1 2
                            2 2 1
                            3 2 2

                            A B C
                            0 1 1 1
                            1 1 2 1
                            2 1 1 2
                            3 1 2 2
                            4 2 3 3
                            5 2 4 3
                            6 2 3 4
                            7 2 4 4

                            A B
                            0 1 1
                            1 1 2
                            2 2 1
                            3 2 2
                            4 2 3
                            5 3 1

                            D B
                            0 A 1
                            1 A 2
                            2 B 1
                            3 B 2
                            4 B 3
                            5 C 1





                            share|improve this answer




























                              7














                              One alternative is to apply the meshgrid recipe over the rows of the columns to unnest:



                              import numpy as np
                              import pandas as pd


                              def unnest(frame, explode):
                              def mesh(values):
                              return np.array(np.meshgrid(*values)).T.reshape(-1, len(values))

                              data = np.vstack(mesh(row) for row in frame[explode].values)
                              return pd.DataFrame(data=data, columns=explode)


                              df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
                              print(unnest(df, ['A', 'B'])) # base
                              print()

                              df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [3, 4]], 'C': [[1, 2], [3, 4]]})
                              print(unnest(df, ['A', 'B', 'C'])) # multiple columns
                              print()

                              df = pd.DataFrame({'A': [1, 2, 3], 'B': [[1, 2], [1, 2, 3], [1]],
                              'C': [[1, 2, 3], [1, 2], [1, 2]], 'D': ['A', 'B', 'C']})

                              print(unnest(df, ['A', 'B'])) # uneven length lists
                              print()
                              print(unnest(df, ['D', 'B'])) # different types
                              print()


                              Output



                                 A  B
                              0 1 1
                              1 1 2
                              2 2 1
                              3 2 2

                              A B C
                              0 1 1 1
                              1 1 2 1
                              2 1 1 2
                              3 1 2 2
                              4 2 3 3
                              5 2 4 3
                              6 2 3 4
                              7 2 4 4

                              A B
                              0 1 1
                              1 1 2
                              2 2 1
                              3 2 2
                              4 2 3
                              5 3 1

                              D B
                              0 A 1
                              1 A 2
                              2 B 1
                              3 B 2
                              4 B 3
                              5 C 1





                              share|improve this answer


























                                7












                                7








                                7







                                One alternative is to apply the meshgrid recipe over the rows of the columns to unnest:



                                import numpy as np
                                import pandas as pd


                                def unnest(frame, explode):
                                def mesh(values):
                                return np.array(np.meshgrid(*values)).T.reshape(-1, len(values))

                                data = np.vstack(mesh(row) for row in frame[explode].values)
                                return pd.DataFrame(data=data, columns=explode)


                                df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
                                print(unnest(df, ['A', 'B'])) # base
                                print()

                                df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [3, 4]], 'C': [[1, 2], [3, 4]]})
                                print(unnest(df, ['A', 'B', 'C'])) # multiple columns
                                print()

                                df = pd.DataFrame({'A': [1, 2, 3], 'B': [[1, 2], [1, 2, 3], [1]],
                                'C': [[1, 2, 3], [1, 2], [1, 2]], 'D': ['A', 'B', 'C']})

                                print(unnest(df, ['A', 'B'])) # uneven length lists
                                print()
                                print(unnest(df, ['D', 'B'])) # different types
                                print()


                                Output



                                   A  B
                                0 1 1
                                1 1 2
                                2 2 1
                                3 2 2

                                A B C
                                0 1 1 1
                                1 1 2 1
                                2 1 1 2
                                3 1 2 2
                                4 2 3 3
                                5 2 4 3
                                6 2 3 4
                                7 2 4 4

                                A B
                                0 1 1
                                1 1 2
                                2 2 1
                                3 2 2
                                4 2 3
                                5 3 1

                                D B
                                0 A 1
                                1 A 2
                                2 B 1
                                3 B 2
                                4 B 3
                                5 C 1





                                share|improve this answer













                                One alternative is to apply the meshgrid recipe over the rows of the columns to unnest:



                                import numpy as np
                                import pandas as pd


                                def unnest(frame, explode):
                                def mesh(values):
                                return np.array(np.meshgrid(*values)).T.reshape(-1, len(values))

                                data = np.vstack(mesh(row) for row in frame[explode].values)
                                return pd.DataFrame(data=data, columns=explode)


                                df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
                                print(unnest(df, ['A', 'B'])) # base
                                print()

                                df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [3, 4]], 'C': [[1, 2], [3, 4]]})
                                print(unnest(df, ['A', 'B', 'C'])) # multiple columns
                                print()

                                df = pd.DataFrame({'A': [1, 2, 3], 'B': [[1, 2], [1, 2, 3], [1]],
                                'C': [[1, 2, 3], [1, 2], [1, 2]], 'D': ['A', 'B', 'C']})

                                print(unnest(df, ['A', 'B'])) # uneven length lists
                                print()
                                print(unnest(df, ['D', 'B'])) # different types
                                print()


                                Output



                                   A  B
                                0 1 1
                                1 1 2
                                2 2 1
                                3 2 2

                                A B C
                                0 1 1 1
                                1 1 2 1
                                2 1 1 2
                                3 1 2 2
                                4 2 3 3
                                5 2 4 3
                                6 2 3 4
                                7 2 4 4

                                A B
                                0 1 1
                                1 1 2
                                2 2 1
                                3 2 2
                                4 2 3
                                5 3 1

                                D B
                                0 A 1
                                1 A 2
                                2 B 1
                                3 B 2
                                4 B 3
                                5 C 1






                                share|improve this answer












                                share|improve this answer



                                share|improve this answer










                                answered Dec 1 '18 at 1:31









                                Daniel MesejoDaniel Mesejo

                                18.4k21432




                                18.4k21432























                                    1














                                    My 5 cents:



                                    df[['B', 'B2']] = pd.DataFrame(df['B'].values.tolist())

                                    df[['A', 'B']].append(df[['A', 'B2']].rename(columns={'B2': 'B'}),
                                    ignore_index=True)


                                    and another 5



                                    df[['B1', 'B2']] = pd.DataFrame([*df['B']]) # if values.tolist() is too boring

                                    (pd.wide_to_long(df.drop('B', 1), 'B', 'A', '')
                                    .reset_index(level=1, drop=True)
                                    .reset_index())


                                    both resulting in the same



                                       A  B
                                    0 1 1
                                    1 2 1
                                    2 1 2
                                    3 2 2





                                    share|improve this answer






























                                      1














                                      My 5 cents:



                                      df[['B', 'B2']] = pd.DataFrame(df['B'].values.tolist())

                                      df[['A', 'B']].append(df[['A', 'B2']].rename(columns={'B2': 'B'}),
                                      ignore_index=True)


                                      and another 5



                                      df[['B1', 'B2']] = pd.DataFrame([*df['B']]) # if values.tolist() is too boring

                                      (pd.wide_to_long(df.drop('B', 1), 'B', 'A', '')
                                      .reset_index(level=1, drop=True)
                                      .reset_index())


                                      both resulting in the same



                                         A  B
                                      0 1 1
                                      1 2 1
                                      2 1 2
                                      3 2 2





                                      share|improve this answer




























                                        1












                                        1








                                        1







                                        My 5 cents:



                                        df[['B', 'B2']] = pd.DataFrame(df['B'].values.tolist())

                                        df[['A', 'B']].append(df[['A', 'B2']].rename(columns={'B2': 'B'}),
                                        ignore_index=True)


                                        and another 5



                                        df[['B1', 'B2']] = pd.DataFrame([*df['B']]) # if values.tolist() is too boring

                                        (pd.wide_to_long(df.drop('B', 1), 'B', 'A', '')
                                        .reset_index(level=1, drop=True)
                                        .reset_index())


                                        both resulting in the same



                                           A  B
                                        0 1 1
                                        1 2 1
                                        2 1 2
                                        3 2 2





                                        share|improve this answer















                                        My 5 cents:



                                        df[['B', 'B2']] = pd.DataFrame(df['B'].values.tolist())

                                        df[['A', 'B']].append(df[['A', 'B2']].rename(columns={'B2': 'B'}),
                                        ignore_index=True)


                                        and another 5



                                        df[['B1', 'B2']] = pd.DataFrame([*df['B']]) # if values.tolist() is too boring

                                        (pd.wide_to_long(df.drop('B', 1), 'B', 'A', '')
                                        .reset_index(level=1, drop=True)
                                        .reset_index())


                                        both resulting in the same



                                           A  B
                                        0 1 1
                                        1 2 1
                                        2 1 2
                                        3 2 2






                                        share|improve this answer














                                        share|improve this answer



                                        share|improve this answer








                                        edited Dec 11 '18 at 3:50

























                                        answered Dec 11 '18 at 2:05









                                        ayorgoayorgo

                                        1,130714




                                        1,130714























                                            -1














                                            Something pretty not recommended (at least work in this case):



                                            df=pd.concat([df]*2).sort_index()
                                            it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
                                            df['B']=df['B'].apply(lambda x:next(it))


                                            concat + sort_index + iter + apply + next.



                                            Now:



                                            print(df)


                                            Is:



                                               A  B
                                            0 1 1
                                            0 1 2
                                            1 2 1
                                            1 2 2


                                            If care about index:



                                            df=df.reset_index(drop=True)


                                            Now:



                                            print(df)


                                            Is:



                                               A  B
                                            0 1 1
                                            1 1 2
                                            2 2 1
                                            3 2 2





                                            share|improve this answer




























                                              -1














                                              Something pretty not recommended (at least work in this case):



                                              df=pd.concat([df]*2).sort_index()
                                              it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
                                              df['B']=df['B'].apply(lambda x:next(it))


                                              concat + sort_index + iter + apply + next.



                                              Now:



                                              print(df)


                                              Is:



                                                 A  B
                                              0 1 1
                                              0 1 2
                                              1 2 1
                                              1 2 2


                                              If care about index:



                                              df=df.reset_index(drop=True)


                                              Now:



                                              print(df)


                                              Is:



                                                 A  B
                                              0 1 1
                                              1 1 2
                                              2 2 1
                                              3 2 2





                                              share|improve this answer


























                                                -1












                                                -1








                                                -1







                                                Something pretty not recommended (at least work in this case):



                                                df=pd.concat([df]*2).sort_index()
                                                it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
                                                df['B']=df['B'].apply(lambda x:next(it))


                                                concat + sort_index + iter + apply + next.



                                                Now:



                                                print(df)


                                                Is:



                                                   A  B
                                                0 1 1
                                                0 1 2
                                                1 2 1
                                                1 2 2


                                                If care about index:



                                                df=df.reset_index(drop=True)


                                                Now:



                                                print(df)


                                                Is:



                                                   A  B
                                                0 1 1
                                                1 1 2
                                                2 2 1
                                                3 2 2





                                                share|improve this answer













                                                Something pretty not recommended (at least work in this case):



                                                df=pd.concat([df]*2).sort_index()
                                                it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
                                                df['B']=df['B'].apply(lambda x:next(it))


                                                concat + sort_index + iter + apply + next.



                                                Now:



                                                print(df)


                                                Is:



                                                   A  B
                                                0 1 1
                                                0 1 2
                                                1 2 1
                                                1 2 2


                                                If care about index:



                                                df=df.reset_index(drop=True)


                                                Now:



                                                print(df)


                                                Is:



                                                   A  B
                                                0 1 1
                                                1 1 2
                                                2 2 1
                                                3 2 2






                                                share|improve this answer












                                                share|improve this answer



                                                share|improve this answer










                                                answered Nov 9 '18 at 2:40









                                                U9-ForwardU9-Forward

                                                15.3k41439




                                                15.3k41439






























                                                    draft saved

                                                    draft discarded




















































                                                    Thanks for contributing an answer to Stack Overflow!


                                                    • Please be sure to answer the question. Provide details and share your research!

                                                    But avoid



                                                    • Asking for help, clarification, or responding to other answers.

                                                    • Making statements based on opinion; back them up with references or personal experience.


                                                    To learn more, see our tips on writing great answers.




                                                    draft saved


                                                    draft discarded














                                                    StackExchange.ready(
                                                    function () {
                                                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53218931%2fhow-do-i-unnest-explode-a-column-in-a-pandas-dataframe%23new-answer', 'question_page');
                                                    }
                                                    );

                                                    Post as a guest















                                                    Required, but never shown





















































                                                    Required, but never shown














                                                    Required, but never shown












                                                    Required, but never shown







                                                    Required, but never shown

































                                                    Required, but never shown














                                                    Required, but never shown












                                                    Required, but never shown







                                                    Required, but never shown







                                                    Popular posts from this blog

                                                    'app-layout' is not a known element: how to share Component with different Modules

                                                    android studio warns about leanback feature tag usage required on manifest while using Unity exported app?

                                                    WPF add header to Image with URL pettitions [duplicate]