Different groupers for each column with pandas GroupBy












7















How could I use a multidimensional Grouper, in this case another dataframe, as a Grouper for another dataframe? Can it be done in one step?



My question is essentially regarding how to perform an actual grouping under these circumstances, but to make it more specific, say I want to then transform and take the sum.



Consider for example:



df1 = pd.DataFrame({'a':[1,2,3,4], 'b':[5,6,7,8]})

print(df1)
a b
0 1 5
1 2 6
2 3 7
3 4 8

df2 = pd.DataFrame({'a':['A','B','A','B'], 'b':['A','A','B','B']})

print(df2)
a b
0 A A
1 B A
2 A B
3 B B


Then, the expected output would be:



   a  b
0 4 11
1 6 11
2 4 15
3 6 15


Where columns a and b in df1 have been grouped by columns a and b from df2 respectively.










share|improve this question




















  • 1





    can you elaborate on the desired output? not clear what the rule is

    – Yuca
    Jan 15 at 16:08











  • Sure, added a brief explanation. Let me know if still not clear

    – yatu
    Jan 15 at 16:09











  • What do you group by? Your output has the same number of rows and columns as the input.

    – Zoe
    Jan 15 at 16:11











  • So you are grouping rows 1 and 3 in df1 because rows 1 and 3 are grouped in df2, correct?

    – Yuca
    Jan 15 at 16:12











  • Yes that is correct. The the resulting df has the same shape as df1, with the sum of the grouped values

    – yatu
    Jan 15 at 16:12
















7















How could I use a multidimensional Grouper, in this case another dataframe, as a Grouper for another dataframe? Can it be done in one step?



My question is essentially regarding how to perform an actual grouping under these circumstances, but to make it more specific, say I want to then transform and take the sum.



Consider for example:



df1 = pd.DataFrame({'a':[1,2,3,4], 'b':[5,6,7,8]})

print(df1)
a b
0 1 5
1 2 6
2 3 7
3 4 8

df2 = pd.DataFrame({'a':['A','B','A','B'], 'b':['A','A','B','B']})

print(df2)
a b
0 A A
1 B A
2 A B
3 B B


Then, the expected output would be:



   a  b
0 4 11
1 6 11
2 4 15
3 6 15


Where columns a and b in df1 have been grouped by columns a and b from df2 respectively.










share|improve this question




















  • 1





    can you elaborate on the desired output? not clear what the rule is

    – Yuca
    Jan 15 at 16:08











  • Sure, added a brief explanation. Let me know if still not clear

    – yatu
    Jan 15 at 16:09











  • What do you group by? Your output has the same number of rows and columns as the input.

    – Zoe
    Jan 15 at 16:11











  • So you are grouping rows 1 and 3 in df1 because rows 1 and 3 are grouped in df2, correct?

    – Yuca
    Jan 15 at 16:12











  • Yes that is correct. The the resulting df has the same shape as df1, with the sum of the grouped values

    – yatu
    Jan 15 at 16:12














7












7








7








How could I use a multidimensional Grouper, in this case another dataframe, as a Grouper for another dataframe? Can it be done in one step?



My question is essentially regarding how to perform an actual grouping under these circumstances, but to make it more specific, say I want to then transform and take the sum.



Consider for example:



df1 = pd.DataFrame({'a':[1,2,3,4], 'b':[5,6,7,8]})

print(df1)
a b
0 1 5
1 2 6
2 3 7
3 4 8

df2 = pd.DataFrame({'a':['A','B','A','B'], 'b':['A','A','B','B']})

print(df2)
a b
0 A A
1 B A
2 A B
3 B B


Then, the expected output would be:



   a  b
0 4 11
1 6 11
2 4 15
3 6 15


Where columns a and b in df1 have been grouped by columns a and b from df2 respectively.










share|improve this question
















How could I use a multidimensional Grouper, in this case another dataframe, as a Grouper for another dataframe? Can it be done in one step?



My question is essentially regarding how to perform an actual grouping under these circumstances, but to make it more specific, say I want to then transform and take the sum.



Consider for example:



df1 = pd.DataFrame({'a':[1,2,3,4], 'b':[5,6,7,8]})

print(df1)
a b
0 1 5
1 2 6
2 3 7
3 4 8

df2 = pd.DataFrame({'a':['A','B','A','B'], 'b':['A','A','B','B']})

print(df2)
a b
0 A A
1 B A
2 A B
3 B B


Then, the expected output would be:



   a  b
0 4 11
1 6 11
2 4 15
3 6 15


Where columns a and b in df1 have been grouped by columns a and b from df2 respectively.







python pandas group-by pandas-groupby






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 15 at 19:09









coldspeed

132k23139223




132k23139223










asked Jan 15 at 16:06









yatuyatu

10.1k21033




10.1k21033








  • 1





    can you elaborate on the desired output? not clear what the rule is

    – Yuca
    Jan 15 at 16:08











  • Sure, added a brief explanation. Let me know if still not clear

    – yatu
    Jan 15 at 16:09











  • What do you group by? Your output has the same number of rows and columns as the input.

    – Zoe
    Jan 15 at 16:11











  • So you are grouping rows 1 and 3 in df1 because rows 1 and 3 are grouped in df2, correct?

    – Yuca
    Jan 15 at 16:12











  • Yes that is correct. The the resulting df has the same shape as df1, with the sum of the grouped values

    – yatu
    Jan 15 at 16:12














  • 1





    can you elaborate on the desired output? not clear what the rule is

    – Yuca
    Jan 15 at 16:08











  • Sure, added a brief explanation. Let me know if still not clear

    – yatu
    Jan 15 at 16:09











  • What do you group by? Your output has the same number of rows and columns as the input.

    – Zoe
    Jan 15 at 16:11











  • So you are grouping rows 1 and 3 in df1 because rows 1 and 3 are grouped in df2, correct?

    – Yuca
    Jan 15 at 16:12











  • Yes that is correct. The the resulting df has the same shape as df1, with the sum of the grouped values

    – yatu
    Jan 15 at 16:12








1




1





can you elaborate on the desired output? not clear what the rule is

– Yuca
Jan 15 at 16:08





can you elaborate on the desired output? not clear what the rule is

– Yuca
Jan 15 at 16:08













Sure, added a brief explanation. Let me know if still not clear

– yatu
Jan 15 at 16:09





Sure, added a brief explanation. Let me know if still not clear

– yatu
Jan 15 at 16:09













What do you group by? Your output has the same number of rows and columns as the input.

– Zoe
Jan 15 at 16:11





What do you group by? Your output has the same number of rows and columns as the input.

– Zoe
Jan 15 at 16:11













So you are grouping rows 1 and 3 in df1 because rows 1 and 3 are grouped in df2, correct?

– Yuca
Jan 15 at 16:12





So you are grouping rows 1 and 3 in df1 because rows 1 and 3 are grouped in df2, correct?

– Yuca
Jan 15 at 16:12













Yes that is correct. The the resulting df has the same shape as df1, with the sum of the grouped values

– yatu
Jan 15 at 16:12





Yes that is correct. The the resulting df has the same shape as df1, with the sum of the grouped values

– yatu
Jan 15 at 16:12












5 Answers
5






active

oldest

votes


















5














You will have to group each column individually since each column uses a different grouping scheme.



If you want a cleaner version, I would recommend a list comprehension over the column names, and call pd.concat on the resultant series:



pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)

a b
0 4 11
1 6 11
2 4 15
3 6 15


Not to say there's anything wrong with using apply as in the other answer, just that I don't like apply, so this is my suggestion :-)





Here are some timeits for your perusal. Just for your sample data, you will notice the difference in timings is obvious.



%%timeit 
(df1.stack()
.groupby([df2.stack().index.get_level_values(level=1), df2.stack()])
.transform('sum').unstack())
%%timeit
df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))
%%timeit
pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)

8.99 ms ± 4.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
8.35 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.13 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Not to say apply is slow, but explicit iteration in this case is faster. Additionally, you will notice the second and third timed solution will scale better with larger length v/s breadth since the number of iterations depends on the number of columns.






share|improve this answer





















  • 1





    @ScottBoston I have already upvoted your answer for its simplicity B)

    – coldspeed
    Jan 15 at 17:21






  • 1





    Thanks!! Yes, using a list comprehension wiith pd.concat was what I had in mind, was curious to know whether looping could be avoided. Nice to see other alternatives here too though. And thanks for the timeits :)

    – yatu
    Jan 15 at 17:26





















7














Try using apply to apply a lambda function to each column of your dataframe, then use the name of that pd.Series to group by the second dataframe:



df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))


Output:



   a   b
0 4 11
1 6 11
2 4 15
3 6 15





share|improve this answer


























  • Nicee! Guessing it can not directly be done using Groupby rather than applying along columns right? Nice alternative in any case

    – yatu
    Jan 15 at 16:22











  • No, I don't think you can apply two different groupings to a dataframe based on a column.

    – Scott Boston
    Jan 15 at 16:37






  • 1





    Ok, thanks. Will leave for some time see if I get any other answers, otherwise will accept

    – yatu
    Jan 15 at 16:38





















4














Using stack and unstack



df1.stack().groupby([df2.stack().index.get_level_values(level=1),df2.stack()]).transform('sum').unstack()
Out[291]:
a b
0 4 11
1 6 11
2 4 15
3 6 15





share|improve this answer
























  • Thanks @W-B interesting approach!!

    – yatu
    Jan 15 at 17:30



















2














I'm going to propose a (mostly) numpythonic solution that uses a scipy.sparse_matrix to perform a vectorized groupby on the entire DataFrame at once, rather than column by column.





The key to performing this operation efficiently is finding a performant way to factorize the entire DataFrame, while avoiding duplicates in any columns. Since your groups are represented by strings, you can simply concatenate the column
name on the end of each value (since columns should be unique), and then factorize the result, like so [*]



>>> df2 + df2.columns
a b
0 Aa Ab
1 Ba Ab
2 Aa Bb
3 Ba Bb

>>> pd.factorize((df2 + df2.columns).values.ravel())
(array([0, 1, 2, 1, 0, 3, 2, 3], dtype=int64),
array(['Aa', 'Ab', 'Ba', 'Bb'], dtype=object))




Once we have a unique grouping, we can utilize our scipy.sparse matrix, to perform a groupby in a single pass on the flattened arrays, and use advanced indexing and a reshaping operation to convert the result back to the original shape.



from scipy import sparse

a = df1.values.ravel()
b, _ = pd.factorize((df2 + df2.columns).values.ravel())

o = sparse.csr_matrix(
(a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
).sum(0).A1

res = o[b].reshape(df1.shape)




array([[ 4, 11],
[ 6, 11],
[ 4, 15],
[ 6, 15]], dtype=int64)




Performance



Functions



def gp_chris(f1, f2):
a = f1.values.ravel()
b, _ = pd.factorize((f2 + f2.columns).values.ravel())

o = sparse.csr_matrix(
(a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
).sum(0).A1

return pd.DataFrame(o[b].reshape(f1.shape), columns=df1.columns)


def gp_cs(f1, f2):
return pd.concat([f1[c].groupby(f2[c]).transform('sum') for c in f1.columns], axis=1)


def gp_scott(f1, f2):
return f1.apply(lambda x: x.groupby(f2[x.name]).transform('sum'))


def gp_wen(f1, f2):
return f1.stack().groupby([f2.stack().index.get_level_values(level=1), f2.stack()]).transform('sum').unstack()


Setup



import numpy as np
from scipy import sparse
import pandas as pd
import string
from timeit import timeit
import matplotlib.pyplot as plt
res = pd.DataFrame(
index=[f'gp_{f}' for f in ('chris', 'cs', 'scott', 'wen')],
columns=[10, 50, 100, 200, 400],
dtype=float
)

for f in res.index:
for c in res.columns:
df1 = pd.DataFrame(np.random.rand(c, c))
df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (c, c)))
df1.columns = df1.columns.astype(str)
df2.columns = df2.columns.astype(str)

stmt = '{}(df1, df2)'.format(f)
setp = 'from __main__ import df1, df2, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=50)


ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")

plt.show()


Results



enter image description here





Validation



df1 = pd.DataFrame(np.random.rand(10, 10))
df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (10, 10)))
df1.columns = df1.columns.astype(str)
df2.columns = df2.columns.astype(str)

v = np.stack([gp_chris(df1, df2), gp_cs(df1, df2), gp_scott(df1, df2), gp_wen(df1, df2)])
print(np.all(v[:-1] == v[1:]))




True


Either we're all wrong or we're all correct :)





[*] There is a possibility that you could get a duplicate value here if one item is the concatenation of a column and another item before concatenation occurs. However if this is the case, you shouldn't need to adjust much to fix it.






share|improve this answer


























  • Thanks for the final validation :-). This approach is pretty interesting and surprisingly fast compared to the others. Also @W-B 's approach is quite faster than the others too for larger amounts of columns

    – yatu
    Jan 15 at 20:45








  • 1





    Correct. Both Wen's and my approaches have more overhead at the start (stacking and reshaping), but since we perform our groupby operation a single time, it scales better.

    – user3483203
    Jan 15 at 20:48











  • I personally wouldn't change the accepted answer. This approach is has a couple caveats with factorizing the DataFrame, as well as only works for certain operations related to a DataFrame, for example count wouldn't work. Coldspeed's is much more general.

    – user3483203
    Jan 15 at 20:50











  • Okay thanks for pointing out that it is not as general. Definitely worth having as a possible solution for whenever it can be applied and performance is an issue.

    – yatu
    Jan 15 at 20:55



















0














You could do something like the following:



res = df1.assign(a_sum=lambda df: df['a'].groupby(df2['a']).transform('sum'))
.assign(b_sum=lambda df: df['b'].groupby(df2['b']).transform('sum'))


Results:



   a   b
0 4 11
1 6 11
2 4 15
3 6 15





share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54202615%2fdifferent-groupers-for-each-column-with-pandas-groupby%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    5 Answers
    5






    active

    oldest

    votes








    5 Answers
    5






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    5














    You will have to group each column individually since each column uses a different grouping scheme.



    If you want a cleaner version, I would recommend a list comprehension over the column names, and call pd.concat on the resultant series:



    pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)

    a b
    0 4 11
    1 6 11
    2 4 15
    3 6 15


    Not to say there's anything wrong with using apply as in the other answer, just that I don't like apply, so this is my suggestion :-)





    Here are some timeits for your perusal. Just for your sample data, you will notice the difference in timings is obvious.



    %%timeit 
    (df1.stack()
    .groupby([df2.stack().index.get_level_values(level=1), df2.stack()])
    .transform('sum').unstack())
    %%timeit
    df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))
    %%timeit
    pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)

    8.99 ms ± 4.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    8.35 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
    6.13 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


    Not to say apply is slow, but explicit iteration in this case is faster. Additionally, you will notice the second and third timed solution will scale better with larger length v/s breadth since the number of iterations depends on the number of columns.






    share|improve this answer





















    • 1





      @ScottBoston I have already upvoted your answer for its simplicity B)

      – coldspeed
      Jan 15 at 17:21






    • 1





      Thanks!! Yes, using a list comprehension wiith pd.concat was what I had in mind, was curious to know whether looping could be avoided. Nice to see other alternatives here too though. And thanks for the timeits :)

      – yatu
      Jan 15 at 17:26


















    5














    You will have to group each column individually since each column uses a different grouping scheme.



    If you want a cleaner version, I would recommend a list comprehension over the column names, and call pd.concat on the resultant series:



    pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)

    a b
    0 4 11
    1 6 11
    2 4 15
    3 6 15


    Not to say there's anything wrong with using apply as in the other answer, just that I don't like apply, so this is my suggestion :-)





    Here are some timeits for your perusal. Just for your sample data, you will notice the difference in timings is obvious.



    %%timeit 
    (df1.stack()
    .groupby([df2.stack().index.get_level_values(level=1), df2.stack()])
    .transform('sum').unstack())
    %%timeit
    df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))
    %%timeit
    pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)

    8.99 ms ± 4.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    8.35 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
    6.13 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


    Not to say apply is slow, but explicit iteration in this case is faster. Additionally, you will notice the second and third timed solution will scale better with larger length v/s breadth since the number of iterations depends on the number of columns.






    share|improve this answer





















    • 1





      @ScottBoston I have already upvoted your answer for its simplicity B)

      – coldspeed
      Jan 15 at 17:21






    • 1





      Thanks!! Yes, using a list comprehension wiith pd.concat was what I had in mind, was curious to know whether looping could be avoided. Nice to see other alternatives here too though. And thanks for the timeits :)

      – yatu
      Jan 15 at 17:26
















    5












    5








    5







    You will have to group each column individually since each column uses a different grouping scheme.



    If you want a cleaner version, I would recommend a list comprehension over the column names, and call pd.concat on the resultant series:



    pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)

    a b
    0 4 11
    1 6 11
    2 4 15
    3 6 15


    Not to say there's anything wrong with using apply as in the other answer, just that I don't like apply, so this is my suggestion :-)





    Here are some timeits for your perusal. Just for your sample data, you will notice the difference in timings is obvious.



    %%timeit 
    (df1.stack()
    .groupby([df2.stack().index.get_level_values(level=1), df2.stack()])
    .transform('sum').unstack())
    %%timeit
    df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))
    %%timeit
    pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)

    8.99 ms ± 4.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    8.35 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
    6.13 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


    Not to say apply is slow, but explicit iteration in this case is faster. Additionally, you will notice the second and third timed solution will scale better with larger length v/s breadth since the number of iterations depends on the number of columns.






    share|improve this answer















    You will have to group each column individually since each column uses a different grouping scheme.



    If you want a cleaner version, I would recommend a list comprehension over the column names, and call pd.concat on the resultant series:



    pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)

    a b
    0 4 11
    1 6 11
    2 4 15
    3 6 15


    Not to say there's anything wrong with using apply as in the other answer, just that I don't like apply, so this is my suggestion :-)





    Here are some timeits for your perusal. Just for your sample data, you will notice the difference in timings is obvious.



    %%timeit 
    (df1.stack()
    .groupby([df2.stack().index.get_level_values(level=1), df2.stack()])
    .transform('sum').unstack())
    %%timeit
    df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))
    %%timeit
    pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)

    8.99 ms ± 4.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    8.35 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
    6.13 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


    Not to say apply is slow, but explicit iteration in this case is faster. Additionally, you will notice the second and third timed solution will scale better with larger length v/s breadth since the number of iterations depends on the number of columns.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Jan 15 at 17:21

























    answered Jan 15 at 17:03









    coldspeedcoldspeed

    132k23139223




    132k23139223








    • 1





      @ScottBoston I have already upvoted your answer for its simplicity B)

      – coldspeed
      Jan 15 at 17:21






    • 1





      Thanks!! Yes, using a list comprehension wiith pd.concat was what I had in mind, was curious to know whether looping could be avoided. Nice to see other alternatives here too though. And thanks for the timeits :)

      – yatu
      Jan 15 at 17:26
















    • 1





      @ScottBoston I have already upvoted your answer for its simplicity B)

      – coldspeed
      Jan 15 at 17:21






    • 1





      Thanks!! Yes, using a list comprehension wiith pd.concat was what I had in mind, was curious to know whether looping could be avoided. Nice to see other alternatives here too though. And thanks for the timeits :)

      – yatu
      Jan 15 at 17:26










    1




    1





    @ScottBoston I have already upvoted your answer for its simplicity B)

    – coldspeed
    Jan 15 at 17:21





    @ScottBoston I have already upvoted your answer for its simplicity B)

    – coldspeed
    Jan 15 at 17:21




    1




    1





    Thanks!! Yes, using a list comprehension wiith pd.concat was what I had in mind, was curious to know whether looping could be avoided. Nice to see other alternatives here too though. And thanks for the timeits :)

    – yatu
    Jan 15 at 17:26







    Thanks!! Yes, using a list comprehension wiith pd.concat was what I had in mind, was curious to know whether looping could be avoided. Nice to see other alternatives here too though. And thanks for the timeits :)

    – yatu
    Jan 15 at 17:26















    7














    Try using apply to apply a lambda function to each column of your dataframe, then use the name of that pd.Series to group by the second dataframe:



    df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))


    Output:



       a   b
    0 4 11
    1 6 11
    2 4 15
    3 6 15





    share|improve this answer


























    • Nicee! Guessing it can not directly be done using Groupby rather than applying along columns right? Nice alternative in any case

      – yatu
      Jan 15 at 16:22











    • No, I don't think you can apply two different groupings to a dataframe based on a column.

      – Scott Boston
      Jan 15 at 16:37






    • 1





      Ok, thanks. Will leave for some time see if I get any other answers, otherwise will accept

      – yatu
      Jan 15 at 16:38


















    7














    Try using apply to apply a lambda function to each column of your dataframe, then use the name of that pd.Series to group by the second dataframe:



    df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))


    Output:



       a   b
    0 4 11
    1 6 11
    2 4 15
    3 6 15





    share|improve this answer


























    • Nicee! Guessing it can not directly be done using Groupby rather than applying along columns right? Nice alternative in any case

      – yatu
      Jan 15 at 16:22











    • No, I don't think you can apply two different groupings to a dataframe based on a column.

      – Scott Boston
      Jan 15 at 16:37






    • 1





      Ok, thanks. Will leave for some time see if I get any other answers, otherwise will accept

      – yatu
      Jan 15 at 16:38
















    7












    7








    7







    Try using apply to apply a lambda function to each column of your dataframe, then use the name of that pd.Series to group by the second dataframe:



    df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))


    Output:



       a   b
    0 4 11
    1 6 11
    2 4 15
    3 6 15





    share|improve this answer















    Try using apply to apply a lambda function to each column of your dataframe, then use the name of that pd.Series to group by the second dataframe:



    df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))


    Output:



       a   b
    0 4 11
    1 6 11
    2 4 15
    3 6 15






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Jan 15 at 16:17

























    answered Jan 15 at 16:14









    Scott BostonScott Boston

    54.8k73056




    54.8k73056













    • Nicee! Guessing it can not directly be done using Groupby rather than applying along columns right? Nice alternative in any case

      – yatu
      Jan 15 at 16:22











    • No, I don't think you can apply two different groupings to a dataframe based on a column.

      – Scott Boston
      Jan 15 at 16:37






    • 1





      Ok, thanks. Will leave for some time see if I get any other answers, otherwise will accept

      – yatu
      Jan 15 at 16:38





















    • Nicee! Guessing it can not directly be done using Groupby rather than applying along columns right? Nice alternative in any case

      – yatu
      Jan 15 at 16:22











    • No, I don't think you can apply two different groupings to a dataframe based on a column.

      – Scott Boston
      Jan 15 at 16:37






    • 1





      Ok, thanks. Will leave for some time see if I get any other answers, otherwise will accept

      – yatu
      Jan 15 at 16:38



















    Nicee! Guessing it can not directly be done using Groupby rather than applying along columns right? Nice alternative in any case

    – yatu
    Jan 15 at 16:22





    Nicee! Guessing it can not directly be done using Groupby rather than applying along columns right? Nice alternative in any case

    – yatu
    Jan 15 at 16:22













    No, I don't think you can apply two different groupings to a dataframe based on a column.

    – Scott Boston
    Jan 15 at 16:37





    No, I don't think you can apply two different groupings to a dataframe based on a column.

    – Scott Boston
    Jan 15 at 16:37




    1




    1





    Ok, thanks. Will leave for some time see if I get any other answers, otherwise will accept

    – yatu
    Jan 15 at 16:38







    Ok, thanks. Will leave for some time see if I get any other answers, otherwise will accept

    – yatu
    Jan 15 at 16:38













    4














    Using stack and unstack



    df1.stack().groupby([df2.stack().index.get_level_values(level=1),df2.stack()]).transform('sum').unstack()
    Out[291]:
    a b
    0 4 11
    1 6 11
    2 4 15
    3 6 15





    share|improve this answer
























    • Thanks @W-B interesting approach!!

      – yatu
      Jan 15 at 17:30
















    4














    Using stack and unstack



    df1.stack().groupby([df2.stack().index.get_level_values(level=1),df2.stack()]).transform('sum').unstack()
    Out[291]:
    a b
    0 4 11
    1 6 11
    2 4 15
    3 6 15





    share|improve this answer
























    • Thanks @W-B interesting approach!!

      – yatu
      Jan 15 at 17:30














    4












    4








    4







    Using stack and unstack



    df1.stack().groupby([df2.stack().index.get_level_values(level=1),df2.stack()]).transform('sum').unstack()
    Out[291]:
    a b
    0 4 11
    1 6 11
    2 4 15
    3 6 15





    share|improve this answer













    Using stack and unstack



    df1.stack().groupby([df2.stack().index.get_level_values(level=1),df2.stack()]).transform('sum').unstack()
    Out[291]:
    a b
    0 4 11
    1 6 11
    2 4 15
    3 6 15






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Jan 15 at 17:14









    Wen-BenWen-Ben

    111k83267




    111k83267













    • Thanks @W-B interesting approach!!

      – yatu
      Jan 15 at 17:30



















    • Thanks @W-B interesting approach!!

      – yatu
      Jan 15 at 17:30

















    Thanks @W-B interesting approach!!

    – yatu
    Jan 15 at 17:30





    Thanks @W-B interesting approach!!

    – yatu
    Jan 15 at 17:30











    2














    I'm going to propose a (mostly) numpythonic solution that uses a scipy.sparse_matrix to perform a vectorized groupby on the entire DataFrame at once, rather than column by column.





    The key to performing this operation efficiently is finding a performant way to factorize the entire DataFrame, while avoiding duplicates in any columns. Since your groups are represented by strings, you can simply concatenate the column
    name on the end of each value (since columns should be unique), and then factorize the result, like so [*]



    >>> df2 + df2.columns
    a b
    0 Aa Ab
    1 Ba Ab
    2 Aa Bb
    3 Ba Bb

    >>> pd.factorize((df2 + df2.columns).values.ravel())
    (array([0, 1, 2, 1, 0, 3, 2, 3], dtype=int64),
    array(['Aa', 'Ab', 'Ba', 'Bb'], dtype=object))




    Once we have a unique grouping, we can utilize our scipy.sparse matrix, to perform a groupby in a single pass on the flattened arrays, and use advanced indexing and a reshaping operation to convert the result back to the original shape.



    from scipy import sparse

    a = df1.values.ravel()
    b, _ = pd.factorize((df2 + df2.columns).values.ravel())

    o = sparse.csr_matrix(
    (a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
    ).sum(0).A1

    res = o[b].reshape(df1.shape)




    array([[ 4, 11],
    [ 6, 11],
    [ 4, 15],
    [ 6, 15]], dtype=int64)




    Performance



    Functions



    def gp_chris(f1, f2):
    a = f1.values.ravel()
    b, _ = pd.factorize((f2 + f2.columns).values.ravel())

    o = sparse.csr_matrix(
    (a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
    ).sum(0).A1

    return pd.DataFrame(o[b].reshape(f1.shape), columns=df1.columns)


    def gp_cs(f1, f2):
    return pd.concat([f1[c].groupby(f2[c]).transform('sum') for c in f1.columns], axis=1)


    def gp_scott(f1, f2):
    return f1.apply(lambda x: x.groupby(f2[x.name]).transform('sum'))


    def gp_wen(f1, f2):
    return f1.stack().groupby([f2.stack().index.get_level_values(level=1), f2.stack()]).transform('sum').unstack()


    Setup



    import numpy as np
    from scipy import sparse
    import pandas as pd
    import string
    from timeit import timeit
    import matplotlib.pyplot as plt
    res = pd.DataFrame(
    index=[f'gp_{f}' for f in ('chris', 'cs', 'scott', 'wen')],
    columns=[10, 50, 100, 200, 400],
    dtype=float
    )

    for f in res.index:
    for c in res.columns:
    df1 = pd.DataFrame(np.random.rand(c, c))
    df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (c, c)))
    df1.columns = df1.columns.astype(str)
    df2.columns = df2.columns.astype(str)

    stmt = '{}(df1, df2)'.format(f)
    setp = 'from __main__ import df1, df2, {}'.format(f)
    res.at[f, c] = timeit(stmt, setp, number=50)


    ax = res.div(res.min()).T.plot(loglog=True)
    ax.set_xlabel("N")
    ax.set_ylabel("time (relative)")

    plt.show()


    Results



    enter image description here





    Validation



    df1 = pd.DataFrame(np.random.rand(10, 10))
    df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (10, 10)))
    df1.columns = df1.columns.astype(str)
    df2.columns = df2.columns.astype(str)

    v = np.stack([gp_chris(df1, df2), gp_cs(df1, df2), gp_scott(df1, df2), gp_wen(df1, df2)])
    print(np.all(v[:-1] == v[1:]))




    True


    Either we're all wrong or we're all correct :)





    [*] There is a possibility that you could get a duplicate value here if one item is the concatenation of a column and another item before concatenation occurs. However if this is the case, you shouldn't need to adjust much to fix it.






    share|improve this answer


























    • Thanks for the final validation :-). This approach is pretty interesting and surprisingly fast compared to the others. Also @W-B 's approach is quite faster than the others too for larger amounts of columns

      – yatu
      Jan 15 at 20:45








    • 1





      Correct. Both Wen's and my approaches have more overhead at the start (stacking and reshaping), but since we perform our groupby operation a single time, it scales better.

      – user3483203
      Jan 15 at 20:48











    • I personally wouldn't change the accepted answer. This approach is has a couple caveats with factorizing the DataFrame, as well as only works for certain operations related to a DataFrame, for example count wouldn't work. Coldspeed's is much more general.

      – user3483203
      Jan 15 at 20:50











    • Okay thanks for pointing out that it is not as general. Definitely worth having as a possible solution for whenever it can be applied and performance is an issue.

      – yatu
      Jan 15 at 20:55
















    2














    I'm going to propose a (mostly) numpythonic solution that uses a scipy.sparse_matrix to perform a vectorized groupby on the entire DataFrame at once, rather than column by column.





    The key to performing this operation efficiently is finding a performant way to factorize the entire DataFrame, while avoiding duplicates in any columns. Since your groups are represented by strings, you can simply concatenate the column
    name on the end of each value (since columns should be unique), and then factorize the result, like so [*]



    >>> df2 + df2.columns
    a b
    0 Aa Ab
    1 Ba Ab
    2 Aa Bb
    3 Ba Bb

    >>> pd.factorize((df2 + df2.columns).values.ravel())
    (array([0, 1, 2, 1, 0, 3, 2, 3], dtype=int64),
    array(['Aa', 'Ab', 'Ba', 'Bb'], dtype=object))




    Once we have a unique grouping, we can utilize our scipy.sparse matrix, to perform a groupby in a single pass on the flattened arrays, and use advanced indexing and a reshaping operation to convert the result back to the original shape.



    from scipy import sparse

    a = df1.values.ravel()
    b, _ = pd.factorize((df2 + df2.columns).values.ravel())

    o = sparse.csr_matrix(
    (a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
    ).sum(0).A1

    res = o[b].reshape(df1.shape)




    array([[ 4, 11],
    [ 6, 11],
    [ 4, 15],
    [ 6, 15]], dtype=int64)




    Performance



    Functions



    def gp_chris(f1, f2):
    a = f1.values.ravel()
    b, _ = pd.factorize((f2 + f2.columns).values.ravel())

    o = sparse.csr_matrix(
    (a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
    ).sum(0).A1

    return pd.DataFrame(o[b].reshape(f1.shape), columns=df1.columns)


    def gp_cs(f1, f2):
    return pd.concat([f1[c].groupby(f2[c]).transform('sum') for c in f1.columns], axis=1)


    def gp_scott(f1, f2):
    return f1.apply(lambda x: x.groupby(f2[x.name]).transform('sum'))


    def gp_wen(f1, f2):
    return f1.stack().groupby([f2.stack().index.get_level_values(level=1), f2.stack()]).transform('sum').unstack()


    Setup



    import numpy as np
    from scipy import sparse
    import pandas as pd
    import string
    from timeit import timeit
    import matplotlib.pyplot as plt
    res = pd.DataFrame(
    index=[f'gp_{f}' for f in ('chris', 'cs', 'scott', 'wen')],
    columns=[10, 50, 100, 200, 400],
    dtype=float
    )

    for f in res.index:
    for c in res.columns:
    df1 = pd.DataFrame(np.random.rand(c, c))
    df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (c, c)))
    df1.columns = df1.columns.astype(str)
    df2.columns = df2.columns.astype(str)

    stmt = '{}(df1, df2)'.format(f)
    setp = 'from __main__ import df1, df2, {}'.format(f)
    res.at[f, c] = timeit(stmt, setp, number=50)


    ax = res.div(res.min()).T.plot(loglog=True)
    ax.set_xlabel("N")
    ax.set_ylabel("time (relative)")

    plt.show()


    Results



    enter image description here





    Validation



    df1 = pd.DataFrame(np.random.rand(10, 10))
    df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (10, 10)))
    df1.columns = df1.columns.astype(str)
    df2.columns = df2.columns.astype(str)

    v = np.stack([gp_chris(df1, df2), gp_cs(df1, df2), gp_scott(df1, df2), gp_wen(df1, df2)])
    print(np.all(v[:-1] == v[1:]))




    True


    Either we're all wrong or we're all correct :)





    [*] There is a possibility that you could get a duplicate value here if one item is the concatenation of a column and another item before concatenation occurs. However if this is the case, you shouldn't need to adjust much to fix it.






    share|improve this answer


























    • Thanks for the final validation :-). This approach is pretty interesting and surprisingly fast compared to the others. Also @W-B 's approach is quite faster than the others too for larger amounts of columns

      – yatu
      Jan 15 at 20:45








    • 1





      Correct. Both Wen's and my approaches have more overhead at the start (stacking and reshaping), but since we perform our groupby operation a single time, it scales better.

      – user3483203
      Jan 15 at 20:48











    • I personally wouldn't change the accepted answer. This approach is has a couple caveats with factorizing the DataFrame, as well as only works for certain operations related to a DataFrame, for example count wouldn't work. Coldspeed's is much more general.

      – user3483203
      Jan 15 at 20:50











    • Okay thanks for pointing out that it is not as general. Definitely worth having as a possible solution for whenever it can be applied and performance is an issue.

      – yatu
      Jan 15 at 20:55














    2












    2








    2







    I'm going to propose a (mostly) numpythonic solution that uses a scipy.sparse_matrix to perform a vectorized groupby on the entire DataFrame at once, rather than column by column.





    The key to performing this operation efficiently is finding a performant way to factorize the entire DataFrame, while avoiding duplicates in any columns. Since your groups are represented by strings, you can simply concatenate the column
    name on the end of each value (since columns should be unique), and then factorize the result, like so [*]



    >>> df2 + df2.columns
    a b
    0 Aa Ab
    1 Ba Ab
    2 Aa Bb
    3 Ba Bb

    >>> pd.factorize((df2 + df2.columns).values.ravel())
    (array([0, 1, 2, 1, 0, 3, 2, 3], dtype=int64),
    array(['Aa', 'Ab', 'Ba', 'Bb'], dtype=object))




    Once we have a unique grouping, we can utilize our scipy.sparse matrix, to perform a groupby in a single pass on the flattened arrays, and use advanced indexing and a reshaping operation to convert the result back to the original shape.



    from scipy import sparse

    a = df1.values.ravel()
    b, _ = pd.factorize((df2 + df2.columns).values.ravel())

    o = sparse.csr_matrix(
    (a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
    ).sum(0).A1

    res = o[b].reshape(df1.shape)




    array([[ 4, 11],
    [ 6, 11],
    [ 4, 15],
    [ 6, 15]], dtype=int64)




    Performance



    Functions



    def gp_chris(f1, f2):
    a = f1.values.ravel()
    b, _ = pd.factorize((f2 + f2.columns).values.ravel())

    o = sparse.csr_matrix(
    (a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
    ).sum(0).A1

    return pd.DataFrame(o[b].reshape(f1.shape), columns=df1.columns)


    def gp_cs(f1, f2):
    return pd.concat([f1[c].groupby(f2[c]).transform('sum') for c in f1.columns], axis=1)


    def gp_scott(f1, f2):
    return f1.apply(lambda x: x.groupby(f2[x.name]).transform('sum'))


    def gp_wen(f1, f2):
    return f1.stack().groupby([f2.stack().index.get_level_values(level=1), f2.stack()]).transform('sum').unstack()


    Setup



    import numpy as np
    from scipy import sparse
    import pandas as pd
    import string
    from timeit import timeit
    import matplotlib.pyplot as plt
    res = pd.DataFrame(
    index=[f'gp_{f}' for f in ('chris', 'cs', 'scott', 'wen')],
    columns=[10, 50, 100, 200, 400],
    dtype=float
    )

    for f in res.index:
    for c in res.columns:
    df1 = pd.DataFrame(np.random.rand(c, c))
    df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (c, c)))
    df1.columns = df1.columns.astype(str)
    df2.columns = df2.columns.astype(str)

    stmt = '{}(df1, df2)'.format(f)
    setp = 'from __main__ import df1, df2, {}'.format(f)
    res.at[f, c] = timeit(stmt, setp, number=50)


    ax = res.div(res.min()).T.plot(loglog=True)
    ax.set_xlabel("N")
    ax.set_ylabel("time (relative)")

    plt.show()


    Results



    enter image description here





    Validation



    df1 = pd.DataFrame(np.random.rand(10, 10))
    df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (10, 10)))
    df1.columns = df1.columns.astype(str)
    df2.columns = df2.columns.astype(str)

    v = np.stack([gp_chris(df1, df2), gp_cs(df1, df2), gp_scott(df1, df2), gp_wen(df1, df2)])
    print(np.all(v[:-1] == v[1:]))




    True


    Either we're all wrong or we're all correct :)





    [*] There is a possibility that you could get a duplicate value here if one item is the concatenation of a column and another item before concatenation occurs. However if this is the case, you shouldn't need to adjust much to fix it.






    share|improve this answer















    I'm going to propose a (mostly) numpythonic solution that uses a scipy.sparse_matrix to perform a vectorized groupby on the entire DataFrame at once, rather than column by column.





    The key to performing this operation efficiently is finding a performant way to factorize the entire DataFrame, while avoiding duplicates in any columns. Since your groups are represented by strings, you can simply concatenate the column
    name on the end of each value (since columns should be unique), and then factorize the result, like so [*]



    >>> df2 + df2.columns
    a b
    0 Aa Ab
    1 Ba Ab
    2 Aa Bb
    3 Ba Bb

    >>> pd.factorize((df2 + df2.columns).values.ravel())
    (array([0, 1, 2, 1, 0, 3, 2, 3], dtype=int64),
    array(['Aa', 'Ab', 'Ba', 'Bb'], dtype=object))




    Once we have a unique grouping, we can utilize our scipy.sparse matrix, to perform a groupby in a single pass on the flattened arrays, and use advanced indexing and a reshaping operation to convert the result back to the original shape.



    from scipy import sparse

    a = df1.values.ravel()
    b, _ = pd.factorize((df2 + df2.columns).values.ravel())

    o = sparse.csr_matrix(
    (a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
    ).sum(0).A1

    res = o[b].reshape(df1.shape)




    array([[ 4, 11],
    [ 6, 11],
    [ 4, 15],
    [ 6, 15]], dtype=int64)




    Performance



    Functions



    def gp_chris(f1, f2):
    a = f1.values.ravel()
    b, _ = pd.factorize((f2 + f2.columns).values.ravel())

    o = sparse.csr_matrix(
    (a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
    ).sum(0).A1

    return pd.DataFrame(o[b].reshape(f1.shape), columns=df1.columns)


    def gp_cs(f1, f2):
    return pd.concat([f1[c].groupby(f2[c]).transform('sum') for c in f1.columns], axis=1)


    def gp_scott(f1, f2):
    return f1.apply(lambda x: x.groupby(f2[x.name]).transform('sum'))


    def gp_wen(f1, f2):
    return f1.stack().groupby([f2.stack().index.get_level_values(level=1), f2.stack()]).transform('sum').unstack()


    Setup



    import numpy as np
    from scipy import sparse
    import pandas as pd
    import string
    from timeit import timeit
    import matplotlib.pyplot as plt
    res = pd.DataFrame(
    index=[f'gp_{f}' for f in ('chris', 'cs', 'scott', 'wen')],
    columns=[10, 50, 100, 200, 400],
    dtype=float
    )

    for f in res.index:
    for c in res.columns:
    df1 = pd.DataFrame(np.random.rand(c, c))
    df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (c, c)))
    df1.columns = df1.columns.astype(str)
    df2.columns = df2.columns.astype(str)

    stmt = '{}(df1, df2)'.format(f)
    setp = 'from __main__ import df1, df2, {}'.format(f)
    res.at[f, c] = timeit(stmt, setp, number=50)


    ax = res.div(res.min()).T.plot(loglog=True)
    ax.set_xlabel("N")
    ax.set_ylabel("time (relative)")

    plt.show()


    Results



    enter image description here





    Validation



    df1 = pd.DataFrame(np.random.rand(10, 10))
    df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (10, 10)))
    df1.columns = df1.columns.astype(str)
    df2.columns = df2.columns.astype(str)

    v = np.stack([gp_chris(df1, df2), gp_cs(df1, df2), gp_scott(df1, df2), gp_wen(df1, df2)])
    print(np.all(v[:-1] == v[1:]))




    True


    Either we're all wrong or we're all correct :)





    [*] There is a possibility that you could get a duplicate value here if one item is the concatenation of a column and another item before concatenation occurs. However if this is the case, you shouldn't need to adjust much to fix it.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Jan 15 at 20:14

























    answered Jan 15 at 19:32









    user3483203user3483203

    31.2k82655




    31.2k82655













    • Thanks for the final validation :-). This approach is pretty interesting and surprisingly fast compared to the others. Also @W-B 's approach is quite faster than the others too for larger amounts of columns

      – yatu
      Jan 15 at 20:45








    • 1





      Correct. Both Wen's and my approaches have more overhead at the start (stacking and reshaping), but since we perform our groupby operation a single time, it scales better.

      – user3483203
      Jan 15 at 20:48











    • I personally wouldn't change the accepted answer. This approach is has a couple caveats with factorizing the DataFrame, as well as only works for certain operations related to a DataFrame, for example count wouldn't work. Coldspeed's is much more general.

      – user3483203
      Jan 15 at 20:50











    • Okay thanks for pointing out that it is not as general. Definitely worth having as a possible solution for whenever it can be applied and performance is an issue.

      – yatu
      Jan 15 at 20:55



















    • Thanks for the final validation :-). This approach is pretty interesting and surprisingly fast compared to the others. Also @W-B 's approach is quite faster than the others too for larger amounts of columns

      – yatu
      Jan 15 at 20:45








    • 1





      Correct. Both Wen's and my approaches have more overhead at the start (stacking and reshaping), but since we perform our groupby operation a single time, it scales better.

      – user3483203
      Jan 15 at 20:48











    • I personally wouldn't change the accepted answer. This approach is has a couple caveats with factorizing the DataFrame, as well as only works for certain operations related to a DataFrame, for example count wouldn't work. Coldspeed's is much more general.

      – user3483203
      Jan 15 at 20:50











    • Okay thanks for pointing out that it is not as general. Definitely worth having as a possible solution for whenever it can be applied and performance is an issue.

      – yatu
      Jan 15 at 20:55

















    Thanks for the final validation :-). This approach is pretty interesting and surprisingly fast compared to the others. Also @W-B 's approach is quite faster than the others too for larger amounts of columns

    – yatu
    Jan 15 at 20:45







    Thanks for the final validation :-). This approach is pretty interesting and surprisingly fast compared to the others. Also @W-B 's approach is quite faster than the others too for larger amounts of columns

    – yatu
    Jan 15 at 20:45






    1




    1





    Correct. Both Wen's and my approaches have more overhead at the start (stacking and reshaping), but since we perform our groupby operation a single time, it scales better.

    – user3483203
    Jan 15 at 20:48





    Correct. Both Wen's and my approaches have more overhead at the start (stacking and reshaping), but since we perform our groupby operation a single time, it scales better.

    – user3483203
    Jan 15 at 20:48













    I personally wouldn't change the accepted answer. This approach is has a couple caveats with factorizing the DataFrame, as well as only works for certain operations related to a DataFrame, for example count wouldn't work. Coldspeed's is much more general.

    – user3483203
    Jan 15 at 20:50





    I personally wouldn't change the accepted answer. This approach is has a couple caveats with factorizing the DataFrame, as well as only works for certain operations related to a DataFrame, for example count wouldn't work. Coldspeed's is much more general.

    – user3483203
    Jan 15 at 20:50













    Okay thanks for pointing out that it is not as general. Definitely worth having as a possible solution for whenever it can be applied and performance is an issue.

    – yatu
    Jan 15 at 20:55





    Okay thanks for pointing out that it is not as general. Definitely worth having as a possible solution for whenever it can be applied and performance is an issue.

    – yatu
    Jan 15 at 20:55











    0














    You could do something like the following:



    res = df1.assign(a_sum=lambda df: df['a'].groupby(df2['a']).transform('sum'))
    .assign(b_sum=lambda df: df['b'].groupby(df2['b']).transform('sum'))


    Results:



       a   b
    0 4 11
    1 6 11
    2 4 15
    3 6 15





    share|improve this answer




























      0














      You could do something like the following:



      res = df1.assign(a_sum=lambda df: df['a'].groupby(df2['a']).transform('sum'))
      .assign(b_sum=lambda df: df['b'].groupby(df2['b']).transform('sum'))


      Results:



         a   b
      0 4 11
      1 6 11
      2 4 15
      3 6 15





      share|improve this answer


























        0












        0








        0







        You could do something like the following:



        res = df1.assign(a_sum=lambda df: df['a'].groupby(df2['a']).transform('sum'))
        .assign(b_sum=lambda df: df['b'].groupby(df2['b']).transform('sum'))


        Results:



           a   b
        0 4 11
        1 6 11
        2 4 15
        3 6 15





        share|improve this answer













        You could do something like the following:



        res = df1.assign(a_sum=lambda df: df['a'].groupby(df2['a']).transform('sum'))
        .assign(b_sum=lambda df: df['b'].groupby(df2['b']).transform('sum'))


        Results:



           a   b
        0 4 11
        1 6 11
        2 4 15
        3 6 15






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Jan 15 at 17:13









        PMendePMende

        1,591613




        1,591613






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54202615%2fdifferent-groupers-for-each-column-with-pandas-groupby%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Can a sorcerer learn a 5th-level spell early by creating spell slots using the Font of Magic feature?

            Does disintegrating a polymorphed enemy still kill it after the 2018 errata?

            A Topological Invariant for $pi_3(U(n))$