Different groupers for each column with pandas GroupBy

How could I use a multidimensional Grouper, in this case another dataframe, as a Grouper for another dataframe? Can it be done in one step?

My question is essentially regarding how to perform an actual grouping under these circumstances, but to make it more specific, say I want to then transform and take the sum.

Consider for example:

df1 = pd.DataFrame({'a':[1,2,3,4], 'b':[5,6,7,8]})



print(df1)

   a  b

0  1  5

1  2  6

2  3  7

3  4  8



df2  = pd.DataFrame({'a':['A','B','A','B'], 'b':['A','A','B','B']})



print(df2)

   a  b

0  A  A

1  B  A

2  A  B

3  B  B

Then, the expected output would be:

Where columns a and b in df1 have been grouped by columns a and b from df2 respectively.

edited Jan 15 at 19:09

coldspeed

132k23139223

asked Jan 15 at 16:06

yatu

10.1k21033

1

can you elaborate on the desired output? not clear what the rule is

– Yuca
Jan 15 at 16:08

Sure, added a brief explanation. Let me know if still not clear

– yatu
Jan 15 at 16:09

What do you group by? Your output has the same number of rows and columns as the input.

– Zoe
Jan 15 at 16:11

So you are grouping rows 1 and 3 in df1 because rows 1 and 3 are grouped in df2, correct?

– Yuca
Jan 15 at 16:12

Yes that is correct. The the resulting df has the same shape as df1, with the sum of the grouped values

– yatu
Jan 15 at 16:12

add a comment |

How could I use a multidimensional Grouper, in this case another dataframe, as a Grouper for another dataframe? Can it be done in one step?

My question is essentially regarding how to perform an actual grouping under these circumstances, but to make it more specific, say I want to then transform and take the sum.

Consider for example:

df1 = pd.DataFrame({'a':[1,2,3,4], 'b':[5,6,7,8]})



print(df1)

   a  b

0  1  5

1  2  6

2  3  7

3  4  8



df2  = pd.DataFrame({'a':['A','B','A','B'], 'b':['A','A','B','B']})



print(df2)

   a  b

0  A  A

1  B  A

2  A  B

3  B  B

Then, the expected output would be:

Where columns a and b in df1 have been grouped by columns a and b from df2 respectively.

edited Jan 15 at 19:09

coldspeed

132k23139223

asked Jan 15 at 16:06

yatu

10.1k21033

1

can you elaborate on the desired output? not clear what the rule is

– Yuca
Jan 15 at 16:08

Sure, added a brief explanation. Let me know if still not clear

– yatu
Jan 15 at 16:09

What do you group by? Your output has the same number of rows and columns as the input.

– Zoe
Jan 15 at 16:11

So you are grouping rows 1 and 3 in df1 because rows 1 and 3 are grouped in df2, correct?

– Yuca
Jan 15 at 16:12

Yes that is correct. The the resulting df has the same shape as df1, with the sum of the grouped values

– yatu
Jan 15 at 16:12

add a comment |

How could I use a multidimensional Grouper, in this case another dataframe, as a Grouper for another dataframe? Can it be done in one step?

My question is essentially regarding how to perform an actual grouping under these circumstances, but to make it more specific, say I want to then transform and take the sum.

Consider for example:

df1 = pd.DataFrame({'a':[1,2,3,4], 'b':[5,6,7,8]})



print(df1)

   a  b

0  1  5

1  2  6

2  3  7

3  4  8



df2  = pd.DataFrame({'a':['A','B','A','B'], 'b':['A','A','B','B']})



print(df2)

   a  b

0  A  A

1  B  A

2  A  B

3  B  B

Then, the expected output would be:

Where columns a and b in df1 have been grouped by columns a and b from df2 respectively.

edited Jan 15 at 19:09

coldspeed

132k23139223

asked Jan 15 at 16:06

yatu

10.1k21033

How could I use a multidimensional Grouper, in this case another dataframe, as a Grouper for another dataframe? Can it be done in one step?

My question is essentially regarding how to perform an actual grouping under these circumstances, but to make it more specific, say I want to then transform and take the sum.

Consider for example:

df1 = pd.DataFrame({'a':[1,2,3,4], 'b':[5,6,7,8]})



print(df1)

   a  b

0  1  5

1  2  6

2  3  7

3  4  8



df2  = pd.DataFrame({'a':['A','B','A','B'], 'b':['A','A','B','B']})



print(df2)

   a  b

0  A  A

1  B  A

2  A  B

3  B  B

Then, the expected output would be:

Where columns a and b in df1 have been grouped by columns a and b from df2 respectively.

python pandas group-by pandas-groupby

edited Jan 15 at 19:09

coldspeed

132k23139223

asked Jan 15 at 16:06

yatu

10.1k21033

edited Jan 15 at 19:09

coldspeed

132k23139223

asked Jan 15 at 16:06

yatu

10.1k21033

edited Jan 15 at 19:09

coldspeed

132k23139223

edited Jan 15 at 19:09

coldspeed

132k23139223

edited Jan 15 at 19:09

coldspeed

132k23139223

asked Jan 15 at 16:06

yatu

10.1k21033

asked Jan 15 at 16:06

yatu

10.1k21033

asked Jan 15 at 16:06

yatu

10.1k21033

1

can you elaborate on the desired output? not clear what the rule is

– Yuca
Jan 15 at 16:08

Sure, added a brief explanation. Let me know if still not clear

– yatu
Jan 15 at 16:09

What do you group by? Your output has the same number of rows and columns as the input.

– Zoe
Jan 15 at 16:11

So you are grouping rows 1 and 3 in df1 because rows 1 and 3 are grouped in df2, correct?

– Yuca
Jan 15 at 16:12

Yes that is correct. The the resulting df has the same shape as df1, with the sum of the grouped values

– yatu
Jan 15 at 16:12

add a comment |

1

can you elaborate on the desired output? not clear what the rule is

– Yuca
Jan 15 at 16:08

Sure, added a brief explanation. Let me know if still not clear

– yatu
Jan 15 at 16:09

What do you group by? Your output has the same number of rows and columns as the input.

– Zoe
Jan 15 at 16:11

So you are grouping rows 1 and 3 in df1 because rows 1 and 3 are grouped in df2, correct?

– Yuca
Jan 15 at 16:12

Yes that is correct. The the resulting df has the same shape as df1, with the sum of the grouped values

– yatu
Jan 15 at 16:12

can you elaborate on the desired output? not clear what the rule is

– Yuca
Jan 15 at 16:08

Sure, added a brief explanation. Let me know if still not clear

– yatu
Jan 15 at 16:09

What do you group by? Your output has the same number of rows and columns as the input.

– Zoe
Jan 15 at 16:11

So you are grouping rows 1 and 3 in df1 because rows 1 and 3 are grouped in df2, correct?

– Yuca
Jan 15 at 16:12

Yes that is correct. The the resulting df has the same shape as df1, with the sum of the grouped values

– yatu
Jan 15 at 16:12

add a comment |

5 Answers
5

active

oldest

votes

You will have to group each column individually since each column uses a different grouping scheme.

If you want a cleaner version, I would recommend a list comprehension over the column names, and call pd.concat on the resultant series:

pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)



   a   b

0  4  11

1  6  11

2  4  15

3  6  15

Not to say there's anything wrong with using apply as in the other answer, just that I don't like apply, so this is my suggestion :-)

Here are some timeits for your perusal. Just for your sample data, you will notice the difference in timings is obvious.

%%timeit 

(df1.stack()

    .groupby([df2.stack().index.get_level_values(level=1), df2.stack()])

    .transform('sum').unstack())

%%timeit 

df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))

%%timeit 

pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)



8.99 ms ± 4.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

8.35 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

6.13 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Not to say apply is slow, but explicit iteration in this case is faster. Additionally, you will notice the second and third timed solution will scale better with larger length v/s breadth since the number of iterations depends on the number of columns.

edited Jan 15 at 17:21

answered Jan 15 at 17:03

coldspeed

132k23139223

1

@ScottBoston I have already upvoted your answer for its simplicity B)

– coldspeed
Jan 15 at 17:21

1

Thanks!! Yes, using a list comprehension wiith pd.concat was what I had in mind, was curious to know whether looping could be avoided. Nice to see other alternatives here too though. And thanks for the timeits :)

– yatu
Jan 15 at 17:26

add a comment |

Try using apply to apply a lambda function to each column of your dataframe, then use the name of that pd.Series to group by the second dataframe:

df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))

Output:

edited Jan 15 at 16:17

answered Jan 15 at 16:14

Scott Boston

54.8k73056

Nicee! Guessing it can not directly be done using Groupby rather than applying along columns right? Nice alternative in any case

– yatu
Jan 15 at 16:22

No, I don't think you can apply two different groupings to a dataframe based on a column.

– Scott Boston
Jan 15 at 16:37

1

Ok, thanks. Will leave for some time see if I get any other answers, otherwise will accept

– yatu
Jan 15 at 16:38

add a comment |

Using stack and unstack

df1.stack().groupby([df2.stack().index.get_level_values(level=1),df2.stack()]).transform('sum').unstack()

Out[291]: 

   a   b

0  4  11

1  6  11

2  4  15

3  6  15

answered Jan 15 at 17:14

Wen-Ben

111k83267

Thanks @W-B interesting approach!!

– yatu
Jan 15 at 17:30

add a comment |

I'm going to propose a (mostly) numpythonic solution that uses a scipy.sparse_matrix to perform a vectorized groupby on the entire DataFrame at once, rather than column by column.

The key to performing this operation efficiently is finding a performant way to factorize the entire DataFrame, while avoiding duplicates in any columns. Since your groups are represented by strings, you can simply concatenate the column
name on the end of each value (since columns should be unique), and then factorize the result, like so ^[*]

>>> df2 + df2.columns

    a   b

0  Aa  Ab

1  Ba  Ab

2  Aa  Bb

3  Ba  Bb



>>> pd.factorize((df2 + df2.columns).values.ravel())

(array([0, 1, 2, 1, 0, 3, 2, 3], dtype=int64),

array(['Aa', 'Ab', 'Ba', 'Bb'], dtype=object))

Once we have a unique grouping, we can utilize our scipy.sparse matrix, to perform a groupby in a single pass on the flattened arrays, and use advanced indexing and a reshaping operation to convert the result back to the original shape.

from scipy import sparse



a = df1.values.ravel()

b, _ = pd.factorize((df2 + df2.columns).values.ravel())



o = sparse.csr_matrix(

    (a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)

).sum(0).A1



res = o[b].reshape(df1.shape)

array([[ 4, 11],

       [ 6, 11],

       [ 4, 15],

       [ 6, 15]], dtype=int64)

Performance

Functions

def gp_chris(f1, f2):

    a = f1.values.ravel()

    b, _ = pd.factorize((f2 + f2.columns).values.ravel())



    o = sparse.csr_matrix(

        (a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)

    ).sum(0).A1



    return pd.DataFrame(o[b].reshape(f1.shape), columns=df1.columns)





def gp_cs(f1, f2):

    return pd.concat([f1[c].groupby(f2[c]).transform('sum') for c in f1.columns], axis=1)





def gp_scott(f1, f2):

    return f1.apply(lambda x: x.groupby(f2[x.name]).transform('sum'))





def gp_wen(f1, f2):

    return f1.stack().groupby([f2.stack().index.get_level_values(level=1), f2.stack()]).transform('sum').unstack()

Setup

import numpy as np

from scipy import sparse

import pandas as pd

import string

from timeit import timeit

import matplotlib.pyplot as plt

res = pd.DataFrame(

       index=[f'gp_{f}' for f in ('chris', 'cs', 'scott', 'wen')],

       columns=[10, 50, 100, 200, 400],

       dtype=float

)



for f in res.index:

    for c in res.columns:

        df1 = pd.DataFrame(np.random.rand(c, c))

        df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (c, c)))

        df1.columns = df1.columns.astype(str)

        df2.columns = df2.columns.astype(str)



        stmt = '{}(df1, df2)'.format(f)

        setp = 'from __main__ import df1, df2, {}'.format(f)

        res.at[f, c] = timeit(stmt, setp, number=50)





ax = res.div(res.min()).T.plot(loglog=True)

ax.set_xlabel("N")

ax.set_ylabel("time (relative)")



plt.show()

Results

enter image description here

Validation

df1 = pd.DataFrame(np.random.rand(10, 10))

df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (10, 10)))

df1.columns = df1.columns.astype(str)

df2.columns = df2.columns.astype(str)



v = np.stack([gp_chris(df1, df2), gp_cs(df1, df2), gp_scott(df1, df2), gp_wen(df1, df2)])

print(np.all(v[:-1] == v[1:]))

True

Either we're all wrong or we're all correct :)

_{^[*] There is a possibility that you could get a duplicate value here if one item is the concatenation of a column and another item before concatenation occurs. However if this is the case, you shouldn't need to adjust much to fix it.}

edited Jan 15 at 20:14

answered Jan 15 at 19:32

user3483203

31.2k82655

Thanks for the final validation :-). This approach is pretty interesting and surprisingly fast compared to the others. Also @W-B 's approach is quite faster than the others too for larger amounts of columns

– yatu
Jan 15 at 20:45

1

Correct. Both Wen's and my approaches have more overhead at the start (stacking and reshaping), but since we perform our groupby operation a single time, it scales better.

– user3483203
Jan 15 at 20:48

I personally wouldn't change the accepted answer. This approach is has a couple caveats with factorizing the DataFrame, as well as only works for certain operations related to a DataFrame, for example count wouldn't work. Coldspeed's is much more general.

– user3483203
Jan 15 at 20:50

Okay thanks for pointing out that it is not as general. Definitely worth having as a possible solution for whenever it can be applied and performance is an issue.

– yatu
Jan 15 at 20:55

add a comment |

You could do something like the following:

res = df1.assign(a_sum=lambda df: df['a'].groupby(df2['a']).transform('sum'))

         .assign(b_sum=lambda df: df['b'].groupby(df2['b']).transform('sum'))

Results:

answered Jan 15 at 17:13

PMende

1,591613

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54202615%2fdifferent-groupers-for-each-column-with-pandas-groupby%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

5 Answers
5

active

oldest

votes

5 Answers
5

active

oldest

votes

You will have to group each column individually since each column uses a different grouping scheme.

If you want a cleaner version, I would recommend a list comprehension over the column names, and call pd.concat on the resultant series:

pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)



   a   b

0  4  11

1  6  11

2  4  15

3  6  15

Not to say there's anything wrong with using apply as in the other answer, just that I don't like apply, so this is my suggestion :-)

Here are some timeits for your perusal. Just for your sample data, you will notice the difference in timings is obvious.

%%timeit 

(df1.stack()

    .groupby([df2.stack().index.get_level_values(level=1), df2.stack()])

    .transform('sum').unstack())

%%timeit 

df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))

%%timeit 

pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)



8.99 ms ± 4.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

8.35 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

6.13 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Jan 15 at 17:21

answered Jan 15 at 17:03

coldspeed

132k23139223

1

@ScottBoston I have already upvoted your answer for its simplicity B)

– coldspeed
Jan 15 at 17:21

1

Thanks!! Yes, using a list comprehension wiith pd.concat was what I had in mind, was curious to know whether looping could be avoided. Nice to see other alternatives here too though. And thanks for the timeits :)

– yatu
Jan 15 at 17:26

add a comment |

You will have to group each column individually since each column uses a different grouping scheme.

If you want a cleaner version, I would recommend a list comprehension over the column names, and call pd.concat on the resultant series:

pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)



   a   b

0  4  11

1  6  11

2  4  15

3  6  15

Not to say there's anything wrong with using apply as in the other answer, just that I don't like apply, so this is my suggestion :-)

Here are some timeits for your perusal. Just for your sample data, you will notice the difference in timings is obvious.

%%timeit 

(df1.stack()

    .groupby([df2.stack().index.get_level_values(level=1), df2.stack()])

    .transform('sum').unstack())

%%timeit 

df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))

%%timeit 

pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)



8.99 ms ± 4.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

8.35 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

6.13 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Jan 15 at 17:21

answered Jan 15 at 17:03

coldspeed

132k23139223

1

@ScottBoston I have already upvoted your answer for its simplicity B)

– coldspeed
Jan 15 at 17:21

1

Thanks!! Yes, using a list comprehension wiith pd.concat was what I had in mind, was curious to know whether looping could be avoided. Nice to see other alternatives here too though. And thanks for the timeits :)

– yatu
Jan 15 at 17:26

add a comment |

You will have to group each column individually since each column uses a different grouping scheme.

If you want a cleaner version, I would recommend a list comprehension over the column names, and call pd.concat on the resultant series:

pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)



   a   b

0  4  11

1  6  11

2  4  15

3  6  15

Not to say there's anything wrong with using apply as in the other answer, just that I don't like apply, so this is my suggestion :-)

Here are some timeits for your perusal. Just for your sample data, you will notice the difference in timings is obvious.

%%timeit 

(df1.stack()

    .groupby([df2.stack().index.get_level_values(level=1), df2.stack()])

    .transform('sum').unstack())

%%timeit 

df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))

%%timeit 

pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)



8.99 ms ± 4.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

8.35 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

6.13 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Jan 15 at 17:21

answered Jan 15 at 17:03

coldspeed

132k23139223

You will have to group each column individually since each column uses a different grouping scheme.

If you want a cleaner version, I would recommend a list comprehension over the column names, and call pd.concat on the resultant series:

pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)



   a   b

0  4  11

1  6  11

2  4  15

3  6  15

Not to say there's anything wrong with using apply as in the other answer, just that I don't like apply, so this is my suggestion :-)

Here are some timeits for your perusal. Just for your sample data, you will notice the difference in timings is obvious.

%%timeit 

(df1.stack()

    .groupby([df2.stack().index.get_level_values(level=1), df2.stack()])

    .transform('sum').unstack())

%%timeit 

df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))

%%timeit 

pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)



8.99 ms ± 4.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

8.35 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

6.13 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Jan 15 at 17:21

answered Jan 15 at 17:03

coldspeed

132k23139223

edited Jan 15 at 17:21

answered Jan 15 at 17:03

coldspeed

132k23139223

answered Jan 15 at 17:03

coldspeed

132k23139223

answered Jan 15 at 17:03

coldspeed

132k23139223

1

@ScottBoston I have already upvoted your answer for its simplicity B)

– coldspeed
Jan 15 at 17:21

1

Thanks!! Yes, using a list comprehension wiith pd.concat was what I had in mind, was curious to know whether looping could be avoided. Nice to see other alternatives here too though. And thanks for the timeits :)

– yatu
Jan 15 at 17:26

add a comment |

1

@ScottBoston I have already upvoted your answer for its simplicity B)

– coldspeed
Jan 15 at 17:21

1

Thanks!! Yes, using a list comprehension wiith pd.concat was what I had in mind, was curious to know whether looping could be avoided. Nice to see other alternatives here too though. And thanks for the timeits :)

– yatu
Jan 15 at 17:26

@ScottBoston I have already upvoted your answer for its simplicity B)

– coldspeed
Jan 15 at 17:21

Thanks!! Yes, using a list comprehension wiith pd.concat was what I had in mind, was curious to know whether looping could be avoided. Nice to see other alternatives here too though. And thanks for the timeits :)

– yatu
Jan 15 at 17:26

add a comment |

Try using apply to apply a lambda function to each column of your dataframe, then use the name of that pd.Series to group by the second dataframe:

df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))

Output:

edited Jan 15 at 16:17

answered Jan 15 at 16:14

Scott Boston

54.8k73056

Nicee! Guessing it can not directly be done using Groupby rather than applying along columns right? Nice alternative in any case

– yatu
Jan 15 at 16:22

No, I don't think you can apply two different groupings to a dataframe based on a column.

– Scott Boston
Jan 15 at 16:37

1

Ok, thanks. Will leave for some time see if I get any other answers, otherwise will accept

– yatu
Jan 15 at 16:38

add a comment |

Try using apply to apply a lambda function to each column of your dataframe, then use the name of that pd.Series to group by the second dataframe:

df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))

Output:

edited Jan 15 at 16:17

answered Jan 15 at 16:14

Scott Boston

54.8k73056

Nicee! Guessing it can not directly be done using Groupby rather than applying along columns right? Nice alternative in any case

– yatu
Jan 15 at 16:22

No, I don't think you can apply two different groupings to a dataframe based on a column.

– Scott Boston
Jan 15 at 16:37

1

Ok, thanks. Will leave for some time see if I get any other answers, otherwise will accept

– yatu
Jan 15 at 16:38

add a comment |

Try using apply to apply a lambda function to each column of your dataframe, then use the name of that pd.Series to group by the second dataframe:

df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))

Output:

edited Jan 15 at 16:17

answered Jan 15 at 16:14

Scott Boston

54.8k73056

Try using apply to apply a lambda function to each column of your dataframe, then use the name of that pd.Series to group by the second dataframe:

df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))

Output:

edited Jan 15 at 16:17

answered Jan 15 at 16:14

Scott Boston

54.8k73056

edited Jan 15 at 16:17

answered Jan 15 at 16:14

Scott Boston

54.8k73056

answered Jan 15 at 16:14

Scott Boston

54.8k73056

answered Jan 15 at 16:14

Scott Boston

54.8k73056

Nicee! Guessing it can not directly be done using Groupby rather than applying along columns right? Nice alternative in any case

– yatu
Jan 15 at 16:22

No, I don't think you can apply two different groupings to a dataframe based on a column.

– Scott Boston
Jan 15 at 16:37

1

Ok, thanks. Will leave for some time see if I get any other answers, otherwise will accept

– yatu
Jan 15 at 16:38

add a comment |

Nicee! Guessing it can not directly be done using Groupby rather than applying along columns right? Nice alternative in any case

– yatu
Jan 15 at 16:22

No, I don't think you can apply two different groupings to a dataframe based on a column.

– Scott Boston
Jan 15 at 16:37

1

Ok, thanks. Will leave for some time see if I get any other answers, otherwise will accept

– yatu
Jan 15 at 16:38

Nicee! Guessing it can not directly be done using Groupby rather than applying along columns right? Nice alternative in any case

– yatu
Jan 15 at 16:22

No, I don't think you can apply two different groupings to a dataframe based on a column.

– Scott Boston
Jan 15 at 16:37

Ok, thanks. Will leave for some time see if I get any other answers, otherwise will accept

– yatu
Jan 15 at 16:38

add a comment |

Using stack and unstack

df1.stack().groupby([df2.stack().index.get_level_values(level=1),df2.stack()]).transform('sum').unstack()

Out[291]: 

   a   b

0  4  11

1  6  11

2  4  15

3  6  15

answered Jan 15 at 17:14

Wen-Ben

111k83267

Thanks @W-B interesting approach!!

– yatu
Jan 15 at 17:30

add a comment |

Using stack and unstack

df1.stack().groupby([df2.stack().index.get_level_values(level=1),df2.stack()]).transform('sum').unstack()

Out[291]: 

   a   b

0  4  11

1  6  11

2  4  15

3  6  15

answered Jan 15 at 17:14

Wen-Ben

111k83267

Thanks @W-B interesting approach!!

– yatu
Jan 15 at 17:30

add a comment |

Using stack and unstack

df1.stack().groupby([df2.stack().index.get_level_values(level=1),df2.stack()]).transform('sum').unstack()

Out[291]: 

   a   b

0  4  11

1  6  11

2  4  15

3  6  15

answered Jan 15 at 17:14

Wen-Ben

111k83267

Using stack and unstack

df1.stack().groupby([df2.stack().index.get_level_values(level=1),df2.stack()]).transform('sum').unstack()

Out[291]: 

   a   b

0  4  11

1  6  11

2  4  15

3  6  15

answered Jan 15 at 17:14

Wen-Ben

111k83267

answered Jan 15 at 17:14

Wen-Ben

111k83267

answered Jan 15 at 17:14

Wen-Ben

111k83267

answered Jan 15 at 17:14

Wen-Ben

111k83267

Thanks @W-B interesting approach!!

– yatu
Jan 15 at 17:30

add a comment |

Thanks @W-B interesting approach!!

– yatu
Jan 15 at 17:30

Thanks @W-B interesting approach!!

– yatu
Jan 15 at 17:30

add a comment |

I'm going to propose a (mostly) numpythonic solution that uses a scipy.sparse_matrix to perform a vectorized groupby on the entire DataFrame at once, rather than column by column.

>>> df2 + df2.columns

    a   b

0  Aa  Ab

1  Ba  Ab

2  Aa  Bb

3  Ba  Bb



>>> pd.factorize((df2 + df2.columns).values.ravel())

(array([0, 1, 2, 1, 0, 3, 2, 3], dtype=int64),

array(['Aa', 'Ab', 'Ba', 'Bb'], dtype=object))

from scipy import sparse



a = df1.values.ravel()

b, _ = pd.factorize((df2 + df2.columns).values.ravel())



o = sparse.csr_matrix(

    (a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)

).sum(0).A1



res = o[b].reshape(df1.shape)

array([[ 4, 11],

       [ 6, 11],

       [ 4, 15],

       [ 6, 15]], dtype=int64)

Performance

Functions

def gp_chris(f1, f2):

    a = f1.values.ravel()

    b, _ = pd.factorize((f2 + f2.columns).values.ravel())



    o = sparse.csr_matrix(

        (a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)

    ).sum(0).A1



    return pd.DataFrame(o[b].reshape(f1.shape), columns=df1.columns)





def gp_cs(f1, f2):

    return pd.concat([f1[c].groupby(f2[c]).transform('sum') for c in f1.columns], axis=1)





def gp_scott(f1, f2):

    return f1.apply(lambda x: x.groupby(f2[x.name]).transform('sum'))





def gp_wen(f1, f2):

    return f1.stack().groupby([f2.stack().index.get_level_values(level=1), f2.stack()]).transform('sum').unstack()

Setup

import numpy as np

from scipy import sparse

import pandas as pd

import string

from timeit import timeit

import matplotlib.pyplot as plt

res = pd.DataFrame(

       index=[f'gp_{f}' for f in ('chris', 'cs', 'scott', 'wen')],

       columns=[10, 50, 100, 200, 400],

       dtype=float

)



for f in res.index:

    for c in res.columns:

        df1 = pd.DataFrame(np.random.rand(c, c))

        df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (c, c)))

        df1.columns = df1.columns.astype(str)

        df2.columns = df2.columns.astype(str)



        stmt = '{}(df1, df2)'.format(f)

        setp = 'from __main__ import df1, df2, {}'.format(f)

        res.at[f, c] = timeit(stmt, setp, number=50)





ax = res.div(res.min()).T.plot(loglog=True)

ax.set_xlabel("N")

ax.set_ylabel("time (relative)")



plt.show()

Results

enter image description here

Validation

df1 = pd.DataFrame(np.random.rand(10, 10))

df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (10, 10)))

df1.columns = df1.columns.astype(str)

df2.columns = df2.columns.astype(str)



v = np.stack([gp_chris(df1, df2), gp_cs(df1, df2), gp_scott(df1, df2), gp_wen(df1, df2)])

print(np.all(v[:-1] == v[1:]))

True

Either we're all wrong or we're all correct :)

edited Jan 15 at 20:14

answered Jan 15 at 19:32

user3483203

31.2k82655

Thanks for the final validation :-). This approach is pretty interesting and surprisingly fast compared to the others. Also @W-B 's approach is quite faster than the others too for larger amounts of columns

– yatu
Jan 15 at 20:45

1

Correct. Both Wen's and my approaches have more overhead at the start (stacking and reshaping), but since we perform our groupby operation a single time, it scales better.

– user3483203
Jan 15 at 20:48

I personally wouldn't change the accepted answer. This approach is has a couple caveats with factorizing the DataFrame, as well as only works for certain operations related to a DataFrame, for example count wouldn't work. Coldspeed's is much more general.

– user3483203
Jan 15 at 20:50

Okay thanks for pointing out that it is not as general. Definitely worth having as a possible solution for whenever it can be applied and performance is an issue.

– yatu
Jan 15 at 20:55

add a comment |

I'm going to propose a (mostly) numpythonic solution that uses a scipy.sparse_matrix to perform a vectorized groupby on the entire DataFrame at once, rather than column by column.

>>> df2 + df2.columns

    a   b

0  Aa  Ab

1  Ba  Ab

2  Aa  Bb

3  Ba  Bb



>>> pd.factorize((df2 + df2.columns).values.ravel())

(array([0, 1, 2, 1, 0, 3, 2, 3], dtype=int64),

array(['Aa', 'Ab', 'Ba', 'Bb'], dtype=object))

from scipy import sparse



a = df1.values.ravel()

b, _ = pd.factorize((df2 + df2.columns).values.ravel())



o = sparse.csr_matrix(

    (a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)

).sum(0).A1



res = o[b].reshape(df1.shape)

array([[ 4, 11],

       [ 6, 11],

       [ 4, 15],

       [ 6, 15]], dtype=int64)

Performance

Functions

def gp_chris(f1, f2):

    a = f1.values.ravel()

    b, _ = pd.factorize((f2 + f2.columns).values.ravel())



    o = sparse.csr_matrix(

        (a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)

    ).sum(0).A1



    return pd.DataFrame(o[b].reshape(f1.shape), columns=df1.columns)





def gp_cs(f1, f2):

    return pd.concat([f1[c].groupby(f2[c]).transform('sum') for c in f1.columns], axis=1)





def gp_scott(f1, f2):

    return f1.apply(lambda x: x.groupby(f2[x.name]).transform('sum'))





def gp_wen(f1, f2):

    return f1.stack().groupby([f2.stack().index.get_level_values(level=1), f2.stack()]).transform('sum').unstack()

Setup

import numpy as np

from scipy import sparse

import pandas as pd

import string

from timeit import timeit

import matplotlib.pyplot as plt

res = pd.DataFrame(

       index=[f'gp_{f}' for f in ('chris', 'cs', 'scott', 'wen')],

       columns=[10, 50, 100, 200, 400],

       dtype=float

)



for f in res.index:

    for c in res.columns:

        df1 = pd.DataFrame(np.random.rand(c, c))

        df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (c, c)))

        df1.columns = df1.columns.astype(str)

        df2.columns = df2.columns.astype(str)



        stmt = '{}(df1, df2)'.format(f)

        setp = 'from __main__ import df1, df2, {}'.format(f)

        res.at[f, c] = timeit(stmt, setp, number=50)





ax = res.div(res.min()).T.plot(loglog=True)

ax.set_xlabel("N")

ax.set_ylabel("time (relative)")



plt.show()

Results

enter image description here

Validation

df1 = pd.DataFrame(np.random.rand(10, 10))

df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (10, 10)))

df1.columns = df1.columns.astype(str)

df2.columns = df2.columns.astype(str)



v = np.stack([gp_chris(df1, df2), gp_cs(df1, df2), gp_scott(df1, df2), gp_wen(df1, df2)])

print(np.all(v[:-1] == v[1:]))

True

Either we're all wrong or we're all correct :)

edited Jan 15 at 20:14

answered Jan 15 at 19:32

user3483203

31.2k82655

Thanks for the final validation :-). This approach is pretty interesting and surprisingly fast compared to the others. Also @W-B 's approach is quite faster than the others too for larger amounts of columns

– yatu
Jan 15 at 20:45

1

Correct. Both Wen's and my approaches have more overhead at the start (stacking and reshaping), but since we perform our groupby operation a single time, it scales better.

– user3483203
Jan 15 at 20:48

I personally wouldn't change the accepted answer. This approach is has a couple caveats with factorizing the DataFrame, as well as only works for certain operations related to a DataFrame, for example count wouldn't work. Coldspeed's is much more general.

– user3483203
Jan 15 at 20:50

Okay thanks for pointing out that it is not as general. Definitely worth having as a possible solution for whenever it can be applied and performance is an issue.

– yatu
Jan 15 at 20:55

add a comment |

I'm going to propose a (mostly) numpythonic solution that uses a scipy.sparse_matrix to perform a vectorized groupby on the entire DataFrame at once, rather than column by column.

>>> df2 + df2.columns

    a   b

0  Aa  Ab

1  Ba  Ab

2  Aa  Bb

3  Ba  Bb



>>> pd.factorize((df2 + df2.columns).values.ravel())

(array([0, 1, 2, 1, 0, 3, 2, 3], dtype=int64),

array(['Aa', 'Ab', 'Ba', 'Bb'], dtype=object))

from scipy import sparse



a = df1.values.ravel()

b, _ = pd.factorize((df2 + df2.columns).values.ravel())



o = sparse.csr_matrix(

    (a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)

).sum(0).A1



res = o[b].reshape(df1.shape)

array([[ 4, 11],

       [ 6, 11],

       [ 4, 15],

       [ 6, 15]], dtype=int64)

Performance

Functions

def gp_chris(f1, f2):

    a = f1.values.ravel()

    b, _ = pd.factorize((f2 + f2.columns).values.ravel())



    o = sparse.csr_matrix(

        (a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)

    ).sum(0).A1



    return pd.DataFrame(o[b].reshape(f1.shape), columns=df1.columns)





def gp_cs(f1, f2):

    return pd.concat([f1[c].groupby(f2[c]).transform('sum') for c in f1.columns], axis=1)





def gp_scott(f1, f2):

    return f1.apply(lambda x: x.groupby(f2[x.name]).transform('sum'))





def gp_wen(f1, f2):

    return f1.stack().groupby([f2.stack().index.get_level_values(level=1), f2.stack()]).transform('sum').unstack()

Setup

import numpy as np

from scipy import sparse

import pandas as pd

import string

from timeit import timeit

import matplotlib.pyplot as plt

res = pd.DataFrame(

       index=[f'gp_{f}' for f in ('chris', 'cs', 'scott', 'wen')],

       columns=[10, 50, 100, 200, 400],

       dtype=float

)



for f in res.index:

    for c in res.columns:

        df1 = pd.DataFrame(np.random.rand(c, c))

        df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (c, c)))

        df1.columns = df1.columns.astype(str)

        df2.columns = df2.columns.astype(str)



        stmt = '{}(df1, df2)'.format(f)

        setp = 'from __main__ import df1, df2, {}'.format(f)

        res.at[f, c] = timeit(stmt, setp, number=50)





ax = res.div(res.min()).T.plot(loglog=True)

ax.set_xlabel("N")

ax.set_ylabel("time (relative)")



plt.show()

Results

enter image description here

Validation

df1 = pd.DataFrame(np.random.rand(10, 10))

df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (10, 10)))

df1.columns = df1.columns.astype(str)

df2.columns = df2.columns.astype(str)



v = np.stack([gp_chris(df1, df2), gp_cs(df1, df2), gp_scott(df1, df2), gp_wen(df1, df2)])

print(np.all(v[:-1] == v[1:]))

True

Either we're all wrong or we're all correct :)

edited Jan 15 at 20:14

answered Jan 15 at 19:32

user3483203

31.2k82655

I'm going to propose a (mostly) numpythonic solution that uses a scipy.sparse_matrix to perform a vectorized groupby on the entire DataFrame at once, rather than column by column.

>>> df2 + df2.columns

    a   b

0  Aa  Ab

1  Ba  Ab

2  Aa  Bb

3  Ba  Bb



>>> pd.factorize((df2 + df2.columns).values.ravel())

(array([0, 1, 2, 1, 0, 3, 2, 3], dtype=int64),

array(['Aa', 'Ab', 'Ba', 'Bb'], dtype=object))

from scipy import sparse



a = df1.values.ravel()

b, _ = pd.factorize((df2 + df2.columns).values.ravel())



o = sparse.csr_matrix(

    (a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)

).sum(0).A1



res = o[b].reshape(df1.shape)

array([[ 4, 11],

       [ 6, 11],

       [ 4, 15],

       [ 6, 15]], dtype=int64)

Performance

Functions

def gp_chris(f1, f2):

    a = f1.values.ravel()

    b, _ = pd.factorize((f2 + f2.columns).values.ravel())



    o = sparse.csr_matrix(

        (a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)

    ).sum(0).A1



    return pd.DataFrame(o[b].reshape(f1.shape), columns=df1.columns)





def gp_cs(f1, f2):

    return pd.concat([f1[c].groupby(f2[c]).transform('sum') for c in f1.columns], axis=1)





def gp_scott(f1, f2):

    return f1.apply(lambda x: x.groupby(f2[x.name]).transform('sum'))





def gp_wen(f1, f2):

    return f1.stack().groupby([f2.stack().index.get_level_values(level=1), f2.stack()]).transform('sum').unstack()

Setup

import numpy as np

from scipy import sparse

import pandas as pd

import string

from timeit import timeit

import matplotlib.pyplot as plt

res = pd.DataFrame(

       index=[f'gp_{f}' for f in ('chris', 'cs', 'scott', 'wen')],

       columns=[10, 50, 100, 200, 400],

       dtype=float

)



for f in res.index:

    for c in res.columns:

        df1 = pd.DataFrame(np.random.rand(c, c))

        df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (c, c)))

        df1.columns = df1.columns.astype(str)

        df2.columns = df2.columns.astype(str)



        stmt = '{}(df1, df2)'.format(f)

        setp = 'from __main__ import df1, df2, {}'.format(f)

        res.at[f, c] = timeit(stmt, setp, number=50)





ax = res.div(res.min()).T.plot(loglog=True)

ax.set_xlabel("N")

ax.set_ylabel("time (relative)")



plt.show()

Results

enter image description here

Validation

df1 = pd.DataFrame(np.random.rand(10, 10))

df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (10, 10)))

df1.columns = df1.columns.astype(str)

df2.columns = df2.columns.astype(str)



v = np.stack([gp_chris(df1, df2), gp_cs(df1, df2), gp_scott(df1, df2), gp_wen(df1, df2)])

print(np.all(v[:-1] == v[1:]))

True

Either we're all wrong or we're all correct :)

edited Jan 15 at 20:14

answered Jan 15 at 19:32

user3483203

31.2k82655

edited Jan 15 at 20:14

answered Jan 15 at 19:32

user3483203

31.2k82655

answered Jan 15 at 19:32

user3483203

31.2k82655

answered Jan 15 at 19:32

user3483203

31.2k82655

Thanks for the final validation :-). This approach is pretty interesting and surprisingly fast compared to the others. Also @W-B 's approach is quite faster than the others too for larger amounts of columns

– yatu
Jan 15 at 20:45

1

Correct. Both Wen's and my approaches have more overhead at the start (stacking and reshaping), but since we perform our groupby operation a single time, it scales better.

– user3483203
Jan 15 at 20:48

I personally wouldn't change the accepted answer. This approach is has a couple caveats with factorizing the DataFrame, as well as only works for certain operations related to a DataFrame, for example count wouldn't work. Coldspeed's is much more general.

– user3483203
Jan 15 at 20:50

Okay thanks for pointing out that it is not as general. Definitely worth having as a possible solution for whenever it can be applied and performance is an issue.

– yatu
Jan 15 at 20:55

add a comment |

Thanks for the final validation :-). This approach is pretty interesting and surprisingly fast compared to the others. Also @W-B 's approach is quite faster than the others too for larger amounts of columns

– yatu
Jan 15 at 20:45

1

Correct. Both Wen's and my approaches have more overhead at the start (stacking and reshaping), but since we perform our groupby operation a single time, it scales better.

– user3483203
Jan 15 at 20:48

I personally wouldn't change the accepted answer. This approach is has a couple caveats with factorizing the DataFrame, as well as only works for certain operations related to a DataFrame, for example count wouldn't work. Coldspeed's is much more general.

– user3483203
Jan 15 at 20:50

Okay thanks for pointing out that it is not as general. Definitely worth having as a possible solution for whenever it can be applied and performance is an issue.

– yatu
Jan 15 at 20:55

Thanks for the final validation :-). This approach is pretty interesting and surprisingly fast compared to the others. Also @W-B 's approach is quite faster than the others too for larger amounts of columns

– yatu
Jan 15 at 20:45

Correct. Both Wen's and my approaches have more overhead at the start (stacking and reshaping), but since we perform our groupby operation a single time, it scales better.

– user3483203
Jan 15 at 20:48

I personally wouldn't change the accepted answer. This approach is has a couple caveats with factorizing the DataFrame, as well as only works for certain operations related to a DataFrame, for example count wouldn't work. Coldspeed's is much more general.

– user3483203
Jan 15 at 20:50

Okay thanks for pointing out that it is not as general. Definitely worth having as a possible solution for whenever it can be applied and performance is an issue.

– yatu
Jan 15 at 20:55

add a comment |

You could do something like the following:

res = df1.assign(a_sum=lambda df: df['a'].groupby(df2['a']).transform('sum'))

         .assign(b_sum=lambda df: df['b'].groupby(df2['b']).transform('sum'))

Results:

answered Jan 15 at 17:13

PMende

1,591613

add a comment |

You could do something like the following:

res = df1.assign(a_sum=lambda df: df['a'].groupby(df2['a']).transform('sum'))

         .assign(b_sum=lambda df: df['b'].groupby(df2['b']).transform('sum'))

Results:

answered Jan 15 at 17:13

PMende

1,591613

add a comment |

You could do something like the following:

res = df1.assign(a_sum=lambda df: df['a'].groupby(df2['a']).transform('sum'))

         .assign(b_sum=lambda df: df['b'].groupby(df2['b']).transform('sum'))

Results:

answered Jan 15 at 17:13

PMende

1,591613

You could do something like the following:

res = df1.assign(a_sum=lambda df: df['a'].groupby(df2['a']).transform('sum'))

         .assign(b_sum=lambda df: df['b'].groupby(df2['b']).transform('sum'))

Results:

answered Jan 15 at 17:13

PMende

1,591613

answered Jan 15 at 17:13

PMende

1,591613

answered Jan 15 at 17:13

PMende

1,591613

answered Jan 15 at 17:13

PMende

1,591613

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu