Different groupers for each column with pandas GroupBy
How could I use a multidimensional Grouper, in this case another dataframe, as a Grouper for another dataframe? Can it be done in one step?
My question is essentially regarding how to perform an actual grouping under these circumstances, but to make it more specific, say I want to then transform
and take the sum
.
Consider for example:
df1 = pd.DataFrame({'a':[1,2,3,4], 'b':[5,6,7,8]})
print(df1)
a b
0 1 5
1 2 6
2 3 7
3 4 8
df2 = pd.DataFrame({'a':['A','B','A','B'], 'b':['A','A','B','B']})
print(df2)
a b
0 A A
1 B A
2 A B
3 B B
Then, the expected output would be:
a b
0 4 11
1 6 11
2 4 15
3 6 15
Where columns a
and b
in df1
have been grouped by columns a
and b
from df2
respectively.
python pandas group-by pandas-groupby
add a comment |
How could I use a multidimensional Grouper, in this case another dataframe, as a Grouper for another dataframe? Can it be done in one step?
My question is essentially regarding how to perform an actual grouping under these circumstances, but to make it more specific, say I want to then transform
and take the sum
.
Consider for example:
df1 = pd.DataFrame({'a':[1,2,3,4], 'b':[5,6,7,8]})
print(df1)
a b
0 1 5
1 2 6
2 3 7
3 4 8
df2 = pd.DataFrame({'a':['A','B','A','B'], 'b':['A','A','B','B']})
print(df2)
a b
0 A A
1 B A
2 A B
3 B B
Then, the expected output would be:
a b
0 4 11
1 6 11
2 4 15
3 6 15
Where columns a
and b
in df1
have been grouped by columns a
and b
from df2
respectively.
python pandas group-by pandas-groupby
1
can you elaborate on the desired output? not clear what the rule is
– Yuca
Jan 15 at 16:08
Sure, added a brief explanation. Let me know if still not clear
– yatu
Jan 15 at 16:09
What do you group by? Your output has the same number of rows and columns as the input.
– Zoe
Jan 15 at 16:11
So you are grouping rows 1 and 3 in df1 because rows 1 and 3 are grouped in df2, correct?
– Yuca
Jan 15 at 16:12
Yes that is correct. The the resulting df has the same shape as df1, with the sum of the grouped values
– yatu
Jan 15 at 16:12
add a comment |
How could I use a multidimensional Grouper, in this case another dataframe, as a Grouper for another dataframe? Can it be done in one step?
My question is essentially regarding how to perform an actual grouping under these circumstances, but to make it more specific, say I want to then transform
and take the sum
.
Consider for example:
df1 = pd.DataFrame({'a':[1,2,3,4], 'b':[5,6,7,8]})
print(df1)
a b
0 1 5
1 2 6
2 3 7
3 4 8
df2 = pd.DataFrame({'a':['A','B','A','B'], 'b':['A','A','B','B']})
print(df2)
a b
0 A A
1 B A
2 A B
3 B B
Then, the expected output would be:
a b
0 4 11
1 6 11
2 4 15
3 6 15
Where columns a
and b
in df1
have been grouped by columns a
and b
from df2
respectively.
python pandas group-by pandas-groupby
How could I use a multidimensional Grouper, in this case another dataframe, as a Grouper for another dataframe? Can it be done in one step?
My question is essentially regarding how to perform an actual grouping under these circumstances, but to make it more specific, say I want to then transform
and take the sum
.
Consider for example:
df1 = pd.DataFrame({'a':[1,2,3,4], 'b':[5,6,7,8]})
print(df1)
a b
0 1 5
1 2 6
2 3 7
3 4 8
df2 = pd.DataFrame({'a':['A','B','A','B'], 'b':['A','A','B','B']})
print(df2)
a b
0 A A
1 B A
2 A B
3 B B
Then, the expected output would be:
a b
0 4 11
1 6 11
2 4 15
3 6 15
Where columns a
and b
in df1
have been grouped by columns a
and b
from df2
respectively.
python pandas group-by pandas-groupby
python pandas group-by pandas-groupby
edited Jan 15 at 19:09
coldspeed
132k23139223
132k23139223
asked Jan 15 at 16:06
yatuyatu
10.1k21033
10.1k21033
1
can you elaborate on the desired output? not clear what the rule is
– Yuca
Jan 15 at 16:08
Sure, added a brief explanation. Let me know if still not clear
– yatu
Jan 15 at 16:09
What do you group by? Your output has the same number of rows and columns as the input.
– Zoe
Jan 15 at 16:11
So you are grouping rows 1 and 3 in df1 because rows 1 and 3 are grouped in df2, correct?
– Yuca
Jan 15 at 16:12
Yes that is correct. The the resulting df has the same shape as df1, with the sum of the grouped values
– yatu
Jan 15 at 16:12
add a comment |
1
can you elaborate on the desired output? not clear what the rule is
– Yuca
Jan 15 at 16:08
Sure, added a brief explanation. Let me know if still not clear
– yatu
Jan 15 at 16:09
What do you group by? Your output has the same number of rows and columns as the input.
– Zoe
Jan 15 at 16:11
So you are grouping rows 1 and 3 in df1 because rows 1 and 3 are grouped in df2, correct?
– Yuca
Jan 15 at 16:12
Yes that is correct. The the resulting df has the same shape as df1, with the sum of the grouped values
– yatu
Jan 15 at 16:12
1
1
can you elaborate on the desired output? not clear what the rule is
– Yuca
Jan 15 at 16:08
can you elaborate on the desired output? not clear what the rule is
– Yuca
Jan 15 at 16:08
Sure, added a brief explanation. Let me know if still not clear
– yatu
Jan 15 at 16:09
Sure, added a brief explanation. Let me know if still not clear
– yatu
Jan 15 at 16:09
What do you group by? Your output has the same number of rows and columns as the input.
– Zoe
Jan 15 at 16:11
What do you group by? Your output has the same number of rows and columns as the input.
– Zoe
Jan 15 at 16:11
So you are grouping rows 1 and 3 in df1 because rows 1 and 3 are grouped in df2, correct?
– Yuca
Jan 15 at 16:12
So you are grouping rows 1 and 3 in df1 because rows 1 and 3 are grouped in df2, correct?
– Yuca
Jan 15 at 16:12
Yes that is correct. The the resulting df has the same shape as df1, with the sum of the grouped values
– yatu
Jan 15 at 16:12
Yes that is correct. The the resulting df has the same shape as df1, with the sum of the grouped values
– yatu
Jan 15 at 16:12
add a comment |
5 Answers
5
active
oldest
votes
You will have to group each column individually since each column uses a different grouping scheme.
If you want a cleaner version, I would recommend a list comprehension over the column names, and call pd.concat
on the resultant series:
pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)
a b
0 4 11
1 6 11
2 4 15
3 6 15
Not to say there's anything wrong with using apply
as in the other answer, just that I don't like apply
, so this is my suggestion :-)
Here are some timeits for your perusal. Just for your sample data, you will notice the difference in timings is obvious.
%%timeit
(df1.stack()
.groupby([df2.stack().index.get_level_values(level=1), df2.stack()])
.transform('sum').unstack())
%%timeit
df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))
%%timeit
pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)
8.99 ms ± 4.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
8.35 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.13 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Not to say apply
is slow, but explicit iteration in this case is faster. Additionally, you will notice the second and third timed solution will scale better with larger length v/s breadth since the number of iterations depends on the number of columns.
1
@ScottBoston I have already upvoted your answer for its simplicity B)
– coldspeed
Jan 15 at 17:21
1
Thanks!! Yes, using a list comprehension wiithpd.concat
was what I had in mind, was curious to know whether looping could be avoided. Nice to see other alternatives here too though. And thanks for the timeits :)
– yatu
Jan 15 at 17:26
add a comment |
Try using apply
to apply a lambda function to each column of your dataframe, then use the name of that pd.Series to group by the second dataframe:
df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))
Output:
a b
0 4 11
1 6 11
2 4 15
3 6 15
Nicee! Guessing it can not directly be done using Groupby rather than applying along columns right? Nice alternative in any case
– yatu
Jan 15 at 16:22
No, I don't think you can apply two different groupings to a dataframe based on a column.
– Scott Boston
Jan 15 at 16:37
1
Ok, thanks. Will leave for some time see if I get any other answers, otherwise will accept
– yatu
Jan 15 at 16:38
add a comment |
Using stack
and unstack
df1.stack().groupby([df2.stack().index.get_level_values(level=1),df2.stack()]).transform('sum').unstack()
Out[291]:
a b
0 4 11
1 6 11
2 4 15
3 6 15
Thanks @W-B interesting approach!!
– yatu
Jan 15 at 17:30
add a comment |
I'm going to propose a (mostly) numpythonic solution that uses a scipy.sparse_matrix
to perform a vectorized groupby
on the entire DataFrame at once, rather than column by column.
The key to performing this operation efficiently is finding a performant way to factorize the entire DataFrame, while avoiding duplicates in any columns. Since your groups are represented by strings, you can simply concatenate the column
name on the end of each value (since columns should be unique), and then factorize the result, like so [*]
>>> df2 + df2.columns
a b
0 Aa Ab
1 Ba Ab
2 Aa Bb
3 Ba Bb
>>> pd.factorize((df2 + df2.columns).values.ravel())
(array([0, 1, 2, 1, 0, 3, 2, 3], dtype=int64),
array(['Aa', 'Ab', 'Ba', 'Bb'], dtype=object))
Once we have a unique grouping, we can utilize our scipy.sparse
matrix, to perform a groupby in a single pass on the flattened arrays, and use advanced indexing and a reshaping operation to convert the result back to the original shape.
from scipy import sparse
a = df1.values.ravel()
b, _ = pd.factorize((df2 + df2.columns).values.ravel())
o = sparse.csr_matrix(
(a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
).sum(0).A1
res = o[b].reshape(df1.shape)
array([[ 4, 11],
[ 6, 11],
[ 4, 15],
[ 6, 15]], dtype=int64)
Performance
Functions
def gp_chris(f1, f2):
a = f1.values.ravel()
b, _ = pd.factorize((f2 + f2.columns).values.ravel())
o = sparse.csr_matrix(
(a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
).sum(0).A1
return pd.DataFrame(o[b].reshape(f1.shape), columns=df1.columns)
def gp_cs(f1, f2):
return pd.concat([f1[c].groupby(f2[c]).transform('sum') for c in f1.columns], axis=1)
def gp_scott(f1, f2):
return f1.apply(lambda x: x.groupby(f2[x.name]).transform('sum'))
def gp_wen(f1, f2):
return f1.stack().groupby([f2.stack().index.get_level_values(level=1), f2.stack()]).transform('sum').unstack()
Setup
import numpy as np
from scipy import sparse
import pandas as pd
import string
from timeit import timeit
import matplotlib.pyplot as plt
res = pd.DataFrame(
index=[f'gp_{f}' for f in ('chris', 'cs', 'scott', 'wen')],
columns=[10, 50, 100, 200, 400],
dtype=float
)
for f in res.index:
for c in res.columns:
df1 = pd.DataFrame(np.random.rand(c, c))
df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (c, c)))
df1.columns = df1.columns.astype(str)
df2.columns = df2.columns.astype(str)
stmt = '{}(df1, df2)'.format(f)
setp = 'from __main__ import df1, df2, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=50)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")
plt.show()
Results
Validation
df1 = pd.DataFrame(np.random.rand(10, 10))
df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (10, 10)))
df1.columns = df1.columns.astype(str)
df2.columns = df2.columns.astype(str)
v = np.stack([gp_chris(df1, df2), gp_cs(df1, df2), gp_scott(df1, df2), gp_wen(df1, df2)])
print(np.all(v[:-1] == v[1:]))
True
Either we're all wrong or we're all correct :)
[*] There is a possibility that you could get a duplicate value here if one item is the concatenation of a column and another item before concatenation occurs. However if this is the case, you shouldn't need to adjust much to fix it.
Thanks for the final validation :-). This approach is pretty interesting and surprisingly fast compared to the others. Also @W-B 's approach is quite faster than the others too for larger amounts of columns
– yatu
Jan 15 at 20:45
1
Correct. Both Wen's and my approaches have more overhead at the start (stacking and reshaping), but since we perform ourgroupby
operation a single time, it scales better.
– user3483203
Jan 15 at 20:48
I personally wouldn't change the accepted answer. This approach is has a couple caveats with factorizing the DataFrame, as well as only works for certain operations related to a DataFrame, for examplecount
wouldn't work. Coldspeed's is much more general.
– user3483203
Jan 15 at 20:50
Okay thanks for pointing out that it is not as general. Definitely worth having as a possible solution for whenever it can be applied and performance is an issue.
– yatu
Jan 15 at 20:55
add a comment |
You could do something like the following:
res = df1.assign(a_sum=lambda df: df['a'].groupby(df2['a']).transform('sum'))
.assign(b_sum=lambda df: df['b'].groupby(df2['b']).transform('sum'))
Results:
a b
0 4 11
1 6 11
2 4 15
3 6 15
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54202615%2fdifferent-groupers-for-each-column-with-pandas-groupby%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
You will have to group each column individually since each column uses a different grouping scheme.
If you want a cleaner version, I would recommend a list comprehension over the column names, and call pd.concat
on the resultant series:
pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)
a b
0 4 11
1 6 11
2 4 15
3 6 15
Not to say there's anything wrong with using apply
as in the other answer, just that I don't like apply
, so this is my suggestion :-)
Here are some timeits for your perusal. Just for your sample data, you will notice the difference in timings is obvious.
%%timeit
(df1.stack()
.groupby([df2.stack().index.get_level_values(level=1), df2.stack()])
.transform('sum').unstack())
%%timeit
df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))
%%timeit
pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)
8.99 ms ± 4.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
8.35 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.13 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Not to say apply
is slow, but explicit iteration in this case is faster. Additionally, you will notice the second and third timed solution will scale better with larger length v/s breadth since the number of iterations depends on the number of columns.
1
@ScottBoston I have already upvoted your answer for its simplicity B)
– coldspeed
Jan 15 at 17:21
1
Thanks!! Yes, using a list comprehension wiithpd.concat
was what I had in mind, was curious to know whether looping could be avoided. Nice to see other alternatives here too though. And thanks for the timeits :)
– yatu
Jan 15 at 17:26
add a comment |
You will have to group each column individually since each column uses a different grouping scheme.
If you want a cleaner version, I would recommend a list comprehension over the column names, and call pd.concat
on the resultant series:
pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)
a b
0 4 11
1 6 11
2 4 15
3 6 15
Not to say there's anything wrong with using apply
as in the other answer, just that I don't like apply
, so this is my suggestion :-)
Here are some timeits for your perusal. Just for your sample data, you will notice the difference in timings is obvious.
%%timeit
(df1.stack()
.groupby([df2.stack().index.get_level_values(level=1), df2.stack()])
.transform('sum').unstack())
%%timeit
df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))
%%timeit
pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)
8.99 ms ± 4.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
8.35 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.13 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Not to say apply
is slow, but explicit iteration in this case is faster. Additionally, you will notice the second and third timed solution will scale better with larger length v/s breadth since the number of iterations depends on the number of columns.
1
@ScottBoston I have already upvoted your answer for its simplicity B)
– coldspeed
Jan 15 at 17:21
1
Thanks!! Yes, using a list comprehension wiithpd.concat
was what I had in mind, was curious to know whether looping could be avoided. Nice to see other alternatives here too though. And thanks for the timeits :)
– yatu
Jan 15 at 17:26
add a comment |
You will have to group each column individually since each column uses a different grouping scheme.
If you want a cleaner version, I would recommend a list comprehension over the column names, and call pd.concat
on the resultant series:
pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)
a b
0 4 11
1 6 11
2 4 15
3 6 15
Not to say there's anything wrong with using apply
as in the other answer, just that I don't like apply
, so this is my suggestion :-)
Here are some timeits for your perusal. Just for your sample data, you will notice the difference in timings is obvious.
%%timeit
(df1.stack()
.groupby([df2.stack().index.get_level_values(level=1), df2.stack()])
.transform('sum').unstack())
%%timeit
df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))
%%timeit
pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)
8.99 ms ± 4.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
8.35 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.13 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Not to say apply
is slow, but explicit iteration in this case is faster. Additionally, you will notice the second and third timed solution will scale better with larger length v/s breadth since the number of iterations depends on the number of columns.
You will have to group each column individually since each column uses a different grouping scheme.
If you want a cleaner version, I would recommend a list comprehension over the column names, and call pd.concat
on the resultant series:
pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)
a b
0 4 11
1 6 11
2 4 15
3 6 15
Not to say there's anything wrong with using apply
as in the other answer, just that I don't like apply
, so this is my suggestion :-)
Here are some timeits for your perusal. Just for your sample data, you will notice the difference in timings is obvious.
%%timeit
(df1.stack()
.groupby([df2.stack().index.get_level_values(level=1), df2.stack()])
.transform('sum').unstack())
%%timeit
df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))
%%timeit
pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)
8.99 ms ± 4.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
8.35 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.13 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Not to say apply
is slow, but explicit iteration in this case is faster. Additionally, you will notice the second and third timed solution will scale better with larger length v/s breadth since the number of iterations depends on the number of columns.
edited Jan 15 at 17:21
answered Jan 15 at 17:03
coldspeedcoldspeed
132k23139223
132k23139223
1
@ScottBoston I have already upvoted your answer for its simplicity B)
– coldspeed
Jan 15 at 17:21
1
Thanks!! Yes, using a list comprehension wiithpd.concat
was what I had in mind, was curious to know whether looping could be avoided. Nice to see other alternatives here too though. And thanks for the timeits :)
– yatu
Jan 15 at 17:26
add a comment |
1
@ScottBoston I have already upvoted your answer for its simplicity B)
– coldspeed
Jan 15 at 17:21
1
Thanks!! Yes, using a list comprehension wiithpd.concat
was what I had in mind, was curious to know whether looping could be avoided. Nice to see other alternatives here too though. And thanks for the timeits :)
– yatu
Jan 15 at 17:26
1
1
@ScottBoston I have already upvoted your answer for its simplicity B)
– coldspeed
Jan 15 at 17:21
@ScottBoston I have already upvoted your answer for its simplicity B)
– coldspeed
Jan 15 at 17:21
1
1
Thanks!! Yes, using a list comprehension wiith
pd.concat
was what I had in mind, was curious to know whether looping could be avoided. Nice to see other alternatives here too though. And thanks for the timeits :)– yatu
Jan 15 at 17:26
Thanks!! Yes, using a list comprehension wiith
pd.concat
was what I had in mind, was curious to know whether looping could be avoided. Nice to see other alternatives here too though. And thanks for the timeits :)– yatu
Jan 15 at 17:26
add a comment |
Try using apply
to apply a lambda function to each column of your dataframe, then use the name of that pd.Series to group by the second dataframe:
df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))
Output:
a b
0 4 11
1 6 11
2 4 15
3 6 15
Nicee! Guessing it can not directly be done using Groupby rather than applying along columns right? Nice alternative in any case
– yatu
Jan 15 at 16:22
No, I don't think you can apply two different groupings to a dataframe based on a column.
– Scott Boston
Jan 15 at 16:37
1
Ok, thanks. Will leave for some time see if I get any other answers, otherwise will accept
– yatu
Jan 15 at 16:38
add a comment |
Try using apply
to apply a lambda function to each column of your dataframe, then use the name of that pd.Series to group by the second dataframe:
df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))
Output:
a b
0 4 11
1 6 11
2 4 15
3 6 15
Nicee! Guessing it can not directly be done using Groupby rather than applying along columns right? Nice alternative in any case
– yatu
Jan 15 at 16:22
No, I don't think you can apply two different groupings to a dataframe based on a column.
– Scott Boston
Jan 15 at 16:37
1
Ok, thanks. Will leave for some time see if I get any other answers, otherwise will accept
– yatu
Jan 15 at 16:38
add a comment |
Try using apply
to apply a lambda function to each column of your dataframe, then use the name of that pd.Series to group by the second dataframe:
df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))
Output:
a b
0 4 11
1 6 11
2 4 15
3 6 15
Try using apply
to apply a lambda function to each column of your dataframe, then use the name of that pd.Series to group by the second dataframe:
df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))
Output:
a b
0 4 11
1 6 11
2 4 15
3 6 15
edited Jan 15 at 16:17
answered Jan 15 at 16:14
Scott BostonScott Boston
54.8k73056
54.8k73056
Nicee! Guessing it can not directly be done using Groupby rather than applying along columns right? Nice alternative in any case
– yatu
Jan 15 at 16:22
No, I don't think you can apply two different groupings to a dataframe based on a column.
– Scott Boston
Jan 15 at 16:37
1
Ok, thanks. Will leave for some time see if I get any other answers, otherwise will accept
– yatu
Jan 15 at 16:38
add a comment |
Nicee! Guessing it can not directly be done using Groupby rather than applying along columns right? Nice alternative in any case
– yatu
Jan 15 at 16:22
No, I don't think you can apply two different groupings to a dataframe based on a column.
– Scott Boston
Jan 15 at 16:37
1
Ok, thanks. Will leave for some time see if I get any other answers, otherwise will accept
– yatu
Jan 15 at 16:38
Nicee! Guessing it can not directly be done using Groupby rather than applying along columns right? Nice alternative in any case
– yatu
Jan 15 at 16:22
Nicee! Guessing it can not directly be done using Groupby rather than applying along columns right? Nice alternative in any case
– yatu
Jan 15 at 16:22
No, I don't think you can apply two different groupings to a dataframe based on a column.
– Scott Boston
Jan 15 at 16:37
No, I don't think you can apply two different groupings to a dataframe based on a column.
– Scott Boston
Jan 15 at 16:37
1
1
Ok, thanks. Will leave for some time see if I get any other answers, otherwise will accept
– yatu
Jan 15 at 16:38
Ok, thanks. Will leave for some time see if I get any other answers, otherwise will accept
– yatu
Jan 15 at 16:38
add a comment |
Using stack
and unstack
df1.stack().groupby([df2.stack().index.get_level_values(level=1),df2.stack()]).transform('sum').unstack()
Out[291]:
a b
0 4 11
1 6 11
2 4 15
3 6 15
Thanks @W-B interesting approach!!
– yatu
Jan 15 at 17:30
add a comment |
Using stack
and unstack
df1.stack().groupby([df2.stack().index.get_level_values(level=1),df2.stack()]).transform('sum').unstack()
Out[291]:
a b
0 4 11
1 6 11
2 4 15
3 6 15
Thanks @W-B interesting approach!!
– yatu
Jan 15 at 17:30
add a comment |
Using stack
and unstack
df1.stack().groupby([df2.stack().index.get_level_values(level=1),df2.stack()]).transform('sum').unstack()
Out[291]:
a b
0 4 11
1 6 11
2 4 15
3 6 15
Using stack
and unstack
df1.stack().groupby([df2.stack().index.get_level_values(level=1),df2.stack()]).transform('sum').unstack()
Out[291]:
a b
0 4 11
1 6 11
2 4 15
3 6 15
answered Jan 15 at 17:14
Wen-BenWen-Ben
111k83267
111k83267
Thanks @W-B interesting approach!!
– yatu
Jan 15 at 17:30
add a comment |
Thanks @W-B interesting approach!!
– yatu
Jan 15 at 17:30
Thanks @W-B interesting approach!!
– yatu
Jan 15 at 17:30
Thanks @W-B interesting approach!!
– yatu
Jan 15 at 17:30
add a comment |
I'm going to propose a (mostly) numpythonic solution that uses a scipy.sparse_matrix
to perform a vectorized groupby
on the entire DataFrame at once, rather than column by column.
The key to performing this operation efficiently is finding a performant way to factorize the entire DataFrame, while avoiding duplicates in any columns. Since your groups are represented by strings, you can simply concatenate the column
name on the end of each value (since columns should be unique), and then factorize the result, like so [*]
>>> df2 + df2.columns
a b
0 Aa Ab
1 Ba Ab
2 Aa Bb
3 Ba Bb
>>> pd.factorize((df2 + df2.columns).values.ravel())
(array([0, 1, 2, 1, 0, 3, 2, 3], dtype=int64),
array(['Aa', 'Ab', 'Ba', 'Bb'], dtype=object))
Once we have a unique grouping, we can utilize our scipy.sparse
matrix, to perform a groupby in a single pass on the flattened arrays, and use advanced indexing and a reshaping operation to convert the result back to the original shape.
from scipy import sparse
a = df1.values.ravel()
b, _ = pd.factorize((df2 + df2.columns).values.ravel())
o = sparse.csr_matrix(
(a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
).sum(0).A1
res = o[b].reshape(df1.shape)
array([[ 4, 11],
[ 6, 11],
[ 4, 15],
[ 6, 15]], dtype=int64)
Performance
Functions
def gp_chris(f1, f2):
a = f1.values.ravel()
b, _ = pd.factorize((f2 + f2.columns).values.ravel())
o = sparse.csr_matrix(
(a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
).sum(0).A1
return pd.DataFrame(o[b].reshape(f1.shape), columns=df1.columns)
def gp_cs(f1, f2):
return pd.concat([f1[c].groupby(f2[c]).transform('sum') for c in f1.columns], axis=1)
def gp_scott(f1, f2):
return f1.apply(lambda x: x.groupby(f2[x.name]).transform('sum'))
def gp_wen(f1, f2):
return f1.stack().groupby([f2.stack().index.get_level_values(level=1), f2.stack()]).transform('sum').unstack()
Setup
import numpy as np
from scipy import sparse
import pandas as pd
import string
from timeit import timeit
import matplotlib.pyplot as plt
res = pd.DataFrame(
index=[f'gp_{f}' for f in ('chris', 'cs', 'scott', 'wen')],
columns=[10, 50, 100, 200, 400],
dtype=float
)
for f in res.index:
for c in res.columns:
df1 = pd.DataFrame(np.random.rand(c, c))
df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (c, c)))
df1.columns = df1.columns.astype(str)
df2.columns = df2.columns.astype(str)
stmt = '{}(df1, df2)'.format(f)
setp = 'from __main__ import df1, df2, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=50)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")
plt.show()
Results
Validation
df1 = pd.DataFrame(np.random.rand(10, 10))
df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (10, 10)))
df1.columns = df1.columns.astype(str)
df2.columns = df2.columns.astype(str)
v = np.stack([gp_chris(df1, df2), gp_cs(df1, df2), gp_scott(df1, df2), gp_wen(df1, df2)])
print(np.all(v[:-1] == v[1:]))
True
Either we're all wrong or we're all correct :)
[*] There is a possibility that you could get a duplicate value here if one item is the concatenation of a column and another item before concatenation occurs. However if this is the case, you shouldn't need to adjust much to fix it.
Thanks for the final validation :-). This approach is pretty interesting and surprisingly fast compared to the others. Also @W-B 's approach is quite faster than the others too for larger amounts of columns
– yatu
Jan 15 at 20:45
1
Correct. Both Wen's and my approaches have more overhead at the start (stacking and reshaping), but since we perform ourgroupby
operation a single time, it scales better.
– user3483203
Jan 15 at 20:48
I personally wouldn't change the accepted answer. This approach is has a couple caveats with factorizing the DataFrame, as well as only works for certain operations related to a DataFrame, for examplecount
wouldn't work. Coldspeed's is much more general.
– user3483203
Jan 15 at 20:50
Okay thanks for pointing out that it is not as general. Definitely worth having as a possible solution for whenever it can be applied and performance is an issue.
– yatu
Jan 15 at 20:55
add a comment |
I'm going to propose a (mostly) numpythonic solution that uses a scipy.sparse_matrix
to perform a vectorized groupby
on the entire DataFrame at once, rather than column by column.
The key to performing this operation efficiently is finding a performant way to factorize the entire DataFrame, while avoiding duplicates in any columns. Since your groups are represented by strings, you can simply concatenate the column
name on the end of each value (since columns should be unique), and then factorize the result, like so [*]
>>> df2 + df2.columns
a b
0 Aa Ab
1 Ba Ab
2 Aa Bb
3 Ba Bb
>>> pd.factorize((df2 + df2.columns).values.ravel())
(array([0, 1, 2, 1, 0, 3, 2, 3], dtype=int64),
array(['Aa', 'Ab', 'Ba', 'Bb'], dtype=object))
Once we have a unique grouping, we can utilize our scipy.sparse
matrix, to perform a groupby in a single pass on the flattened arrays, and use advanced indexing and a reshaping operation to convert the result back to the original shape.
from scipy import sparse
a = df1.values.ravel()
b, _ = pd.factorize((df2 + df2.columns).values.ravel())
o = sparse.csr_matrix(
(a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
).sum(0).A1
res = o[b].reshape(df1.shape)
array([[ 4, 11],
[ 6, 11],
[ 4, 15],
[ 6, 15]], dtype=int64)
Performance
Functions
def gp_chris(f1, f2):
a = f1.values.ravel()
b, _ = pd.factorize((f2 + f2.columns).values.ravel())
o = sparse.csr_matrix(
(a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
).sum(0).A1
return pd.DataFrame(o[b].reshape(f1.shape), columns=df1.columns)
def gp_cs(f1, f2):
return pd.concat([f1[c].groupby(f2[c]).transform('sum') for c in f1.columns], axis=1)
def gp_scott(f1, f2):
return f1.apply(lambda x: x.groupby(f2[x.name]).transform('sum'))
def gp_wen(f1, f2):
return f1.stack().groupby([f2.stack().index.get_level_values(level=1), f2.stack()]).transform('sum').unstack()
Setup
import numpy as np
from scipy import sparse
import pandas as pd
import string
from timeit import timeit
import matplotlib.pyplot as plt
res = pd.DataFrame(
index=[f'gp_{f}' for f in ('chris', 'cs', 'scott', 'wen')],
columns=[10, 50, 100, 200, 400],
dtype=float
)
for f in res.index:
for c in res.columns:
df1 = pd.DataFrame(np.random.rand(c, c))
df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (c, c)))
df1.columns = df1.columns.astype(str)
df2.columns = df2.columns.astype(str)
stmt = '{}(df1, df2)'.format(f)
setp = 'from __main__ import df1, df2, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=50)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")
plt.show()
Results
Validation
df1 = pd.DataFrame(np.random.rand(10, 10))
df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (10, 10)))
df1.columns = df1.columns.astype(str)
df2.columns = df2.columns.astype(str)
v = np.stack([gp_chris(df1, df2), gp_cs(df1, df2), gp_scott(df1, df2), gp_wen(df1, df2)])
print(np.all(v[:-1] == v[1:]))
True
Either we're all wrong or we're all correct :)
[*] There is a possibility that you could get a duplicate value here if one item is the concatenation of a column and another item before concatenation occurs. However if this is the case, you shouldn't need to adjust much to fix it.
Thanks for the final validation :-). This approach is pretty interesting and surprisingly fast compared to the others. Also @W-B 's approach is quite faster than the others too for larger amounts of columns
– yatu
Jan 15 at 20:45
1
Correct. Both Wen's and my approaches have more overhead at the start (stacking and reshaping), but since we perform ourgroupby
operation a single time, it scales better.
– user3483203
Jan 15 at 20:48
I personally wouldn't change the accepted answer. This approach is has a couple caveats with factorizing the DataFrame, as well as only works for certain operations related to a DataFrame, for examplecount
wouldn't work. Coldspeed's is much more general.
– user3483203
Jan 15 at 20:50
Okay thanks for pointing out that it is not as general. Definitely worth having as a possible solution for whenever it can be applied and performance is an issue.
– yatu
Jan 15 at 20:55
add a comment |
I'm going to propose a (mostly) numpythonic solution that uses a scipy.sparse_matrix
to perform a vectorized groupby
on the entire DataFrame at once, rather than column by column.
The key to performing this operation efficiently is finding a performant way to factorize the entire DataFrame, while avoiding duplicates in any columns. Since your groups are represented by strings, you can simply concatenate the column
name on the end of each value (since columns should be unique), and then factorize the result, like so [*]
>>> df2 + df2.columns
a b
0 Aa Ab
1 Ba Ab
2 Aa Bb
3 Ba Bb
>>> pd.factorize((df2 + df2.columns).values.ravel())
(array([0, 1, 2, 1, 0, 3, 2, 3], dtype=int64),
array(['Aa', 'Ab', 'Ba', 'Bb'], dtype=object))
Once we have a unique grouping, we can utilize our scipy.sparse
matrix, to perform a groupby in a single pass on the flattened arrays, and use advanced indexing and a reshaping operation to convert the result back to the original shape.
from scipy import sparse
a = df1.values.ravel()
b, _ = pd.factorize((df2 + df2.columns).values.ravel())
o = sparse.csr_matrix(
(a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
).sum(0).A1
res = o[b].reshape(df1.shape)
array([[ 4, 11],
[ 6, 11],
[ 4, 15],
[ 6, 15]], dtype=int64)
Performance
Functions
def gp_chris(f1, f2):
a = f1.values.ravel()
b, _ = pd.factorize((f2 + f2.columns).values.ravel())
o = sparse.csr_matrix(
(a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
).sum(0).A1
return pd.DataFrame(o[b].reshape(f1.shape), columns=df1.columns)
def gp_cs(f1, f2):
return pd.concat([f1[c].groupby(f2[c]).transform('sum') for c in f1.columns], axis=1)
def gp_scott(f1, f2):
return f1.apply(lambda x: x.groupby(f2[x.name]).transform('sum'))
def gp_wen(f1, f2):
return f1.stack().groupby([f2.stack().index.get_level_values(level=1), f2.stack()]).transform('sum').unstack()
Setup
import numpy as np
from scipy import sparse
import pandas as pd
import string
from timeit import timeit
import matplotlib.pyplot as plt
res = pd.DataFrame(
index=[f'gp_{f}' for f in ('chris', 'cs', 'scott', 'wen')],
columns=[10, 50, 100, 200, 400],
dtype=float
)
for f in res.index:
for c in res.columns:
df1 = pd.DataFrame(np.random.rand(c, c))
df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (c, c)))
df1.columns = df1.columns.astype(str)
df2.columns = df2.columns.astype(str)
stmt = '{}(df1, df2)'.format(f)
setp = 'from __main__ import df1, df2, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=50)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")
plt.show()
Results
Validation
df1 = pd.DataFrame(np.random.rand(10, 10))
df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (10, 10)))
df1.columns = df1.columns.astype(str)
df2.columns = df2.columns.astype(str)
v = np.stack([gp_chris(df1, df2), gp_cs(df1, df2), gp_scott(df1, df2), gp_wen(df1, df2)])
print(np.all(v[:-1] == v[1:]))
True
Either we're all wrong or we're all correct :)
[*] There is a possibility that you could get a duplicate value here if one item is the concatenation of a column and another item before concatenation occurs. However if this is the case, you shouldn't need to adjust much to fix it.
I'm going to propose a (mostly) numpythonic solution that uses a scipy.sparse_matrix
to perform a vectorized groupby
on the entire DataFrame at once, rather than column by column.
The key to performing this operation efficiently is finding a performant way to factorize the entire DataFrame, while avoiding duplicates in any columns. Since your groups are represented by strings, you can simply concatenate the column
name on the end of each value (since columns should be unique), and then factorize the result, like so [*]
>>> df2 + df2.columns
a b
0 Aa Ab
1 Ba Ab
2 Aa Bb
3 Ba Bb
>>> pd.factorize((df2 + df2.columns).values.ravel())
(array([0, 1, 2, 1, 0, 3, 2, 3], dtype=int64),
array(['Aa', 'Ab', 'Ba', 'Bb'], dtype=object))
Once we have a unique grouping, we can utilize our scipy.sparse
matrix, to perform a groupby in a single pass on the flattened arrays, and use advanced indexing and a reshaping operation to convert the result back to the original shape.
from scipy import sparse
a = df1.values.ravel()
b, _ = pd.factorize((df2 + df2.columns).values.ravel())
o = sparse.csr_matrix(
(a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
).sum(0).A1
res = o[b].reshape(df1.shape)
array([[ 4, 11],
[ 6, 11],
[ 4, 15],
[ 6, 15]], dtype=int64)
Performance
Functions
def gp_chris(f1, f2):
a = f1.values.ravel()
b, _ = pd.factorize((f2 + f2.columns).values.ravel())
o = sparse.csr_matrix(
(a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
).sum(0).A1
return pd.DataFrame(o[b].reshape(f1.shape), columns=df1.columns)
def gp_cs(f1, f2):
return pd.concat([f1[c].groupby(f2[c]).transform('sum') for c in f1.columns], axis=1)
def gp_scott(f1, f2):
return f1.apply(lambda x: x.groupby(f2[x.name]).transform('sum'))
def gp_wen(f1, f2):
return f1.stack().groupby([f2.stack().index.get_level_values(level=1), f2.stack()]).transform('sum').unstack()
Setup
import numpy as np
from scipy import sparse
import pandas as pd
import string
from timeit import timeit
import matplotlib.pyplot as plt
res = pd.DataFrame(
index=[f'gp_{f}' for f in ('chris', 'cs', 'scott', 'wen')],
columns=[10, 50, 100, 200, 400],
dtype=float
)
for f in res.index:
for c in res.columns:
df1 = pd.DataFrame(np.random.rand(c, c))
df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (c, c)))
df1.columns = df1.columns.astype(str)
df2.columns = df2.columns.astype(str)
stmt = '{}(df1, df2)'.format(f)
setp = 'from __main__ import df1, df2, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=50)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")
plt.show()
Results
Validation
df1 = pd.DataFrame(np.random.rand(10, 10))
df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (10, 10)))
df1.columns = df1.columns.astype(str)
df2.columns = df2.columns.astype(str)
v = np.stack([gp_chris(df1, df2), gp_cs(df1, df2), gp_scott(df1, df2), gp_wen(df1, df2)])
print(np.all(v[:-1] == v[1:]))
True
Either we're all wrong or we're all correct :)
[*] There is a possibility that you could get a duplicate value here if one item is the concatenation of a column and another item before concatenation occurs. However if this is the case, you shouldn't need to adjust much to fix it.
edited Jan 15 at 20:14
answered Jan 15 at 19:32
user3483203user3483203
31.2k82655
31.2k82655
Thanks for the final validation :-). This approach is pretty interesting and surprisingly fast compared to the others. Also @W-B 's approach is quite faster than the others too for larger amounts of columns
– yatu
Jan 15 at 20:45
1
Correct. Both Wen's and my approaches have more overhead at the start (stacking and reshaping), but since we perform ourgroupby
operation a single time, it scales better.
– user3483203
Jan 15 at 20:48
I personally wouldn't change the accepted answer. This approach is has a couple caveats with factorizing the DataFrame, as well as only works for certain operations related to a DataFrame, for examplecount
wouldn't work. Coldspeed's is much more general.
– user3483203
Jan 15 at 20:50
Okay thanks for pointing out that it is not as general. Definitely worth having as a possible solution for whenever it can be applied and performance is an issue.
– yatu
Jan 15 at 20:55
add a comment |
Thanks for the final validation :-). This approach is pretty interesting and surprisingly fast compared to the others. Also @W-B 's approach is quite faster than the others too for larger amounts of columns
– yatu
Jan 15 at 20:45
1
Correct. Both Wen's and my approaches have more overhead at the start (stacking and reshaping), but since we perform ourgroupby
operation a single time, it scales better.
– user3483203
Jan 15 at 20:48
I personally wouldn't change the accepted answer. This approach is has a couple caveats with factorizing the DataFrame, as well as only works for certain operations related to a DataFrame, for examplecount
wouldn't work. Coldspeed's is much more general.
– user3483203
Jan 15 at 20:50
Okay thanks for pointing out that it is not as general. Definitely worth having as a possible solution for whenever it can be applied and performance is an issue.
– yatu
Jan 15 at 20:55
Thanks for the final validation :-). This approach is pretty interesting and surprisingly fast compared to the others. Also @W-B 's approach is quite faster than the others too for larger amounts of columns
– yatu
Jan 15 at 20:45
Thanks for the final validation :-). This approach is pretty interesting and surprisingly fast compared to the others. Also @W-B 's approach is quite faster than the others too for larger amounts of columns
– yatu
Jan 15 at 20:45
1
1
Correct. Both Wen's and my approaches have more overhead at the start (stacking and reshaping), but since we perform our
groupby
operation a single time, it scales better.– user3483203
Jan 15 at 20:48
Correct. Both Wen's and my approaches have more overhead at the start (stacking and reshaping), but since we perform our
groupby
operation a single time, it scales better.– user3483203
Jan 15 at 20:48
I personally wouldn't change the accepted answer. This approach is has a couple caveats with factorizing the DataFrame, as well as only works for certain operations related to a DataFrame, for example
count
wouldn't work. Coldspeed's is much more general.– user3483203
Jan 15 at 20:50
I personally wouldn't change the accepted answer. This approach is has a couple caveats with factorizing the DataFrame, as well as only works for certain operations related to a DataFrame, for example
count
wouldn't work. Coldspeed's is much more general.– user3483203
Jan 15 at 20:50
Okay thanks for pointing out that it is not as general. Definitely worth having as a possible solution for whenever it can be applied and performance is an issue.
– yatu
Jan 15 at 20:55
Okay thanks for pointing out that it is not as general. Definitely worth having as a possible solution for whenever it can be applied and performance is an issue.
– yatu
Jan 15 at 20:55
add a comment |
You could do something like the following:
res = df1.assign(a_sum=lambda df: df['a'].groupby(df2['a']).transform('sum'))
.assign(b_sum=lambda df: df['b'].groupby(df2['b']).transform('sum'))
Results:
a b
0 4 11
1 6 11
2 4 15
3 6 15
add a comment |
You could do something like the following:
res = df1.assign(a_sum=lambda df: df['a'].groupby(df2['a']).transform('sum'))
.assign(b_sum=lambda df: df['b'].groupby(df2['b']).transform('sum'))
Results:
a b
0 4 11
1 6 11
2 4 15
3 6 15
add a comment |
You could do something like the following:
res = df1.assign(a_sum=lambda df: df['a'].groupby(df2['a']).transform('sum'))
.assign(b_sum=lambda df: df['b'].groupby(df2['b']).transform('sum'))
Results:
a b
0 4 11
1 6 11
2 4 15
3 6 15
You could do something like the following:
res = df1.assign(a_sum=lambda df: df['a'].groupby(df2['a']).transform('sum'))
.assign(b_sum=lambda df: df['b'].groupby(df2['b']).transform('sum'))
Results:
a b
0 4 11
1 6 11
2 4 15
3 6 15
answered Jan 15 at 17:13
PMendePMende
1,591613
1,591613
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54202615%2fdifferent-groupers-for-each-column-with-pandas-groupby%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
can you elaborate on the desired output? not clear what the rule is
– Yuca
Jan 15 at 16:08
Sure, added a brief explanation. Let me know if still not clear
– yatu
Jan 15 at 16:09
What do you group by? Your output has the same number of rows and columns as the input.
– Zoe
Jan 15 at 16:11
So you are grouping rows 1 and 3 in df1 because rows 1 and 3 are grouped in df2, correct?
– Yuca
Jan 15 at 16:12
Yes that is correct. The the resulting df has the same shape as df1, with the sum of the grouped values
– yatu
Jan 15 at 16:12