How to count number of rows per group (and other statistics) in pandas group by?
I have a data frame df
and I use several columns from it to groupby
:
df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()
In the above way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.
python group-by pandas distinct
add a comment |
I have a data frame df
and I use several columns from it to groupby
:
df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()
In the above way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.
python group-by pandas distinct
add a comment |
I have a data frame df
and I use several columns from it to groupby
:
df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()
In the above way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.
python group-by pandas distinct
I have a data frame df
and I use several columns from it to groupby
:
df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()
In the above way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.
python group-by pandas distinct
python group-by pandas distinct
edited May 30 '18 at 6:37
Pedro M Duarte
11.8k43241
11.8k43241
asked Oct 15 '13 at 15:00
RomanRoman
28.6k125284377
28.6k125284377
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
On groupby
object, the agg
function can take a list to apply several aggregation methods at once. This should give you the result you need:
df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])
I think you need the column reference to be a list. Do you perhaps mean:df[['col1','col2','col3','col4']].groupby(['col1','col2']).agg(['mean', 'count'])
– rysqui
Dec 17 '14 at 6:14
27
This creates four count columns, but how to get only one? (The question asks for "an additional column" and that's what I would like too.)
– Jaan
Jul 22 '15 at 6:58
10
Please see my answer if you want to get only onecount
column per group.
– Pedro M Duarte
Sep 26 '15 at 19:43
What if I have a separate called Counts and instead of count the rows of the grouped type, I need to add along the column Counts.
– Abhishek Bhatia
Oct 2 '17 at 21:28
add a comment |
Quick Answer:
The simplest way to get row counts per group is by calling .size()
, which returns a Series
:
df.groupby(['col1','col2']).size()
Usually you want this result as a DataFrame
(instead of a Series
) so you can do:
df.groupby(['col1', 'col2']).size().reset_index(name='counts')
If you want to find out how to calculate the row counts and other statistics for each group continue reading below.
Detailed example:
Consider the following example dataframe:
In [2]: df
Out[2]:
col1 col2 col3 col4 col5 col6
0 A B 0.20 -0.61 -0.49 1.49
1 A B -1.53 -1.01 -0.39 1.82
2 A B -0.44 0.27 0.72 0.11
3 A B 0.28 -1.32 0.38 0.18
4 C D 0.12 0.59 0.81 0.66
5 C D -0.13 -1.65 -1.64 0.50
6 C D -1.42 -0.11 -0.18 -0.44
7 E F -0.00 1.42 -0.26 1.17
8 E F 0.91 -0.47 1.35 -0.34
9 G H 1.48 -0.63 -1.14 0.17
First let's use .size()
to get the row counts:
In [3]: df.groupby(['col1', 'col2']).size()
Out[3]:
col1 col2
A B 4
C D 3
E F 2
G H 1
dtype: int64
Then let's use .size().reset_index(name='counts')
to get the row counts:
In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]:
col1 col2 counts
0 A B 4
1 C D 3
2 E F 2
3 G H 1
Including results for more statistics
When you want to calculate statistics on grouped data, it usually looks like this:
In [5]: (df
...: .groupby(['col1', 'col2'])
...: .agg({
...: 'col3': ['mean', 'count'],
...: 'col4': ['median', 'min', 'count']
...: }))
Out[5]:
col4 col3
median min count mean count
col1 col2
A B -0.810 -1.32 4 -0.372500 4
C D -0.110 -1.65 3 -0.476667 3
E F 0.475 -0.47 2 0.455000 2
G H -0.630 -0.63 1 1.480000 1
The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.
To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join
. It looks like this:
In [6]: gb = df.groupby(['col1', 'col2'])
...: counts = gb.size().to_frame(name='counts')
...: (counts
...: .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
...: .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
...: .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
...: .reset_index()
...: )
...:
Out[6]:
col1 col2 counts col3_mean col4_median col4_min
0 A B 4 -0.372500 -0.810 -1.32
1 C D 3 -0.476667 -0.110 -1.65
2 E F 2 0.455000 0.475 -0.47
3 G H 1 1.480000 -0.630 -0.63
Footnotes
The code used to generate the test data is shown below:
In [1]: import numpy as np
...: import pandas as pd
...:
...: keys = np.array([
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['E', 'F'],
...: ['E', 'F'],
...: ['G', 'H']
...: ])
...:
...: df = pd.DataFrame(
...: np.hstack([keys,np.random.randn(10,4).round(2)]),
...: columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
...: )
...:
...: df[['col3', 'col4', 'col5', 'col6']] =
...: df[['col3', 'col4', 'col5', 'col6']].astype(float)
...:
Disclaimer:
If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN
entries in the mean calculation without telling you about it.
1
Hey, I really like your solution, particularly the last, where you use method chaining. However, since it is often necessary, to apply different aggregation functions to different columns, one could also concat the resulting data frames using pd.concat. This maybe easier to read than subsqeuent chaining
– Quickbeam2k1
Aug 17 '16 at 11:26
4
nice solution,but forIn [5]: counts_df = pd.DataFrame(df.groupby('col1').size().rename('counts'))
, maybe it's better to set the size() as a new column if you'd like to manipulate the dataframe for further analysis,which should becounts_df = pd.DataFrame(df.groupby('col1').size().reset_index(name='counts')
– LancelotHolmes
Feb 28 '17 at 2:35
1
Thanks for the "Including results for more statistics" bit! Since my next search was about flattening the resulting multiindex on columns, I'll link to the answer here: stackoverflow.com/a/50558529/1026
– Nickolay
May 28 '18 at 8:17
Great! Could you please give me a hint how to addisnull
to this query to have it in one column as well?'col4': ['median', 'min', 'count', 'isnull']
– Peter.k
Jan 18 at 10:31
add a comment |
We can easily do it by using groupby and count. But, we should remember to use reset_index().
df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().
reset_index()
This solution works as long as there is no null value in the columns, otherwise it can be misleading (count will be lower than the actual number of observation by group).
– Adrien Pacifico
Jul 9 '18 at 0:59
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f19384532%2fhow-to-count-number-of-rows-per-group-and-other-statistics-in-pandas-group-by%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
On groupby
object, the agg
function can take a list to apply several aggregation methods at once. This should give you the result you need:
df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])
I think you need the column reference to be a list. Do you perhaps mean:df[['col1','col2','col3','col4']].groupby(['col1','col2']).agg(['mean', 'count'])
– rysqui
Dec 17 '14 at 6:14
27
This creates four count columns, but how to get only one? (The question asks for "an additional column" and that's what I would like too.)
– Jaan
Jul 22 '15 at 6:58
10
Please see my answer if you want to get only onecount
column per group.
– Pedro M Duarte
Sep 26 '15 at 19:43
What if I have a separate called Counts and instead of count the rows of the grouped type, I need to add along the column Counts.
– Abhishek Bhatia
Oct 2 '17 at 21:28
add a comment |
On groupby
object, the agg
function can take a list to apply several aggregation methods at once. This should give you the result you need:
df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])
I think you need the column reference to be a list. Do you perhaps mean:df[['col1','col2','col3','col4']].groupby(['col1','col2']).agg(['mean', 'count'])
– rysqui
Dec 17 '14 at 6:14
27
This creates four count columns, but how to get only one? (The question asks for "an additional column" and that's what I would like too.)
– Jaan
Jul 22 '15 at 6:58
10
Please see my answer if you want to get only onecount
column per group.
– Pedro M Duarte
Sep 26 '15 at 19:43
What if I have a separate called Counts and instead of count the rows of the grouped type, I need to add along the column Counts.
– Abhishek Bhatia
Oct 2 '17 at 21:28
add a comment |
On groupby
object, the agg
function can take a list to apply several aggregation methods at once. This should give you the result you need:
df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])
On groupby
object, the agg
function can take a list to apply several aggregation methods at once. This should give you the result you need:
df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])
edited May 17 '15 at 3:55


Alexander
55.8k1491127
55.8k1491127
answered Oct 15 '13 at 15:49
BoudBoud
19.3k63956
19.3k63956
I think you need the column reference to be a list. Do you perhaps mean:df[['col1','col2','col3','col4']].groupby(['col1','col2']).agg(['mean', 'count'])
– rysqui
Dec 17 '14 at 6:14
27
This creates four count columns, but how to get only one? (The question asks for "an additional column" and that's what I would like too.)
– Jaan
Jul 22 '15 at 6:58
10
Please see my answer if you want to get only onecount
column per group.
– Pedro M Duarte
Sep 26 '15 at 19:43
What if I have a separate called Counts and instead of count the rows of the grouped type, I need to add along the column Counts.
– Abhishek Bhatia
Oct 2 '17 at 21:28
add a comment |
I think you need the column reference to be a list. Do you perhaps mean:df[['col1','col2','col3','col4']].groupby(['col1','col2']).agg(['mean', 'count'])
– rysqui
Dec 17 '14 at 6:14
27
This creates four count columns, but how to get only one? (The question asks for "an additional column" and that's what I would like too.)
– Jaan
Jul 22 '15 at 6:58
10
Please see my answer if you want to get only onecount
column per group.
– Pedro M Duarte
Sep 26 '15 at 19:43
What if I have a separate called Counts and instead of count the rows of the grouped type, I need to add along the column Counts.
– Abhishek Bhatia
Oct 2 '17 at 21:28
I think you need the column reference to be a list. Do you perhaps mean:
df[['col1','col2','col3','col4']].groupby(['col1','col2']).agg(['mean', 'count'])
– rysqui
Dec 17 '14 at 6:14
I think you need the column reference to be a list. Do you perhaps mean:
df[['col1','col2','col3','col4']].groupby(['col1','col2']).agg(['mean', 'count'])
– rysqui
Dec 17 '14 at 6:14
27
27
This creates four count columns, but how to get only one? (The question asks for "an additional column" and that's what I would like too.)
– Jaan
Jul 22 '15 at 6:58
This creates four count columns, but how to get only one? (The question asks for "an additional column" and that's what I would like too.)
– Jaan
Jul 22 '15 at 6:58
10
10
Please see my answer if you want to get only one
count
column per group.– Pedro M Duarte
Sep 26 '15 at 19:43
Please see my answer if you want to get only one
count
column per group.– Pedro M Duarte
Sep 26 '15 at 19:43
What if I have a separate called Counts and instead of count the rows of the grouped type, I need to add along the column Counts.
– Abhishek Bhatia
Oct 2 '17 at 21:28
What if I have a separate called Counts and instead of count the rows of the grouped type, I need to add along the column Counts.
– Abhishek Bhatia
Oct 2 '17 at 21:28
add a comment |
Quick Answer:
The simplest way to get row counts per group is by calling .size()
, which returns a Series
:
df.groupby(['col1','col2']).size()
Usually you want this result as a DataFrame
(instead of a Series
) so you can do:
df.groupby(['col1', 'col2']).size().reset_index(name='counts')
If you want to find out how to calculate the row counts and other statistics for each group continue reading below.
Detailed example:
Consider the following example dataframe:
In [2]: df
Out[2]:
col1 col2 col3 col4 col5 col6
0 A B 0.20 -0.61 -0.49 1.49
1 A B -1.53 -1.01 -0.39 1.82
2 A B -0.44 0.27 0.72 0.11
3 A B 0.28 -1.32 0.38 0.18
4 C D 0.12 0.59 0.81 0.66
5 C D -0.13 -1.65 -1.64 0.50
6 C D -1.42 -0.11 -0.18 -0.44
7 E F -0.00 1.42 -0.26 1.17
8 E F 0.91 -0.47 1.35 -0.34
9 G H 1.48 -0.63 -1.14 0.17
First let's use .size()
to get the row counts:
In [3]: df.groupby(['col1', 'col2']).size()
Out[3]:
col1 col2
A B 4
C D 3
E F 2
G H 1
dtype: int64
Then let's use .size().reset_index(name='counts')
to get the row counts:
In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]:
col1 col2 counts
0 A B 4
1 C D 3
2 E F 2
3 G H 1
Including results for more statistics
When you want to calculate statistics on grouped data, it usually looks like this:
In [5]: (df
...: .groupby(['col1', 'col2'])
...: .agg({
...: 'col3': ['mean', 'count'],
...: 'col4': ['median', 'min', 'count']
...: }))
Out[5]:
col4 col3
median min count mean count
col1 col2
A B -0.810 -1.32 4 -0.372500 4
C D -0.110 -1.65 3 -0.476667 3
E F 0.475 -0.47 2 0.455000 2
G H -0.630 -0.63 1 1.480000 1
The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.
To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join
. It looks like this:
In [6]: gb = df.groupby(['col1', 'col2'])
...: counts = gb.size().to_frame(name='counts')
...: (counts
...: .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
...: .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
...: .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
...: .reset_index()
...: )
...:
Out[6]:
col1 col2 counts col3_mean col4_median col4_min
0 A B 4 -0.372500 -0.810 -1.32
1 C D 3 -0.476667 -0.110 -1.65
2 E F 2 0.455000 0.475 -0.47
3 G H 1 1.480000 -0.630 -0.63
Footnotes
The code used to generate the test data is shown below:
In [1]: import numpy as np
...: import pandas as pd
...:
...: keys = np.array([
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['E', 'F'],
...: ['E', 'F'],
...: ['G', 'H']
...: ])
...:
...: df = pd.DataFrame(
...: np.hstack([keys,np.random.randn(10,4).round(2)]),
...: columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
...: )
...:
...: df[['col3', 'col4', 'col5', 'col6']] =
...: df[['col3', 'col4', 'col5', 'col6']].astype(float)
...:
Disclaimer:
If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN
entries in the mean calculation without telling you about it.
1
Hey, I really like your solution, particularly the last, where you use method chaining. However, since it is often necessary, to apply different aggregation functions to different columns, one could also concat the resulting data frames using pd.concat. This maybe easier to read than subsqeuent chaining
– Quickbeam2k1
Aug 17 '16 at 11:26
4
nice solution,but forIn [5]: counts_df = pd.DataFrame(df.groupby('col1').size().rename('counts'))
, maybe it's better to set the size() as a new column if you'd like to manipulate the dataframe for further analysis,which should becounts_df = pd.DataFrame(df.groupby('col1').size().reset_index(name='counts')
– LancelotHolmes
Feb 28 '17 at 2:35
1
Thanks for the "Including results for more statistics" bit! Since my next search was about flattening the resulting multiindex on columns, I'll link to the answer here: stackoverflow.com/a/50558529/1026
– Nickolay
May 28 '18 at 8:17
Great! Could you please give me a hint how to addisnull
to this query to have it in one column as well?'col4': ['median', 'min', 'count', 'isnull']
– Peter.k
Jan 18 at 10:31
add a comment |
Quick Answer:
The simplest way to get row counts per group is by calling .size()
, which returns a Series
:
df.groupby(['col1','col2']).size()
Usually you want this result as a DataFrame
(instead of a Series
) so you can do:
df.groupby(['col1', 'col2']).size().reset_index(name='counts')
If you want to find out how to calculate the row counts and other statistics for each group continue reading below.
Detailed example:
Consider the following example dataframe:
In [2]: df
Out[2]:
col1 col2 col3 col4 col5 col6
0 A B 0.20 -0.61 -0.49 1.49
1 A B -1.53 -1.01 -0.39 1.82
2 A B -0.44 0.27 0.72 0.11
3 A B 0.28 -1.32 0.38 0.18
4 C D 0.12 0.59 0.81 0.66
5 C D -0.13 -1.65 -1.64 0.50
6 C D -1.42 -0.11 -0.18 -0.44
7 E F -0.00 1.42 -0.26 1.17
8 E F 0.91 -0.47 1.35 -0.34
9 G H 1.48 -0.63 -1.14 0.17
First let's use .size()
to get the row counts:
In [3]: df.groupby(['col1', 'col2']).size()
Out[3]:
col1 col2
A B 4
C D 3
E F 2
G H 1
dtype: int64
Then let's use .size().reset_index(name='counts')
to get the row counts:
In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]:
col1 col2 counts
0 A B 4
1 C D 3
2 E F 2
3 G H 1
Including results for more statistics
When you want to calculate statistics on grouped data, it usually looks like this:
In [5]: (df
...: .groupby(['col1', 'col2'])
...: .agg({
...: 'col3': ['mean', 'count'],
...: 'col4': ['median', 'min', 'count']
...: }))
Out[5]:
col4 col3
median min count mean count
col1 col2
A B -0.810 -1.32 4 -0.372500 4
C D -0.110 -1.65 3 -0.476667 3
E F 0.475 -0.47 2 0.455000 2
G H -0.630 -0.63 1 1.480000 1
The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.
To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join
. It looks like this:
In [6]: gb = df.groupby(['col1', 'col2'])
...: counts = gb.size().to_frame(name='counts')
...: (counts
...: .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
...: .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
...: .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
...: .reset_index()
...: )
...:
Out[6]:
col1 col2 counts col3_mean col4_median col4_min
0 A B 4 -0.372500 -0.810 -1.32
1 C D 3 -0.476667 -0.110 -1.65
2 E F 2 0.455000 0.475 -0.47
3 G H 1 1.480000 -0.630 -0.63
Footnotes
The code used to generate the test data is shown below:
In [1]: import numpy as np
...: import pandas as pd
...:
...: keys = np.array([
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['E', 'F'],
...: ['E', 'F'],
...: ['G', 'H']
...: ])
...:
...: df = pd.DataFrame(
...: np.hstack([keys,np.random.randn(10,4).round(2)]),
...: columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
...: )
...:
...: df[['col3', 'col4', 'col5', 'col6']] =
...: df[['col3', 'col4', 'col5', 'col6']].astype(float)
...:
Disclaimer:
If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN
entries in the mean calculation without telling you about it.
1
Hey, I really like your solution, particularly the last, where you use method chaining. However, since it is often necessary, to apply different aggregation functions to different columns, one could also concat the resulting data frames using pd.concat. This maybe easier to read than subsqeuent chaining
– Quickbeam2k1
Aug 17 '16 at 11:26
4
nice solution,but forIn [5]: counts_df = pd.DataFrame(df.groupby('col1').size().rename('counts'))
, maybe it's better to set the size() as a new column if you'd like to manipulate the dataframe for further analysis,which should becounts_df = pd.DataFrame(df.groupby('col1').size().reset_index(name='counts')
– LancelotHolmes
Feb 28 '17 at 2:35
1
Thanks for the "Including results for more statistics" bit! Since my next search was about flattening the resulting multiindex on columns, I'll link to the answer here: stackoverflow.com/a/50558529/1026
– Nickolay
May 28 '18 at 8:17
Great! Could you please give me a hint how to addisnull
to this query to have it in one column as well?'col4': ['median', 'min', 'count', 'isnull']
– Peter.k
Jan 18 at 10:31
add a comment |
Quick Answer:
The simplest way to get row counts per group is by calling .size()
, which returns a Series
:
df.groupby(['col1','col2']).size()
Usually you want this result as a DataFrame
(instead of a Series
) so you can do:
df.groupby(['col1', 'col2']).size().reset_index(name='counts')
If you want to find out how to calculate the row counts and other statistics for each group continue reading below.
Detailed example:
Consider the following example dataframe:
In [2]: df
Out[2]:
col1 col2 col3 col4 col5 col6
0 A B 0.20 -0.61 -0.49 1.49
1 A B -1.53 -1.01 -0.39 1.82
2 A B -0.44 0.27 0.72 0.11
3 A B 0.28 -1.32 0.38 0.18
4 C D 0.12 0.59 0.81 0.66
5 C D -0.13 -1.65 -1.64 0.50
6 C D -1.42 -0.11 -0.18 -0.44
7 E F -0.00 1.42 -0.26 1.17
8 E F 0.91 -0.47 1.35 -0.34
9 G H 1.48 -0.63 -1.14 0.17
First let's use .size()
to get the row counts:
In [3]: df.groupby(['col1', 'col2']).size()
Out[3]:
col1 col2
A B 4
C D 3
E F 2
G H 1
dtype: int64
Then let's use .size().reset_index(name='counts')
to get the row counts:
In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]:
col1 col2 counts
0 A B 4
1 C D 3
2 E F 2
3 G H 1
Including results for more statistics
When you want to calculate statistics on grouped data, it usually looks like this:
In [5]: (df
...: .groupby(['col1', 'col2'])
...: .agg({
...: 'col3': ['mean', 'count'],
...: 'col4': ['median', 'min', 'count']
...: }))
Out[5]:
col4 col3
median min count mean count
col1 col2
A B -0.810 -1.32 4 -0.372500 4
C D -0.110 -1.65 3 -0.476667 3
E F 0.475 -0.47 2 0.455000 2
G H -0.630 -0.63 1 1.480000 1
The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.
To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join
. It looks like this:
In [6]: gb = df.groupby(['col1', 'col2'])
...: counts = gb.size().to_frame(name='counts')
...: (counts
...: .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
...: .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
...: .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
...: .reset_index()
...: )
...:
Out[6]:
col1 col2 counts col3_mean col4_median col4_min
0 A B 4 -0.372500 -0.810 -1.32
1 C D 3 -0.476667 -0.110 -1.65
2 E F 2 0.455000 0.475 -0.47
3 G H 1 1.480000 -0.630 -0.63
Footnotes
The code used to generate the test data is shown below:
In [1]: import numpy as np
...: import pandas as pd
...:
...: keys = np.array([
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['E', 'F'],
...: ['E', 'F'],
...: ['G', 'H']
...: ])
...:
...: df = pd.DataFrame(
...: np.hstack([keys,np.random.randn(10,4).round(2)]),
...: columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
...: )
...:
...: df[['col3', 'col4', 'col5', 'col6']] =
...: df[['col3', 'col4', 'col5', 'col6']].astype(float)
...:
Disclaimer:
If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN
entries in the mean calculation without telling you about it.
Quick Answer:
The simplest way to get row counts per group is by calling .size()
, which returns a Series
:
df.groupby(['col1','col2']).size()
Usually you want this result as a DataFrame
(instead of a Series
) so you can do:
df.groupby(['col1', 'col2']).size().reset_index(name='counts')
If you want to find out how to calculate the row counts and other statistics for each group continue reading below.
Detailed example:
Consider the following example dataframe:
In [2]: df
Out[2]:
col1 col2 col3 col4 col5 col6
0 A B 0.20 -0.61 -0.49 1.49
1 A B -1.53 -1.01 -0.39 1.82
2 A B -0.44 0.27 0.72 0.11
3 A B 0.28 -1.32 0.38 0.18
4 C D 0.12 0.59 0.81 0.66
5 C D -0.13 -1.65 -1.64 0.50
6 C D -1.42 -0.11 -0.18 -0.44
7 E F -0.00 1.42 -0.26 1.17
8 E F 0.91 -0.47 1.35 -0.34
9 G H 1.48 -0.63 -1.14 0.17
First let's use .size()
to get the row counts:
In [3]: df.groupby(['col1', 'col2']).size()
Out[3]:
col1 col2
A B 4
C D 3
E F 2
G H 1
dtype: int64
Then let's use .size().reset_index(name='counts')
to get the row counts:
In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]:
col1 col2 counts
0 A B 4
1 C D 3
2 E F 2
3 G H 1
Including results for more statistics
When you want to calculate statistics on grouped data, it usually looks like this:
In [5]: (df
...: .groupby(['col1', 'col2'])
...: .agg({
...: 'col3': ['mean', 'count'],
...: 'col4': ['median', 'min', 'count']
...: }))
Out[5]:
col4 col3
median min count mean count
col1 col2
A B -0.810 -1.32 4 -0.372500 4
C D -0.110 -1.65 3 -0.476667 3
E F 0.475 -0.47 2 0.455000 2
G H -0.630 -0.63 1 1.480000 1
The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.
To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join
. It looks like this:
In [6]: gb = df.groupby(['col1', 'col2'])
...: counts = gb.size().to_frame(name='counts')
...: (counts
...: .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
...: .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
...: .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
...: .reset_index()
...: )
...:
Out[6]:
col1 col2 counts col3_mean col4_median col4_min
0 A B 4 -0.372500 -0.810 -1.32
1 C D 3 -0.476667 -0.110 -1.65
2 E F 2 0.455000 0.475 -0.47
3 G H 1 1.480000 -0.630 -0.63
Footnotes
The code used to generate the test data is shown below:
In [1]: import numpy as np
...: import pandas as pd
...:
...: keys = np.array([
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['E', 'F'],
...: ['E', 'F'],
...: ['G', 'H']
...: ])
...:
...: df = pd.DataFrame(
...: np.hstack([keys,np.random.randn(10,4).round(2)]),
...: columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
...: )
...:
...: df[['col3', 'col4', 'col5', 'col6']] =
...: df[['col3', 'col4', 'col5', 'col6']].astype(float)
...:
Disclaimer:
If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN
entries in the mean calculation without telling you about it.
edited May 30 '18 at 6:35
answered Sep 26 '15 at 19:34
Pedro M DuartePedro M Duarte
11.8k43241
11.8k43241
1
Hey, I really like your solution, particularly the last, where you use method chaining. However, since it is often necessary, to apply different aggregation functions to different columns, one could also concat the resulting data frames using pd.concat. This maybe easier to read than subsqeuent chaining
– Quickbeam2k1
Aug 17 '16 at 11:26
4
nice solution,but forIn [5]: counts_df = pd.DataFrame(df.groupby('col1').size().rename('counts'))
, maybe it's better to set the size() as a new column if you'd like to manipulate the dataframe for further analysis,which should becounts_df = pd.DataFrame(df.groupby('col1').size().reset_index(name='counts')
– LancelotHolmes
Feb 28 '17 at 2:35
1
Thanks for the "Including results for more statistics" bit! Since my next search was about flattening the resulting multiindex on columns, I'll link to the answer here: stackoverflow.com/a/50558529/1026
– Nickolay
May 28 '18 at 8:17
Great! Could you please give me a hint how to addisnull
to this query to have it in one column as well?'col4': ['median', 'min', 'count', 'isnull']
– Peter.k
Jan 18 at 10:31
add a comment |
1
Hey, I really like your solution, particularly the last, where you use method chaining. However, since it is often necessary, to apply different aggregation functions to different columns, one could also concat the resulting data frames using pd.concat. This maybe easier to read than subsqeuent chaining
– Quickbeam2k1
Aug 17 '16 at 11:26
4
nice solution,but forIn [5]: counts_df = pd.DataFrame(df.groupby('col1').size().rename('counts'))
, maybe it's better to set the size() as a new column if you'd like to manipulate the dataframe for further analysis,which should becounts_df = pd.DataFrame(df.groupby('col1').size().reset_index(name='counts')
– LancelotHolmes
Feb 28 '17 at 2:35
1
Thanks for the "Including results for more statistics" bit! Since my next search was about flattening the resulting multiindex on columns, I'll link to the answer here: stackoverflow.com/a/50558529/1026
– Nickolay
May 28 '18 at 8:17
Great! Could you please give me a hint how to addisnull
to this query to have it in one column as well?'col4': ['median', 'min', 'count', 'isnull']
– Peter.k
Jan 18 at 10:31
1
1
Hey, I really like your solution, particularly the last, where you use method chaining. However, since it is often necessary, to apply different aggregation functions to different columns, one could also concat the resulting data frames using pd.concat. This maybe easier to read than subsqeuent chaining
– Quickbeam2k1
Aug 17 '16 at 11:26
Hey, I really like your solution, particularly the last, where you use method chaining. However, since it is often necessary, to apply different aggregation functions to different columns, one could also concat the resulting data frames using pd.concat. This maybe easier to read than subsqeuent chaining
– Quickbeam2k1
Aug 17 '16 at 11:26
4
4
nice solution,but for
In [5]: counts_df = pd.DataFrame(df.groupby('col1').size().rename('counts'))
, maybe it's better to set the size() as a new column if you'd like to manipulate the dataframe for further analysis,which should be counts_df = pd.DataFrame(df.groupby('col1').size().reset_index(name='counts')
– LancelotHolmes
Feb 28 '17 at 2:35
nice solution,but for
In [5]: counts_df = pd.DataFrame(df.groupby('col1').size().rename('counts'))
, maybe it's better to set the size() as a new column if you'd like to manipulate the dataframe for further analysis,which should be counts_df = pd.DataFrame(df.groupby('col1').size().reset_index(name='counts')
– LancelotHolmes
Feb 28 '17 at 2:35
1
1
Thanks for the "Including results for more statistics" bit! Since my next search was about flattening the resulting multiindex on columns, I'll link to the answer here: stackoverflow.com/a/50558529/1026
– Nickolay
May 28 '18 at 8:17
Thanks for the "Including results for more statistics" bit! Since my next search was about flattening the resulting multiindex on columns, I'll link to the answer here: stackoverflow.com/a/50558529/1026
– Nickolay
May 28 '18 at 8:17
Great! Could you please give me a hint how to add
isnull
to this query to have it in one column as well? 'col4': ['median', 'min', 'count', 'isnull']
– Peter.k
Jan 18 at 10:31
Great! Could you please give me a hint how to add
isnull
to this query to have it in one column as well? 'col4': ['median', 'min', 'count', 'isnull']
– Peter.k
Jan 18 at 10:31
add a comment |
We can easily do it by using groupby and count. But, we should remember to use reset_index().
df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().
reset_index()
This solution works as long as there is no null value in the columns, otherwise it can be misleading (count will be lower than the actual number of observation by group).
– Adrien Pacifico
Jul 9 '18 at 0:59
add a comment |
We can easily do it by using groupby and count. But, we should remember to use reset_index().
df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().
reset_index()
This solution works as long as there is no null value in the columns, otherwise it can be misleading (count will be lower than the actual number of observation by group).
– Adrien Pacifico
Jul 9 '18 at 0:59
add a comment |
We can easily do it by using groupby and count. But, we should remember to use reset_index().
df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().
reset_index()
We can easily do it by using groupby and count. But, we should remember to use reset_index().
df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().
reset_index()
edited Nov 27 '17 at 18:17
answered Nov 27 '17 at 9:17


NimeshNimesh
5913
5913
This solution works as long as there is no null value in the columns, otherwise it can be misleading (count will be lower than the actual number of observation by group).
– Adrien Pacifico
Jul 9 '18 at 0:59
add a comment |
This solution works as long as there is no null value in the columns, otherwise it can be misleading (count will be lower than the actual number of observation by group).
– Adrien Pacifico
Jul 9 '18 at 0:59
This solution works as long as there is no null value in the columns, otherwise it can be misleading (count will be lower than the actual number of observation by group).
– Adrien Pacifico
Jul 9 '18 at 0:59
This solution works as long as there is no null value in the columns, otherwise it can be misleading (count will be lower than the actual number of observation by group).
– Adrien Pacifico
Jul 9 '18 at 0:59
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f19384532%2fhow-to-count-number-of-rows-per-group-and-other-statistics-in-pandas-group-by%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown