How to count number of rows per group (and other statistics) in pandas group by?

289

I have a data frame df and I use several columns from it to groupby:

df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()

In the above way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.

edited May 30 '18 at 6:37

Pedro M Duarte

11.8k43241

asked Oct 15 '13 at 15:00

Roman

28.6k125284377

add a comment |

289

I have a data frame df and I use several columns from it to groupby:

df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()

edited May 30 '18 at 6:37

Pedro M Duarte

11.8k43241

asked Oct 15 '13 at 15:00

Roman

28.6k125284377

add a comment |

289

126

I have a data frame df and I use several columns from it to groupby:

df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()

edited May 30 '18 at 6:37

Pedro M Duarte

11.8k43241

asked Oct 15 '13 at 15:00

Roman

28.6k125284377

I have a data frame df and I use several columns from it to groupby:

df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()

python group-by pandas distinct

edited May 30 '18 at 6:37

Pedro M Duarte

11.8k43241

asked Oct 15 '13 at 15:00

Roman

28.6k125284377

edited May 30 '18 at 6:37

Pedro M Duarte

11.8k43241

asked Oct 15 '13 at 15:00

Roman

28.6k125284377

edited May 30 '18 at 6:37

Pedro M Duarte

11.8k43241

edited May 30 '18 at 6:37

Pedro M Duarte

11.8k43241

edited May 30 '18 at 6:37

Pedro M Duarte

11.8k43241

asked Oct 15 '13 at 15:00

Roman

28.6k125284377

asked Oct 15 '13 at 15:00

Roman

28.6k125284377

asked Oct 15 '13 at 15:00

Roman

28.6k125284377

add a comment |

3 Answers
3

active

oldest

votes

301

On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:

df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])

edited May 17 '15 at 3:55

Alexander

55.8k1491127

answered Oct 15 '13 at 15:49

Boud

19.3k63956

I think you need the column reference to be a list. Do you perhaps mean: df[['col1','col2','col3','col4']].groupby(['col1','col2']).agg(['mean', 'count'])

– rysqui
Dec 17 '14 at 6:14

27

This creates four count columns, but how to get only one? (The question asks for "an additional column" and that's what I would like too.)

– Jaan
Jul 22 '15 at 6:58

10

Please see my answer if you want to get only one count column per group.

– Pedro M Duarte
Sep 26 '15 at 19:43

What if I have a separate called Counts and instead of count the rows of the grouped type, I need to add along the column Counts.

– Abhishek Bhatia
Oct 2 '17 at 21:28

add a comment |

675

Quick Answer:

The simplest way to get row counts per group is by calling .size(), which returns a Series:

df.groupby(['col1','col2']).size()

Usually you want this result as a DataFrame (instead of a Series) so you can do:

df.groupby(['col1', 'col2']).size().reset_index(name='counts')

If you want to find out how to calculate the row counts and other statistics for each group continue reading below.

Detailed example:

Consider the following example dataframe:

In [2]: df

Out[2]: 

  col1 col2  col3  col4  col5  col6

0    A    B  0.20 -0.61 -0.49  1.49

1    A    B -1.53 -1.01 -0.39  1.82

2    A    B -0.44  0.27  0.72  0.11

3    A    B  0.28 -1.32  0.38  0.18

4    C    D  0.12  0.59  0.81  0.66

5    C    D -0.13 -1.65 -1.64  0.50

6    C    D -1.42 -0.11 -0.18 -0.44

7    E    F -0.00  1.42 -0.26  1.17

8    E    F  0.91 -0.47  1.35 -0.34

9    G    H  1.48 -0.63 -1.14  0.17

First let's use .size() to get the row counts:

In [3]: df.groupby(['col1', 'col2']).size()

Out[3]: 

col1  col2

A     B       4

C     D       3

E     F       2

G     H       1

dtype: int64

Then let's use .size().reset_index(name='counts') to get the row counts:

In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')

Out[4]: 

  col1 col2  counts

0    A    B       4

1    C    D       3

2    E    F       2

3    G    H       1

Including results for more statistics

When you want to calculate statistics on grouped data, it usually looks like this:

In [5]: (df

   ...: .groupby(['col1', 'col2'])

   ...: .agg({

   ...:     'col3': ['mean', 'count'], 

   ...:     'col4': ['median', 'min', 'count']

   ...: }))

Out[5]: 

            col4                  col3      

          median   min count      mean count

col1 col2                                   

A    B    -0.810 -1.32     4 -0.372500     4

C    D    -0.110 -1.65     3 -0.476667     3

E    F     0.475 -0.47     2  0.455000     2

G    H    -0.630 -0.63     1  1.480000     1

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:

In [6]: gb = df.groupby(['col1', 'col2'])

   ...: counts = gb.size().to_frame(name='counts')

   ...: (counts

   ...:  .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))

   ...:  .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))

   ...:  .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))

   ...:  .reset_index()

   ...: )

   ...: 

Out[6]: 

  col1 col2  counts  col3_mean  col4_median  col4_min

0    A    B       4  -0.372500       -0.810     -1.32

1    C    D       3  -0.476667       -0.110     -1.65

2    E    F       2   0.455000        0.475     -0.47

3    G    H       1   1.480000       -0.630     -0.63

Footnotes

The code used to generate the test data is shown below:

In [1]: import numpy as np

   ...: import pandas as pd 

   ...: 

   ...: keys = np.array([

   ...:         ['A', 'B'],

   ...:         ['A', 'B'],

   ...:         ['A', 'B'],

   ...:         ['A', 'B'],

   ...:         ['C', 'D'],

   ...:         ['C', 'D'],

   ...:         ['C', 'D'],

   ...:         ['E', 'F'],

   ...:         ['E', 'F'],

   ...:         ['G', 'H'] 

   ...:         ])

   ...: 

   ...: df = pd.DataFrame(

   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 

   ...:     columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']

   ...: )

   ...: 

   ...: df[['col3', 'col4', 'col5', 'col6']] = 

   ...:     df[['col3', 'col4', 'col5', 'col6']].astype(float)

   ...:

Disclaimer:

If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.

edited May 30 '18 at 6:35

answered Sep 26 '15 at 19:34

Pedro M Duarte

11.8k43241

1

Hey, I really like your solution, particularly the last, where you use method chaining. However, since it is often necessary, to apply different aggregation functions to different columns, one could also concat the resulting data frames using pd.concat. This maybe easier to read than subsqeuent chaining

– Quickbeam2k1
Aug 17 '16 at 11:26

4

nice solution,but for In [5]: counts_df = pd.DataFrame(df.groupby('col1').size().rename('counts')) , maybe it's better to set the size() as a new column if you'd like to manipulate the dataframe for further analysis,which should be counts_df = pd.DataFrame(df.groupby('col1').size().reset_index(name='counts')

– LancelotHolmes
Feb 28 '17 at 2:35

1

Thanks for the "Including results for more statistics" bit! Since my next search was about flattening the resulting multiindex on columns, I'll link to the answer here: stackoverflow.com/a/50558529/1026

– Nickolay
May 28 '18 at 8:17

Great! Could you please give me a hint how to add isnull to this query to have it in one column as well? 'col4': ['median', 'min', 'count', 'isnull']

– Peter.k
Jan 18 at 10:31

add a comment |

We can easily do it by using groupby and count. But, we should remember to use reset_index().

df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().

reset_index()

edited Nov 27 '17 at 18:17

answered Nov 27 '17 at 9:17

Nimesh

5913

This solution works as long as there is no null value in the columns, otherwise it can be misleading (count will be lower than the actual number of observation by group).

– Adrien Pacifico
Jul 9 '18 at 0:59

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f19384532%2fhow-to-count-number-of-rows-per-group-and-other-statistics-in-pandas-group-by%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

301

On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:

df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])

edited May 17 '15 at 3:55

Alexander

55.8k1491127

answered Oct 15 '13 at 15:49

Boud

19.3k63956

I think you need the column reference to be a list. Do you perhaps mean: df[['col1','col2','col3','col4']].groupby(['col1','col2']).agg(['mean', 'count'])

– rysqui
Dec 17 '14 at 6:14

27

This creates four count columns, but how to get only one? (The question asks for "an additional column" and that's what I would like too.)

– Jaan
Jul 22 '15 at 6:58

10

Please see my answer if you want to get only one count column per group.

– Pedro M Duarte
Sep 26 '15 at 19:43

What if I have a separate called Counts and instead of count the rows of the grouped type, I need to add along the column Counts.

– Abhishek Bhatia
Oct 2 '17 at 21:28

add a comment |

301

On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:

df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])

edited May 17 '15 at 3:55

Alexander

55.8k1491127

answered Oct 15 '13 at 15:49

Boud

19.3k63956

I think you need the column reference to be a list. Do you perhaps mean: df[['col1','col2','col3','col4']].groupby(['col1','col2']).agg(['mean', 'count'])

– rysqui
Dec 17 '14 at 6:14

27

This creates four count columns, but how to get only one? (The question asks for "an additional column" and that's what I would like too.)

– Jaan
Jul 22 '15 at 6:58

10

Please see my answer if you want to get only one count column per group.

– Pedro M Duarte
Sep 26 '15 at 19:43

What if I have a separate called Counts and instead of count the rows of the grouped type, I need to add along the column Counts.

– Abhishek Bhatia
Oct 2 '17 at 21:28

add a comment |

301

On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:

df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])

edited May 17 '15 at 3:55

Alexander

55.8k1491127

answered Oct 15 '13 at 15:49

Boud

19.3k63956

On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:

df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])

edited May 17 '15 at 3:55

Alexander

55.8k1491127

answered Oct 15 '13 at 15:49

Boud

19.3k63956

edited May 17 '15 at 3:55

Alexander

55.8k1491127

edited May 17 '15 at 3:55

Alexander

55.8k1491127

edited May 17 '15 at 3:55

Alexander

55.8k1491127

answered Oct 15 '13 at 15:49

Boud

19.3k63956

answered Oct 15 '13 at 15:49

Boud

19.3k63956

answered Oct 15 '13 at 15:49

Boud

19.3k63956

I think you need the column reference to be a list. Do you perhaps mean: df[['col1','col2','col3','col4']].groupby(['col1','col2']).agg(['mean', 'count'])

– rysqui
Dec 17 '14 at 6:14

27

This creates four count columns, but how to get only one? (The question asks for "an additional column" and that's what I would like too.)

– Jaan
Jul 22 '15 at 6:58

10

Please see my answer if you want to get only one count column per group.

– Pedro M Duarte
Sep 26 '15 at 19:43

What if I have a separate called Counts and instead of count the rows of the grouped type, I need to add along the column Counts.

– Abhishek Bhatia
Oct 2 '17 at 21:28

add a comment |

I think you need the column reference to be a list. Do you perhaps mean: df[['col1','col2','col3','col4']].groupby(['col1','col2']).agg(['mean', 'count'])

– rysqui
Dec 17 '14 at 6:14

27

This creates four count columns, but how to get only one? (The question asks for "an additional column" and that's what I would like too.)

– Jaan
Jul 22 '15 at 6:58

10

Please see my answer if you want to get only one count column per group.

– Pedro M Duarte
Sep 26 '15 at 19:43

What if I have a separate called Counts and instead of count the rows of the grouped type, I need to add along the column Counts.

– Abhishek Bhatia
Oct 2 '17 at 21:28

I think you need the column reference to be a list. Do you perhaps mean: df[['col1','col2','col3','col4']].groupby(['col1','col2']).agg(['mean', 'count'])

– rysqui
Dec 17 '14 at 6:14

This creates four count columns, but how to get only one? (The question asks for "an additional column" and that's what I would like too.)

– Jaan
Jul 22 '15 at 6:58

Please see my answer if you want to get only one count column per group.

– Pedro M Duarte
Sep 26 '15 at 19:43

What if I have a separate called Counts and instead of count the rows of the grouped type, I need to add along the column Counts.

– Abhishek Bhatia
Oct 2 '17 at 21:28

add a comment |

675

Quick Answer:

The simplest way to get row counts per group is by calling .size(), which returns a Series:

df.groupby(['col1','col2']).size()

Usually you want this result as a DataFrame (instead of a Series) so you can do:

df.groupby(['col1', 'col2']).size().reset_index(name='counts')

If you want to find out how to calculate the row counts and other statistics for each group continue reading below.

Detailed example:

Consider the following example dataframe:

In [2]: df

Out[2]: 

  col1 col2  col3  col4  col5  col6

0    A    B  0.20 -0.61 -0.49  1.49

1    A    B -1.53 -1.01 -0.39  1.82

2    A    B -0.44  0.27  0.72  0.11

3    A    B  0.28 -1.32  0.38  0.18

4    C    D  0.12  0.59  0.81  0.66

5    C    D -0.13 -1.65 -1.64  0.50

6    C    D -1.42 -0.11 -0.18 -0.44

7    E    F -0.00  1.42 -0.26  1.17

8    E    F  0.91 -0.47  1.35 -0.34

9    G    H  1.48 -0.63 -1.14  0.17

First let's use .size() to get the row counts:

In [3]: df.groupby(['col1', 'col2']).size()

Out[3]: 

col1  col2

A     B       4

C     D       3

E     F       2

G     H       1

dtype: int64

Then let's use .size().reset_index(name='counts') to get the row counts:

In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')

Out[4]: 

  col1 col2  counts

0    A    B       4

1    C    D       3

2    E    F       2

3    G    H       1

Including results for more statistics

When you want to calculate statistics on grouped data, it usually looks like this:

In [5]: (df

   ...: .groupby(['col1', 'col2'])

   ...: .agg({

   ...:     'col3': ['mean', 'count'], 

   ...:     'col4': ['median', 'min', 'count']

   ...: }))

Out[5]: 

            col4                  col3      

          median   min count      mean count

col1 col2                                   

A    B    -0.810 -1.32     4 -0.372500     4

C    D    -0.110 -1.65     3 -0.476667     3

E    F     0.475 -0.47     2  0.455000     2

G    H    -0.630 -0.63     1  1.480000     1

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:

In [6]: gb = df.groupby(['col1', 'col2'])

   ...: counts = gb.size().to_frame(name='counts')

   ...: (counts

   ...:  .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))

   ...:  .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))

   ...:  .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))

   ...:  .reset_index()

   ...: )

   ...: 

Out[6]: 

  col1 col2  counts  col3_mean  col4_median  col4_min

0    A    B       4  -0.372500       -0.810     -1.32

1    C    D       3  -0.476667       -0.110     -1.65

2    E    F       2   0.455000        0.475     -0.47

3    G    H       1   1.480000       -0.630     -0.63

Footnotes

The code used to generate the test data is shown below:

In [1]: import numpy as np

   ...: import pandas as pd 

   ...: 

   ...: keys = np.array([

   ...:         ['A', 'B'],

   ...:         ['A', 'B'],

   ...:         ['A', 'B'],

   ...:         ['A', 'B'],

   ...:         ['C', 'D'],

   ...:         ['C', 'D'],

   ...:         ['C', 'D'],

   ...:         ['E', 'F'],

   ...:         ['E', 'F'],

   ...:         ['G', 'H'] 

   ...:         ])

   ...: 

   ...: df = pd.DataFrame(

   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 

   ...:     columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']

   ...: )

   ...: 

   ...: df[['col3', 'col4', 'col5', 'col6']] = 

   ...:     df[['col3', 'col4', 'col5', 'col6']].astype(float)

   ...:

Disclaimer:

edited May 30 '18 at 6:35

answered Sep 26 '15 at 19:34

Pedro M Duarte

11.8k43241

1

Hey, I really like your solution, particularly the last, where you use method chaining. However, since it is often necessary, to apply different aggregation functions to different columns, one could also concat the resulting data frames using pd.concat. This maybe easier to read than subsqeuent chaining

– Quickbeam2k1
Aug 17 '16 at 11:26

4

nice solution,but for In [5]: counts_df = pd.DataFrame(df.groupby('col1').size().rename('counts')) , maybe it's better to set the size() as a new column if you'd like to manipulate the dataframe for further analysis,which should be counts_df = pd.DataFrame(df.groupby('col1').size().reset_index(name='counts')

– LancelotHolmes
Feb 28 '17 at 2:35

1

Thanks for the "Including results for more statistics" bit! Since my next search was about flattening the resulting multiindex on columns, I'll link to the answer here: stackoverflow.com/a/50558529/1026

– Nickolay
May 28 '18 at 8:17

Great! Could you please give me a hint how to add isnull to this query to have it in one column as well? 'col4': ['median', 'min', 'count', 'isnull']

– Peter.k
Jan 18 at 10:31

add a comment |

675

Quick Answer:

The simplest way to get row counts per group is by calling .size(), which returns a Series:

df.groupby(['col1','col2']).size()

Usually you want this result as a DataFrame (instead of a Series) so you can do:

df.groupby(['col1', 'col2']).size().reset_index(name='counts')

If you want to find out how to calculate the row counts and other statistics for each group continue reading below.

Detailed example:

Consider the following example dataframe:

In [2]: df

Out[2]: 

  col1 col2  col3  col4  col5  col6

0    A    B  0.20 -0.61 -0.49  1.49

1    A    B -1.53 -1.01 -0.39  1.82

2    A    B -0.44  0.27  0.72  0.11

3    A    B  0.28 -1.32  0.38  0.18

4    C    D  0.12  0.59  0.81  0.66

5    C    D -0.13 -1.65 -1.64  0.50

6    C    D -1.42 -0.11 -0.18 -0.44

7    E    F -0.00  1.42 -0.26  1.17

8    E    F  0.91 -0.47  1.35 -0.34

9    G    H  1.48 -0.63 -1.14  0.17

First let's use .size() to get the row counts:

In [3]: df.groupby(['col1', 'col2']).size()

Out[3]: 

col1  col2

A     B       4

C     D       3

E     F       2

G     H       1

dtype: int64

Then let's use .size().reset_index(name='counts') to get the row counts:

In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')

Out[4]: 

  col1 col2  counts

0    A    B       4

1    C    D       3

2    E    F       2

3    G    H       1

Including results for more statistics

When you want to calculate statistics on grouped data, it usually looks like this:

In [5]: (df

   ...: .groupby(['col1', 'col2'])

   ...: .agg({

   ...:     'col3': ['mean', 'count'], 

   ...:     'col4': ['median', 'min', 'count']

   ...: }))

Out[5]: 

            col4                  col3      

          median   min count      mean count

col1 col2                                   

A    B    -0.810 -1.32     4 -0.372500     4

C    D    -0.110 -1.65     3 -0.476667     3

E    F     0.475 -0.47     2  0.455000     2

G    H    -0.630 -0.63     1  1.480000     1

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:

In [6]: gb = df.groupby(['col1', 'col2'])

   ...: counts = gb.size().to_frame(name='counts')

   ...: (counts

   ...:  .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))

   ...:  .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))

   ...:  .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))

   ...:  .reset_index()

   ...: )

   ...: 

Out[6]: 

  col1 col2  counts  col3_mean  col4_median  col4_min

0    A    B       4  -0.372500       -0.810     -1.32

1    C    D       3  -0.476667       -0.110     -1.65

2    E    F       2   0.455000        0.475     -0.47

3    G    H       1   1.480000       -0.630     -0.63

Footnotes

The code used to generate the test data is shown below:

In [1]: import numpy as np

   ...: import pandas as pd 

   ...: 

   ...: keys = np.array([

   ...:         ['A', 'B'],

   ...:         ['A', 'B'],

   ...:         ['A', 'B'],

   ...:         ['A', 'B'],

   ...:         ['C', 'D'],

   ...:         ['C', 'D'],

   ...:         ['C', 'D'],

   ...:         ['E', 'F'],

   ...:         ['E', 'F'],

   ...:         ['G', 'H'] 

   ...:         ])

   ...: 

   ...: df = pd.DataFrame(

   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 

   ...:     columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']

   ...: )

   ...: 

   ...: df[['col3', 'col4', 'col5', 'col6']] = 

   ...:     df[['col3', 'col4', 'col5', 'col6']].astype(float)

   ...:

Disclaimer:

edited May 30 '18 at 6:35

answered Sep 26 '15 at 19:34

Pedro M Duarte

11.8k43241

1

Hey, I really like your solution, particularly the last, where you use method chaining. However, since it is often necessary, to apply different aggregation functions to different columns, one could also concat the resulting data frames using pd.concat. This maybe easier to read than subsqeuent chaining

– Quickbeam2k1
Aug 17 '16 at 11:26

4

nice solution,but for In [5]: counts_df = pd.DataFrame(df.groupby('col1').size().rename('counts')) , maybe it's better to set the size() as a new column if you'd like to manipulate the dataframe for further analysis,which should be counts_df = pd.DataFrame(df.groupby('col1').size().reset_index(name='counts')

– LancelotHolmes
Feb 28 '17 at 2:35

1

Thanks for the "Including results for more statistics" bit! Since my next search was about flattening the resulting multiindex on columns, I'll link to the answer here: stackoverflow.com/a/50558529/1026

– Nickolay
May 28 '18 at 8:17

Great! Could you please give me a hint how to add isnull to this query to have it in one column as well? 'col4': ['median', 'min', 'count', 'isnull']

– Peter.k
Jan 18 at 10:31

add a comment |

675

Quick Answer:

The simplest way to get row counts per group is by calling .size(), which returns a Series:

df.groupby(['col1','col2']).size()

Usually you want this result as a DataFrame (instead of a Series) so you can do:

df.groupby(['col1', 'col2']).size().reset_index(name='counts')

If you want to find out how to calculate the row counts and other statistics for each group continue reading below.

Detailed example:

Consider the following example dataframe:

In [2]: df

Out[2]: 

  col1 col2  col3  col4  col5  col6

0    A    B  0.20 -0.61 -0.49  1.49

1    A    B -1.53 -1.01 -0.39  1.82

2    A    B -0.44  0.27  0.72  0.11

3    A    B  0.28 -1.32  0.38  0.18

4    C    D  0.12  0.59  0.81  0.66

5    C    D -0.13 -1.65 -1.64  0.50

6    C    D -1.42 -0.11 -0.18 -0.44

7    E    F -0.00  1.42 -0.26  1.17

8    E    F  0.91 -0.47  1.35 -0.34

9    G    H  1.48 -0.63 -1.14  0.17

First let's use .size() to get the row counts:

In [3]: df.groupby(['col1', 'col2']).size()

Out[3]: 

col1  col2

A     B       4

C     D       3

E     F       2

G     H       1

dtype: int64

Then let's use .size().reset_index(name='counts') to get the row counts:

In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')

Out[4]: 

  col1 col2  counts

0    A    B       4

1    C    D       3

2    E    F       2

3    G    H       1

Including results for more statistics

When you want to calculate statistics on grouped data, it usually looks like this:

In [5]: (df

   ...: .groupby(['col1', 'col2'])

   ...: .agg({

   ...:     'col3': ['mean', 'count'], 

   ...:     'col4': ['median', 'min', 'count']

   ...: }))

Out[5]: 

            col4                  col3      

          median   min count      mean count

col1 col2                                   

A    B    -0.810 -1.32     4 -0.372500     4

C    D    -0.110 -1.65     3 -0.476667     3

E    F     0.475 -0.47     2  0.455000     2

G    H    -0.630 -0.63     1  1.480000     1

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:

In [6]: gb = df.groupby(['col1', 'col2'])

   ...: counts = gb.size().to_frame(name='counts')

   ...: (counts

   ...:  .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))

   ...:  .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))

   ...:  .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))

   ...:  .reset_index()

   ...: )

   ...: 

Out[6]: 

  col1 col2  counts  col3_mean  col4_median  col4_min

0    A    B       4  -0.372500       -0.810     -1.32

1    C    D       3  -0.476667       -0.110     -1.65

2    E    F       2   0.455000        0.475     -0.47

3    G    H       1   1.480000       -0.630     -0.63

Footnotes

The code used to generate the test data is shown below:

In [1]: import numpy as np

   ...: import pandas as pd 

   ...: 

   ...: keys = np.array([

   ...:         ['A', 'B'],

   ...:         ['A', 'B'],

   ...:         ['A', 'B'],

   ...:         ['A', 'B'],

   ...:         ['C', 'D'],

   ...:         ['C', 'D'],

   ...:         ['C', 'D'],

   ...:         ['E', 'F'],

   ...:         ['E', 'F'],

   ...:         ['G', 'H'] 

   ...:         ])

   ...: 

   ...: df = pd.DataFrame(

   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 

   ...:     columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']

   ...: )

   ...: 

   ...: df[['col3', 'col4', 'col5', 'col6']] = 

   ...:     df[['col3', 'col4', 'col5', 'col6']].astype(float)

   ...:

Disclaimer:

edited May 30 '18 at 6:35

answered Sep 26 '15 at 19:34

Pedro M Duarte

11.8k43241

Quick Answer:

The simplest way to get row counts per group is by calling .size(), which returns a Series:

df.groupby(['col1','col2']).size()

Usually you want this result as a DataFrame (instead of a Series) so you can do:

df.groupby(['col1', 'col2']).size().reset_index(name='counts')

If you want to find out how to calculate the row counts and other statistics for each group continue reading below.

Detailed example:

Consider the following example dataframe:

In [2]: df

Out[2]: 

  col1 col2  col3  col4  col5  col6

0    A    B  0.20 -0.61 -0.49  1.49

1    A    B -1.53 -1.01 -0.39  1.82

2    A    B -0.44  0.27  0.72  0.11

3    A    B  0.28 -1.32  0.38  0.18

4    C    D  0.12  0.59  0.81  0.66

5    C    D -0.13 -1.65 -1.64  0.50

6    C    D -1.42 -0.11 -0.18 -0.44

7    E    F -0.00  1.42 -0.26  1.17

8    E    F  0.91 -0.47  1.35 -0.34

9    G    H  1.48 -0.63 -1.14  0.17

First let's use .size() to get the row counts:

In [3]: df.groupby(['col1', 'col2']).size()

Out[3]: 

col1  col2

A     B       4

C     D       3

E     F       2

G     H       1

dtype: int64

Then let's use .size().reset_index(name='counts') to get the row counts:

In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')

Out[4]: 

  col1 col2  counts

0    A    B       4

1    C    D       3

2    E    F       2

3    G    H       1

Including results for more statistics

When you want to calculate statistics on grouped data, it usually looks like this:

In [5]: (df

   ...: .groupby(['col1', 'col2'])

   ...: .agg({

   ...:     'col3': ['mean', 'count'], 

   ...:     'col4': ['median', 'min', 'count']

   ...: }))

Out[5]: 

            col4                  col3      

          median   min count      mean count

col1 col2                                   

A    B    -0.810 -1.32     4 -0.372500     4

C    D    -0.110 -1.65     3 -0.476667     3

E    F     0.475 -0.47     2  0.455000     2

G    H    -0.630 -0.63     1  1.480000     1

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:

In [6]: gb = df.groupby(['col1', 'col2'])

   ...: counts = gb.size().to_frame(name='counts')

   ...: (counts

   ...:  .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))

   ...:  .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))

   ...:  .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))

   ...:  .reset_index()

   ...: )

   ...: 

Out[6]: 

  col1 col2  counts  col3_mean  col4_median  col4_min

0    A    B       4  -0.372500       -0.810     -1.32

1    C    D       3  -0.476667       -0.110     -1.65

2    E    F       2   0.455000        0.475     -0.47

3    G    H       1   1.480000       -0.630     -0.63

Footnotes

The code used to generate the test data is shown below:

In [1]: import numpy as np

   ...: import pandas as pd 

   ...: 

   ...: keys = np.array([

   ...:         ['A', 'B'],

   ...:         ['A', 'B'],

   ...:         ['A', 'B'],

   ...:         ['A', 'B'],

   ...:         ['C', 'D'],

   ...:         ['C', 'D'],

   ...:         ['C', 'D'],

   ...:         ['E', 'F'],

   ...:         ['E', 'F'],

   ...:         ['G', 'H'] 

   ...:         ])

   ...: 

   ...: df = pd.DataFrame(

   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 

   ...:     columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']

   ...: )

   ...: 

   ...: df[['col3', 'col4', 'col5', 'col6']] = 

   ...:     df[['col3', 'col4', 'col5', 'col6']].astype(float)

   ...:

Disclaimer:

edited May 30 '18 at 6:35

answered Sep 26 '15 at 19:34

Pedro M Duarte

11.8k43241

edited May 30 '18 at 6:35

answered Sep 26 '15 at 19:34

Pedro M Duarte

11.8k43241

answered Sep 26 '15 at 19:34

Pedro M Duarte

11.8k43241

answered Sep 26 '15 at 19:34

Pedro M Duarte

11.8k43241

1

Hey, I really like your solution, particularly the last, where you use method chaining. However, since it is often necessary, to apply different aggregation functions to different columns, one could also concat the resulting data frames using pd.concat. This maybe easier to read than subsqeuent chaining

– Quickbeam2k1
Aug 17 '16 at 11:26

4

nice solution,but for In [5]: counts_df = pd.DataFrame(df.groupby('col1').size().rename('counts')) , maybe it's better to set the size() as a new column if you'd like to manipulate the dataframe for further analysis,which should be counts_df = pd.DataFrame(df.groupby('col1').size().reset_index(name='counts')

– LancelotHolmes
Feb 28 '17 at 2:35

1

Thanks for the "Including results for more statistics" bit! Since my next search was about flattening the resulting multiindex on columns, I'll link to the answer here: stackoverflow.com/a/50558529/1026

– Nickolay
May 28 '18 at 8:17

Great! Could you please give me a hint how to add isnull to this query to have it in one column as well? 'col4': ['median', 'min', 'count', 'isnull']

– Peter.k
Jan 18 at 10:31

add a comment |

1

Hey, I really like your solution, particularly the last, where you use method chaining. However, since it is often necessary, to apply different aggregation functions to different columns, one could also concat the resulting data frames using pd.concat. This maybe easier to read than subsqeuent chaining

– Quickbeam2k1
Aug 17 '16 at 11:26

4

nice solution,but for In [5]: counts_df = pd.DataFrame(df.groupby('col1').size().rename('counts')) , maybe it's better to set the size() as a new column if you'd like to manipulate the dataframe for further analysis,which should be counts_df = pd.DataFrame(df.groupby('col1').size().reset_index(name='counts')

– LancelotHolmes
Feb 28 '17 at 2:35

1

Thanks for the "Including results for more statistics" bit! Since my next search was about flattening the resulting multiindex on columns, I'll link to the answer here: stackoverflow.com/a/50558529/1026

– Nickolay
May 28 '18 at 8:17

Great! Could you please give me a hint how to add isnull to this query to have it in one column as well? 'col4': ['median', 'min', 'count', 'isnull']

– Peter.k
Jan 18 at 10:31

Hey, I really like your solution, particularly the last, where you use method chaining. However, since it is often necessary, to apply different aggregation functions to different columns, one could also concat the resulting data frames using pd.concat. This maybe easier to read than subsqeuent chaining

– Quickbeam2k1
Aug 17 '16 at 11:26

nice solution,but for In [5]: counts_df = pd.DataFrame(df.groupby('col1').size().rename('counts')) , maybe it's better to set the size() as a new column if you'd like to manipulate the dataframe for further analysis,which should be counts_df = pd.DataFrame(df.groupby('col1').size().reset_index(name='counts')

– LancelotHolmes
Feb 28 '17 at 2:35

Thanks for the "Including results for more statistics" bit! Since my next search was about flattening the resulting multiindex on columns, I'll link to the answer here: stackoverflow.com/a/50558529/1026

– Nickolay
May 28 '18 at 8:17

Great! Could you please give me a hint how to add isnull to this query to have it in one column as well? 'col4': ['median', 'min', 'count', 'isnull']

– Peter.k
Jan 18 at 10:31

add a comment |

We can easily do it by using groupby and count. But, we should remember to use reset_index().

df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().

reset_index()

edited Nov 27 '17 at 18:17

answered Nov 27 '17 at 9:17

Nimesh

5913

This solution works as long as there is no null value in the columns, otherwise it can be misleading (count will be lower than the actual number of observation by group).

– Adrien Pacifico
Jul 9 '18 at 0:59

add a comment |

We can easily do it by using groupby and count. But, we should remember to use reset_index().

df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().

reset_index()

edited Nov 27 '17 at 18:17

answered Nov 27 '17 at 9:17

Nimesh

5913

This solution works as long as there is no null value in the columns, otherwise it can be misleading (count will be lower than the actual number of observation by group).

– Adrien Pacifico
Jul 9 '18 at 0:59

add a comment |

We can easily do it by using groupby and count. But, we should remember to use reset_index().

df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().

reset_index()

edited Nov 27 '17 at 18:17

answered Nov 27 '17 at 9:17

Nimesh

5913

We can easily do it by using groupby and count. But, we should remember to use reset_index().

df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().

reset_index()

edited Nov 27 '17 at 18:17

answered Nov 27 '17 at 9:17

Nimesh

5913

edited Nov 27 '17 at 18:17

answered Nov 27 '17 at 9:17

Nimesh

5913

answered Nov 27 '17 at 9:17

Nimesh

5913

answered Nov 27 '17 at 9:17

Nimesh

5913

This solution works as long as there is no null value in the columns, otherwise it can be misleading (count will be lower than the actual number of observation by group).

– Adrien Pacifico
Jul 9 '18 at 0:59

add a comment |

This solution works as long as there is no null value in the columns, otherwise it can be misleading (count will be lower than the actual number of observation by group).

– Adrien Pacifico
Jul 9 '18 at 0:59

This solution works as long as there is no null value in the columns, otherwise it can be misleading (count will be lower than the actual number of observation by group).

– Adrien Pacifico
Jul 9 '18 at 0:59

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu