How to count number of rows per group (and other statistics) in pandas group by?












289















I have a data frame df and I use several columns from it to groupby:



df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()


In the above way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.










share|improve this question





























    289















    I have a data frame df and I use several columns from it to groupby:



    df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()


    In the above way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.










    share|improve this question



























      289












      289








      289


      126






      I have a data frame df and I use several columns from it to groupby:



      df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()


      In the above way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.










      share|improve this question
















      I have a data frame df and I use several columns from it to groupby:



      df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()


      In the above way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.







      python group-by pandas distinct






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited May 30 '18 at 6:37









      Pedro M Duarte

      11.8k43241




      11.8k43241










      asked Oct 15 '13 at 15:00









      RomanRoman

      28.6k125284377




      28.6k125284377
























          3 Answers
          3






          active

          oldest

          votes


















          301














          On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:



          df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])





          share|improve this answer


























          • I think you need the column reference to be a list. Do you perhaps mean: df[['col1','col2','col3','col4']].groupby(['col1','col2']).agg(['mean', 'count'])

            – rysqui
            Dec 17 '14 at 6:14








          • 27





            This creates four count columns, but how to get only one? (The question asks for "an additional column" and that's what I would like too.)

            – Jaan
            Jul 22 '15 at 6:58








          • 10





            Please see my answer if you want to get only one count column per group.

            – Pedro M Duarte
            Sep 26 '15 at 19:43











          • What if I have a separate called Counts and instead of count the rows of the grouped type, I need to add along the column Counts.

            – Abhishek Bhatia
            Oct 2 '17 at 21:28



















          675














          Quick Answer:



          The simplest way to get row counts per group is by calling .size(), which returns a Series:



          df.groupby(['col1','col2']).size()




          Usually you want this result as a DataFrame (instead of a Series) so you can do:



          df.groupby(['col1', 'col2']).size().reset_index(name='counts')




          If you want to find out how to calculate the row counts and other statistics for each group continue reading below.





          Detailed example:



          Consider the following example dataframe:



          In [2]: df
          Out[2]:
          col1 col2 col3 col4 col5 col6
          0 A B 0.20 -0.61 -0.49 1.49
          1 A B -1.53 -1.01 -0.39 1.82
          2 A B -0.44 0.27 0.72 0.11
          3 A B 0.28 -1.32 0.38 0.18
          4 C D 0.12 0.59 0.81 0.66
          5 C D -0.13 -1.65 -1.64 0.50
          6 C D -1.42 -0.11 -0.18 -0.44
          7 E F -0.00 1.42 -0.26 1.17
          8 E F 0.91 -0.47 1.35 -0.34
          9 G H 1.48 -0.63 -1.14 0.17


          First let's use .size() to get the row counts:



          In [3]: df.groupby(['col1', 'col2']).size()
          Out[3]:
          col1 col2
          A B 4
          C D 3
          E F 2
          G H 1
          dtype: int64


          Then let's use .size().reset_index(name='counts') to get the row counts:



          In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
          Out[4]:
          col1 col2 counts
          0 A B 4
          1 C D 3
          2 E F 2
          3 G H 1





          Including results for more statistics



          When you want to calculate statistics on grouped data, it usually looks like this:



          In [5]: (df
          ...: .groupby(['col1', 'col2'])
          ...: .agg({
          ...: 'col3': ['mean', 'count'],
          ...: 'col4': ['median', 'min', 'count']
          ...: }))
          Out[5]:
          col4 col3
          median min count mean count
          col1 col2
          A B -0.810 -1.32 4 -0.372500 4
          C D -0.110 -1.65 3 -0.476667 3
          E F 0.475 -0.47 2 0.455000 2
          G H -0.630 -0.63 1 1.480000 1


          The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.



          To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:



          In [6]: gb = df.groupby(['col1', 'col2'])
          ...: counts = gb.size().to_frame(name='counts')
          ...: (counts
          ...: .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
          ...: .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
          ...: .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
          ...: .reset_index()
          ...: )
          ...:
          Out[6]:
          col1 col2 counts col3_mean col4_median col4_min
          0 A B 4 -0.372500 -0.810 -1.32
          1 C D 3 -0.476667 -0.110 -1.65
          2 E F 2 0.455000 0.475 -0.47
          3 G H 1 1.480000 -0.630 -0.63







          Footnotes



          The code used to generate the test data is shown below:



          In [1]: import numpy as np
          ...: import pandas as pd
          ...:
          ...: keys = np.array([
          ...: ['A', 'B'],
          ...: ['A', 'B'],
          ...: ['A', 'B'],
          ...: ['A', 'B'],
          ...: ['C', 'D'],
          ...: ['C', 'D'],
          ...: ['C', 'D'],
          ...: ['E', 'F'],
          ...: ['E', 'F'],
          ...: ['G', 'H']
          ...: ])
          ...:
          ...: df = pd.DataFrame(
          ...: np.hstack([keys,np.random.randn(10,4).round(2)]),
          ...: columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
          ...: )
          ...:
          ...: df[['col3', 'col4', 'col5', 'col6']] =
          ...: df[['col3', 'col4', 'col5', 'col6']].astype(float)
          ...:



          Disclaimer:



          If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.






          share|improve this answer





















          • 1





            Hey, I really like your solution, particularly the last, where you use method chaining. However, since it is often necessary, to apply different aggregation functions to different columns, one could also concat the resulting data frames using pd.concat. This maybe easier to read than subsqeuent chaining

            – Quickbeam2k1
            Aug 17 '16 at 11:26






          • 4





            nice solution,but for In [5]: counts_df = pd.DataFrame(df.groupby('col1').size().rename('counts')) , maybe it's better to set the size() as a new column if you'd like to manipulate the dataframe for further analysis,which should be counts_df = pd.DataFrame(df.groupby('col1').size().reset_index(name='counts')

            – LancelotHolmes
            Feb 28 '17 at 2:35








          • 1





            Thanks for the "Including results for more statistics" bit! Since my next search was about flattening the resulting multiindex on columns, I'll link to the answer here: stackoverflow.com/a/50558529/1026

            – Nickolay
            May 28 '18 at 8:17











          • Great! Could you please give me a hint how to add isnull to this query to have it in one column as well? 'col4': ['median', 'min', 'count', 'isnull']

            – Peter.k
            Jan 18 at 10:31



















          5














          We can easily do it by using groupby and count. But, we should remember to use reset_index().



          df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().
          reset_index()





          share|improve this answer


























          • This solution works as long as there is no null value in the columns, otherwise it can be misleading (count will be lower than the actual number of observation by group).

            – Adrien Pacifico
            Jul 9 '18 at 0:59











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f19384532%2fhow-to-count-number-of-rows-per-group-and-other-statistics-in-pandas-group-by%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          3 Answers
          3






          active

          oldest

          votes








          3 Answers
          3






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          301














          On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:



          df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])





          share|improve this answer


























          • I think you need the column reference to be a list. Do you perhaps mean: df[['col1','col2','col3','col4']].groupby(['col1','col2']).agg(['mean', 'count'])

            – rysqui
            Dec 17 '14 at 6:14








          • 27





            This creates four count columns, but how to get only one? (The question asks for "an additional column" and that's what I would like too.)

            – Jaan
            Jul 22 '15 at 6:58








          • 10





            Please see my answer if you want to get only one count column per group.

            – Pedro M Duarte
            Sep 26 '15 at 19:43











          • What if I have a separate called Counts and instead of count the rows of the grouped type, I need to add along the column Counts.

            – Abhishek Bhatia
            Oct 2 '17 at 21:28
















          301














          On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:



          df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])





          share|improve this answer


























          • I think you need the column reference to be a list. Do you perhaps mean: df[['col1','col2','col3','col4']].groupby(['col1','col2']).agg(['mean', 'count'])

            – rysqui
            Dec 17 '14 at 6:14








          • 27





            This creates four count columns, but how to get only one? (The question asks for "an additional column" and that's what I would like too.)

            – Jaan
            Jul 22 '15 at 6:58








          • 10





            Please see my answer if you want to get only one count column per group.

            – Pedro M Duarte
            Sep 26 '15 at 19:43











          • What if I have a separate called Counts and instead of count the rows of the grouped type, I need to add along the column Counts.

            – Abhishek Bhatia
            Oct 2 '17 at 21:28














          301












          301








          301







          On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:



          df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])





          share|improve this answer















          On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:



          df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited May 17 '15 at 3:55









          Alexander

          55.8k1491127




          55.8k1491127










          answered Oct 15 '13 at 15:49









          BoudBoud

          19.3k63956




          19.3k63956













          • I think you need the column reference to be a list. Do you perhaps mean: df[['col1','col2','col3','col4']].groupby(['col1','col2']).agg(['mean', 'count'])

            – rysqui
            Dec 17 '14 at 6:14








          • 27





            This creates four count columns, but how to get only one? (The question asks for "an additional column" and that's what I would like too.)

            – Jaan
            Jul 22 '15 at 6:58








          • 10





            Please see my answer if you want to get only one count column per group.

            – Pedro M Duarte
            Sep 26 '15 at 19:43











          • What if I have a separate called Counts and instead of count the rows of the grouped type, I need to add along the column Counts.

            – Abhishek Bhatia
            Oct 2 '17 at 21:28



















          • I think you need the column reference to be a list. Do you perhaps mean: df[['col1','col2','col3','col4']].groupby(['col1','col2']).agg(['mean', 'count'])

            – rysqui
            Dec 17 '14 at 6:14








          • 27





            This creates four count columns, but how to get only one? (The question asks for "an additional column" and that's what I would like too.)

            – Jaan
            Jul 22 '15 at 6:58








          • 10





            Please see my answer if you want to get only one count column per group.

            – Pedro M Duarte
            Sep 26 '15 at 19:43











          • What if I have a separate called Counts and instead of count the rows of the grouped type, I need to add along the column Counts.

            – Abhishek Bhatia
            Oct 2 '17 at 21:28

















          I think you need the column reference to be a list. Do you perhaps mean: df[['col1','col2','col3','col4']].groupby(['col1','col2']).agg(['mean', 'count'])

          – rysqui
          Dec 17 '14 at 6:14







          I think you need the column reference to be a list. Do you perhaps mean: df[['col1','col2','col3','col4']].groupby(['col1','col2']).agg(['mean', 'count'])

          – rysqui
          Dec 17 '14 at 6:14






          27




          27





          This creates four count columns, but how to get only one? (The question asks for "an additional column" and that's what I would like too.)

          – Jaan
          Jul 22 '15 at 6:58







          This creates four count columns, but how to get only one? (The question asks for "an additional column" and that's what I would like too.)

          – Jaan
          Jul 22 '15 at 6:58






          10




          10





          Please see my answer if you want to get only one count column per group.

          – Pedro M Duarte
          Sep 26 '15 at 19:43





          Please see my answer if you want to get only one count column per group.

          – Pedro M Duarte
          Sep 26 '15 at 19:43













          What if I have a separate called Counts and instead of count the rows of the grouped type, I need to add along the column Counts.

          – Abhishek Bhatia
          Oct 2 '17 at 21:28





          What if I have a separate called Counts and instead of count the rows of the grouped type, I need to add along the column Counts.

          – Abhishek Bhatia
          Oct 2 '17 at 21:28













          675














          Quick Answer:



          The simplest way to get row counts per group is by calling .size(), which returns a Series:



          df.groupby(['col1','col2']).size()




          Usually you want this result as a DataFrame (instead of a Series) so you can do:



          df.groupby(['col1', 'col2']).size().reset_index(name='counts')




          If you want to find out how to calculate the row counts and other statistics for each group continue reading below.





          Detailed example:



          Consider the following example dataframe:



          In [2]: df
          Out[2]:
          col1 col2 col3 col4 col5 col6
          0 A B 0.20 -0.61 -0.49 1.49
          1 A B -1.53 -1.01 -0.39 1.82
          2 A B -0.44 0.27 0.72 0.11
          3 A B 0.28 -1.32 0.38 0.18
          4 C D 0.12 0.59 0.81 0.66
          5 C D -0.13 -1.65 -1.64 0.50
          6 C D -1.42 -0.11 -0.18 -0.44
          7 E F -0.00 1.42 -0.26 1.17
          8 E F 0.91 -0.47 1.35 -0.34
          9 G H 1.48 -0.63 -1.14 0.17


          First let's use .size() to get the row counts:



          In [3]: df.groupby(['col1', 'col2']).size()
          Out[3]:
          col1 col2
          A B 4
          C D 3
          E F 2
          G H 1
          dtype: int64


          Then let's use .size().reset_index(name='counts') to get the row counts:



          In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
          Out[4]:
          col1 col2 counts
          0 A B 4
          1 C D 3
          2 E F 2
          3 G H 1





          Including results for more statistics



          When you want to calculate statistics on grouped data, it usually looks like this:



          In [5]: (df
          ...: .groupby(['col1', 'col2'])
          ...: .agg({
          ...: 'col3': ['mean', 'count'],
          ...: 'col4': ['median', 'min', 'count']
          ...: }))
          Out[5]:
          col4 col3
          median min count mean count
          col1 col2
          A B -0.810 -1.32 4 -0.372500 4
          C D -0.110 -1.65 3 -0.476667 3
          E F 0.475 -0.47 2 0.455000 2
          G H -0.630 -0.63 1 1.480000 1


          The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.



          To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:



          In [6]: gb = df.groupby(['col1', 'col2'])
          ...: counts = gb.size().to_frame(name='counts')
          ...: (counts
          ...: .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
          ...: .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
          ...: .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
          ...: .reset_index()
          ...: )
          ...:
          Out[6]:
          col1 col2 counts col3_mean col4_median col4_min
          0 A B 4 -0.372500 -0.810 -1.32
          1 C D 3 -0.476667 -0.110 -1.65
          2 E F 2 0.455000 0.475 -0.47
          3 G H 1 1.480000 -0.630 -0.63







          Footnotes



          The code used to generate the test data is shown below:



          In [1]: import numpy as np
          ...: import pandas as pd
          ...:
          ...: keys = np.array([
          ...: ['A', 'B'],
          ...: ['A', 'B'],
          ...: ['A', 'B'],
          ...: ['A', 'B'],
          ...: ['C', 'D'],
          ...: ['C', 'D'],
          ...: ['C', 'D'],
          ...: ['E', 'F'],
          ...: ['E', 'F'],
          ...: ['G', 'H']
          ...: ])
          ...:
          ...: df = pd.DataFrame(
          ...: np.hstack([keys,np.random.randn(10,4).round(2)]),
          ...: columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
          ...: )
          ...:
          ...: df[['col3', 'col4', 'col5', 'col6']] =
          ...: df[['col3', 'col4', 'col5', 'col6']].astype(float)
          ...:



          Disclaimer:



          If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.






          share|improve this answer





















          • 1





            Hey, I really like your solution, particularly the last, where you use method chaining. However, since it is often necessary, to apply different aggregation functions to different columns, one could also concat the resulting data frames using pd.concat. This maybe easier to read than subsqeuent chaining

            – Quickbeam2k1
            Aug 17 '16 at 11:26






          • 4





            nice solution,but for In [5]: counts_df = pd.DataFrame(df.groupby('col1').size().rename('counts')) , maybe it's better to set the size() as a new column if you'd like to manipulate the dataframe for further analysis,which should be counts_df = pd.DataFrame(df.groupby('col1').size().reset_index(name='counts')

            – LancelotHolmes
            Feb 28 '17 at 2:35








          • 1





            Thanks for the "Including results for more statistics" bit! Since my next search was about flattening the resulting multiindex on columns, I'll link to the answer here: stackoverflow.com/a/50558529/1026

            – Nickolay
            May 28 '18 at 8:17











          • Great! Could you please give me a hint how to add isnull to this query to have it in one column as well? 'col4': ['median', 'min', 'count', 'isnull']

            – Peter.k
            Jan 18 at 10:31
















          675














          Quick Answer:



          The simplest way to get row counts per group is by calling .size(), which returns a Series:



          df.groupby(['col1','col2']).size()




          Usually you want this result as a DataFrame (instead of a Series) so you can do:



          df.groupby(['col1', 'col2']).size().reset_index(name='counts')




          If you want to find out how to calculate the row counts and other statistics for each group continue reading below.





          Detailed example:



          Consider the following example dataframe:



          In [2]: df
          Out[2]:
          col1 col2 col3 col4 col5 col6
          0 A B 0.20 -0.61 -0.49 1.49
          1 A B -1.53 -1.01 -0.39 1.82
          2 A B -0.44 0.27 0.72 0.11
          3 A B 0.28 -1.32 0.38 0.18
          4 C D 0.12 0.59 0.81 0.66
          5 C D -0.13 -1.65 -1.64 0.50
          6 C D -1.42 -0.11 -0.18 -0.44
          7 E F -0.00 1.42 -0.26 1.17
          8 E F 0.91 -0.47 1.35 -0.34
          9 G H 1.48 -0.63 -1.14 0.17


          First let's use .size() to get the row counts:



          In [3]: df.groupby(['col1', 'col2']).size()
          Out[3]:
          col1 col2
          A B 4
          C D 3
          E F 2
          G H 1
          dtype: int64


          Then let's use .size().reset_index(name='counts') to get the row counts:



          In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
          Out[4]:
          col1 col2 counts
          0 A B 4
          1 C D 3
          2 E F 2
          3 G H 1





          Including results for more statistics



          When you want to calculate statistics on grouped data, it usually looks like this:



          In [5]: (df
          ...: .groupby(['col1', 'col2'])
          ...: .agg({
          ...: 'col3': ['mean', 'count'],
          ...: 'col4': ['median', 'min', 'count']
          ...: }))
          Out[5]:
          col4 col3
          median min count mean count
          col1 col2
          A B -0.810 -1.32 4 -0.372500 4
          C D -0.110 -1.65 3 -0.476667 3
          E F 0.475 -0.47 2 0.455000 2
          G H -0.630 -0.63 1 1.480000 1


          The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.



          To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:



          In [6]: gb = df.groupby(['col1', 'col2'])
          ...: counts = gb.size().to_frame(name='counts')
          ...: (counts
          ...: .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
          ...: .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
          ...: .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
          ...: .reset_index()
          ...: )
          ...:
          Out[6]:
          col1 col2 counts col3_mean col4_median col4_min
          0 A B 4 -0.372500 -0.810 -1.32
          1 C D 3 -0.476667 -0.110 -1.65
          2 E F 2 0.455000 0.475 -0.47
          3 G H 1 1.480000 -0.630 -0.63







          Footnotes



          The code used to generate the test data is shown below:



          In [1]: import numpy as np
          ...: import pandas as pd
          ...:
          ...: keys = np.array([
          ...: ['A', 'B'],
          ...: ['A', 'B'],
          ...: ['A', 'B'],
          ...: ['A', 'B'],
          ...: ['C', 'D'],
          ...: ['C', 'D'],
          ...: ['C', 'D'],
          ...: ['E', 'F'],
          ...: ['E', 'F'],
          ...: ['G', 'H']
          ...: ])
          ...:
          ...: df = pd.DataFrame(
          ...: np.hstack([keys,np.random.randn(10,4).round(2)]),
          ...: columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
          ...: )
          ...:
          ...: df[['col3', 'col4', 'col5', 'col6']] =
          ...: df[['col3', 'col4', 'col5', 'col6']].astype(float)
          ...:



          Disclaimer:



          If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.






          share|improve this answer





















          • 1





            Hey, I really like your solution, particularly the last, where you use method chaining. However, since it is often necessary, to apply different aggregation functions to different columns, one could also concat the resulting data frames using pd.concat. This maybe easier to read than subsqeuent chaining

            – Quickbeam2k1
            Aug 17 '16 at 11:26






          • 4





            nice solution,but for In [5]: counts_df = pd.DataFrame(df.groupby('col1').size().rename('counts')) , maybe it's better to set the size() as a new column if you'd like to manipulate the dataframe for further analysis,which should be counts_df = pd.DataFrame(df.groupby('col1').size().reset_index(name='counts')

            – LancelotHolmes
            Feb 28 '17 at 2:35








          • 1





            Thanks for the "Including results for more statistics" bit! Since my next search was about flattening the resulting multiindex on columns, I'll link to the answer here: stackoverflow.com/a/50558529/1026

            – Nickolay
            May 28 '18 at 8:17











          • Great! Could you please give me a hint how to add isnull to this query to have it in one column as well? 'col4': ['median', 'min', 'count', 'isnull']

            – Peter.k
            Jan 18 at 10:31














          675












          675








          675







          Quick Answer:



          The simplest way to get row counts per group is by calling .size(), which returns a Series:



          df.groupby(['col1','col2']).size()




          Usually you want this result as a DataFrame (instead of a Series) so you can do:



          df.groupby(['col1', 'col2']).size().reset_index(name='counts')




          If you want to find out how to calculate the row counts and other statistics for each group continue reading below.





          Detailed example:



          Consider the following example dataframe:



          In [2]: df
          Out[2]:
          col1 col2 col3 col4 col5 col6
          0 A B 0.20 -0.61 -0.49 1.49
          1 A B -1.53 -1.01 -0.39 1.82
          2 A B -0.44 0.27 0.72 0.11
          3 A B 0.28 -1.32 0.38 0.18
          4 C D 0.12 0.59 0.81 0.66
          5 C D -0.13 -1.65 -1.64 0.50
          6 C D -1.42 -0.11 -0.18 -0.44
          7 E F -0.00 1.42 -0.26 1.17
          8 E F 0.91 -0.47 1.35 -0.34
          9 G H 1.48 -0.63 -1.14 0.17


          First let's use .size() to get the row counts:



          In [3]: df.groupby(['col1', 'col2']).size()
          Out[3]:
          col1 col2
          A B 4
          C D 3
          E F 2
          G H 1
          dtype: int64


          Then let's use .size().reset_index(name='counts') to get the row counts:



          In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
          Out[4]:
          col1 col2 counts
          0 A B 4
          1 C D 3
          2 E F 2
          3 G H 1





          Including results for more statistics



          When you want to calculate statistics on grouped data, it usually looks like this:



          In [5]: (df
          ...: .groupby(['col1', 'col2'])
          ...: .agg({
          ...: 'col3': ['mean', 'count'],
          ...: 'col4': ['median', 'min', 'count']
          ...: }))
          Out[5]:
          col4 col3
          median min count mean count
          col1 col2
          A B -0.810 -1.32 4 -0.372500 4
          C D -0.110 -1.65 3 -0.476667 3
          E F 0.475 -0.47 2 0.455000 2
          G H -0.630 -0.63 1 1.480000 1


          The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.



          To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:



          In [6]: gb = df.groupby(['col1', 'col2'])
          ...: counts = gb.size().to_frame(name='counts')
          ...: (counts
          ...: .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
          ...: .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
          ...: .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
          ...: .reset_index()
          ...: )
          ...:
          Out[6]:
          col1 col2 counts col3_mean col4_median col4_min
          0 A B 4 -0.372500 -0.810 -1.32
          1 C D 3 -0.476667 -0.110 -1.65
          2 E F 2 0.455000 0.475 -0.47
          3 G H 1 1.480000 -0.630 -0.63







          Footnotes



          The code used to generate the test data is shown below:



          In [1]: import numpy as np
          ...: import pandas as pd
          ...:
          ...: keys = np.array([
          ...: ['A', 'B'],
          ...: ['A', 'B'],
          ...: ['A', 'B'],
          ...: ['A', 'B'],
          ...: ['C', 'D'],
          ...: ['C', 'D'],
          ...: ['C', 'D'],
          ...: ['E', 'F'],
          ...: ['E', 'F'],
          ...: ['G', 'H']
          ...: ])
          ...:
          ...: df = pd.DataFrame(
          ...: np.hstack([keys,np.random.randn(10,4).round(2)]),
          ...: columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
          ...: )
          ...:
          ...: df[['col3', 'col4', 'col5', 'col6']] =
          ...: df[['col3', 'col4', 'col5', 'col6']].astype(float)
          ...:



          Disclaimer:



          If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.






          share|improve this answer















          Quick Answer:



          The simplest way to get row counts per group is by calling .size(), which returns a Series:



          df.groupby(['col1','col2']).size()




          Usually you want this result as a DataFrame (instead of a Series) so you can do:



          df.groupby(['col1', 'col2']).size().reset_index(name='counts')




          If you want to find out how to calculate the row counts and other statistics for each group continue reading below.





          Detailed example:



          Consider the following example dataframe:



          In [2]: df
          Out[2]:
          col1 col2 col3 col4 col5 col6
          0 A B 0.20 -0.61 -0.49 1.49
          1 A B -1.53 -1.01 -0.39 1.82
          2 A B -0.44 0.27 0.72 0.11
          3 A B 0.28 -1.32 0.38 0.18
          4 C D 0.12 0.59 0.81 0.66
          5 C D -0.13 -1.65 -1.64 0.50
          6 C D -1.42 -0.11 -0.18 -0.44
          7 E F -0.00 1.42 -0.26 1.17
          8 E F 0.91 -0.47 1.35 -0.34
          9 G H 1.48 -0.63 -1.14 0.17


          First let's use .size() to get the row counts:



          In [3]: df.groupby(['col1', 'col2']).size()
          Out[3]:
          col1 col2
          A B 4
          C D 3
          E F 2
          G H 1
          dtype: int64


          Then let's use .size().reset_index(name='counts') to get the row counts:



          In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
          Out[4]:
          col1 col2 counts
          0 A B 4
          1 C D 3
          2 E F 2
          3 G H 1





          Including results for more statistics



          When you want to calculate statistics on grouped data, it usually looks like this:



          In [5]: (df
          ...: .groupby(['col1', 'col2'])
          ...: .agg({
          ...: 'col3': ['mean', 'count'],
          ...: 'col4': ['median', 'min', 'count']
          ...: }))
          Out[5]:
          col4 col3
          median min count mean count
          col1 col2
          A B -0.810 -1.32 4 -0.372500 4
          C D -0.110 -1.65 3 -0.476667 3
          E F 0.475 -0.47 2 0.455000 2
          G H -0.630 -0.63 1 1.480000 1


          The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.



          To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:



          In [6]: gb = df.groupby(['col1', 'col2'])
          ...: counts = gb.size().to_frame(name='counts')
          ...: (counts
          ...: .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
          ...: .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
          ...: .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
          ...: .reset_index()
          ...: )
          ...:
          Out[6]:
          col1 col2 counts col3_mean col4_median col4_min
          0 A B 4 -0.372500 -0.810 -1.32
          1 C D 3 -0.476667 -0.110 -1.65
          2 E F 2 0.455000 0.475 -0.47
          3 G H 1 1.480000 -0.630 -0.63







          Footnotes



          The code used to generate the test data is shown below:



          In [1]: import numpy as np
          ...: import pandas as pd
          ...:
          ...: keys = np.array([
          ...: ['A', 'B'],
          ...: ['A', 'B'],
          ...: ['A', 'B'],
          ...: ['A', 'B'],
          ...: ['C', 'D'],
          ...: ['C', 'D'],
          ...: ['C', 'D'],
          ...: ['E', 'F'],
          ...: ['E', 'F'],
          ...: ['G', 'H']
          ...: ])
          ...:
          ...: df = pd.DataFrame(
          ...: np.hstack([keys,np.random.randn(10,4).round(2)]),
          ...: columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
          ...: )
          ...:
          ...: df[['col3', 'col4', 'col5', 'col6']] =
          ...: df[['col3', 'col4', 'col5', 'col6']].astype(float)
          ...:



          Disclaimer:



          If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited May 30 '18 at 6:35

























          answered Sep 26 '15 at 19:34









          Pedro M DuartePedro M Duarte

          11.8k43241




          11.8k43241








          • 1





            Hey, I really like your solution, particularly the last, where you use method chaining. However, since it is often necessary, to apply different aggregation functions to different columns, one could also concat the resulting data frames using pd.concat. This maybe easier to read than subsqeuent chaining

            – Quickbeam2k1
            Aug 17 '16 at 11:26






          • 4





            nice solution,but for In [5]: counts_df = pd.DataFrame(df.groupby('col1').size().rename('counts')) , maybe it's better to set the size() as a new column if you'd like to manipulate the dataframe for further analysis,which should be counts_df = pd.DataFrame(df.groupby('col1').size().reset_index(name='counts')

            – LancelotHolmes
            Feb 28 '17 at 2:35








          • 1





            Thanks for the "Including results for more statistics" bit! Since my next search was about flattening the resulting multiindex on columns, I'll link to the answer here: stackoverflow.com/a/50558529/1026

            – Nickolay
            May 28 '18 at 8:17











          • Great! Could you please give me a hint how to add isnull to this query to have it in one column as well? 'col4': ['median', 'min', 'count', 'isnull']

            – Peter.k
            Jan 18 at 10:31














          • 1





            Hey, I really like your solution, particularly the last, where you use method chaining. However, since it is often necessary, to apply different aggregation functions to different columns, one could also concat the resulting data frames using pd.concat. This maybe easier to read than subsqeuent chaining

            – Quickbeam2k1
            Aug 17 '16 at 11:26






          • 4





            nice solution,but for In [5]: counts_df = pd.DataFrame(df.groupby('col1').size().rename('counts')) , maybe it's better to set the size() as a new column if you'd like to manipulate the dataframe for further analysis,which should be counts_df = pd.DataFrame(df.groupby('col1').size().reset_index(name='counts')

            – LancelotHolmes
            Feb 28 '17 at 2:35








          • 1





            Thanks for the "Including results for more statistics" bit! Since my next search was about flattening the resulting multiindex on columns, I'll link to the answer here: stackoverflow.com/a/50558529/1026

            – Nickolay
            May 28 '18 at 8:17











          • Great! Could you please give me a hint how to add isnull to this query to have it in one column as well? 'col4': ['median', 'min', 'count', 'isnull']

            – Peter.k
            Jan 18 at 10:31








          1




          1





          Hey, I really like your solution, particularly the last, where you use method chaining. However, since it is often necessary, to apply different aggregation functions to different columns, one could also concat the resulting data frames using pd.concat. This maybe easier to read than subsqeuent chaining

          – Quickbeam2k1
          Aug 17 '16 at 11:26





          Hey, I really like your solution, particularly the last, where you use method chaining. However, since it is often necessary, to apply different aggregation functions to different columns, one could also concat the resulting data frames using pd.concat. This maybe easier to read than subsqeuent chaining

          – Quickbeam2k1
          Aug 17 '16 at 11:26




          4




          4





          nice solution,but for In [5]: counts_df = pd.DataFrame(df.groupby('col1').size().rename('counts')) , maybe it's better to set the size() as a new column if you'd like to manipulate the dataframe for further analysis,which should be counts_df = pd.DataFrame(df.groupby('col1').size().reset_index(name='counts')

          – LancelotHolmes
          Feb 28 '17 at 2:35







          nice solution,but for In [5]: counts_df = pd.DataFrame(df.groupby('col1').size().rename('counts')) , maybe it's better to set the size() as a new column if you'd like to manipulate the dataframe for further analysis,which should be counts_df = pd.DataFrame(df.groupby('col1').size().reset_index(name='counts')

          – LancelotHolmes
          Feb 28 '17 at 2:35






          1




          1





          Thanks for the "Including results for more statistics" bit! Since my next search was about flattening the resulting multiindex on columns, I'll link to the answer here: stackoverflow.com/a/50558529/1026

          – Nickolay
          May 28 '18 at 8:17





          Thanks for the "Including results for more statistics" bit! Since my next search was about flattening the resulting multiindex on columns, I'll link to the answer here: stackoverflow.com/a/50558529/1026

          – Nickolay
          May 28 '18 at 8:17













          Great! Could you please give me a hint how to add isnull to this query to have it in one column as well? 'col4': ['median', 'min', 'count', 'isnull']

          – Peter.k
          Jan 18 at 10:31





          Great! Could you please give me a hint how to add isnull to this query to have it in one column as well? 'col4': ['median', 'min', 'count', 'isnull']

          – Peter.k
          Jan 18 at 10:31











          5














          We can easily do it by using groupby and count. But, we should remember to use reset_index().



          df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().
          reset_index()





          share|improve this answer


























          • This solution works as long as there is no null value in the columns, otherwise it can be misleading (count will be lower than the actual number of observation by group).

            – Adrien Pacifico
            Jul 9 '18 at 0:59
















          5














          We can easily do it by using groupby and count. But, we should remember to use reset_index().



          df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().
          reset_index()





          share|improve this answer


























          • This solution works as long as there is no null value in the columns, otherwise it can be misleading (count will be lower than the actual number of observation by group).

            – Adrien Pacifico
            Jul 9 '18 at 0:59














          5












          5








          5







          We can easily do it by using groupby and count. But, we should remember to use reset_index().



          df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().
          reset_index()





          share|improve this answer















          We can easily do it by using groupby and count. But, we should remember to use reset_index().



          df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().
          reset_index()






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 27 '17 at 18:17

























          answered Nov 27 '17 at 9:17









          NimeshNimesh

          5913




          5913













          • This solution works as long as there is no null value in the columns, otherwise it can be misleading (count will be lower than the actual number of observation by group).

            – Adrien Pacifico
            Jul 9 '18 at 0:59



















          • This solution works as long as there is no null value in the columns, otherwise it can be misleading (count will be lower than the actual number of observation by group).

            – Adrien Pacifico
            Jul 9 '18 at 0:59

















          This solution works as long as there is no null value in the columns, otherwise it can be misleading (count will be lower than the actual number of observation by group).

          – Adrien Pacifico
          Jul 9 '18 at 0:59





          This solution works as long as there is no null value in the columns, otherwise it can be misleading (count will be lower than the actual number of observation by group).

          – Adrien Pacifico
          Jul 9 '18 at 0:59


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f19384532%2fhow-to-count-number-of-rows-per-group-and-other-statistics-in-pandas-group-by%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          MongoDB - Not Authorized To Execute Command

          in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith

          How to fix TextFormField cause rebuild widget in Flutter