User Tools

Site Tools


pandas_groupby

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
pandas_groupby [2021/02/10 18:10] – [apply a function on each group] rajupandas_groupby [2024/03/26 22:25] (current) – [groupby slicing] raju
Line 1: Line 1:
 ==== preserve the highest value entries in each group ==== ==== preserve the highest value entries in each group ====
 tags | filter by value tags | filter by value
 +
 Given Given
 <code> <code>
Line 74: Line 75:
  
 Ref:- https://stackoverflow.com/questions/15705630/get-the-rows-which-have-the-max-value-in-groups-using-groupby Ref:- https://stackoverflow.com/questions/15705630/get-the-rows-which-have-the-max-value-in-groups-using-groupby
 +
 +==== preserve the highest odd value in each group ====
 +tags | pandas groupby transform maximum odd number, maxodd
 +
 +Given
 +<code>
 +     Sp  Mt Value  count
 +0   MM1  S1          1
 +1   MM1  S1          2
 +2   MM1  S1    cb      3
 +3   MM2  S2    mk      1
 +4   MM2  S2    bg      2
 +5   MM3  S3   dgd      2
 +6   MM3  S3    rd      3
 +7   MM4  S4    cb      1
 +8   MM4  S4   uyi      3
 +9   MM5  S5          1
 +10  MM6  S6    ea      2
 +11  MM7  S7          3
 +</code>
 +
 +We want
 +<code>
 +     Sp  Mt Value  count
 +2   MM1  S1    cb      3
 +3   MM2  S2    mk      1
 +6   MM3  S3    rd      3
 +8   MM4  S4   uyi      3
 +9   MM5  S5          1
 +11  MM7  S7          3
 +</code>
 +That is get all the rows with highest odd 'count' for each ['Sp', 'Mt'] combination.
 +If there is a group with only even 'count' values, discard it.
 +
 +Solution
 +<code>
 +In [1]:
 +import pandas as pd
 +df = pd.DataFrame({'Sp': ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM3', 'MM3',
 +                          'MM4', 'MM4', 'MM5', 'MM6', 'MM7'],
 +                   'Mt': ['S1', 'S1', 'S1', 'S2', 'S2', 'S3', 'S3',
 +                          'S4', 'S4', 'S5', 'S6', 'S7'],
 +                   'Value': ['a', 'n', 'cb', 'mk', 'bg', 'dgd', 'rd',
 +                             'cb', 'uyi', 'w', 'ea', 't'],
 +                   'count': [1, 2, 3, 1, 2, 2, 3, 1, 3, 1, 2, 3]})
 +df
 +Out[1]:
 +     Sp  Mt Value  count
 +0   MM1  S1          1
 +1   MM1  S1          2
 +2   MM1  S1    cb      3
 +3   MM2  S2    mk      1
 +4   MM2  S2    bg      2
 +5   MM3  S3   dgd      2
 +6   MM3  S3    rd      3
 +7   MM4  S4    cb      1
 +8   MM4  S4   uyi      3
 +9   MM5  S5          1
 +10  MM6  S6    ea      2
 +11  MM7  S7          3
 +
 +In [2]:
 +def max_odd(s):
 +    value = s.loc[s % 2 == 1].max()
 +    return value
 +
 +
 +In [3]:
 +idx = df.groupby(['Sp', 'Mt'])['count'].transform(max_odd) == df['count']
 +df[idx]
 +Out[3]:
 +     Sp  Mt Value  count
 +2   MM1  S1    cb      3
 +3   MM2  S2    mk      1
 +6   MM3  S3    rd      3
 +8   MM4  S4   uyi      3
 +9   MM5  S5          1
 +11  MM7  S7          3
 +</code>
 +
 +Breakdown of how it works:
 +<code>
 +In [4]:
 +df.groupby(['Sp', 'Mt'])['count'].transform(max_odd)
 +Out[4]:
 +0     3.0
 +1     3.0
 +2     3.0
 +3     1.0
 +4     1.0
 +5     3.0
 +6     3.0
 +7     3.0
 +8     3.0
 +9     1.0
 +10    NaN
 +11    3.0
 +Name: count, dtype: float64
 +
 +In [5]:
 +idx = df.groupby(['Sp', 'Mt'])['count'].transform(max_odd) == df['count']
 +idx
 +Out[5]:
 +0     False
 +1     False
 +2      True
 +3      True
 +4     False
 +5     False
 +6      True
 +7     False
 +8      True
 +9      True
 +10    False
 +11     True
 +Name: count, dtype: bool
 +</code>
 ==== level ==== ==== level ====
 If a dataframe has multiple indices but you need to groupby on only of them, use level. So, level=0 groups it on the first index, level=1 on the second index, level=-1 on the last index etc., If a dataframe has multiple indices but you need to groupby on only of them, use level. So, level=0 groups it on the first index, level=1 on the second index, level=-1 on the last index etc.,
Line 119: Line 237:
  
 ==== extract groupby object by key ==== ==== extract groupby object by key ====
 +tags | pandas groupby filter a group
 +
   * groups.get_group(key_value) if grouping on a single column   * groups.get_group(key_value) if grouping on a single column
   * groups.get_group(key_value_tuple) if grouping on multiple columns.   * groups.get_group(key_value_tuple) if grouping on multiple columns.
Line 199: Line 319:
 5  bar  0  6 5  bar  0  6
 </code> </code>
 +
 +==== groupby slicing ====
 +Consider
 +<code>
 +In [1]: 
 +import pandas as pd
 +import numpy as np
 +rand = np.random.RandomState(1)
 +df = pd.DataFrame({'A': ['foo', 'bar'] * 3,
 +                   'B': rand.randn(6),
 +                   'C': rand.randint(0, 20, 6)})
 +
 +In [2]: 
 +df
 +Out[2]: 
 +               C
 +0  foo  1.624345   5
 +1  bar -0.611756  18
 +2  foo -0.528172  11
 +3  bar -1.072969  10
 +4  foo  0.865408  14
 +5  bar -2.301539  18
 +</code>
 +
 +Group by on column 'A'
 +<code>
 +In [3]: 
 +gb = df.groupby(['A'])
 +</code>
 +
 +You can use get_group() to get a single group
 +<code>
 +In [4]: 
 +gb.get_group('foo')
 +Out[4]: 
 +               C
 +0  foo  1.624345   5
 +2  foo -0.528172  11
 +4  foo  0.865408  14
 +</code>
 +
 +You can select different columns using the groupby slicing:
 +<code>
 +In [5]: 
 +gb[['A', 'B']].get_group('foo')
 +Out[5]: 
 +             B
 +0  foo  1.624345
 +2  foo -0.528172
 +4  foo  0.865408
 +
 +In [6]: 
 +gb[['C']].get_group('foo')
 +Out[6]: 
 +    C
 +0   5
 +2  11
 +4  14
 +</code>
 +
 +Ref:
 +  * https://stackoverflow.com/questions/14734533/how-to-access-subdataframes-of-pandas-groupby-by-key
  
 ==== apply a function on each group ==== ==== apply a function on each group ====
Line 244: Line 426:
 </code> </code>
  
-tags | reset_index remove level_1 column, apply function to multiple columns and rename result, groupby apply name the result+tags | reset_index remove level_1 column, apply function to multiple columns and rename result, groupby apply name the result, groupby apply remove level_1 
pandas_groupby.1612980645.txt.gz · Last modified: 2021/02/10 18:10 by raju