Differences

This shows you the differences between two versions of the page.

--- pandas_groupby [2021/07/06 22:50] – admin
+++ pandas_groupby [2024/03/26 22:25] – [groupby slicing] raju
@@ Line 1: / Line 1: @@
 ==== preserve the highest value entries in each group ====
 tags | filter by value
 Given
 <code>
@@ Line 74: / Line 75: @@
 Ref:- https://stackoverflow.com/questions/15705630/get-the-rows-which-have-the-max-value-in-groups-using-groupby
+==== preserve the highest odd value in each group ====
+tags | pandas groupby transform maximum odd number, maxodd
+Given
+<code>
+     Sp  Mt Value  count
+   MM1  S1     a      1
+   MM1  S1     n      2
+   MM1  S1    cb      3
+   MM2  S2    mk      1
+   MM2  S2    bg      2
+   MM3  S3   dgd      2
+   MM3  S3    rd      3
+   MM4  S4    cb      1
+   MM4  S4   uyi      3
+   MM5  S5     w      1
+  MM6  S6    ea      2
+  MM7  S7     t      3
+</code>
+We want
+<code>
+     Sp  Mt Value  count
+   MM1  S1    cb      3
+   MM2  S2    mk      1
+   MM3  S3    rd      3
+   MM4  S4   uyi      3
+   MM5  S5     w      1
+  MM7  S7     t      3
+</code>
+That is get all the rows with highest odd 'count' for each ['Sp', 'Mt'] combination.
+If there is a group with only even 'count' values, discard it.
+Solution
+<code>
+In [1]:
+import pandas as pd
+df = pd.DataFrame({'Sp': ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM3', 'MM3',
+                          'MM4', 'MM4', 'MM5', 'MM6', 'MM7'],
+                   'Mt': ['S1', 'S1', 'S1', 'S2', 'S2', 'S3', 'S3',
+                          'S4', 'S4', 'S5', 'S6', 'S7'],
+                   'Value': ['a', 'n', 'cb', 'mk', 'bg', 'dgd', 'rd',
+                             'cb', 'uyi', 'w', 'ea', 't'],
+                   'count': [1, 2, 3, 1, 2, 2, 3, 1, 3, 1, 2, 3]})
+df
+Out[1]:
+     Sp  Mt Value  count
+   MM1  S1     a      1
+   MM1  S1     n      2
+   MM1  S1    cb      3
+   MM2  S2    mk      1
+   MM2  S2    bg      2
+   MM3  S3   dgd      2
+   MM3  S3    rd      3
+   MM4  S4    cb      1
+   MM4  S4   uyi      3
+   MM5  S5     w      1
+  MM6  S6    ea      2
+  MM7  S7     t      3
+In [2]:
+def max_odd(s):
+    value = s.loc[s % 2 == 1].max()
+    return value
+In [3]:
+idx = df.groupby(['Sp', 'Mt'])['count'].transform(max_odd) == df['count']
+df[idx]
+Out[3]:
+     Sp  Mt Value  count
+   MM1  S1    cb      3
+   MM2  S2    mk      1
+   MM3  S3    rd      3
+   MM4  S4   uyi      3
+   MM5  S5     w      1
+  MM7  S7     t      3
+</code>
+Breakdown of how it works:
+<code>
+In [4]:
+df.groupby(['Sp', 'Mt'])['count'].transform(max_odd)
+Out[4]:
+     3.0
+     3.0
+     3.0
+     1.0
+     1.0
+     3.0
+     3.0
+     3.0
+     3.0
+     1.0
+    NaN
+    3.0
+Name: count, dtype: float64
+In [5]:
+idx = df.groupby(['Sp', 'Mt'])['count'].transform(max_odd) == df['count']
+idx
+Out[5]:
+     False
+     False
+      True
+      True
+     False
+     False
+      True
+     False
+      True
+      True
+    False
+     True
+Name: count, dtype: bool
+</code>
 ==== level ====
 If a dataframe has multiple indices but you need to groupby on only of them, use level. So, level=0 groups it on the first index, level=1 on the second index, level=-1 on the last index etc.,
@@ Line 119: / Line 237: @@
 ==== extract groupby object by key ====
+tags | pandas groupby filter a group
   * groups.get_group(key_value) if grouping on a single column
   * groups.get_group(key_value_tuple) if grouping on multiple columns.
@@ Line 199: / Line 319: @@
   bar  0  6
 </code>
+==== groupby slicing ====
+Consider
+<code>
+In [1]:
+import pandas as pd
+import numpy as np
+rand = np.random.RandomState(1)
+df = pd.DataFrame({'A': ['foo', 'bar'] * 3,
+                   'B': rand.randn(6),
+                   'C': rand.randint(0, 20, 6)})
+In [2]:
+df
+Out[2]:
+     A         B   C
+  foo  1.624345   5
+  bar -0.611756  18
+  foo -0.528172  11
+  bar -1.072969  10
+  foo  0.865408  14
+  bar -2.301539  18
+</code>
+Group by on column 'A'
+<code>
+In [3]:
+gb = df.groupby(['A'])
+</code>
+You can use get_group() to get a single group
+<code>
+In [4]:
+gb.get_group('foo')
+Out[4]:
+     A         B   C
+  foo  1.624345   5
+  foo -0.528172  11
+  foo  0.865408  14
+</code>
+You can select different columns using the groupby slicing:
+<code>
+In [5]:
+gb[['A', 'B']].get_group('foo')
+Out[5]:
+     A         B
+  foo  1.624345
+  foo -0.528172
+  foo  0.865408
+In [6]:
+gb[['C']].get_group('foo')
+Out[6]:
+    C
+   5
+  11
+  14
+</code>
+Ref:
+  * https://stackoverflow.com/questions/14734533/how-to-access-subdataframes-of-pandas-groupby-by-key
 ==== apply a function on each group ====
@@ Line 245: / Line 427: @@
 tags | reset_index remove level_1 column, apply function to multiple columns and rename result, groupby apply name the result, groupby apply remove level_1
-==== preserve the highest odd value in each group ====