==== preserve the highest value entries in each group ====
tags | filter by value
Given
Sp Mt Value count
0 MM1 S1 a 3
1 MM1 S1 n 2
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 7
8 MM4 S2 uyi 7
get all the rows with highest 'count' for each ['Sp', 'Mt'] combination. That is we want
Sp Mt Value count
0 MM1 S1 a 3
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
7 MM4 S2 cb 7
8 MM4 S2 uyi 7
Solution
In [1]:
import pandas as pd
df = pd.DataFrame({'Sp': ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
'Mt': ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'Value': ['a', 'n', 'cb', 'mk', 'bg', 'dgd', 'rd', 'cb', 'uyi'],
'count': [3, 2, 5, 8, 10, 1, 2, 7, 7]})
df
Out[1]:
Sp Mt Value count
0 MM1 S1 a 3
1 MM1 S1 n 2
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 7
8 MM4 S2 uyi 7
In [2]:
df.groupby(['Sp', 'Mt'])['count'].transform(max)
Out[2]:
0 3
1 3
2 5
3 8
4 10
5 10
6 7
7 7
8 7
Name: count, dtype: int64
In [3]:
idx = df.groupby(['Sp', 'Mt'])['count'].transform(max) == df['count']
df[idx]
Out[3]:
Sp Mt Value count
0 MM1 S1 a 3
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
7 MM4 S2 cb 7
8 MM4 S2 uyi 7
Ref:- https://stackoverflow.com/questions/15705630/get-the-rows-which-have-the-max-value-in-groups-using-groupby
==== preserve the highest odd value in each group ====
tags | pandas groupby transform maximum odd number, maxodd
Given
Sp Mt Value count
0 MM1 S1 a 1
1 MM1 S1 n 2
2 MM1 S1 cb 3
3 MM2 S2 mk 1
4 MM2 S2 bg 2
5 MM3 S3 dgd 2
6 MM3 S3 rd 3
7 MM4 S4 cb 1
8 MM4 S4 uyi 3
9 MM5 S5 w 1
10 MM6 S6 ea 2
11 MM7 S7 t 3
We want
Sp Mt Value count
2 MM1 S1 cb 3
3 MM2 S2 mk 1
6 MM3 S3 rd 3
8 MM4 S4 uyi 3
9 MM5 S5 w 1
11 MM7 S7 t 3
That is get all the rows with highest odd 'count' for each ['Sp', 'Mt'] combination.
If there is a group with only even 'count' values, discard it.
Solution
In [1]:
import pandas as pd
df = pd.DataFrame({'Sp': ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM3', 'MM3',
'MM4', 'MM4', 'MM5', 'MM6', 'MM7'],
'Mt': ['S1', 'S1', 'S1', 'S2', 'S2', 'S3', 'S3',
'S4', 'S4', 'S5', 'S6', 'S7'],
'Value': ['a', 'n', 'cb', 'mk', 'bg', 'dgd', 'rd',
'cb', 'uyi', 'w', 'ea', 't'],
'count': [1, 2, 3, 1, 2, 2, 3, 1, 3, 1, 2, 3]})
df
Out[1]:
Sp Mt Value count
0 MM1 S1 a 1
1 MM1 S1 n 2
2 MM1 S1 cb 3
3 MM2 S2 mk 1
4 MM2 S2 bg 2
5 MM3 S3 dgd 2
6 MM3 S3 rd 3
7 MM4 S4 cb 1
8 MM4 S4 uyi 3
9 MM5 S5 w 1
10 MM6 S6 ea 2
11 MM7 S7 t 3
In [2]:
def max_odd(s):
value = s.loc[s % 2 == 1].max()
return value
In [3]:
idx = df.groupby(['Sp', 'Mt'])['count'].transform(max_odd) == df['count']
df[idx]
Out[3]:
Sp Mt Value count
2 MM1 S1 cb 3
3 MM2 S2 mk 1
6 MM3 S3 rd 3
8 MM4 S4 uyi 3
9 MM5 S5 w 1
11 MM7 S7 t 3
Breakdown of how it works:
In [4]:
df.groupby(['Sp', 'Mt'])['count'].transform(max_odd)
Out[4]:
0 3.0
1 3.0
2 3.0
3 1.0
4 1.0
5 3.0
6 3.0
7 3.0
8 3.0
9 1.0
10 NaN
11 3.0
Name: count, dtype: float64
In [5]:
idx = df.groupby(['Sp', 'Mt'])['count'].transform(max_odd) == df['count']
idx
Out[5]:
0 False
1 False
2 True
3 True
4 False
5 False
6 True
7 False
8 True
9 True
10 False
11 True
Name: count, dtype: bool
==== level ====
If a dataframe has multiple indices but you need to groupby on only of them, use level. So, level=0 groups it on the first index, level=1 on the second index, level=-1 on the last index etc.,
tags | pandas groupby level=0
$ ipython
Python 3.6.12 |Anaconda, Inc.| (default, Sep 9 2020, 00:29:25) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.16.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]:
import pandas as pd
df = pd.DataFrame([{'year':2000,'team':'A','game':4,'goals':2},
{'year':2000,'team':'B','game':5,'goals':4},
{'year':2001,'team':'B','game':6,'goals':3}
]).set_index(['year', 'team'])
df
Out[1]:
game goals
year team
2000 A 4 2
B 5 4
2001 B 6 3
In [2]:
df.groupby(level=0).size()
Out[2]:
year
2000 2
2001 1
dtype: int64
In [3]:
df.groupby(level=1).size()
Out[3]:
team
A 1
B 2
dtype: int64
Ref: https://stackoverflow.com/questions/49859182/understanding-level-0-and-group-keys
==== extract groupby object by key ====
tags | pandas groupby filter a group
* groups.get_group(key_value) if grouping on a single column
* groups.get_group(key_value_tuple) if grouping on multiple columns.
In [1]:
import pandas as pd
df = pd.DataFrame({'a': ['foo', 'bar'] * 4 + ['foo'],
'b': [1, 1, 1, 0, 0, 0, 0, 2, 2],
'c': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
df
Out[1]:
a b c
0 foo 1 1
1 bar 1 2
2 foo 1 3
3 bar 0 4
4 foo 0 5
5 bar 0 6
6 foo 0 7
7 bar 2 8
8 foo 2 9
In [2]:
grp1 = df.groupby(['b'])
In [3]:
grp1.get_group(0)
Out[3]:
a b c
3 bar 0 4
4 foo 0 5
5 bar 0 6
6 foo 0 7
In [4]:
grp1.get_group(1)
Out[4]:
a b c
0 foo 1 1
1 bar 1 2
2 foo 1 3
In [5]:
grp2 = df.groupby(['a'])
In [6]:
grp2.get_group('foo')
Out[6]:
a b c
0 foo 1 1
2 foo 1 3
4 foo 0 5
6 foo 0 7
8 foo 2 9
In [7]:
grp2.get_group('bar')
Out[7]:
a b c
1 bar 1 2
3 bar 0 4
5 bar 0 6
7 bar 2 8
In [8]:
grp3 = df.groupby(['a', 'b'])
In [9]:
grp3.get_group(('foo',1))
Out[9]:
a b c
0 foo 1 1
2 foo 1 3
In [10]:
grp3.get_group(('bar',0))
Out[10]:
a b c
3 bar 0 4
5 bar 0 6
==== groupby slicing ====
Consider
In [1]:
import pandas as pd
import numpy as np
rand = np.random.RandomState(1)
df = pd.DataFrame({'A': ['foo', 'bar'] * 3,
'B': rand.randn(6),
'C': rand.randint(0, 20, 6)})
In [2]:
df
Out[2]:
A B C
0 foo 1.624345 5
1 bar -0.611756 18
2 foo -0.528172 11
3 bar -1.072969 10
4 foo 0.865408 14
5 bar -2.301539 18
Group by on column 'A'
In [3]:
gb = df.groupby(['A'])
You can use get_group() to get a single group
In [4]:
gb.get_group('foo')
Out[4]:
A B C
0 foo 1.624345 5
2 foo -0.528172 11
4 foo 0.865408 14
You can select different columns using the groupby slicing:
In [5]:
gb[['A', 'B']].get_group('foo')
Out[5]:
A B
0 foo 1.624345
2 foo -0.528172
4 foo 0.865408
In [6]:
gb[['C']].get_group('foo')
Out[6]:
C
0 5
2 11
4 14
Ref:
* https://stackoverflow.com/questions/14734533/how-to-access-subdataframes-of-pandas-groupby-by-key
==== apply a function on each group ====
Note:- think of a better function for black_ops. The current function does not really demand a groupby. We can simply do those column operations on the top level dataframe
In [1]:
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({
'a': ['foo', 'foo', 'bar', 'bar', 'baz', 'baz', 'qux', 'qux'],
'b': np.random.randn(8),
'c': np.random.randn(8),
'd': np.random.rand(8)})
df
Out[1]:
a b c d
0 foo 0.496714 -0.469474 0.611853
1 foo -0.138264 0.542560 0.139494
2 bar 0.647689 -0.463418 0.292145
3 bar 1.523030 -0.465730 0.366362
4 baz -0.234153 0.241962 0.456070
5 baz -0.234137 -1.913280 0.785176
6 qux 1.579213 -1.724918 0.199674
7 qux 0.767435 -0.562288 0.514234
In [2]:
def black_ops(df):
return df['b'] + df['c'] + df['d']
In [3]:
df.groupby('a').apply(black_ops).reset_index(name='blackops').drop('level_1', axis=1)
Out[3]:
a blackops
0 bar 0.476415
1 bar 1.423662
2 baz 0.463879
3 baz -1.362241
4 foo 0.639093
5 foo 0.543790
6 qux 0.053969
7 qux 0.719382
tags | reset_index remove level_1 column, apply function to multiple columns and rename result, groupby apply name the result, groupby apply remove level_1