get_the_first_non_null_value_in_each_column
This is an old revision of the document!
Table of Contents
Get the first non null value in each column
Task
Get the first non null value in each column
Corner cases:
- If a column is all NaNs, return a NaN.
For example, given
jim joe jolie jack 0 1.0 NaN NaN 0 NaN 2.0 NaN 1 3.0 NaN NaN 1 NaN 4.0 NaN
We want
jim joe jolie jack 0 1.0 2.0 NaN 1 3.0 4.0 NaN
Solution
$ ipython In [1]: import pandas as pd import numpy as np df = pd.DataFrame({'jim': [0, 0, 1, 1], 'joe': [1, np.nan, 3, np.nan], 'jolie': [np.nan, 2, np.nan, 4], 'jack': [np.nan, np.nan, np.nan, np.nan]}) df Out[1]: jim joe jolie jack 0 0 1.0 NaN NaN 1 0 NaN 2.0 NaN 2 1 3.0 NaN NaN 3 1 NaN 4.0 NaN In [2]: def get_first_non_nan(s): values = s.loc[~s.isnull()] value = values.iloc[0] if not values.empty else np.nan return value In [3]: df.groupby('jim').agg(get_first_non_nan) Out[3]: joe jolie jack jim 0 1.0 2.0 NaN 1 3.0 4.0 NaN In [4]: df.groupby('jim').agg(get_first_non_nan).reset_index() Out[4]: jim joe jolie jack 0 0 1.0 2.0 NaN 1 1 3.0 4.0 NaN
meta
used | Python 3.9.4 and IPython 7.22.0
demonstrates | apply a function on each column of a dataframe after doing a groupby
Get the first value not equal to a number
Use case
One downside of using np.nan to denote missing values is that an integer column of a dataframe gets “promoted” to a floating point column even if there is a single np.nan in it.
A work around is to use a specific integer to denote missing values. For example if we expect all integers to be positive, we can use -9999 to denote a missing value. Let's call this special integer NAN_INT.
Task
Get the first value not equal to NAN_INT
Corner cases:
- If a column is all NAN_INTs, return a NAN_INT
For example, give
jim joe jolie jack 0 1 -9999 -9999 0 -9999 2 -9999 1 3 -9999 -9999 1 -9999 4 -9999
We want
jim joe jolie jack 0 1 2 -9999 1 3 4 -9999
Solution
$ ipython Python 3.9.4 (default, Apr 9 2021, 11:43:21) [MSC v.1916 64 bit (AMD64)] Type 'copyright', 'credits' or 'license' for more information IPython 7.22.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: import pandas as pd import numpy as np NAN_INT = -9999 df = pd.DataFrame({'jim': [0, 0, 1, 1], 'joe': [1, NAN_INT, 3, NAN_INT], 'jolie': [NAN_INT, 2, NAN_INT, 4], 'jack': [NAN_INT, NAN_INT, NAN_INT, NAN_INT]}) df Out[1]: jim joe jolie jack 0 0 1 -9999 -9999 1 0 -9999 2 -9999 2 1 3 -9999 -9999 3 1 -9999 4 -9999 In [2]: df.dtypes Out[2]: jim int64 joe int64 jolie int64 jack int64 dtype: object In [3]: def get_first_non_missing(s): NAN_INT = -9999 values = s.loc[(s != NAN_INT)] value = values.iloc[0] if not values.empty else NAN_INT return value In [4]: df.groupby('jim').agg(get_first_non_missing) Out[4]: joe jolie jack jim 0 1 2 -9999 1 3 4 -9999 In [5]: df.groupby('jim').agg(get_first_non_missing).reset_index() Out[5]: jim joe jolie jack 0 0 1 2 -9999 1 1 3 4 -9999
get_the_first_non_null_value_in_each_column.1631743102.txt.gz · Last modified: 2021/09/15 21:58 by raju