User Tools

Site Tools


get_the_first_non_null_value_in_each_column

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
get_the_first_non_null_value_in_each_column [2021/09/15 21:28] – [meta] rajuget_the_first_non_null_value_in_each_column [2021/09/15 22:03] (current) – [Use case] raju
Line 1: Line 1:
 ===== Get the first non null value in each column ===== ===== Get the first non null value in each column =====
 ==== Task ==== ==== Task ====
-Get the first non null value in each column+Get the first non null value in each column.
  
 Corner cases: Corner cases:
Line 66: Line 66:
  
 ==== meta ==== ==== meta ====
-used | Python 3.9.4 and IPython 7.22.0+Used | Python 3.9.4 and IPython 7.22.0
  
 demonstrates | apply a function on each column of a dataframe after doing a groupby demonstrates | apply a function on each column of a dataframe after doing a groupby
 +
 +===== Get the first value not equal to a number =====
 +==== Use case ====
 +One downside of using np.nan to denote a missing value is that an integer column of a
 +dataframe gets "promoted" to a floating point column even if there is a single np.nan in it.
 +
 +One work around is to use a specific integer to denote a missing value. For example, if we expect
 +all integers to be positive, we can use -9999 to denote a missing value. Let's call this
 +special integer NAN_INT.
 +
 +==== Task ====
 +Get the first value not equal to NAN_INT
 +
 +Corner cases:
 +  * If a column is all NAN_INTs, return a NAN_INT
 +
 +For example, give
 +<code>
 +   jim   joe  jolie  jack
 +          -9999 -9999
 +     0 -9999      2 -9999
 +          -9999 -9999
 +     1 -9999      4 -9999
 +</code>
 +We want
 +<code>
 +   jim  joe  jolie  jack
 +        1      2 -9999
 +        3      4 -9999
 +</code>
 +==== Solution ====
 +<code>
 +$ ipython
 +
 +In [1]:
 +import pandas as pd
 +import numpy as np
 +NAN_INT = -9999
 +df = pd.DataFrame({'jim': [0, 0, 1, 1],
 +                   'joe': [1, NAN_INT, 3, NAN_INT],
 +                   'jolie': [NAN_INT, 2, NAN_INT, 4],
 +                   'jack': [NAN_INT, NAN_INT, NAN_INT, NAN_INT]})
 +df
 +Out[1]:
 +   jim   joe  jolie  jack
 +0    0      -9999 -9999
 +1    0 -9999      2 -9999
 +2    1      -9999 -9999
 +3    1 -9999      4 -9999
 +
 +In [2]:
 +df.dtypes
 +Out[2]:
 +jim      int64
 +joe      int64
 +jolie    int64
 +jack     int64
 +dtype: object
 +
 +In [3]:
 +def get_first_non_missing(s):
 +    NAN_INT = -9999
 +    values = s.loc[(s != NAN_INT)]
 +    value = values.iloc[0] if not values.empty else NAN_INT
 +    return value
 +
 +
 +In [4]:
 +df.groupby('jim').agg(get_first_non_missing)
 +Out[4]:
 +     joe  jolie  jack
 +jim
 +0      1      2 -9999
 +1      3      4 -9999
 +
 +In [5]:
 +df.groupby('jim').agg(get_first_non_missing).reset_index()
 +Out[5]:
 +   jim  joe  jolie  jack
 +0    0    1      2 -9999
 +1    1    3      4 -9999
 +</code>
 +==== meta ====
 +Used | Python 3.9.4, IPython 7.22.0
 +
get_the_first_non_null_value_in_each_column.1631741321.txt.gz · Last modified: 2021/09/15 21:28 by raju