User Tools

Site Tools


pandas_dataframe

Tasks

dummy

convert all categorical columns to string columns

df.loc[:, df.dtypes == "category"] = df.select_dtypes(
    ["category"]
    ).apply(lambda x: x.astype("object"))

Ref:- https://stackoverflow.com/a/56944992

sum if columns exist

tags | sum if column exists

$ ipython
In [1]:
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID1':[15.3, 12.1, 13.2, 10.0, np.nan],
                   'ID2':[7.0, 7.7, np.nan, 11.3, np.nan],
                   'ID5':[10, 15, 3.1, 2.2, np.nan]})
df
Out[1]:
    ID1   ID2   ID5
0  15.3   7.0  10.0
1  12.1   7.7  15.0
2  13.2   NaN   3.1
3  10.0  11.3   2.2
4   NaN   NaN   NaN

In [2]:
List_ID = ['ID1','ID2','ID3']
cols = df.columns[df.columns.isin(List_ID)]
cols
Out[2]:
Index(['ID1', 'ID2'], dtype='object')

In [3]:
res = df[cols].sum(axis=1)
res
Out[3]:
0    22.3
1    19.8
2    13.2
3    21.3
4     0.0
dtype: float64

Using Python 3.9.4, IPython 7.22.0, pandas 1.2.4, numpy 1.20.1

Ref:- https://stackoverflow.com/questions/38700848/adding-columns-if-they-exist-in-the-dataframe-pandas

get unique values in a series as another series

  • Series.unique() returns a numpy.ndarray
  • Series.drop_duplicates() returns a Series
    • Series.drop_duplicates(keep='first') retains the first occurrence of any duplicates, keep='last' retains the last occurrence, and keep=False retains NONE of the duplicates. keep='first' is the default.
In [1]:
import pandas as pd
df = pd.DataFrame({
    'a': [1,2,3,3],
    'b': ['foo', 'foo', 'bar', 'bar']})
df
Out[1]:
   a    b
0  1  foo
1  2  foo
2  3  bar
3  3  bar

In [2]:
df['a'].unique()
Out[2]:
array([1, 2, 3], dtype=int64)

In [3]:
type(df['a'].unique())
Out[3]:
numpy.ndarray

In [4]:
df['a'].drop_duplicates()
Out[4]:
0    1
1    2
2    3
Name: a, dtype: int64

In [5]:
type(df['a'].drop_duplicates())
Out[5]:
pandas.core.series.Series

In [6]:
df['a'].drop_duplicates(keep='last')
Out[6]:
0    1
1    2
3    3
Name: a, dtype: int64

In [7]:
type(df['a'].drop_duplicates(keep='last'))
Out[7]:
pandas.core.series.Series

Ref:- https://github.com/pandas-dev/pandas/issues/1923#issuecomment-398217427

clean up column names

# Strip out the white space from both ends of the column names
df.columns = df.columns.str.strip()
# Remove white space from both ends of the column names
# -> convert them to lower case
# -> replace space with an underscore
# -> remove open and close parenthesis
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')

round a single column

How to round a single column in pandas without affecting other columns? For example, given

df:

  item  value1  value2
0    a    1.12     1.3
1    a    1.50     2.5
2    a    0.10     0.0
3    b    3.30    -1.0
4    b    4.80    -1.0

How to get

  item  value1  value2
0    a     1.0     1.3
1    a     2.0     2.5
2    a     0.0     0.0
3    b     3.0    -1.0
4    b     5.0    -1.0

Solution:

df['value1'] = df['value1'].round()

Details:

In [1]: 
import pandas as pd
df = pd.DataFrame({'item': ['a', 'a', 'a', 'b', 'b'],
                   'value1': [1.12, 1.50, 0.10, 3.30, 4.80],
                   'value2': [1.3, 2.5, 0.0, -1.0, -1.0]})
df
Out[1]: 
  item  value1  value2
0    a    1.12     1.3
1    a    1.50     2.5
2    a    0.10     0.0
3    b    3.30    -1.0
4    b    4.80    -1.0

In [2]: 
df['value1'] = df['value1'].round()
df
Out[2]: 
  item  value1  value2
0    a     1.0     1.3
1    a     2.0     2.5
2    a     0.0     0.0
3    b     3.0    -1.0
4    b     5.0    -1.0

relative percentage difference

Code:

% cat rel_pct_diff.py
import pandas as pd
import numpy as np
print('Using pandas', pd.__version__, ', numpy', np.__version__)
df = pd.DataFrame({'old': [2, 1, 0, 5, 0], 'new': [2.1, 1.1, 0.1, 4.9, 0]})
print('orignal df')
print(df)
df['rel_pct_diff1'] = ((df['new'] / df['old']) - 1) * 100
df['rel_pct_diff2'] = ((df['new'].divide(
    df['old'].where(df['old'] != 0, np.nan))) - 1) * 100
print(df)

Run:

% python ./rel_pct_diff.py
Using pandas 1.1.3 , numpy 1.19.1
orignal df
   old  new
0    2  2.1
1    1  1.1
2    0  0.1
3    5  4.9
4    0  0.0
   old  new  rel_pct_diff1  rel_pct_diff2
0    2  2.1            5.0            5.0
1    1  1.1           10.0           10.0
2    0  0.1            inf            NaN
3    5  4.9           -2.0           -2.0
4    0  0.0            NaN            NaN

how to do XYZ

map one column to another

% ipython
Python 3.8.5 (default, Sep  4 2020, 07:30:14) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.18.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd
   ...: mapping = pd.DataFrame({'day': ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat'], 'daynum': range(1, 8)})
   ...: print(mapping)
   day  daynum
0  Sun       1
1  Mon       2
2  Tue       3
3  Wed       4
4  Thu       5
5  Fri       6
6  Sat       7

In [2]: df = pd.DataFrame({'x': [5, 2, 7, 0, 1, 8], 'y': [59, 63, 66, 61, 46, 66]})
   ...: print(df)
   x   y
0  5  59
1  2  63
2  7  66
3  0  61
4  1  46
5  8  66

In [3]: df['z'] = df['x'].map(mapping.set_index('daynum')['day'])
   ...: print(df)
   x   y    z
0  5  59  Thu
1  2  63  Mon
2  7  66  Sat
3  0  61  NaN
4  1  46  Sun
5  8  66  NaN

See also: https://stackoverflow.com/questions/46049658/mapping-columns-from-one-dataframe-to-another-to-create-a-new-column - shows other possible solutions

start column numbers from 1

By default, column numbers start from 0.

% ipython
Python 3.8.3 (default, Jul  2 2020, 16:21:59)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.16.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd
   ...: import numpy as np
   ...: n = np.ones(10).reshape(2,5)
   ...: print(n)
[[1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]]

In [2]: df = pd.DataFrame(n)
   ...: print(df)
     0    1    2    3    4
0  1.0  1.0  1.0  1.0  1.0
1  1.0  1.0  1.0  1.0  1.0

To change them to start from 1

In [3]: df.columns = range(1, df.shape[1]+1)
   ...: print(df)
     1    2    3    4    5
0  1.0  1.0  1.0  1.0  1.0
1  1.0  1.0  1.0  1.0  1.0

where df.shape[1] gives the number of columns in a dataframe.

In [4]: print(df.shape[1])
5

read everything as string

convert all columns in a dataframe to string

df.astype(str)

Example:

In [1]: import pandas as pd
   ...: df = pd.DataFrame({'a': [648, 435], 'b': [175, 389]})
   ...: print(df)
     a    b
0  648  175
1  435  389

In [2]: df.dtypes
Out[2]:
a    int64
b    int64
dtype: object

In [3]: df2 = df.astype(str)
   ...: df2.dtypes
Out[3]:
a    object
b    object
dtype: object

In [4]: print(df2)
     a    b
0  648  175
1  435  389

dollars two decimal places

Use float_format=“%.2f”

df.to_csv(file_path, float_format="%.2f")

convert two columns of dataframe into a dictionary

df.set_index('id')['value'].to_dict()

Example

In [1]:
import pandas as pd
df = pd.DataFrame({'id': [0, 1, 2], 'value': [1.7, 5.2, 4.4]})
df
Out[1]:
   id  value
0   0    1.7
1   1    5.2
2   2    4.4

In [2]:
df.set_index('id')['value'].to_dict()
Out[2]:
{0: 1.7, 1: 5.2, 2: 4.4}

Ref:- https://stackoverflow.com/questions/18695605/python-pandas-dataframe-to-dictionary

number of days between two YYYYMMDD string columns

tags | number of days between two dates, convert YYYYMMDD to date

a_dt = pd.to_datetime(df['a'], format='%Y%m%d')
b_dt = pd.to_datetime(df['b'], format='%Y%m%d')
df['days'] = (a_dt - b_dt).dt.days

See also:

Select columns from a dataframe only if they exist

df[df.columns.intersection(set(['list', 'of', 'cols']))]

Example:

$ ipython
Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.20.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:
import pandas as pd
df = pd.DataFrame(columns=[1,2,3,4])
df
Out[1]:
Empty DataFrame
Columns: [1, 2, 3, 4]
Index: []

In [2]:
df[df.columns.intersection(set([1, 2, 2, 5]))]
Out[2]:
Empty DataFrame
Columns: [1, 2]
Index: []

In [3]:
pd.__version__
Out[3]:
'1.2.1'

See also:- https://stackoverflow.com/questions/43537166/select-columns-from-dataframe-on-condition-they-exist

migrate sql case when else statements to build a new column

initialize column with strings that range from 1 to N

To set it to '1', '2', …, 'N'

df['foo'] = [str(x) for x in range(1, 1 + df.shape[0])]

To set it to 'A_1', 'A_2', …, 'A_N'

df['foo'] = ['A_' + str(x) for x in range(1, 1 + df.shape[0])]

Missing values

number of missing values in a dataframe

  • df.isna().sum().sum() - total number of missing values in the entire dataframe.
  • df.isna().sum() - number of missing values per column
  • df.isna() - boolean dataframe with True if value is missing, false otherwise.
In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [3, None, 5, 7], 'B': [np.nan, 5, None, 4]})
df
Out[1]:
     A    B
0  3.0  NaN
1  NaN  5.0
2  5.0  NaN
3  7.0  4.0

In [2]:
df.isna().sum().sum()
Out[2]:
3

In [3]:
df.isna().sum()
Out[3]:
A    1
B    2
dtype: int64

In [4]:
df.isna()
Out[4]:
       A      B
0  False   True
1   True  False
2  False   True
3  False  False

fill missing values using a lookup table

Q. Fill the NaNs in

                    title            industry
0     Executive Secretary              Health
1  Administrative Officer  Financial Services
2      Recruiting Manager            Property
3           Senior Editor                 NaN
4         Media Manager I                 NaN

using the lookup table

                 title       industry
0  Executive Secretary         Retail
1        Senior Editor  Manufacturing
2      Media Manager I         Health

so that we get

                    title            industry
0     Executive Secretary              Health
1  Administrative Officer  Financial Services
2      Recruiting Manager            Property
3           Senior Editor       Manufacturing
4         Media Manager I              Health

Ans: Get the missing locations using isna() and then use map with set_index.

% ipython
Python 3.8.5 (default, Sep  4 2020, 07:30:14) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.18.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: 
import pandas as pd
import numpy as np
df = pd.DataFrame({
    'title': ['Executive Secretary', 'Administrative Officer',
              'Recruiting Manager', 'Senior Editor', 'Media Manager I'],
    'industry': ['Health', 'Financial Services', 'Property', np.nan, np.nan]})
df
Out[1]: 
                    title            industry
0     Executive Secretary              Health
1  Administrative Officer  Financial Services
2      Recruiting Manager            Property
3           Senior Editor                 NaN
4         Media Manager I                 NaN

In [2]: 
lookup = pd.DataFrame({
    'title': ['Executive Secretary', 'Senior Editor', 'Media Manager I'],
    'industry': ['Retail', 'Manufacturing', 'Health']})
lookup
Out[2]: 
                 title       industry
0  Executive Secretary         Retail
1        Senior Editor  Manufacturing
2      Media Manager I         Health

In [3]: 
missing = df['industry'].isna()

In [4]: 
df.loc[missing, 'industry'] = df.loc[missing, 'title'].map(
    lookup.set_index('title')['industry'])
df
Out[4]: 
                    title            industry
0     Executive Secretary              Health
1  Administrative Officer  Financial Services
2      Recruiting Manager            Property
3           Senior Editor       Manufacturing
4         Media Manager I              Health

Ref:- https://stackoverflow.com/questions/64438066/how-can-i-fillna-based-on-the-columns-from-another-dataframe/ - original question. Here, I changed the column names for brevity.

Sample usage

read_csv usage

  • Read a csv file that has no headers and interpret all columns as strings
pd.read_csv(input_file, dtype=str, sep='|', header=None)
  • drop duplicate lines
pd.read_csv(input_file).drop_duplicates()

using astype

df = df.astype({'col_a': np.int64,
                'col_b': np.int64})
df['col_a'] = df['col_a'].astype(np.int64)

stackoverflow questions I came across

normalize data within each group

find many to one mappings

<place holder>

pivot and sum

convert pivot table output to dataframe

Sample code

table = df[cols_of_interest].pivot_table(values=['amount'], index=['foo', 'bar'],
                                         columns=['class_name'], aggfunc=np.sum)
table.columns = table.columns.droplevel(0)    # removes the values from level 0
table.columns.name = None                     # removes the columns as the column name
table = table.reset_index()                   # changes index to columns

Ref:-

normalize data

How to convert

$ cat data.csv
node_id,sim_id,portfolio_risk_total
11,1,10
11,2,20
11,3,30
11,4,40
22,1,100
22,2,200
22,3,300
22,4,400
33,1,1000
33,2,2000
33,3,3000
33,4,4000

to

$ cat out.csv
sim_id,11,22,33
1,10,100,1000
2,20,200,2000
3,30,300,3000
4,40,400,4000

Answer:

The technical term for this is ‘pivoting’. You are converting “stacked” or “record” format data into a normalized form. Below is how to do it. For more complex and related operations, see https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html .

In [1]:
import pandas as pd
df = pd.read_csv('data.csv')
df
Out[1]:
        node_id  sim_id  portfolio_risk_total
0            11       1                    10
1            11       2                    20
2            11       3                    30
3            11       4                    40
4            22       1                   100
5            22       2                   200
6            22       3                   300
7            22       4                   400
8            33       1                  1000
9            33       2                  2000
10           33       3                  3000
11           33       4                  4000

In [2]:
table = df.pivot(index='sim_id', columns='node_id', values='portfolio_risk_total')
table
Out[2]:
node_id      11   22    33
sim_id
1            10  100  1000
2            20  200  2000
3            30  300  3000
4            40  400  4000

In [3]:
table.columns.name = None
table = table.reset_index()
table
Out[3]:
   sim_id  11   22    33
0       1  10  100  1000
1       2  20  200  2000
2       3  30  300  3000
3       4  40  400  4000

In [4]:
table.to_csv('out.csv', index=False)

dummy

user guide

url title notes
https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html Reshaping and pivot tables tags: stacked or record format, denormalized to normalized data
pandas_dataframe.txt · Last modified: 2022/03/18 23:29 by raju