===== Creating a dataframe =====
==== Create a dataframe from list of lists ====
<code>
>>> import pandas as pd
>>> data = [[0, 1, 0, 1], [1, 0, 1, 1], [0, 1, 1, 1]]
>>> df = pd.DataFrame(data)
>>> df
   0  1  2  3
0  0  1  0  1
1  1  0  1  1
2  0  1  1  1
</code>

You can also specify a list of columns for the dataframe.
<code>
>>> columns = ['a', 'b', 'c', 'd']
>>> df = pd.DataFrame(data, columns=columns)
>>> df
   a  b  c  d
0  0  1  0  1
1  1  0  1  1
2  0  1  1  1
</code>

To specify the index
<code>
>>> df = pd.DataFrame([[1,2,3,4,15,6],['1','2','3','4','F']], index=['CT','NY'])
>>> df
    0  1  2  3   4    5
CT  1  2  3  4  15  6.0
NY  1  2  3  4   F  NaN
</code>

To make each list into a column, use zip
<code>
% ipython
Python 3.10.6 (main, Oct 24 2022, 16:07:47) [GCC 11.2.0]
IPython 8.6.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: 
import pandas as pd
# create lists
l1 =["Amar", "Barsha", "Carlos", "Tanmay", "Misbah"]
l2 =["Alpha", "Bravo", "Charlie", "Tango", "Mike"]
l3 =[23, 25, 22, 27, 29]
l4 =[69, 54, 73, 70, 74]
# create the dataframe
team = pd.DataFrame(list(zip(l1, l2, l3, l4)), columns=['Name', 'Code', 'Age', 'Weight'])
print(team)
     Name     Code  Age  Weight
0    Amar    Alpha   23      69
1  Barsha    Bravo   25      54
2  Carlos  Charlie   22      73
3  Tanmay    Tango   27      70
4  Misbah     Mike   29      74
</code>

Ref:-
  * https://www.geeksforgeeks.org/add-column-names-to-dataframe-in-pandas/ - got the idea on zip from here.

tags | row by row
==== Create a dataframe by splitting strings ====
Given a list of strings, the idea here is to create a data frame by splitting them into multiple columns.

Load the data into a pandas series
<code>
% ipython
Python 3.10.6 (main, Oct 24 2022, 16:07:47) [GCC 11.2.0]
IPython 8.6.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: 
import pandas as pd
strings = ['Netflix 100' , 'Costco 200' , 'Walmart 500', 'Costco 500' ]
s = pd.Series(strings)
print(s)
0    Netflix 100
1     Costco 200
2    Walmart 500
3     Costco 500
dtype: object
</code>

Create a dataframe by splitting the strings into multiple columns
<code>
In [2]: 
df = s.str.split(expand=True)
print(df)
         0    1
0  Netflix  100
1   Costco  200
2  Walmart  500
3   Costco  500
</code>

Name the columns.

str.split gives everything as strings. Change them to numbers as needed.

<code>
In [3]: 
df.columns = ['company', 'value']
df['value'] = df['value'].astype('float')
print(df)
   company  value
0  Netflix  100.0
1   Costco  200.0
2  Walmart  500.0
3   Costco  500.0
</code>

See also:
  * I used this trick to answer a question on stackoverflow - https://stackoverflow.com/a/75212176/6305733
  * https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html
  * https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html

also demonstrates | assign column names to a dataframe

==== Create a dataframe from a series of lists ====
tags | convert series with lists to dataframe
<code>
df = pd.DataFrame(s.to_list())
</code>
For example
<code>
In [1]:
import pandas as pd
s = pd.Series([[1, 2, 3], [4, 5, 6, 7], [8, 9]])
s
Out[1]:
0       [1, 2, 3]
1    [4, 5, 6, 7]
2          [8, 9]
dtype: object

In [2]:
df = pd.DataFrame(s.to_list())
df
Out[2]:
   0  1    2    3
0  1  2  3.0  NaN
1  4  5  6.0  7.0
2  8  9  NaN  NaN
</code>

If the number of elements in each list is same, np.vstack() can be used but otherwise it will not work. For example
<code>
In [5]:
s
Out[5]:
0       [1, 2, 3]
1    [4, 5, 6, 7]
2          [8, 9]
dtype: object

In [6]:
import numpy as np
df = pd.DataFrame(np.vstack(s))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[6], line 2
      1 import numpy as np
----> 2 df = pd.DataFrame(np.vstack(s))

File <__array_function__ internals>:200, in vstack(*args, **kwargs)

File ~\AppData\Local\conda\conda\envs\py311\Lib\site-packages\numpy\core\shape_base.py:296, in vstack(tup, dtype, casting)
    294 if not isinstance(arrs, list):
    295     arrs = [arrs]
--> 296 return _nx.concatenate(arrs, 0, dtype=dtype, casting=casting)

File <__array_function__ internals>:200, in concatenate(*args, **kwargs)

ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 3 and the array at index 1 has size 4
</code>

But with
<code>
In [10]:
s = pd.Series([[1, 2, 3], [4, 5, 6]])
s
Out[10]:
0    [1, 2, 3]
1    [4, 5, 6]
dtype: object

In [11]:
import numpy as np
df = pd.DataFrame(np.vstack(s))
df
Out[11]:
   0  1  2
0  1  2  3
1  4  5  6

In [12]:
df = pd.DataFrame(s.to_list())
df
Out[12]:
   0  1  2
0  1  2  3
1  4  5  6
</code>

See also:
  * https://stackoverflow.com/questions/45901018/convert-pandas-series-of-lists-to-dataframe

==== Create a dataframe from a bunch of variables ====
<code>
import pandas as pd
df = pd.DataFrame({
  'key': ['var1', 'var2', 'var3'],
  'value': [var1, var2, var3]
})
</code>

For example
<code>
$ ipython

In [1]:
year = 2023; month = 6; date = 15

In [2]:
import pandas as pd
df = pd.DataFrame({
  'key': ['year', 'month', 'date'],
  'value': [year, month, date]
})

In [3]:
df
Out[3]:
     key  value
0   year   2023
1  month      6
2   date     15

In [4]:
df.dtypes
Out[4]:
key      object
value     int64
dtype: object
</code>

It works even if the variables are not of the same type.
<code>
In [5]:
year = 2023; month = 'June'; date = 15

In [6]:
df = pd.DataFrame({
  'key': ['year', 'month', 'date'],
  'value': [year, month, date]
})

In [7]:
df
Out[7]:
     key value
0   year  2023
1  month  June
2   date    15

In [8]:
df.dtypes
Out[8]:
key      object
value    object
dtype: object
</code>

Tested with Python 3.11.3, IPython 8.12.0

===== selection related =====
==== split columns ====

  * [[https://nbviewer.jupyter.org/github/KamarajuKusumanchi/notebooks/blob/master/pandas/Separate%20positive%20and%20negative%20values.ipynb | Separate positive and negative values (nbviewer.jupyter.org/github/KamarajuKusumanchi)]]
description | split a column into two columns based on whether the values are positive or negative, default value, longs and shorts

  * [[https://nbviewer.jupyter.org/github/KamarajuKusumanchi/notebooks/blob/master/pandas/dataframe%20split%20column%20string%20into%20multiple%20columns.ipynb | Split string column into multiple columns (nbviewer.jupyter.org/github/KamarajuKusumanchi)]]

tags | uses [http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html pandas.Series.str.split]

==== lookup value ====
To pick the first value in column 'A' for rows where column B is FOO
<code>
df.loc[df['B'] == 'FOO', 'A'].iloc[0]
</code>

Example:
<code>
$ ipython
In [1]:
import pandas as pd
df = pd.DataFrame({'A': ['p1', 'p2', 'p3', 'p4'], 'B': [1, 3, 3, 2]})
print(df)
    A  B
0  p1  1
1  p2  3
2  p3  3
3  p4  2

In [2]:
df.loc[df['B'] == 3, 'A']
Out[2]:
1    p2
2    p3
Name: A, dtype: object

In [3]:
df.loc[df['B'] == 3, 'A'].iloc[0]
Out[3]:
'p2'
</code>

search tags | value of one column when another column equals something

Ref:- https://stackoverflow.com/questions/36684013/extract-column-value-based-on-another-column-in-pandas

===== Series related =====
==== view the index, type, length and name of the series ====
<code>
s.index
s.dtype
s.size
s.name
</code>

==== unique types found in a object series ====
First, use the ''.apply'' method with the ''type'' function to get back a Series that has the type of every member. Then, //chain// the ''.unique'' method onto the result.
<code>
s.apply(type).unique()
</code>

===== Tasks =====
  * [[extract last field from each row of a column]]
  * [[map one column to another]]
  * [[show all the rows and columns]]
    * tags | max_rows, max_columns, set_option, all rows, all columns
  * [[print all characters in a cell]]
    * tags | max_colwidth, set_option
  * [[set index from 0 to N]]
  * [[print dataframe without index]]
      * tags | pretty print dataframe
  * [[Add dates]]
  * [[Add commas]]
  * [[Get first element of series if not empty]]
  * [[Order columns alphabetically]]
  * [[Get the first non null value in each column]]
  * [[Convert a dictionary of dataframes to a big dataframe]]
  * [[Convert string to date]]
  * [[print hundredths]]

===== data transformation related =====
==== matches to coplays ====
Consider the dataframe
<code>
   match_id player_id
0         0         a
1         0         b
2         0         c
3         1         a
4         1         b
5         2         c
</code>
which shows the players played in a given match. For example, it shows that [a, b, c] played in match 0, [a, b] played in match 1, and c alone played in match 2.

We want to get a coplay count
<code>
  player_id1 player_id2  size
0          a          a     2
1          a          b     2
2          a          c     1
3          b          a     2
4          b          b     2
5          b          c     1
6          c          a     1
7          c          b     1
8          c          c     2
</code>
which shows
  * the number of games each player played with another player when player_id1 $ \neq $ player_id2 (ex:- a and b played in 2 games, a and c played in 1 game)
  * the number of games a player played if player_id1 $ == $ player_id2

[[ matches to coplays | Solution ]]

===== difference between =====
  * [[astype vs. to_numeric]]


===== convert stuff =====

==== convert column names to lower case and replace space with underscore ====
<code>
df.columns = df.columns.str.lower().str.replace(' ', '_')
</code>

Example:
<code>
$ ipython
Python 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:
import pandas as pd
df = pd.DataFrame(
    [['2023-03-02', '2023-02-28', 3.20, 3.30],
     ['2022-12-08', '2022-11-30', 3.14, 3.10]],
    columns = ["Announcement Date", "Fiscal Quarter End", "Estimated EPS", "Actual EPS"])
df
Out[1]:
  Announcement Date Fiscal Quarter End  Estimated EPS  Actual EPS
0        2023-03-02         2023-02-28           3.20         3.3
1        2022-12-08         2022-11-30           3.14         3.1

In [2]:
df.columns = df.columns.str.lower().str.replace(' ', '_')

In [3]:
df
Out[3]:
  announcement_date fiscal_quarter_end  estimated_eps  actual_eps
0        2023-03-02         2023-02-28           3.20         3.3
1        2022-12-08         2022-11-30           3.14         3.1
</code>

==== convert column names to lower case ====
Approach 1:
<code>
df.columns = df.columns.str.lower()
</code>

Approach 2:
<code>
df.rename(columns=lambda x: x.lower(), inplace=True)
</code>

Notes:
  * I prefer approach1 - simple syntax, easy to remember.

Use case: While merging data from two data frames using DataFrame.merge(), I ended up with two columns with same name but differing in case (ex: foo from df1, FOO from df2). This caused problems when I tried to upload data into a hadoop cluster since hive is not case sensitive. As a work around, I converted the column names in df2 to lower case and then merged using pd.merge(df1, df2, ..., suffixes = ('_df1', '_df2')). The resulting data frame will now have foo_df1, foo_df2 columns.

Example (using approach 1):
<code>
$ ipython
Python 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:
import pandas as pd
df = pd.DataFrame(
    [['2023-03-02', '2023-02-28', 3.20, 3.30],
     ['2022-12-08', '2022-11-30', 3.14, 3.10]],
    columns = ["Announcement Date", "Fiscal Quarter End", "Estimated EPS", "Actual EPS"])
df
Out[1]:
  Announcement Date Fiscal Quarter End  Estimated EPS  Actual EPS
0        2023-03-02         2023-02-28           3.20         3.3
1        2022-12-08         2022-11-30           3.14         3.1

In [2]:
df.columns = df.columns.str.lower()

In [3]:
df
Out[3]:
  announcement date fiscal quarter end  estimated eps  actual eps
0        2023-03-02         2023-02-28           3.20         3.3
1        2022-12-08         2022-11-30           3.14         3.1
</code>

Example (using approach 2):
<code>
$ ipython
Python 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)]
IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:
import pandas as pd
df = pd.DataFrame(
    [['2023-03-02', '2023-02-28', 3.20, 3.30],
     ['2022-12-08', '2022-11-30', 3.14, 3.10]],
    columns = ["Announcement Date", "Fiscal Quarter End", "Estimated EPS", "Actual EPS"])
df
Out[1]:
  Announcement Date Fiscal Quarter End  Estimated EPS  Actual EPS
0        2023-03-02         2023-02-28           3.20         3.3
1        2022-12-08         2022-11-30           3.14         3.1

In [2]:
df.rename(columns=lambda x: x.lower(), inplace=True)

In [3]:
df
Out[3]:
  announcement date fiscal quarter end  estimated eps  actual eps
0        2023-03-02         2023-02-28           3.20         3.3
1        2022-12-08         2022-11-30           3.14         3.1
</code>


==== convert all categorical columns to string columns ====
<code>
df.loc[:, df.dtypes == "category"] = df.select_dtypes(
    ["category"]
    ).apply(lambda x: x.astype("object"))
</code>

Ref:- https://stackoverflow.com/a/56944992

==== convert all columns in a dataframe to string ====
  df.astype(str)

Example:
<code>
In [1]: import pandas as pd
   ...: df = pd.DataFrame({'a': [648, 435], 'b': [175, 389]})
   ...: print(df)
     a    b
0  648  175
1  435  389

In [2]: df.dtypes
Out[2]:
a    int64
b    int64
dtype: object

In [3]: df2 = df.astype(str)
   ...: df2.dtypes
Out[3]:
a    object
b    object
dtype: object

In [4]: print(df2)
     a    b
0  648  175
1  435  389
</code>

==== convert two columns of dataframe into a dictionary ====
<code>
df.set_index('id')['value'].to_dict()
</code>

Example
<code>
In [1]:
import pandas as pd
df = pd.DataFrame({'id': [0, 1, 2], 'value': [1.7, 5.2, 4.4]})
df
Out[1]:
   id  value
0   0    1.7
1   1    5.2
2   2    4.4

In [2]:
df.set_index('id')['value'].to_dict()
Out[2]:
{0: 1.7, 1: 5.2, 2: 4.4}
</code>

Ref:- https://stackoverflow.com/questions/18695605/python-pandas-dataframe-to-dictionary

==== convert % to numbers ====
tags | convert percent symbol in strings to numbers
<code>
In [2]:
import pandas as pd
df = pd.DataFrame({'foo': ['1.8%', '-2.5%', '0.7%', '3.2%']})
df
Out[2]:
     foo
0   1.8%
1  -2.5%
2   0.7%
3   3.2%

In [3]:
df['foo'] = df['foo'].str.rstrip('%').astype(float)/100

In [4]:
df
Out[4]:
     foo
0  0.018
1 -0.025
2  0.007
3  0.032
</code>

Tested using python 3.10.9, ipython 8.8.0, and pandas 1.5.2

Ref:- https://stackoverflow.com/questions/25669588/convert-percent-string-to-float-in-pandas-read-csv


===== dummy =====


==== json to dataframe ====

<WRAP group><WRAP column>
^  From  ^  to  ^  Use  ^
| JSON | DataFrame | pandas.read_json() |
| DataFrame | JSON | <WRAP>
  * df.to_json() or
  * df.to_json(orient='records')
</WRAP>|
| JSON | Python object | json.loads() |
| Python object | JSON | json.dumps() |
</WRAP></WRAP>

Sample code [[https://nbviewer.jupyter.org/github/KamarajuKusumanchi/notebooks/blob/master/pandas/json%20to%20dataframe.ipynb | nbviewer/github/KamarajuKusumanchi]]

Ref:-
  * https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html
  * https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html

==== sum if columns exist ====
tags | sum if column exists

<code>
$ ipython
In [1]:
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID1':[15.3, 12.1, 13.2, 10.0, np.nan],
                   'ID2':[7.0, 7.7, np.nan, 11.3, np.nan],
                   'ID5':[10, 15, 3.1, 2.2, np.nan]})
df
Out[1]:
    ID1   ID2   ID5
0  15.3   7.0  10.0
1  12.1   7.7  15.0
2  13.2   NaN   3.1
3  10.0  11.3   2.2
4   NaN   NaN   NaN

In [2]:
List_ID = ['ID1','ID2','ID3']
cols = df.columns[df.columns.isin(List_ID)]
cols
Out[2]:
Index(['ID1', 'ID2'], dtype='object')

In [3]:
res = df[cols].sum(axis=1)
res
Out[3]:
0    22.3
1    19.8
2    13.2
3    21.3
4     0.0
dtype: float64
</code>

Using Python 3.9.4, IPython 7.22.0, pandas 1.2.4, numpy 1.20.1

Ref:- https://stackoverflow.com/questions/38700848/adding-columns-if-they-exist-in-the-dataframe-pandas
==== get unique values in a series as another series ====
  * Series.unique() returns a numpy.ndarray
  * Series.drop_duplicates() returns a Series
    * Series.drop_duplicates(keep='first') retains the first occurrence of any duplicates, keep='last' retains the last occurrence, and keep=False retains NONE of the duplicates. keep='first' is the default.

<code>
In [1]:
import pandas as pd
df = pd.DataFrame({
    'a': [1,2,3,3],
    'b': ['foo', 'foo', 'bar', 'bar']})
df
Out[1]:
   a    b
0  1  foo
1  2  foo
2  3  bar
3  3  bar

In [2]:
df['a'].unique()
Out[2]:
array([1, 2, 3], dtype=int64)

In [3]:
type(df['a'].unique())
Out[3]:
numpy.ndarray

In [4]:
df['a'].drop_duplicates()
Out[4]:
0    1
1    2
2    3
Name: a, dtype: int64

In [5]:
type(df['a'].drop_duplicates())
Out[5]:
pandas.core.series.Series

In [6]:
df['a'].drop_duplicates(keep='last')
Out[6]:
0    1
1    2
3    3
Name: a, dtype: int64

In [7]:
type(df['a'].drop_duplicates(keep='last'))
Out[7]:
pandas.core.series.Series
</code>

Ref:- https://github.com/pandas-dev/pandas/issues/1923#issuecomment-398217427
==== clean up column names ====
<code>
# Strip out the white space from both ends of the column names
df.columns = df.columns.str.strip()
</code>

<code>
# Remove white space from both ends of the column names
# -> convert them to lower case
# -> replace space with an underscore
# -> remove open and close parenthesis
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')
</code>
==== round a single column ====
How to round a single column in pandas without affecting other columns? For example, given
<code>
df:

  item  value1  value2
0    a    1.12     1.3
1    a    1.50     2.5
2    a    0.10     0.0
3    b    3.30    -1.0
4    b    4.80    -1.0
</code>
How to get
<code>
  item  value1  value2
0    a     1.0     1.3
1    a     2.0     2.5
2    a     0.0     0.0
3    b     3.0    -1.0
4    b     5.0    -1.0
</code>

Solution:
<code>
df['value1'] = df['value1'].round()
</code>

Details:
<code>
In [1]: 
import pandas as pd
df = pd.DataFrame({'item': ['a', 'a', 'a', 'b', 'b'],
                   'value1': [1.12, 1.50, 0.10, 3.30, 4.80],
                   'value2': [1.3, 2.5, 0.0, -1.0, -1.0]})
df
Out[1]: 
  item  value1  value2
0    a    1.12     1.3
1    a    1.50     2.5
2    a    0.10     0.0
3    b    3.30    -1.0
4    b    4.80    -1.0

In [2]: 
df['value1'] = df['value1'].round()
df
Out[2]: 
  item  value1  value2
0    a     1.0     1.3
1    a     2.0     2.5
2    a     0.0     0.0
3    b     3.0    -1.0
4    b     5.0    -1.0
</code>
==== relative percentage difference ====
Code:
<code>
% cat rel_pct_diff.py
import pandas as pd
import numpy as np
print('Using pandas', pd.__version__, ', numpy', np.__version__)
df = pd.DataFrame({'old': [2, 1, 0, 5, 0], 'new': [2.1, 1.1, 0.1, 4.9, 0]})
print('orignal df')
print(df)
df['rel_pct_diff1'] = ((df['new'] / df['old']) - 1) * 100
df['rel_pct_diff2'] = ((df['new'].divide(
    df['old'].where(df['old'] != 0, np.nan))) - 1) * 100
print(df)
</code>
Run:
<code>
% python ./rel_pct_diff.py
Using pandas 1.1.3 , numpy 1.19.1
orignal df
   old  new
0    2  2.1
1    1  1.1
2    0  0.1
3    5  4.9
4    0  0.0
   old  new  rel_pct_diff1  rel_pct_diff2
0    2  2.1            5.0            5.0
1    1  1.1           10.0           10.0
2    0  0.1            inf            NaN
3    5  4.9           -2.0           -2.0
4    0  0.0            NaN            NaN
</code>


==== how to do XYZ ====
  * https://stackoverflow.com/questions/51705595/pandas-map-values-from-one-column-to-another-column - shows how to link the previous position of a vehicle with its current position. Solution uses groupby + shift + (bfill or fillna)


==== start column numbers from 1 ====
By default, column numbers start from 0.
<code>
% ipython
Python 3.8.3 (default, Jul  2 2020, 16:21:59)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.16.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd
   ...: import numpy as np
   ...: n = np.ones(10).reshape(2,5)
   ...: print(n)
[[1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]]

In [2]: df = pd.DataFrame(n)
   ...: print(df)
     0    1    2    3    4
0  1.0  1.0  1.0  1.0  1.0
1  1.0  1.0  1.0  1.0  1.0
</code>

To change them to start from 1
<code>
In [3]: df.columns = range(1, df.shape[1]+1)
   ...: print(df)
     1    2    3    4    5
0  1.0  1.0  1.0  1.0  1.0
1  1.0  1.0  1.0  1.0  1.0
</code>
where df.shape[1] gives the number of columns in a dataframe.
<code>
In [4]: print(df.shape[1])
5
</code>

==== read everything as string ====
  pd.read_csv(file_path, dtype=str)
Ref:- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

==== dollars two decimal places ====
Use float_format="%.2f"
<code>
df.to_csv(file_path, float_format="%.2f")
</code>


==== number of days between two YYYYMMDD string columns ====
tags | number of days between two dates, convert YYYYMMDD to date

<code>
a_dt = pd.to_datetime(df['a'], format='%Y%m%d')
b_dt = pd.to_datetime(df['b'], format='%Y%m%d')
df['days'] = (a_dt - b_dt).dt.days
</code>

See also:
  * https://docs.microsoft.com/en-us/sql/t-sql/functions/datediff-transact-sql?view=sql-server-ver15 - similar operation in sql. {tags | sql datediff dd}
==== Select columns from a dataframe only if they exist ====
<code>
df[df.columns.intersection(set(['list', 'of', 'cols']))]
</code>
Example:
<code>
$ ipython
Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.20.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:
import pandas as pd
df = pd.DataFrame(columns=[1,2,3,4])
df
Out[1]:
Empty DataFrame
Columns: [1, 2, 3, 4]
Index: []

In [2]:
df[df.columns.intersection(set([1, 2, 2, 5]))]
Out[2]:
Empty DataFrame
Columns: [1, 2]
Index: []

In [3]:
pd.__version__
Out[3]:
'1.2.1'
</code>

See also:- https://stackoverflow.com/questions/43537166/select-columns-from-dataframe-on-condition-they-exist

==== migrate sql case when else statements to build a new column ====
Use np.select()

Ref:- 
  * https://stackoverflow.com/questions/49228596/pandas-case-when-default-in-pandas - contains an example and other possible approaches
  * https://stackoverflow.com/a/57392776
  * https://numpy.org/doc/stable/reference/generated/numpy.select.html
  * https://www.w3schools.com/sql/sql_case.asp

==== initialize column with strings that range from 1 to N ====
To set it to '1', '2', ..., 'N'
<code>
df['foo'] = [str(x) for x in range(1, 1 + df.shape[0])]
</code>
To set it to 'A_1', 'A_2', ..., 'A_N'
<code>
df['foo'] = ['A_' + str(x) for x in range(1, 1 + df.shape[0])]
</code>

==== excel shows alternate blank lines in csv files written using to_csv() ====
Call csv with lineterminator="\n". For example
<code>
df.to_csv(path, index=False, lineterminator="\n")
</code>

tags | to_csv files have an extra line ending character at the end, to_csv line endings

Ref:- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

===== Missing values =====

==== number of missing values in a dataframe ====
  * df.isna().sum().sum() - total number of missing values in the entire dataframe.
  * df.isna().sum() - number of missing values per column
  * df.isna() - boolean dataframe with True if value is missing, false otherwise.
<code>
In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [3, None, 5, 7], 'B': [np.nan, 5, None, 4]})
df
Out[1]:
     A    B
0  3.0  NaN
1  NaN  5.0
2  5.0  NaN
3  7.0  4.0

In [2]:
df.isna().sum().sum()
Out[2]:
3

In [3]:
df.isna().sum()
Out[3]:
A    1
B    2
dtype: int64

In [4]:
df.isna()
Out[4]:
       A      B
0  False   True
1   True  False
2  False   True
3  False  False
</code>
==== fill missing values using a lookup table ====
Q. Fill the NaNs in
<code>
                    title            industry
0     Executive Secretary              Health
1  Administrative Officer  Financial Services
2      Recruiting Manager            Property
3           Senior Editor                 NaN
4         Media Manager I                 NaN
</code>

using the lookup table
<code>
                 title       industry
0  Executive Secretary         Retail
1        Senior Editor  Manufacturing
2      Media Manager I         Health
</code>
so that we get
<code>
                    title            industry
0     Executive Secretary              Health
1  Administrative Officer  Financial Services
2      Recruiting Manager            Property
3           Senior Editor       Manufacturing
4         Media Manager I              Health
</code>

Ans: Get the missing locations using isna() and then use map with set_index.
<code>
% ipython
Python 3.8.5 (default, Sep  4 2020, 07:30:14) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.18.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: 
import pandas as pd
import numpy as np
df = pd.DataFrame({
    'title': ['Executive Secretary', 'Administrative Officer',
              'Recruiting Manager', 'Senior Editor', 'Media Manager I'],
    'industry': ['Health', 'Financial Services', 'Property', np.nan, np.nan]})
df
Out[1]: 
                    title            industry
0     Executive Secretary              Health
1  Administrative Officer  Financial Services
2      Recruiting Manager            Property
3           Senior Editor                 NaN
4         Media Manager I                 NaN

In [2]: 
lookup = pd.DataFrame({
    'title': ['Executive Secretary', 'Senior Editor', 'Media Manager I'],
    'industry': ['Retail', 'Manufacturing', 'Health']})
lookup
Out[2]: 
                 title       industry
0  Executive Secretary         Retail
1        Senior Editor  Manufacturing
2      Media Manager I         Health

In [3]: 
missing = df['industry'].isna()

In [4]: 
df.loc[missing, 'industry'] = df.loc[missing, 'title'].map(
    lookup.set_index('title')['industry'])
df
Out[4]: 
                    title            industry
0     Executive Secretary              Health
1  Administrative Officer  Financial Services
2      Recruiting Manager            Property
3           Senior Editor       Manufacturing
4         Media Manager I              Health

</code>

Ref:- https://stackoverflow.com/questions/64438066/how-can-i-fillna-based-on-the-columns-from-another-dataframe/ - original question. Here, I changed the column names for brevity.


===== Sample usage =====
==== read_csv usage ====
  * Read a csv file that has no headers and interpret all columns as strings
<code>
pd.read_csv(input_file, dtype=str, sep='|', header=None)
</code>
  * drop duplicate lines
<code>
pd.read_csv(input_file).drop_duplicates()
</code>

==== using astype ====
<code>
df = df.astype({'col_a': np.int64,
                'col_b': np.int64})
</code>
<code>
df['col_a'] = df['col_a'].astype(np.int64)
</code>

===== stackoverflow questions I came across ====
==== normalize data within each group ====
  * https://stackoverflow.com/questions/46419180/pandas-normalize-within-the-group
==== find many to one mappings ====
  * https://stackoverflow.com/questions/24005064/finding-rows-in-a-pandas-dataframe-with-columns-that-violate-a-one-to-one-mappin

==== links ====
<place holder>

===== pivot related =====
==== pivot and sum ====
call pivot_table() with aggfunc=np.sum

Ref:-
  * https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html - contains examples
  * https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html

==== convert pivot table output to dataframe ====
Sample code
<code>
table = df[cols_of_interest].pivot_table(values=['amount'], index=['foo', 'bar'],
                                         columns=['class_name'], aggfunc=np.sum)
table.columns = table.columns.droplevel(0)    # removes the values from level 0
table.columns.name = None                     # removes the columns as the column name
table = table.reset_index()                   # changes index to columns
</code>

Ref:-
  * https://stackoverflow.com/questions/43756052/transform-pandas-pivot-table-to-regular-dataframe - where I found the answer
  * https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html

==== normalize data ====
How to convert
<code>
$ cat data.csv
node_id,sim_id,portfolio_risk_total
11,1,10
11,2,20
11,3,30
11,4,40
22,1,100
22,2,200
22,3,300
22,4,400
33,1,1000
33,2,2000
33,3,3000
33,4,4000
</code>
to
<code>
$ cat out.csv
sim_id,11,22,33
1,10,100,1000
2,20,200,2000
3,30,300,3000
4,40,400,4000
</code>

Answer:

The technical term for this is ‘pivoting’. You are converting “stacked” or “record” format data into a normalized form. Below is how to do it. For more complex and related operations, see https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html .

<code>
In [1]:
import pandas as pd
df = pd.read_csv('data.csv')
df
Out[1]:
        node_id  sim_id  portfolio_risk_total
0            11       1                    10
1            11       2                    20
2            11       3                    30
3            11       4                    40
4            22       1                   100
5            22       2                   200
6            22       3                   300
7            22       4                   400
8            33       1                  1000
9            33       2                  2000
10           33       3                  3000
11           33       4                  4000

In [2]:
table = df.pivot(index='sim_id', columns='node_id', values='portfolio_risk_total')
table
Out[2]:
node_id      11   22    33
sim_id
1            10  100  1000
2            20  200  2000
3            30  300  3000
4            40  400  4000

In [3]:
table.columns.name = None
table = table.reset_index()
table
Out[3]:
   sim_id  11   22    33
0       1  10  100  1000
1       2  20  200  2000
2       3  30  300  3000
3       4  40  400  4000

In [4]:
table.to_csv('out.csv', index=False)
</code>
===== pandas documentation links I came across =====
==== dummy ====
^ url ^ function ^ notes ^
| pandas.Series.str.split | https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html |
| pandas.Series.str.join | https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.join.html#pandas.Series.str.join |

^ url ^ function ^ notes ^
| pandas.DataFrame.fillna | https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html |
| pandas.DataFrame.replace | https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html |
| pandas.DataFrame.rename | https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html |
| pandas.DataFrame.notna | https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.notna.html |
==== user guide ====
^ url ^ title ^ notes ^
| https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html | Reshaping and pivot tables | tags: stacked or record format, denormalized to normalized data |