===== Creating a dataframe ===== ==== Create a dataframe from list of lists ==== >>> import pandas as pd >>> data = [[0, 1, 0, 1], [1, 0, 1, 1], [0, 1, 1, 1]] >>> df = pd.DataFrame(data) >>> df 0 1 2 3 0 0 1 0 1 1 1 0 1 1 2 0 1 1 1 You can also specify a list of columns for the dataframe. >>> columns = ['a', 'b', 'c', 'd'] >>> df = pd.DataFrame(data, columns=columns) >>> df a b c d 0 0 1 0 1 1 1 0 1 1 2 0 1 1 1 To specify the index >>> df = pd.DataFrame([[1,2,3,4,15,6],['1','2','3','4','F']], index=['CT','NY']) >>> df 0 1 2 3 4 5 CT 1 2 3 4 15 6.0 NY 1 2 3 4 F NaN To make each list into a column, use zip % ipython Python 3.10.6 (main, Oct 24 2022, 16:07:47) [GCC 11.2.0] IPython 8.6.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: import pandas as pd # create lists l1 =["Amar", "Barsha", "Carlos", "Tanmay", "Misbah"] l2 =["Alpha", "Bravo", "Charlie", "Tango", "Mike"] l3 =[23, 25, 22, 27, 29] l4 =[69, 54, 73, 70, 74] # create the dataframe team = pd.DataFrame(list(zip(l1, l2, l3, l4)), columns=['Name', 'Code', 'Age', 'Weight']) print(team) Name Code Age Weight 0 Amar Alpha 23 69 1 Barsha Bravo 25 54 2 Carlos Charlie 22 73 3 Tanmay Tango 27 70 4 Misbah Mike 29 74 Ref:- * https://www.geeksforgeeks.org/add-column-names-to-dataframe-in-pandas/ - got the idea on zip from here. tags | row by row ==== Create a dataframe by splitting strings ==== Given a list of strings, the idea here is to create a data frame by splitting them into multiple columns. Load the data into a pandas series % ipython Python 3.10.6 (main, Oct 24 2022, 16:07:47) [GCC 11.2.0] IPython 8.6.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: import pandas as pd strings = ['Netflix 100' , 'Costco 200' , 'Walmart 500', 'Costco 500' ] s = pd.Series(strings) print(s) 0 Netflix 100 1 Costco 200 2 Walmart 500 3 Costco 500 dtype: object Create a dataframe by splitting the strings into multiple columns In [2]: df = s.str.split(expand=True) print(df) 0 1 0 Netflix 100 1 Costco 200 2 Walmart 500 3 Costco 500 Name the columns. str.split gives everything as strings. Change them to numbers as needed. In [3]: df.columns = ['company', 'value'] df['value'] = df['value'].astype('float') print(df) company value 0 Netflix 100.0 1 Costco 200.0 2 Walmart 500.0 3 Costco 500.0 See also: * I used this trick to answer a question on stackoverflow - https://stackoverflow.com/a/75212176/6305733 * https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html * https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html also demonstrates | assign column names to a dataframe ==== Create a dataframe from a series of lists ==== tags | convert series with lists to dataframe df = pd.DataFrame(s.to_list()) For example In [1]: import pandas as pd s = pd.Series([[1, 2, 3], [4, 5, 6, 7], [8, 9]]) s Out[1]: 0 [1, 2, 3] 1 [4, 5, 6, 7] 2 [8, 9] dtype: object In [2]: df = pd.DataFrame(s.to_list()) df Out[2]: 0 1 2 3 0 1 2 3.0 NaN 1 4 5 6.0 7.0 2 8 9 NaN NaN If the number of elements in each list is same, np.vstack() can be used but otherwise it will not work. For example In [5]: s Out[5]: 0 [1, 2, 3] 1 [4, 5, 6, 7] 2 [8, 9] dtype: object In [6]: import numpy as np df = pd.DataFrame(np.vstack(s)) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[6], line 2 1 import numpy as np ----> 2 df = pd.DataFrame(np.vstack(s)) File <__array_function__ internals>:200, in vstack(*args, **kwargs) File ~\AppData\Local\conda\conda\envs\py311\Lib\site-packages\numpy\core\shape_base.py:296, in vstack(tup, dtype, casting) 294 if not isinstance(arrs, list): 295 arrs = [arrs] --> 296 return _nx.concatenate(arrs, 0, dtype=dtype, casting=casting) File <__array_function__ internals>:200, in concatenate(*args, **kwargs) ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 3 and the array at index 1 has size 4 But with In [10]: s = pd.Series([[1, 2, 3], [4, 5, 6]]) s Out[10]: 0 [1, 2, 3] 1 [4, 5, 6] dtype: object In [11]: import numpy as np df = pd.DataFrame(np.vstack(s)) df Out[11]: 0 1 2 0 1 2 3 1 4 5 6 In [12]: df = pd.DataFrame(s.to_list()) df Out[12]: 0 1 2 0 1 2 3 1 4 5 6 See also: * https://stackoverflow.com/questions/45901018/convert-pandas-series-of-lists-to-dataframe ==== Create a dataframe from a bunch of variables ==== import pandas as pd df = pd.DataFrame({ 'key': ['var1', 'var2', 'var3'], 'value': [var1, var2, var3] }) For example $ ipython In [1]: year = 2023; month = 6; date = 15 In [2]: import pandas as pd df = pd.DataFrame({ 'key': ['year', 'month', 'date'], 'value': [year, month, date] }) In [3]: df Out[3]: key value 0 year 2023 1 month 6 2 date 15 In [4]: df.dtypes Out[4]: key object value int64 dtype: object It works even if the variables are not of the same type. In [5]: year = 2023; month = 'June'; date = 15 In [6]: df = pd.DataFrame({ 'key': ['year', 'month', 'date'], 'value': [year, month, date] }) In [7]: df Out[7]: key value 0 year 2023 1 month June 2 date 15 In [8]: df.dtypes Out[8]: key object value object dtype: object Tested with Python 3.11.3, IPython 8.12.0 ===== selection related ===== ==== split columns ==== * [[https://nbviewer.jupyter.org/github/KamarajuKusumanchi/notebooks/blob/master/pandas/Separate%20positive%20and%20negative%20values.ipynb | Separate positive and negative values (nbviewer.jupyter.org/github/KamarajuKusumanchi)]] description | split a column into two columns based on whether the values are positive or negative, default value, longs and shorts * [[https://nbviewer.jupyter.org/github/KamarajuKusumanchi/notebooks/blob/master/pandas/dataframe%20split%20column%20string%20into%20multiple%20columns.ipynb | Split string column into multiple columns (nbviewer.jupyter.org/github/KamarajuKusumanchi)]] tags | uses [http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html pandas.Series.str.split] ==== lookup value ==== To pick the first value in column 'A' for rows where column B is FOO df.loc[df['B'] == 'FOO', 'A'].iloc[0] Example: $ ipython In [1]: import pandas as pd df = pd.DataFrame({'A': ['p1', 'p2', 'p3', 'p4'], 'B': [1, 3, 3, 2]}) print(df) A B 0 p1 1 1 p2 3 2 p3 3 3 p4 2 In [2]: df.loc[df['B'] == 3, 'A'] Out[2]: 1 p2 2 p3 Name: A, dtype: object In [3]: df.loc[df['B'] == 3, 'A'].iloc[0] Out[3]: 'p2' search tags | value of one column when another column equals something Ref:- https://stackoverflow.com/questions/36684013/extract-column-value-based-on-another-column-in-pandas ===== Series related ===== ==== view the index, type, length and name of the series ==== s.index s.dtype s.size s.name ==== unique types found in a object series ==== First, use the ''.apply'' method with the ''type'' function to get back a Series that has the type of every member. Then, //chain// the ''.unique'' method onto the result. s.apply(type).unique() ===== Tasks ===== * [[extract last field from each row of a column]] * [[map one column to another]] * [[show all the rows and columns]] * tags | max_rows, max_columns, set_option, all rows, all columns * [[print all characters in a cell]] * tags | max_colwidth, set_option * [[set index from 0 to N]] * [[print dataframe without index]] * tags | pretty print dataframe * [[Add dates]] * [[Add commas]] * [[Get first element of series if not empty]] * [[Order columns alphabetically]] * [[Get the first non null value in each column]] * [[Convert a dictionary of dataframes to a big dataframe]] * [[Convert string to date]] * [[print hundredths]] ===== data transformation related ===== ==== matches to coplays ==== Consider the dataframe match_id player_id 0 0 a 1 0 b 2 0 c 3 1 a 4 1 b 5 2 c which shows the players played in a given match. For example, it shows that [a, b, c] played in match 0, [a, b] played in match 1, and c alone played in match 2. We want to get a coplay count player_id1 player_id2 size 0 a a 2 1 a b 2 2 a c 1 3 b a 2 4 b b 2 5 b c 1 6 c a 1 7 c b 1 8 c c 2 which shows * the number of games each player played with another player when player_id1 $ \neq $ player_id2 (ex:- a and b played in 2 games, a and c played in 1 game) * the number of games a player played if player_id1 $ == $ player_id2 [[ matches to coplays | Solution ]] ===== difference between ===== * [[astype vs. to_numeric]] ===== convert stuff ===== ==== convert column names to lower case and replace space with underscore ==== df.columns = df.columns.str.lower().str.replace(' ', '_') Example: $ ipython Python 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)] Type 'copyright', 'credits' or 'license' for more information IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: import pandas as pd df = pd.DataFrame( [['2023-03-02', '2023-02-28', 3.20, 3.30], ['2022-12-08', '2022-11-30', 3.14, 3.10]], columns = ["Announcement Date", "Fiscal Quarter End", "Estimated EPS", "Actual EPS"]) df Out[1]: Announcement Date Fiscal Quarter End Estimated EPS Actual EPS 0 2023-03-02 2023-02-28 3.20 3.3 1 2022-12-08 2022-11-30 3.14 3.1 In [2]: df.columns = df.columns.str.lower().str.replace(' ', '_') In [3]: df Out[3]: announcement_date fiscal_quarter_end estimated_eps actual_eps 0 2023-03-02 2023-02-28 3.20 3.3 1 2022-12-08 2022-11-30 3.14 3.1 ==== convert column names to lower case ==== Approach 1: df.columns = df.columns.str.lower() Approach 2: df.rename(columns=lambda x: x.lower(), inplace=True) Notes: * I prefer approach1 - simple syntax, easy to remember. Use case: While merging data from two data frames using DataFrame.merge(), I ended up with two columns with same name but differing in case (ex: foo from df1, FOO from df2). This caused problems when I tried to upload data into a hadoop cluster since hive is not case sensitive. As a work around, I converted the column names in df2 to lower case and then merged using pd.merge(df1, df2, ..., suffixes = ('_df1', '_df2')). The resulting data frame will now have foo_df1, foo_df2 columns. Example (using approach 1): $ ipython Python 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)] Type 'copyright', 'credits' or 'license' for more information IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: import pandas as pd df = pd.DataFrame( [['2023-03-02', '2023-02-28', 3.20, 3.30], ['2022-12-08', '2022-11-30', 3.14, 3.10]], columns = ["Announcement Date", "Fiscal Quarter End", "Estimated EPS", "Actual EPS"]) df Out[1]: Announcement Date Fiscal Quarter End Estimated EPS Actual EPS 0 2023-03-02 2023-02-28 3.20 3.3 1 2022-12-08 2022-11-30 3.14 3.1 In [2]: df.columns = df.columns.str.lower() In [3]: df Out[3]: announcement date fiscal quarter end estimated eps actual eps 0 2023-03-02 2023-02-28 3.20 3.3 1 2022-12-08 2022-11-30 3.14 3.1 Example (using approach 2): $ ipython Python 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)] IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: import pandas as pd df = pd.DataFrame( [['2023-03-02', '2023-02-28', 3.20, 3.30], ['2022-12-08', '2022-11-30', 3.14, 3.10]], columns = ["Announcement Date", "Fiscal Quarter End", "Estimated EPS", "Actual EPS"]) df Out[1]: Announcement Date Fiscal Quarter End Estimated EPS Actual EPS 0 2023-03-02 2023-02-28 3.20 3.3 1 2022-12-08 2022-11-30 3.14 3.1 In [2]: df.rename(columns=lambda x: x.lower(), inplace=True) In [3]: df Out[3]: announcement date fiscal quarter end estimated eps actual eps 0 2023-03-02 2023-02-28 3.20 3.3 1 2022-12-08 2022-11-30 3.14 3.1 ==== convert all categorical columns to string columns ==== df.loc[:, df.dtypes == "category"] = df.select_dtypes( ["category"] ).apply(lambda x: x.astype("object")) Ref:- https://stackoverflow.com/a/56944992 ==== convert all columns in a dataframe to string ==== df.astype(str) Example: In [1]: import pandas as pd ...: df = pd.DataFrame({'a': [648, 435], 'b': [175, 389]}) ...: print(df) a b 0 648 175 1 435 389 In [2]: df.dtypes Out[2]: a int64 b int64 dtype: object In [3]: df2 = df.astype(str) ...: df2.dtypes Out[3]: a object b object dtype: object In [4]: print(df2) a b 0 648 175 1 435 389 ==== convert two columns of dataframe into a dictionary ==== df.set_index('id')['value'].to_dict() Example In [1]: import pandas as pd df = pd.DataFrame({'id': [0, 1, 2], 'value': [1.7, 5.2, 4.4]}) df Out[1]: id value 0 0 1.7 1 1 5.2 2 2 4.4 In [2]: df.set_index('id')['value'].to_dict() Out[2]: {0: 1.7, 1: 5.2, 2: 4.4} Ref:- https://stackoverflow.com/questions/18695605/python-pandas-dataframe-to-dictionary ==== convert % to numbers ==== tags | convert percent symbol in strings to numbers In [2]: import pandas as pd df = pd.DataFrame({'foo': ['1.8%', '-2.5%', '0.7%', '3.2%']}) df Out[2]: foo 0 1.8% 1 -2.5% 2 0.7% 3 3.2% In [3]: df['foo'] = df['foo'].str.rstrip('%').astype(float)/100 In [4]: df Out[4]: foo 0 0.018 1 -0.025 2 0.007 3 0.032 Tested using python 3.10.9, ipython 8.8.0, and pandas 1.5.2 Ref:- https://stackoverflow.com/questions/25669588/convert-percent-string-to-float-in-pandas-read-csv ===== dummy ===== ==== json to dataframe ==== ^ From ^ to ^ Use ^ | JSON | DataFrame | pandas.read_json() | | DataFrame | JSON | * df.to_json() or * df.to_json(orient='records') | | JSON | Python object | json.loads() | | Python object | JSON | json.dumps() | Sample code [[https://nbviewer.jupyter.org/github/KamarajuKusumanchi/notebooks/blob/master/pandas/json%20to%20dataframe.ipynb | nbviewer/github/KamarajuKusumanchi]] Ref:- * https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html * https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html ==== sum if columns exist ==== tags | sum if column exists $ ipython In [1]: import numpy as np import pandas as pd df = pd.DataFrame({'ID1':[15.3, 12.1, 13.2, 10.0, np.nan], 'ID2':[7.0, 7.7, np.nan, 11.3, np.nan], 'ID5':[10, 15, 3.1, 2.2, np.nan]}) df Out[1]: ID1 ID2 ID5 0 15.3 7.0 10.0 1 12.1 7.7 15.0 2 13.2 NaN 3.1 3 10.0 11.3 2.2 4 NaN NaN NaN In [2]: List_ID = ['ID1','ID2','ID3'] cols = df.columns[df.columns.isin(List_ID)] cols Out[2]: Index(['ID1', 'ID2'], dtype='object') In [3]: res = df[cols].sum(axis=1) res Out[3]: 0 22.3 1 19.8 2 13.2 3 21.3 4 0.0 dtype: float64 Using Python 3.9.4, IPython 7.22.0, pandas 1.2.4, numpy 1.20.1 Ref:- https://stackoverflow.com/questions/38700848/adding-columns-if-they-exist-in-the-dataframe-pandas ==== get unique values in a series as another series ==== * Series.unique() returns a numpy.ndarray * Series.drop_duplicates() returns a Series * Series.drop_duplicates(keep='first') retains the first occurrence of any duplicates, keep='last' retains the last occurrence, and keep=False retains NONE of the duplicates. keep='first' is the default. In [1]: import pandas as pd df = pd.DataFrame({ 'a': [1,2,3,3], 'b': ['foo', 'foo', 'bar', 'bar']}) df Out[1]: a b 0 1 foo 1 2 foo 2 3 bar 3 3 bar In [2]: df['a'].unique() Out[2]: array([1, 2, 3], dtype=int64) In [3]: type(df['a'].unique()) Out[3]: numpy.ndarray In [4]: df['a'].drop_duplicates() Out[4]: 0 1 1 2 2 3 Name: a, dtype: int64 In [5]: type(df['a'].drop_duplicates()) Out[5]: pandas.core.series.Series In [6]: df['a'].drop_duplicates(keep='last') Out[6]: 0 1 1 2 3 3 Name: a, dtype: int64 In [7]: type(df['a'].drop_duplicates(keep='last')) Out[7]: pandas.core.series.Series Ref:- https://github.com/pandas-dev/pandas/issues/1923#issuecomment-398217427 ==== clean up column names ==== # Strip out the white space from both ends of the column names df.columns = df.columns.str.strip() # Remove white space from both ends of the column names # -> convert them to lower case # -> replace space with an underscore # -> remove open and close parenthesis df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '') ==== round a single column ==== How to round a single column in pandas without affecting other columns? For example, given df: item value1 value2 0 a 1.12 1.3 1 a 1.50 2.5 2 a 0.10 0.0 3 b 3.30 -1.0 4 b 4.80 -1.0 How to get item value1 value2 0 a 1.0 1.3 1 a 2.0 2.5 2 a 0.0 0.0 3 b 3.0 -1.0 4 b 5.0 -1.0 Solution: df['value1'] = df['value1'].round() Details: In [1]: import pandas as pd df = pd.DataFrame({'item': ['a', 'a', 'a', 'b', 'b'], 'value1': [1.12, 1.50, 0.10, 3.30, 4.80], 'value2': [1.3, 2.5, 0.0, -1.0, -1.0]}) df Out[1]: item value1 value2 0 a 1.12 1.3 1 a 1.50 2.5 2 a 0.10 0.0 3 b 3.30 -1.0 4 b 4.80 -1.0 In [2]: df['value1'] = df['value1'].round() df Out[2]: item value1 value2 0 a 1.0 1.3 1 a 2.0 2.5 2 a 0.0 0.0 3 b 3.0 -1.0 4 b 5.0 -1.0 ==== relative percentage difference ==== Code: % cat rel_pct_diff.py import pandas as pd import numpy as np print('Using pandas', pd.__version__, ', numpy', np.__version__) df = pd.DataFrame({'old': [2, 1, 0, 5, 0], 'new': [2.1, 1.1, 0.1, 4.9, 0]}) print('orignal df') print(df) df['rel_pct_diff1'] = ((df['new'] / df['old']) - 1) * 100 df['rel_pct_diff2'] = ((df['new'].divide( df['old'].where(df['old'] != 0, np.nan))) - 1) * 100 print(df) Run: % python ./rel_pct_diff.py Using pandas 1.1.3 , numpy 1.19.1 orignal df old new 0 2 2.1 1 1 1.1 2 0 0.1 3 5 4.9 4 0 0.0 old new rel_pct_diff1 rel_pct_diff2 0 2 2.1 5.0 5.0 1 1 1.1 10.0 10.0 2 0 0.1 inf NaN 3 5 4.9 -2.0 -2.0 4 0 0.0 NaN NaN ==== how to do XYZ ==== * https://stackoverflow.com/questions/51705595/pandas-map-values-from-one-column-to-another-column - shows how to link the previous position of a vehicle with its current position. Solution uses groupby + shift + (bfill or fillna) ==== start column numbers from 1 ==== By default, column numbers start from 0. % ipython Python 3.8.3 (default, Jul 2 2020, 16:21:59) Type 'copyright', 'credits' or 'license' for more information IPython 7.16.1 -- An enhanced Interactive Python. Type '?' for help. In [1]: import pandas as pd ...: import numpy as np ...: n = np.ones(10).reshape(2,5) ...: print(n) [[1. 1. 1. 1. 1.] [1. 1. 1. 1. 1.]] In [2]: df = pd.DataFrame(n) ...: print(df) 0 1 2 3 4 0 1.0 1.0 1.0 1.0 1.0 1 1.0 1.0 1.0 1.0 1.0 To change them to start from 1 In [3]: df.columns = range(1, df.shape[1]+1) ...: print(df) 1 2 3 4 5 0 1.0 1.0 1.0 1.0 1.0 1 1.0 1.0 1.0 1.0 1.0 where df.shape[1] gives the number of columns in a dataframe. In [4]: print(df.shape[1]) 5 ==== read everything as string ==== pd.read_csv(file_path, dtype=str) Ref:- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html ==== dollars two decimal places ==== Use float_format="%.2f" df.to_csv(file_path, float_format="%.2f") ==== number of days between two YYYYMMDD string columns ==== tags | number of days between two dates, convert YYYYMMDD to date a_dt = pd.to_datetime(df['a'], format='%Y%m%d') b_dt = pd.to_datetime(df['b'], format='%Y%m%d') df['days'] = (a_dt - b_dt).dt.days See also: * https://docs.microsoft.com/en-us/sql/t-sql/functions/datediff-transact-sql?view=sql-server-ver15 - similar operation in sql. {tags | sql datediff dd} ==== Select columns from a dataframe only if they exist ==== df[df.columns.intersection(set(['list', 'of', 'cols']))] Example: $ ipython Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] Type 'copyright', 'credits' or 'license' for more information IPython 7.20.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: import pandas as pd df = pd.DataFrame(columns=[1,2,3,4]) df Out[1]: Empty DataFrame Columns: [1, 2, 3, 4] Index: [] In [2]: df[df.columns.intersection(set([1, 2, 2, 5]))] Out[2]: Empty DataFrame Columns: [1, 2] Index: [] In [3]: pd.__version__ Out[3]: '1.2.1' See also:- https://stackoverflow.com/questions/43537166/select-columns-from-dataframe-on-condition-they-exist ==== migrate sql case when else statements to build a new column ==== Use np.select() Ref:- * https://stackoverflow.com/questions/49228596/pandas-case-when-default-in-pandas - contains an example and other possible approaches * https://stackoverflow.com/a/57392776 * https://numpy.org/doc/stable/reference/generated/numpy.select.html * https://www.w3schools.com/sql/sql_case.asp ==== initialize column with strings that range from 1 to N ==== To set it to '1', '2', ..., 'N' df['foo'] = [str(x) for x in range(1, 1 + df.shape[0])] To set it to 'A_1', 'A_2', ..., 'A_N' df['foo'] = ['A_' + str(x) for x in range(1, 1 + df.shape[0])] ==== excel shows alternate blank lines in csv files written using to_csv() ==== Call csv with lineterminator="\n". For example df.to_csv(path, index=False, lineterminator="\n") tags | to_csv files have an extra line ending character at the end, to_csv line endings Ref:- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html ===== Missing values ===== ==== number of missing values in a dataframe ==== * df.isna().sum().sum() - total number of missing values in the entire dataframe. * df.isna().sum() - number of missing values per column * df.isna() - boolean dataframe with True if value is missing, false otherwise. In [1]: import pandas as pd import numpy as np df = pd.DataFrame({'A': [3, None, 5, 7], 'B': [np.nan, 5, None, 4]}) df Out[1]: A B 0 3.0 NaN 1 NaN 5.0 2 5.0 NaN 3 7.0 4.0 In [2]: df.isna().sum().sum() Out[2]: 3 In [3]: df.isna().sum() Out[3]: A 1 B 2 dtype: int64 In [4]: df.isna() Out[4]: A B 0 False True 1 True False 2 False True 3 False False ==== fill missing values using a lookup table ==== Q. Fill the NaNs in title industry 0 Executive Secretary Health 1 Administrative Officer Financial Services 2 Recruiting Manager Property 3 Senior Editor NaN 4 Media Manager I NaN using the lookup table title industry 0 Executive Secretary Retail 1 Senior Editor Manufacturing 2 Media Manager I Health so that we get title industry 0 Executive Secretary Health 1 Administrative Officer Financial Services 2 Recruiting Manager Property 3 Senior Editor Manufacturing 4 Media Manager I Health Ans: Get the missing locations using isna() and then use map with set_index. % ipython Python 3.8.5 (default, Sep 4 2020, 07:30:14) Type 'copyright', 'credits' or 'license' for more information IPython 7.18.1 -- An enhanced Interactive Python. Type '?' for help. In [1]: import pandas as pd import numpy as np df = pd.DataFrame({ 'title': ['Executive Secretary', 'Administrative Officer', 'Recruiting Manager', 'Senior Editor', 'Media Manager I'], 'industry': ['Health', 'Financial Services', 'Property', np.nan, np.nan]}) df Out[1]: title industry 0 Executive Secretary Health 1 Administrative Officer Financial Services 2 Recruiting Manager Property 3 Senior Editor NaN 4 Media Manager I NaN In [2]: lookup = pd.DataFrame({ 'title': ['Executive Secretary', 'Senior Editor', 'Media Manager I'], 'industry': ['Retail', 'Manufacturing', 'Health']}) lookup Out[2]: title industry 0 Executive Secretary Retail 1 Senior Editor Manufacturing 2 Media Manager I Health In [3]: missing = df['industry'].isna() In [4]: df.loc[missing, 'industry'] = df.loc[missing, 'title'].map( lookup.set_index('title')['industry']) df Out[4]: title industry 0 Executive Secretary Health 1 Administrative Officer Financial Services 2 Recruiting Manager Property 3 Senior Editor Manufacturing 4 Media Manager I Health Ref:- https://stackoverflow.com/questions/64438066/how-can-i-fillna-based-on-the-columns-from-another-dataframe/ - original question. Here, I changed the column names for brevity. ===== Sample usage ===== ==== read_csv usage ==== * Read a csv file that has no headers and interpret all columns as strings pd.read_csv(input_file, dtype=str, sep='|', header=None) * drop duplicate lines pd.read_csv(input_file).drop_duplicates() ==== using astype ==== df = df.astype({'col_a': np.int64, 'col_b': np.int64}) df['col_a'] = df['col_a'].astype(np.int64) ===== stackoverflow questions I came across ==== ==== normalize data within each group ==== * https://stackoverflow.com/questions/46419180/pandas-normalize-within-the-group ==== find many to one mappings ==== * https://stackoverflow.com/questions/24005064/finding-rows-in-a-pandas-dataframe-with-columns-that-violate-a-one-to-one-mappin ==== links ==== ===== pivot related ===== ==== pivot and sum ==== call pivot_table() with aggfunc=np.sum Ref:- * https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html - contains examples * https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html ==== convert pivot table output to dataframe ==== Sample code table = df[cols_of_interest].pivot_table(values=['amount'], index=['foo', 'bar'], columns=['class_name'], aggfunc=np.sum) table.columns = table.columns.droplevel(0) # removes the values from level 0 table.columns.name = None # removes the columns as the column name table = table.reset_index() # changes index to columns Ref:- * https://stackoverflow.com/questions/43756052/transform-pandas-pivot-table-to-regular-dataframe - where I found the answer * https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html ==== normalize data ==== How to convert $ cat data.csv node_id,sim_id,portfolio_risk_total 11,1,10 11,2,20 11,3,30 11,4,40 22,1,100 22,2,200 22,3,300 22,4,400 33,1,1000 33,2,2000 33,3,3000 33,4,4000 to $ cat out.csv sim_id,11,22,33 1,10,100,1000 2,20,200,2000 3,30,300,3000 4,40,400,4000 Answer: The technical term for this is ‘pivoting’. You are converting “stacked” or “record” format data into a normalized form. Below is how to do it. For more complex and related operations, see https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html . In [1]: import pandas as pd df = pd.read_csv('data.csv') df Out[1]: node_id sim_id portfolio_risk_total 0 11 1 10 1 11 2 20 2 11 3 30 3 11 4 40 4 22 1 100 5 22 2 200 6 22 3 300 7 22 4 400 8 33 1 1000 9 33 2 2000 10 33 3 3000 11 33 4 4000 In [2]: table = df.pivot(index='sim_id', columns='node_id', values='portfolio_risk_total') table Out[2]: node_id 11 22 33 sim_id 1 10 100 1000 2 20 200 2000 3 30 300 3000 4 40 400 4000 In [3]: table.columns.name = None table = table.reset_index() table Out[3]: sim_id 11 22 33 0 1 10 100 1000 1 2 20 200 2000 2 3 30 300 3000 3 4 40 400 4000 In [4]: table.to_csv('out.csv', index=False) ===== pandas documentation links I came across ===== ==== dummy ==== ^ url ^ function ^ notes ^ | pandas.Series.str.split | https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html | | pandas.Series.str.join | https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.join.html#pandas.Series.str.join | ^ url ^ function ^ notes ^ | pandas.DataFrame.fillna | https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html | | pandas.DataFrame.replace | https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html | | pandas.DataFrame.rename | https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html | | pandas.DataFrame.notna | https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.notna.html | ==== user guide ==== ^ url ^ title ^ notes ^ | https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html | Reshaping and pivot tables | tags: stacked or record format, denormalized to normalized data |