User Tools

Site Tools


pandas_dataframe

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
pandas_dataframe [2023/03/13 15:52] – [convert all categorical columns to string columns] adminpandas_dataframe [2023/09/07 21:44] – [split columns] raju
Line 59: Line 59:
   * https://www.geeksforgeeks.org/add-column-names-to-dataframe-in-pandas/ - got the idea on zip from here.   * https://www.geeksforgeeks.org/add-column-names-to-dataframe-in-pandas/ - got the idea on zip from here.
  
 +tags | row by row
 ==== Create a dataframe by splitting strings ==== ==== Create a dataframe by splitting strings ====
 Given a list of strings, the idea here is to create a data frame by splitting them into multiple columns. Given a list of strings, the idea here is to create a data frame by splitting them into multiple columns.
Line 114: Line 115:
  
 also demonstrates | assign column names to a dataframe also demonstrates | assign column names to a dataframe
 +
 +==== Create a dataframe from a series of lists ====
 +tags | convert series with lists to dataframe
 +<code>
 +df = pd.DataFrame(s.to_list())
 +</code>
 +For example
 +<code>
 +In [1]:
 +import pandas as pd
 +s = pd.Series([[1, 2, 3], [4, 5, 6, 7], [8, 9]])
 +s
 +Out[1]:
 +0       [1, 2, 3]
 +1    [4, 5, 6, 7]
 +2          [8, 9]
 +dtype: object
 +
 +In [2]:
 +df = pd.DataFrame(s.to_list())
 +df
 +Out[2]:
 +    1    2    3
 +0  1  2  3.0  NaN
 +1  4  5  6.0  7.0
 +2  8  9  NaN  NaN
 +</code>
 +
 +If the number of elements in each list is same, np.vstack() can be used but otherwise it will not work. For example
 +<code>
 +In [5]:
 +s
 +Out[5]:
 +0       [1, 2, 3]
 +1    [4, 5, 6, 7]
 +2          [8, 9]
 +dtype: object
 +
 +In [6]:
 +import numpy as np
 +df = pd.DataFrame(np.vstack(s))
 +---------------------------------------------------------------------------
 +ValueError                                Traceback (most recent call last)
 +Cell In[6], line 2
 +      1 import numpy as np
 +----> 2 df = pd.DataFrame(np.vstack(s))
 +
 +File <__array_function__ internals>:200, in vstack(*args, **kwargs)
 +
 +File ~\AppData\Local\conda\conda\envs\py311\Lib\site-packages\numpy\core\shape_base.py:296, in vstack(tup, dtype, casting)
 +    294 if not isinstance(arrs, list):
 +    295     arrs = [arrs]
 +--> 296 return _nx.concatenate(arrs, 0, dtype=dtype, casting=casting)
 +
 +File <__array_function__ internals>:200, in concatenate(*args, **kwargs)
 +
 +ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 3 and the array at index 1 has size 4
 +</code>
 +
 +But with
 +<code>
 +In [10]:
 +s = pd.Series([[1, 2, 3], [4, 5, 6]])
 +s
 +Out[10]:
 +0    [1, 2, 3]
 +1    [4, 5, 6]
 +dtype: object
 +
 +In [11]:
 +import numpy as np
 +df = pd.DataFrame(np.vstack(s))
 +df
 +Out[11]:
 +    1  2
 +0  1  2  3
 +1  4  5  6
 +
 +In [12]:
 +df = pd.DataFrame(s.to_list())
 +df
 +Out[12]:
 +    1  2
 +0  1  2  3
 +1  4  5  6
 +</code>
 +
 +See also:
 +  * https://stackoverflow.com/questions/45901018/convert-pandas-series-of-lists-to-dataframe
 +
 +==== Create a dataframe from a bunch of variables ====
 +<code>
 +import pandas as pd
 +df = pd.DataFrame({
 +  'key': ['var1', 'var2', 'var3'],
 +  'value': [var1, var2, var3]
 +})
 +</code>
 +
 +For example
 +<code>
 +$ ipython
 +
 +In [1]:
 +year = 2023; month = 6; date = 15
 +
 +In [2]:
 +import pandas as pd
 +df = pd.DataFrame({
 +  'key': ['year', 'month', 'date'],
 +  'value': [year, month, date]
 +})
 +
 +In [3]:
 +df
 +Out[3]:
 +     key  value
 +0   year   2023
 +1  month      6
 +2   date     15
 +
 +In [4]:
 +df.dtypes
 +Out[4]:
 +key      object
 +value     int64
 +dtype: object
 +</code>
 +
 +It works even if the variables are not of the same type.
 +<code>
 +In [5]:
 +year = 2023; month = 'June'; date = 15
 +
 +In [6]:
 +df = pd.DataFrame({
 +  'key': ['year', 'month', 'date'],
 +  'value': [year, month, date]
 +})
 +
 +In [7]:
 +df
 +Out[7]:
 +     key value
 +0   year  2023
 +1  month  June
 +2   date    15
 +
 +In [8]:
 +df.dtypes
 +Out[8]:
 +key      object
 +value    object
 +dtype: object
 +</code>
 +
 +Tested with Python 3.11.3, IPython 8.12.0
  
 ===== selection related ===== ===== selection related =====
Line 125: Line 283:
 tags | uses [http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html pandas.Series.str.split] tags | uses [http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html pandas.Series.str.split]
  
 +==== lookup value ====
 +To pick the first value in column 'A' for rows where column B is FOO
 +<code>
 +df.loc[df['B'] == 'FOO', 'A'].iloc[0]
 +</code>
 +
 +Example:
 +<code>
 +$ ipython
 +In [1]:
 +import pandas as pd
 +df = pd.DataFrame({'A': ['p1', 'p2', 'p3', 'p4'], 'B': [1, 3, 3, 2]})
 +print(df)
 +    A  B
 +0  p1  1
 +1  p2  3
 +2  p3  3
 +3  p4  2
 +
 +In [2]:
 +df.loc[df['B'] == 3, 'A']
 +Out[2]:
 +1    p2
 +2    p3
 +Name: A, dtype: object
 +
 +In [3]:
 +df.loc[df['B'] == 3, 'A'].iloc[0]
 +Out[3]:
 +'p2'
 +</code>
  
 +search tags | value of one column when another column equals something
 ===== Series related ===== ===== Series related =====
 ==== view the index, type, length and name of the series ==== ==== view the index, type, length and name of the series ====
Line 193: Line 383:
 [[ matches to coplays | Solution ]] [[ matches to coplays | Solution ]]
  
-====== difference between =====+===== difference between =====
   * [[astype vs. to_numeric]]   * [[astype vs. to_numeric]]
  
  
 ===== convert stuff ===== ===== convert stuff =====
 +
 +==== convert column names to lower case and replace space with underscore ====
 +<code>
 +df.columns = df.columns.str.lower().str.replace(' ', '_')
 +</code>
 +
 +Example:
 +<code>
 +$ ipython
 +Python 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)]
 +Type 'copyright', 'credits' or 'license' for more information
 +IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.
 +
 +In [1]:
 +import pandas as pd
 +df = pd.DataFrame(
 +    [['2023-03-02', '2023-02-28', 3.20, 3.30],
 +     ['2022-12-08', '2022-11-30', 3.14, 3.10]],
 +    columns = ["Announcement Date", "Fiscal Quarter End", "Estimated EPS", "Actual EPS"])
 +df
 +Out[1]:
 +  Announcement Date Fiscal Quarter End  Estimated EPS  Actual EPS
 +0        2023-03-02         2023-02-28           3.20         3.3
 +1        2022-12-08         2022-11-30           3.14         3.1
 +
 +In [2]:
 +df.columns = df.columns.str.lower().str.replace(' ', '_')
 +
 +In [3]:
 +df
 +Out[3]:
 +  announcement_date fiscal_quarter_end  estimated_eps  actual_eps
 +0        2023-03-02         2023-02-28           3.20         3.3
 +1        2022-12-08         2022-11-30           3.14         3.1
 +</code>
  
 ==== convert column names to lower case ==== ==== convert column names to lower case ====
 +Approach 1:
 +<code>
 +df.columns = df.columns.str.lower()
 +</code>
  
 Approach 2: Approach 2:
Line 206: Line 435:
 </code> </code>
  
-Example:+Notes: 
 +  * I prefer approach1 - simple syntax, easy to remember. 
 + 
 +Use case: While merging data from two data frames using DataFrame.merge(), I ended up with two columns with same name but differing in case (ex: foo from df1, FOO from df2). This caused problems when I tried to upload data into a hadoop cluster since hive is not case sensitive. As a work around, I converted the column names in df2 to lower case and then merged using pd.merge(df1, df2, ..., suffixes = ('_df1', '_df2')). The resulting data frame will now have foo_df1, foo_df2 columns. 
 + 
 +Example (using approach 1): 
 +<code> 
 +$ ipython 
 +Python 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)] 
 +Type 'copyright', 'credits' or 'license' for more information 
 +IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help. 
 + 
 +In [1]: 
 +import pandas as pd 
 +df = pd.DataFrame( 
 +    [['2023-03-02', '2023-02-28', 3.20, 3.30], 
 +     ['2022-12-08', '2022-11-30', 3.14, 3.10]], 
 +    columns = ["Announcement Date", "Fiscal Quarter End", "Estimated EPS", "Actual EPS"]) 
 +df 
 +Out[1]: 
 +  Announcement Date Fiscal Quarter End  Estimated EPS  Actual EPS 
 +0        2023-03-02         2023-02-28           3.20         3.3 
 +1        2022-12-08         2022-11-30           3.14         3.1 
 + 
 +In [2]: 
 +df.columns = df.columns.str.lower() 
 + 
 +In [3]: 
 +df 
 +Out[3]: 
 +  announcement date fiscal quarter end  estimated eps  actual eps 
 +0        2023-03-02         2023-02-28           3.20         3.3 
 +1        2022-12-08         2022-11-30           3.14         3.1 
 +</code> 
 + 
 +Example (using approach 2):
 <code> <code>
 $ ipython $ ipython
Line 234: Line 498:
 1        2022-12-08         2022-11-30           3.14         3.1 1        2022-12-08         2022-11-30           3.14         3.1
 </code> </code>
 +
  
 ==== convert all categorical columns to string columns ==== ==== convert all categorical columns to string columns ====
pandas_dataframe.txt · Last modified: 2023/09/07 21:45 by raju