User Tools

Site Tools


pandas_dataframe

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
pandas_dataframe [2023/03/13 15:52] – [convert all categorical columns to string columns] adminpandas_dataframe [2023/09/07 21:45] (current) – [lookup value] raju
Line 59: Line 59:
   * https://www.geeksforgeeks.org/add-column-names-to-dataframe-in-pandas/ - got the idea on zip from here.   * https://www.geeksforgeeks.org/add-column-names-to-dataframe-in-pandas/ - got the idea on zip from here.
  
 +tags | row by row
 ==== Create a dataframe by splitting strings ==== ==== Create a dataframe by splitting strings ====
 Given a list of strings, the idea here is to create a data frame by splitting them into multiple columns. Given a list of strings, the idea here is to create a data frame by splitting them into multiple columns.
Line 114: Line 115:
  
 also demonstrates | assign column names to a dataframe also demonstrates | assign column names to a dataframe
 +
 +==== Create a dataframe from a series of lists ====
 +tags | convert series with lists to dataframe
 +<code>
 +df = pd.DataFrame(s.to_list())
 +</code>
 +For example
 +<code>
 +In [1]:
 +import pandas as pd
 +s = pd.Series([[1, 2, 3], [4, 5, 6, 7], [8, 9]])
 +s
 +Out[1]:
 +0       [1, 2, 3]
 +1    [4, 5, 6, 7]
 +2          [8, 9]
 +dtype: object
 +
 +In [2]:
 +df = pd.DataFrame(s.to_list())
 +df
 +Out[2]:
 +    1    2    3
 +0  1  2  3.0  NaN
 +1  4  5  6.0  7.0
 +2  8  9  NaN  NaN
 +</code>
 +
 +If the number of elements in each list is same, np.vstack() can be used but otherwise it will not work. For example
 +<code>
 +In [5]:
 +s
 +Out[5]:
 +0       [1, 2, 3]
 +1    [4, 5, 6, 7]
 +2          [8, 9]
 +dtype: object
 +
 +In [6]:
 +import numpy as np
 +df = pd.DataFrame(np.vstack(s))
 +---------------------------------------------------------------------------
 +ValueError                                Traceback (most recent call last)
 +Cell In[6], line 2
 +      1 import numpy as np
 +----> 2 df = pd.DataFrame(np.vstack(s))
 +
 +File <__array_function__ internals>:200, in vstack(*args, **kwargs)
 +
 +File ~\AppData\Local\conda\conda\envs\py311\Lib\site-packages\numpy\core\shape_base.py:296, in vstack(tup, dtype, casting)
 +    294 if not isinstance(arrs, list):
 +    295     arrs = [arrs]
 +--> 296 return _nx.concatenate(arrs, 0, dtype=dtype, casting=casting)
 +
 +File <__array_function__ internals>:200, in concatenate(*args, **kwargs)
 +
 +ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 3 and the array at index 1 has size 4
 +</code>
 +
 +But with
 +<code>
 +In [10]:
 +s = pd.Series([[1, 2, 3], [4, 5, 6]])
 +s
 +Out[10]:
 +0    [1, 2, 3]
 +1    [4, 5, 6]
 +dtype: object
 +
 +In [11]:
 +import numpy as np
 +df = pd.DataFrame(np.vstack(s))
 +df
 +Out[11]:
 +    1  2
 +0  1  2  3
 +1  4  5  6
 +
 +In [12]:
 +df = pd.DataFrame(s.to_list())
 +df
 +Out[12]:
 +    1  2
 +0  1  2  3
 +1  4  5  6
 +</code>
 +
 +See also:
 +  * https://stackoverflow.com/questions/45901018/convert-pandas-series-of-lists-to-dataframe
 +
 +==== Create a dataframe from a bunch of variables ====
 +<code>
 +import pandas as pd
 +df = pd.DataFrame({
 +  'key': ['var1', 'var2', 'var3'],
 +  'value': [var1, var2, var3]
 +})
 +</code>
 +
 +For example
 +<code>
 +$ ipython
 +
 +In [1]:
 +year = 2023; month = 6; date = 15
 +
 +In [2]:
 +import pandas as pd
 +df = pd.DataFrame({
 +  'key': ['year', 'month', 'date'],
 +  'value': [year, month, date]
 +})
 +
 +In [3]:
 +df
 +Out[3]:
 +     key  value
 +0   year   2023
 +1  month      6
 +2   date     15
 +
 +In [4]:
 +df.dtypes
 +Out[4]:
 +key      object
 +value     int64
 +dtype: object
 +</code>
 +
 +It works even if the variables are not of the same type.
 +<code>
 +In [5]:
 +year = 2023; month = 'June'; date = 15
 +
 +In [6]:
 +df = pd.DataFrame({
 +  'key': ['year', 'month', 'date'],
 +  'value': [year, month, date]
 +})
 +
 +In [7]:
 +df
 +Out[7]:
 +     key value
 +0   year  2023
 +1  month  June
 +2   date    15
 +
 +In [8]:
 +df.dtypes
 +Out[8]:
 +key      object
 +value    object
 +dtype: object
 +</code>
 +
 +Tested with Python 3.11.3, IPython 8.12.0
  
 ===== selection related ===== ===== selection related =====
Line 125: Line 283:
 tags | uses [http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html pandas.Series.str.split] tags | uses [http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html pandas.Series.str.split]
  
 +==== lookup value ====
 +To pick the first value in column 'A' for rows where column B is FOO
 +<code>
 +df.loc[df['B'] == 'FOO', 'A'].iloc[0]
 +</code>
 +
 +Example:
 +<code>
 +$ ipython
 +In [1]:
 +import pandas as pd
 +df = pd.DataFrame({'A': ['p1', 'p2', 'p3', 'p4'], 'B': [1, 3, 3, 2]})
 +print(df)
 +    A  B
 +0  p1  1
 +1  p2  3
 +2  p3  3
 +3  p4  2
 +
 +In [2]:
 +df.loc[df['B'] == 3, 'A']
 +Out[2]:
 +1    p2
 +2    p3
 +Name: A, dtype: object
 +
 +In [3]:
 +df.loc[df['B'] == 3, 'A'].iloc[0]
 +Out[3]:
 +'p2'
 +</code>
 +
 +search tags | value of one column when another column equals something
 +
 +Ref:- https://stackoverflow.com/questions/36684013/extract-column-value-based-on-another-column-in-pandas
  
 ===== Series related ===== ===== Series related =====
Line 193: Line 386:
 [[ matches to coplays | Solution ]] [[ matches to coplays | Solution ]]
  
-====== difference between =====+===== difference between =====
   * [[astype vs. to_numeric]]   * [[astype vs. to_numeric]]
  
  
 ===== convert stuff ===== ===== convert stuff =====
 +
 +==== convert column names to lower case and replace space with underscore ====
 +<code>
 +df.columns = df.columns.str.lower().str.replace(' ', '_')
 +</code>
 +
 +Example:
 +<code>
 +$ ipython
 +Python 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)]
 +Type 'copyright', 'credits' or 'license' for more information
 +IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.
 +
 +In [1]:
 +import pandas as pd
 +df = pd.DataFrame(
 +    [['2023-03-02', '2023-02-28', 3.20, 3.30],
 +     ['2022-12-08', '2022-11-30', 3.14, 3.10]],
 +    columns = ["Announcement Date", "Fiscal Quarter End", "Estimated EPS", "Actual EPS"])
 +df
 +Out[1]:
 +  Announcement Date Fiscal Quarter End  Estimated EPS  Actual EPS
 +0        2023-03-02         2023-02-28           3.20         3.3
 +1        2022-12-08         2022-11-30           3.14         3.1
 +
 +In [2]:
 +df.columns = df.columns.str.lower().str.replace(' ', '_')
 +
 +In [3]:
 +df
 +Out[3]:
 +  announcement_date fiscal_quarter_end  estimated_eps  actual_eps
 +0        2023-03-02         2023-02-28           3.20         3.3
 +1        2022-12-08         2022-11-30           3.14         3.1
 +</code>
  
 ==== convert column names to lower case ==== ==== convert column names to lower case ====
 +Approach 1:
 +<code>
 +df.columns = df.columns.str.lower()
 +</code>
  
 Approach 2: Approach 2:
Line 206: Line 438:
 </code> </code>
  
-Example:+Notes: 
 +  * I prefer approach1 - simple syntax, easy to remember. 
 + 
 +Use case: While merging data from two data frames using DataFrame.merge(), I ended up with two columns with same name but differing in case (ex: foo from df1, FOO from df2). This caused problems when I tried to upload data into a hadoop cluster since hive is not case sensitive. As a work around, I converted the column names in df2 to lower case and then merged using pd.merge(df1, df2, ..., suffixes = ('_df1', '_df2')). The resulting data frame will now have foo_df1, foo_df2 columns. 
 + 
 +Example (using approach 1): 
 +<code> 
 +$ ipython 
 +Python 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)] 
 +Type 'copyright', 'credits' or 'license' for more information 
 +IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help. 
 + 
 +In [1]: 
 +import pandas as pd 
 +df = pd.DataFrame( 
 +    [['2023-03-02', '2023-02-28', 3.20, 3.30], 
 +     ['2022-12-08', '2022-11-30', 3.14, 3.10]], 
 +    columns = ["Announcement Date", "Fiscal Quarter End", "Estimated EPS", "Actual EPS"]) 
 +df 
 +Out[1]: 
 +  Announcement Date Fiscal Quarter End  Estimated EPS  Actual EPS 
 +0        2023-03-02         2023-02-28           3.20         3.3 
 +1        2022-12-08         2022-11-30           3.14         3.1 
 + 
 +In [2]: 
 +df.columns = df.columns.str.lower() 
 + 
 +In [3]: 
 +df 
 +Out[3]: 
 +  announcement date fiscal quarter end  estimated eps  actual eps 
 +0        2023-03-02         2023-02-28           3.20         3.3 
 +1        2022-12-08         2022-11-30           3.14         3.1 
 +</code> 
 + 
 +Example (using approach 2):
 <code> <code>
 $ ipython $ ipython
Line 234: Line 501:
 1        2022-12-08         2022-11-30           3.14         3.1 1        2022-12-08         2022-11-30           3.14         3.1
 </code> </code>
 +
  
 ==== convert all categorical columns to string columns ==== ==== convert all categorical columns to string columns ====
pandas_dataframe.1678722723.txt.gz · Last modified: 2023/03/13 15:52 by admin