Differences

This shows you the differences between two versions of the page.

--- pandas_dataframe [2023/03/13 15:52] – [convert all categorical columns to string columns] admin
+++ pandas_dataframe [2023/09/07 21:44] – [split columns] raju
@@ Line 59: / Line 59: @@
   * https://www.geeksforgeeks.org/add-column-names-to-dataframe-in-pandas/ - got the idea on zip from here.
+tags | row by row
 ==== Create a dataframe by splitting strings ====
 Given a list of strings, the idea here is to create a data frame by splitting them into multiple columns.
@@ Line 114: / Line 115: @@
 also demonstrates | assign column names to a dataframe
+==== Create a dataframe from a series of lists ====
+tags | convert series with lists to dataframe
+<code>
+df = pd.DataFrame(s.to_list())
+</code>
+For example
+<code>
+In [1]:
+import pandas as pd
+s = pd.Series([[1, 2, 3], [4, 5, 6, 7], [8, 9]])
+s
+Out[1]:
+       [1, 2, 3]
+    [4, 5, 6, 7]
+          [8, 9]
+dtype: object
+In [2]:
+df = pd.DataFrame(s.to_list())
+df
+Out[2]:
+  1    2    3
+  1  2  3.0  NaN
+  4  5  6.0  7.0
+  8  9  NaN  NaN
+</code>
+If the number of elements in each list is same, np.vstack() can be used but otherwise it will not work. For example
+<code>
+In [5]:
+s
+Out[5]:
+       [1, 2, 3]
+    [4, 5, 6, 7]
+          [8, 9]
+dtype: object
+In [6]:
+import numpy as np
+df = pd.DataFrame(np.vstack(s))
+---------------------------------------------------------------------------
+ValueError                                Traceback (most recent call last)
+Cell In[6], line 2
+import numpy as np
+----> 2 df = pd.DataFrame(np.vstack(s))
+File <__array_function__ internals>:200, in vstack(*args, **kwargs)
+File ~\AppData\Local\conda\conda\envs\py311\Lib\site-packages\numpy\core\shape_base.py:296, in vstack(tup, dtype, casting)
+if not isinstance(arrs, list):
+     arrs = [arrs]
+--> 296 return _nx.concatenate(arrs, 0, dtype=dtype, casting=casting)
+File <__array_function__ internals>:200, in concatenate(*args, **kwargs)
+ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 3 and the array at index 1 has size 4
+</code>
+But with
+<code>
+In [10]:
+s = pd.Series([[1, 2, 3], [4, 5, 6]])
+s
+Out[10]:
+    [1, 2, 3]
+    [4, 5, 6]
+dtype: object
+In [11]:
+import numpy as np
+df = pd.DataFrame(np.vstack(s))
+df
+Out[11]:
+  1  2
+  1  2  3
+  4  5  6
+In [12]:
+df = pd.DataFrame(s.to_list())
+df
+Out[12]:
+  1  2
+  1  2  3
+  4  5  6
+</code>
+See also:
+  * https://stackoverflow.com/questions/45901018/convert-pandas-series-of-lists-to-dataframe
+==== Create a dataframe from a bunch of variables ====
+<code>
+import pandas as pd
+df = pd.DataFrame({
+  'key': ['var1', 'var2', 'var3'],
+  'value': [var1, var2, var3]
+})
+</code>
+For example
+<code>
+$ ipython
+In [1]:
+year = 2023; month = 6; date = 15
+In [2]:
+import pandas as pd
+df = pd.DataFrame({
+  'key': ['year', 'month', 'date'],
+  'value': [year, month, date]
+})
+In [3]:
+df
+Out[3]:
+     key  value
+   year   2023
+  month      6
+   date     15
+In [4]:
+df.dtypes
+Out[4]:
+key      object
+value     int64
+dtype: object
+</code>
+It works even if the variables are not of the same type.
+<code>
+In [5]:
+year = 2023; month = 'June'; date = 15
+In [6]:
+df = pd.DataFrame({
+  'key': ['year', 'month', 'date'],
+  'value': [year, month, date]
+})
+In [7]:
+df
+Out[7]:
+     key value
+   year  2023
+  month  June
+   date    15
+In [8]:
+df.dtypes
+Out[8]:
+key      object
+value    object
+dtype: object
+</code>
+Tested with Python 3.11.3, IPython 8.12.0
 ===== selection related =====
@@ Line 125: / Line 283: @@
 tags | uses [http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html pandas.Series.str.split]
+==== lookup value ====
+To pick the first value in column 'A' for rows where column B is FOO
+<code>
+df.loc[df['B'] == 'FOO', 'A'].iloc[0]
+</code>
+Example:
+<code>
+$ ipython
+In [1]:
+import pandas as pd
+df = pd.DataFrame({'A': ['p1', 'p2', 'p3', 'p4'], 'B': [1, 3, 3, 2]})
+print(df)
+    A  B
+  p1  1
+  p2  3
+  p3  3
+  p4  2
+In [2]:
+df.loc[df['B'] == 3, 'A']
+Out[2]:
+    p2
+    p3
+Name: A, dtype: object
+In [3]:
+df.loc[df['B'] == 3, 'A'].iloc[0]
+Out[3]:
+'p2'
+</code>
+search tags | value of one column when another column equals something
 ===== Series related =====
 ==== view the index, type, length and name of the series ====
@@ Line 193: / Line 383: @@
 [[ matches to coplays | Solution ]]
-====== difference between =====
+===== difference between =====
   * [[astype vs. to_numeric]]
 ===== convert stuff =====
+==== convert column names to lower case and replace space with underscore ====
+<code>
+df.columns = df.columns.str.lower().str.replace(' ', '_')
+</code>
+Example:
+<code>
+$ ipython
+Python 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)]
+Type 'copyright', 'credits' or 'license' for more information
+IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.
+In [1]:
+import pandas as pd
+df = pd.DataFrame(
+    [['2023-03-02', '2023-02-28', 3.20, 3.30],
+     ['2022-12-08', '2022-11-30', 3.14, 3.10]],
+    columns = ["Announcement Date", "Fiscal Quarter End", "Estimated EPS", "Actual EPS"])
+df
+Out[1]:
+  Announcement Date Fiscal Quarter End  Estimated EPS  Actual EPS
+        2023-03-02         2023-02-28           3.20         3.3
+        2022-12-08         2022-11-30           3.14         3.1
+In [2]:
+df.columns = df.columns.str.lower().str.replace(' ', '_')
+In [3]:
+df
+Out[3]:
+  announcement_date fiscal_quarter_end  estimated_eps  actual_eps
+        2023-03-02         2023-02-28           3.20         3.3
+        2022-12-08         2022-11-30           3.14         3.1
+</code>
 ==== convert column names to lower case ====
+Approach 1:
+<code>
+df.columns = df.columns.str.lower()
+</code>
 Approach 2:
@@ Line 206: / Line 435: @@
 </code>
-Example:
+Notes:
+  * I prefer approach1 - simple syntax, easy to remember.
+Use case: While merging data from two data frames using DataFrame.merge(), I ended up with two columns with same name but differing in case (ex: foo from df1, FOO from df2). This caused problems when I tried to upload data into a hadoop cluster since hive is not case sensitive. As a work around, I converted the column names in df2 to lower case and then merged using pd.merge(df1, df2, ..., suffixes = ('_df1', '_df2')). The resulting data frame will now have foo_df1, foo_df2 columns.
+Example (using approach 1):
+<code>
+$ ipython
+Python 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)]
+Type 'copyright', 'credits' or 'license' for more information
+IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.
+In [1]:
+import pandas as pd
+df = pd.DataFrame(
+    [['2023-03-02', '2023-02-28', 3.20, 3.30],
+     ['2022-12-08', '2022-11-30', 3.14, 3.10]],
+    columns = ["Announcement Date", "Fiscal Quarter End", "Estimated EPS", "Actual EPS"])
+df
+Out[1]:
+  Announcement Date Fiscal Quarter End  Estimated EPS  Actual EPS
+        2023-03-02         2023-02-28           3.20         3.3
+        2022-12-08         2022-11-30           3.14         3.1
+In [2]:
+df.columns = df.columns.str.lower()
+In [3]:
+df
+Out[3]:
+  announcement date fiscal quarter end  estimated eps  actual eps
+        2023-03-02         2023-02-28           3.20         3.3
+        2022-12-08         2022-11-30           3.14         3.1
+</code>
+Example (using approach 2):
 <code>
 $ ipython
@@ Line 234: / Line 498: @@
         2022-12-08         2022-11-30           3.14         3.1
 </code>
 ==== convert all categorical columns to string columns ====