Differences

This shows you the differences between two versions of the page.

--- pandas_dataframe [2023/03/13 15:59] – [convert column names to lower case] admin
+++ pandas_dataframe [2023/07/21 22:16] – [Create a dataframe from list of lists] raju
@@ Line 59: / Line 59: @@
   * https://www.geeksforgeeks.org/add-column-names-to-dataframe-in-pandas/ - got the idea on zip from here.
+tags | row by row
 ==== Create a dataframe by splitting strings ====
 Given a list of strings, the idea here is to create a data frame by splitting them into multiple columns.
@@ Line 114: / Line 115: @@
 also demonstrates | assign column names to a dataframe
+==== Create a dataframe from a series of lists ====
+tags | convert series with lists to dataframe
+<code>
+df = pd.DataFrame(s.to_list())
+</code>
+For example
+<code>
+In [1]:
+import pandas as pd
+s = pd.Series([[1, 2, 3], [4, 5, 6, 7], [8, 9]])
+s
+Out[1]:
+       [1, 2, 3]
+    [4, 5, 6, 7]
+          [8, 9]
+dtype: object
+In [2]:
+df = pd.DataFrame(s.to_list())
+df
+Out[2]:
+  1    2    3
+  1  2  3.0  NaN
+  4  5  6.0  7.0
+  8  9  NaN  NaN
+</code>
+If the number of elements in each list is same, np.vstack() can be used but otherwise it will not work. For example
+<code>
+In [5]:
+s
+Out[5]:
+       [1, 2, 3]
+    [4, 5, 6, 7]
+          [8, 9]
+dtype: object
+In [6]:
+import numpy as np
+df = pd.DataFrame(np.vstack(s))
+---------------------------------------------------------------------------
+ValueError                                Traceback (most recent call last)
+Cell In[6], line 2
+import numpy as np
+----> 2 df = pd.DataFrame(np.vstack(s))
+File <__array_function__ internals>:200, in vstack(*args, **kwargs)
+File ~\AppData\Local\conda\conda\envs\py311\Lib\site-packages\numpy\core\shape_base.py:296, in vstack(tup, dtype, casting)
+if not isinstance(arrs, list):
+     arrs = [arrs]
+--> 296 return _nx.concatenate(arrs, 0, dtype=dtype, casting=casting)
+File <__array_function__ internals>:200, in concatenate(*args, **kwargs)
+ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 3 and the array at index 1 has size 4
+</code>
+But with
+<code>
+In [10]:
+s = pd.Series([[1, 2, 3], [4, 5, 6]])
+s
+Out[10]:
+    [1, 2, 3]
+    [4, 5, 6]
+dtype: object
+In [11]:
+import numpy as np
+df = pd.DataFrame(np.vstack(s))
+df
+Out[11]:
+  1  2
+  1  2  3
+  4  5  6
+In [12]:
+df = pd.DataFrame(s.to_list())
+df
+Out[12]:
+  1  2
+  1  2  3
+  4  5  6
+</code>
+See also:
+  * https://stackoverflow.com/questions/45901018/convert-pandas-series-of-lists-to-dataframe
+==== Create a dataframe from a bunch of variables ====
+<code>
+import pandas as pd
+df = pd.DataFrame({
+  'key': ['var1', 'var2', 'var3'],
+  'value': [var1, var2, var3]
+})
+</code>
+For example
+<code>
+$ ipython
+In [1]:
+year = 2023; month = 6; date = 15
+In [2]:
+import pandas as pd
+df = pd.DataFrame({
+  'key': ['year', 'month', 'date'],
+  'value': [year, month, date]
+})
+In [3]:
+df
+Out[3]:
+     key  value
+   year   2023
+  month      6
+   date     15
+In [4]:
+df.dtypes
+Out[4]:
+key      object
+value     int64
+dtype: object
+</code>
+It works even if the variables are not of the same type.
+<code>
+In [5]:
+year = 2023; month = 'June'; date = 15
+In [6]:
+df = pd.DataFrame({
+  'key': ['year', 'month', 'date'],
+  'value': [year, month, date]
+})
+In [7]:
+df
+Out[7]:
+     key value
+   year  2023
+  month  June
+   date    15
+In [8]:
+df.dtypes
+Out[8]:
+key      object
+value    object
+dtype: object
+</code>
+Tested with Python 3.11.3, IPython 8.12.0
 ===== selection related =====
@@ Line 193: / Line 351: @@
 [[ matches to coplays | Solution ]]
-====== difference between =====
+===== difference between =====
   * [[astype vs. to_numeric]]
@@ Line 199: / Line 357: @@
 ===== convert stuff =====
-==== convert column names to lower case ====
+==== convert column names to lower case and replace space with underscore ====
-Approach 1:
 <code>
-df.columns = df.columns.str.lower()
+df.columns = df.columns.str.lower().str.replace(' ', '_')
 </code>
@@ Line 225: / Line 382: @@
 In [2]:
-df.columns = df.columns.str.lower()
+df.columns = df.columns.str.lower().str.replace(' ', '_')
 In [3]:
 df
 Out[3]:
-  announcement date fiscal quarter end  estimated eps  actual eps
+  announcement_date fiscal_quarter_end  estimated_eps  actual_eps
         2023-03-02         2023-02-28           3.20         3.3
         2022-12-08         2022-11-30           3.14         3.1
+</code>
+==== convert column names to lower case ====
+Approach 1:
+<code>
+df.columns = df.columns.str.lower()
 </code>
@@ Line 240: / Line 403: @@
 </code>
-Example:
+Notes:
+  * I prefer approach1 - simple syntax, easy to remember.
+Use case: While merging data from two data frames using DataFrame.merge(), I ended up with two columns with same name but differing in case (ex: foo from df1, FOO from df2). This caused problems when I tried to upload data into a hadoop cluster since hive is not case sensitive. As a work around, I converted the column names in df2 to lower case and then merged using pd.merge(df1, df2, ..., suffixes = ('_df1', '_df2')). The resulting data frame will now have foo_df1, foo_df2 columns.
+Example (using approach 1):
+<code>
+$ ipython
+Python 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)]
+Type 'copyright', 'credits' or 'license' for more information
+IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.
+In [1]:
+import pandas as pd
+df = pd.DataFrame(
+    [['2023-03-02', '2023-02-28', 3.20, 3.30],
+     ['2022-12-08', '2022-11-30', 3.14, 3.10]],
+    columns = ["Announcement Date", "Fiscal Quarter End", "Estimated EPS", "Actual EPS"])
+df
+Out[1]:
+  Announcement Date Fiscal Quarter End  Estimated EPS  Actual EPS
+        2023-03-02         2023-02-28           3.20         3.3
+        2022-12-08         2022-11-30           3.14         3.1
+In [2]:
+df.columns = df.columns.str.lower()
+In [3]:
+df
+Out[3]:
+  announcement date fiscal quarter end  estimated eps  actual eps
+        2023-03-02         2023-02-28           3.20         3.3
+        2022-12-08         2022-11-30           3.14         3.1
+</code>
+Example (using approach 2):
 <code>
 $ ipython
@@ Line 269: / Line 467: @@
 </code>
-Use case: While merging data from two data frames using DataFrame.merge(), I ended up with two columns with same name but differing in case (ex: foo from df1, FOO from df2). This caused problems when I tried to upload data into a hadoop cluster since hive is not case sensitive. As a work around, I converted the column names in df2 to lower case and then merged using pd.merge(df1, df2, ..., suffixes = ('_df1', '_df2')). The resulting data frame will now have foo_df1, foo_df2 columns.
 ==== convert all categorical columns to string columns ====