User Tools

Site Tools


pandas_dataframe

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
pandas_dataframe [2023/03/13 16:15] – [convert column names to lower case] adminpandas_dataframe [2023/09/07 21:44] – [split columns] raju
Line 59: Line 59:
   * https://www.geeksforgeeks.org/add-column-names-to-dataframe-in-pandas/ - got the idea on zip from here.   * https://www.geeksforgeeks.org/add-column-names-to-dataframe-in-pandas/ - got the idea on zip from here.
  
 +tags | row by row
 ==== Create a dataframe by splitting strings ==== ==== Create a dataframe by splitting strings ====
 Given a list of strings, the idea here is to create a data frame by splitting them into multiple columns. Given a list of strings, the idea here is to create a data frame by splitting them into multiple columns.
Line 114: Line 115:
  
 also demonstrates | assign column names to a dataframe also demonstrates | assign column names to a dataframe
 +
 +==== Create a dataframe from a series of lists ====
 +tags | convert series with lists to dataframe
 +<code>
 +df = pd.DataFrame(s.to_list())
 +</code>
 +For example
 +<code>
 +In [1]:
 +import pandas as pd
 +s = pd.Series([[1, 2, 3], [4, 5, 6, 7], [8, 9]])
 +s
 +Out[1]:
 +0       [1, 2, 3]
 +1    [4, 5, 6, 7]
 +2          [8, 9]
 +dtype: object
 +
 +In [2]:
 +df = pd.DataFrame(s.to_list())
 +df
 +Out[2]:
 +    1    2    3
 +0  1  2  3.0  NaN
 +1  4  5  6.0  7.0
 +2  8  9  NaN  NaN
 +</code>
 +
 +If the number of elements in each list is same, np.vstack() can be used but otherwise it will not work. For example
 +<code>
 +In [5]:
 +s
 +Out[5]:
 +0       [1, 2, 3]
 +1    [4, 5, 6, 7]
 +2          [8, 9]
 +dtype: object
 +
 +In [6]:
 +import numpy as np
 +df = pd.DataFrame(np.vstack(s))
 +---------------------------------------------------------------------------
 +ValueError                                Traceback (most recent call last)
 +Cell In[6], line 2
 +      1 import numpy as np
 +----> 2 df = pd.DataFrame(np.vstack(s))
 +
 +File <__array_function__ internals>:200, in vstack(*args, **kwargs)
 +
 +File ~\AppData\Local\conda\conda\envs\py311\Lib\site-packages\numpy\core\shape_base.py:296, in vstack(tup, dtype, casting)
 +    294 if not isinstance(arrs, list):
 +    295     arrs = [arrs]
 +--> 296 return _nx.concatenate(arrs, 0, dtype=dtype, casting=casting)
 +
 +File <__array_function__ internals>:200, in concatenate(*args, **kwargs)
 +
 +ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 3 and the array at index 1 has size 4
 +</code>
 +
 +But with
 +<code>
 +In [10]:
 +s = pd.Series([[1, 2, 3], [4, 5, 6]])
 +s
 +Out[10]:
 +0    [1, 2, 3]
 +1    [4, 5, 6]
 +dtype: object
 +
 +In [11]:
 +import numpy as np
 +df = pd.DataFrame(np.vstack(s))
 +df
 +Out[11]:
 +    1  2
 +0  1  2  3
 +1  4  5  6
 +
 +In [12]:
 +df = pd.DataFrame(s.to_list())
 +df
 +Out[12]:
 +    1  2
 +0  1  2  3
 +1  4  5  6
 +</code>
 +
 +See also:
 +  * https://stackoverflow.com/questions/45901018/convert-pandas-series-of-lists-to-dataframe
 +
 +==== Create a dataframe from a bunch of variables ====
 +<code>
 +import pandas as pd
 +df = pd.DataFrame({
 +  'key': ['var1', 'var2', 'var3'],
 +  'value': [var1, var2, var3]
 +})
 +</code>
 +
 +For example
 +<code>
 +$ ipython
 +
 +In [1]:
 +year = 2023; month = 6; date = 15
 +
 +In [2]:
 +import pandas as pd
 +df = pd.DataFrame({
 +  'key': ['year', 'month', 'date'],
 +  'value': [year, month, date]
 +})
 +
 +In [3]:
 +df
 +Out[3]:
 +     key  value
 +0   year   2023
 +1  month      6
 +2   date     15
 +
 +In [4]:
 +df.dtypes
 +Out[4]:
 +key      object
 +value     int64
 +dtype: object
 +</code>
 +
 +It works even if the variables are not of the same type.
 +<code>
 +In [5]:
 +year = 2023; month = 'June'; date = 15
 +
 +In [6]:
 +df = pd.DataFrame({
 +  'key': ['year', 'month', 'date'],
 +  'value': [year, month, date]
 +})
 +
 +In [7]:
 +df
 +Out[7]:
 +     key value
 +0   year  2023
 +1  month  June
 +2   date    15
 +
 +In [8]:
 +df.dtypes
 +Out[8]:
 +key      object
 +value    object
 +dtype: object
 +</code>
 +
 +Tested with Python 3.11.3, IPython 8.12.0
  
 ===== selection related ===== ===== selection related =====
Line 125: Line 283:
 tags | uses [http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html pandas.Series.str.split] tags | uses [http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html pandas.Series.str.split]
  
 +==== lookup value ====
 +To pick the first value in column 'A' for rows where column B is FOO
 +<code>
 +df.loc[df['B'] == 'FOO', 'A'].iloc[0]
 +</code>
  
 +Example:
 +<code>
 +$ ipython
 +In [1]:
 +import pandas as pd
 +df = pd.DataFrame({'A': ['p1', 'p2', 'p3', 'p4'], 'B': [1, 3, 3, 2]})
 +print(df)
 +    A  B
 +0  p1  1
 +1  p2  3
 +2  p3  3
 +3  p4  2
 +
 +In [2]:
 +df.loc[df['B'] == 3, 'A']
 +Out[2]:
 +1    p2
 +2    p3
 +Name: A, dtype: object
 +
 +In [3]:
 +df.loc[df['B'] == 3, 'A'].iloc[0]
 +Out[3]:
 +'p2'
 +</code>
 +
 +search tags | value of one column when another column equals something
 ===== Series related ===== ===== Series related =====
 ==== view the index, type, length and name of the series ==== ==== view the index, type, length and name of the series ====
Line 193: Line 383:
 [[ matches to coplays | Solution ]] [[ matches to coplays | Solution ]]
  
-====== difference between =====+===== difference between =====
   * [[astype vs. to_numeric]]   * [[astype vs. to_numeric]]
  
  
 ===== convert stuff ===== ===== convert stuff =====
 +
 +==== convert column names to lower case and replace space with underscore ====
 +<code>
 +df.columns = df.columns.str.lower().str.replace(' ', '_')
 +</code>
 +
 +Example:
 +<code>
 +$ ipython
 +Python 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)]
 +Type 'copyright', 'credits' or 'license' for more information
 +IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.
 +
 +In [1]:
 +import pandas as pd
 +df = pd.DataFrame(
 +    [['2023-03-02', '2023-02-28', 3.20, 3.30],
 +     ['2022-12-08', '2022-11-30', 3.14, 3.10]],
 +    columns = ["Announcement Date", "Fiscal Quarter End", "Estimated EPS", "Actual EPS"])
 +df
 +Out[1]:
 +  Announcement Date Fiscal Quarter End  Estimated EPS  Actual EPS
 +0        2023-03-02         2023-02-28           3.20         3.3
 +1        2022-12-08         2022-11-30           3.14         3.1
 +
 +In [2]:
 +df.columns = df.columns.str.lower().str.replace(' ', '_')
 +
 +In [3]:
 +df
 +Out[3]:
 +  announcement_date fiscal_quarter_end  estimated_eps  actual_eps
 +0        2023-03-02         2023-02-28           3.20         3.3
 +1        2022-12-08         2022-11-30           3.14         3.1
 +</code>
  
 ==== convert column names to lower case ==== ==== convert column names to lower case ====
pandas_dataframe.txt · Last modified: 2023/09/07 21:45 by raju