Pandas Feature Engineering: Creating New DataFrame Columns

IE6400 Foundations for Data Analytics Engineering ¶ Fall 2023 ¶ Module 1: Pandas Library - Part 2 - Lab 7 ¶ The Column Creation of Pandas DataFrame ¶ When working on the Data Analytics, one of the most important tasks is feature engineering: creating new features from the data. This lab shows different ways to create those new features from existing columns. Part I: Basic Operations ¶ Exercise 7.1 Creating a new column using the operators ¶ In [1]: import pandas as pd df = pd.read_csv('tax_dataset.csv') In [2]: # Displaying using Jupyter Notebook df # For non-Jupyter environments # print(df) Out[2]: ID birthdate firstname lastname gender bruto netto 0 1 20/8/2004 Anna Adams F 300 250 1 2 15/12/1990 Alex Grant NaN 3200 1900 2 3 13/1/1959 Lea Fox F 5050 2750 3 4 6/5/1978 John Adams M 4575 2540 4 5 29/7/2010 Sam Baker M 0 0 In [3]: # multiplication with a scalar df['netto_times_2'] = df['netto'] * 2 In [4]: df Out[4]: ID birthdate firstname lastname gender bruto netto netto_times_2 0 1 20/8/2004 Anna Adams F 300 250 500 1 2 15/12/1990 Alex Grant NaN 3200 1900 3800 2 3 13/1/1959 Lea Fox F 5050 2750 5500

ID birthdate firstname lastname gender bruto netto netto_times_2 3 4 6/5/1978 John Adams M 4575 2540 5080 4 5 29/7/2010 Sam Baker M 0 0 0 In [5]: # subtracting two columns df['tax'] = df['bruto'] - df['netto'] In [6]: df Out[6]: ID birthdate firstname lastname gender bruto netto netto_times_2 tax 0 1 20/8/2004 Anna Adams F 300 250 500 50 1 2 15/12/1990 Alex Grant NaN 3200 1900 3800 1300 2 3 13/1/1959 Lea Fox F 5050 2750 5500 2300 3 4 6/5/1978 John Adams M 4575 2540 5080 2035 4 5 29/7/2010 Sam Baker M 0 0 0 0 In [7]: # Customizing the text df['fullname'] = df['firstname'] + ' ' + df['lastname'] In [8]: df Out[8]: ID birthdate firstname lastname gender bruto netto netto_ times_ 2 tax fullname 0 1 20/8/2004 Anna Adams F 300 250 500 50 Anna Adams 1 2 15/12/199 0 Alex Grant NaN 3200 1900 3800 1300 Alex Grant 2 3 13/1/1959 Lea Fox F 5050 2750 5500 2300 Lea Fox 3 4 6/5/1978 John Adams M 4575 2540 5080 2035 John Adams 4 5 29/7/2010 Sam Baker M 0 0 0 0 Sam

ID birthdate firstname lastname gender bruto netto netto_ times_ 2 tax fullname Baker Exercise 7.2 Extracting information from a date column ¶ Dates can contain valuable information. In our example dataframe, we can calculate the age of a person or extract the year of birth. Another option is to calculate the days since a date. In [9]: import pandas as pd # Load CSV to DF df = pd.read_csv('tax_dataset.csv') # Displaying using Jupyter Notebook df Out[9]: ID birthdate firstname lastname gender bruto netto 0 1 20/8/2004 Anna Adams F 300 250 1 2 15/12/1990 Alex Grant NaN 3200 1900 2 3 13/1/1959 Lea Fox F 5050 2750 3 4 6/5/1978 John Adams M 4575 2540 4 5 29/7/2010 Sam Baker M 0 0 In [10]: # Querying only birthdate field df = df.loc[:,['birthdate']] # Slicing only birthdate column # converting the dtype df['birthdate'] = pd.to_datetime(df['birthdate'], dayfirst=True) # DD/MM/YYYY df Out[10]: birthdate 0 2004-08-20 1 1990-12-15 2 1959-01-13 3 1978-05-06

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

birthdate 4 2010-07-29 In [11]: # extracting the year from a date df['year_of_birth'] = pd.DatetimeIndex(df['birthdate']).year In [12]: # Displaying using Jupyter Notebook df Out[12]: birthdate year_of_birth 0 2004-08-20 2004 1 1990-12-15 1990 2 1959-01-13 1959 3 1978-05-06 1978 4 2010-07-29 2010 In [13]: # calculating the age of a person (Option 1) now = pd.Timestamp.now() df['age'] = (now - df['birthdate']).astype('m8[Y]').astype('int') M8[Y] changes the date to be the year. In [14]: # Displaying using Jupyter Notebook df Out[14]: birthdate year_of_birth age 0 2004-08-20 2004 19 1 1990-12-15 1990 32 2 1959-01-13 1959 64 3 1978-05-06 1978 45 4 2010-07-29 2010 13 In [15]: # calculating the age of a person (Option 2) now = pd.Timestamp.now() df['today'] = now.year df['age_option2'] = (df['today'] - df['year_of_birth']).astype(int)

In [16]: # Displaying using Jupyter Notebook df Out[16]: birthdate year_of_birth age today age_option2 0 2004-08-20 2004 19 2023 19 1 1990-12-15 1990 32 2023 33 2 1959-01-13 1959 64 2023 64 3 1978-05-06 1978 45 2023 45 4 2010-07-29 2010 13 2023 13 In [17]: # calculating days since date now = pd.Timestamp.now() df['days_since_birth'] = df.apply(lambda row: (now - row['birthdate']).days, axis=1) In [18]: # Displaying using Jupyter Notebook df[['birthdate', 'year_of_birth', 'age', 'days_since_birth']].head() Out[18]: birthdate year_of_birth age days_since_birth 0 2004-08-20 2004 19 6973 1 1990-12-15 1990 32 11970 2 1959-01-13 1959 64 23629 3 1978-05-06 1978 45 16576 4 2010-07-29 2010 13 4804 Part II: Conditions and Functions ¶ Exercise 7.3 Conditional column creation ¶ In [19]: import pandas as pd import numpy as np # Load CSV to DF df = pd.read_csv('tax_dataset.csv') # Calculating age field df['birthdate'] = pd.to_datetime(df.birthdate, dayfirst=True) now = pd.Timestamp.now() df['age'] = (now - df['birthdate']).astype('m8[Y]').astype('int')

# Slicing only age and gender df = df.loc[:,['age','gender']] # Displaying using Jupyter Notebook df Out[19]: age gender 0 19 F 1 32 NaN 2 64 F 3 45 M 4 13 M In [20]: df['child'] = np.where(df['age'] < 18, 1, 0) In [21]: # Displaying using Jupyter Notebook df Out[21]: age gender child 0 19 F 0 1 32 NaN 0 2 64 F 0 3 45 M 0 4 13 M 1 In [22]: df['male'] = np.where(df['gender'] == 'M', 1, 0) In [23]: # Displaying using Jupyter Notebook df Out[23]: age gender child male 0 19 F 0 0 1 32 NaN 0 0

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

age gender child male 2 64 F 0 0 3 45 M 0 1 4 13 M 1 1 Exercise 7.4 Apply an existing function to a column ¶ If you want to use an existing function and apply this function to a column, df.apply is option. For example, you want to transform a numerical column using the np.log1p function. The log1p( ) method returns log(1+number) In [24]: import pandas as pd import numpy as np # Load CSV to DF df = pd.read_csv('tax_dataset.csv') # Slicing only bruto field df = df.loc[:,['bruto']] # Displaying using Jupyter Notebook df Out[24]: bruto 0 300 1 3200 2 5050 3 4575 4 0 In [25]: # applying an existing function to a column df['log1p_bruto'] = df['bruto'].apply(np.log1p) In [26]: df Out[26]: bruto log1p_bruto 0 300 5.707110

bruto log1p_bruto 1 3200 8.071219 2 5050 8.527342 3 4575 8.428581 4 0 0.000000 Exercise 7.5 Apply and lambda combined ¶ In the exercise 4.1, we subtracted the values of the bruto and netto columns. It is easy to use basic operators, but you can also use apply combined with a lambda function: In [27]: import pandas as pd import numpy as np # Load CSV to DF df = pd.read_csv('tax_dataset.csv') # Slicing only bruto and netto fields df = df.loc[:,['bruto', 'netto']] # Displaying using Jupyter Notebook df Out[27]: bruto netto 0 300 250 1 3200 1900 2 5050 2750 3 4575 2540 4 0 0 In [28]: df['tax'] = df.apply(lambda row: row.bruto - row.netto, axis=1) In [29]: df Out[29]: bruto netto tax 0 300 250 50

bruto netto tax 1 3200 1900 1300 2 5050 2750 2300 3 4575 2540 2035 4 0 0 0 Exercise 7.6 Create a custom function (and apply) ¶ Sometimes you have multiple conditions and you want to apply a function to multiple columns at the same time. It can be done by using a custom made function, and applying this function to your dataframe. Let’s create age groups in our dataframe. In [30]: import pandas as pd import numpy as np # Load CSV to DF df = pd.read_csv('tax_dataset.csv') # Calculating age field df['birthdate'] = pd.to_datetime(df.birthdate, dayfirst=True) now = pd.Timestamp.now() df['age'] = (now - df['birthdate']).astype('m8[Y]').astype('int') # Slicing only age field df = df.loc[:,['age']] # Displaying using Jupyter Notebook df Out[30]: age 0 19 1 32 2 64 3 45 4 13 In [31]: # create a function to define age groups def age_groups(row): if row['age'] < 18: return 0

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

elif row['age'] >= 18 and row['age'] < 30: return 1 elif row['age'] >= 30 and row['age'] < 60: return 2 else: return 3 In [32]: # apply to dataframe, use axis=1 to apply the function to every row df['age_groups'] = df.apply(age_groups, axis=1) In [33]: df Out[33]: age age_groups 0 19 1 1 32 2 2 64 3 3 45 2 4 13 0 You can easily use multiple columns and multiple conditions with this way of column creation. If you want to use age and bruto income to interpret salaries: In [34]: import pandas as pd import numpy as np # Load CSV to DF df = pd.read_csv('tax_dataset.csv') # Calculating age field df['birthdate'] = pd.to_datetime(df.birthdate, dayfirst=True) now = pd.Timestamp.now() df['age'] = (now - df['birthdate']).astype('m8[Y]').astype('int') # Slicing only age and bruto fields df = df.loc[:,['age','bruto']] # Displaying using Jupyter Notebook df Out[34]: age bruto 0 19 300 1 32 3200

age bruto 2 64 5050 3 45 4575 4 13 0 In [35]: def age_salary(row): if row['bruto'] / row['age'] > 100: return 'high salary' elif row['bruto'] / row['age'] <= 100 and row['bruto'] / row['age'] > 50: return 'medium salary' elif row['bruto'] / row['age'] < 50 and row['bruto'] / row['age'] > 0: return 'low salary' else: return 'no salary' In [36]: df['salary_age_relation'] = df.apply(age_salary, axis=1) In [37]: df.head() Out[37]: age bruto salary_age_relation 0 19 300 low salary 1 32 3200 medium salary 2 64 5050 medium salary 3 45 4575 high salary 4 13 0 no salary Exercise 7.7 Multiple conditions (vectorized solution) ¶ The solution in the previous exercise works, but might not be the best. If you are looking for a more efficient solution (e.g. if you deal with a large dataset), you can specify your conditions in a list and use np.select In [38]: import pandas as pd import numpy as np # Load CSV to DF df = pd.read_csv('tax_dataset.csv') # Calculating age field df['birthdate'] = pd.to_datetime(df.birthdate, dayfirst=True) now = pd.Timestamp.now() df['age'] = (now - df['birthdate']).astype('m8[Y]').astype('int')

# Slicing only age and bruto fields df = df.loc[:,['age','bruto']] # Displaying using Jupyter Notebook df Out[38]: age bruto 0 19 300 1 32 3200 2 64 5050 3 45 4575 4 13 0 In [39]: conditions = [df['bruto'] / df['age'] > 100, (df['bruto'] / df['age'] <= 100) & (df['bruto'] / df['age'] > 50), (df['bruto'] / df['age'] < 50) & (df['bruto'] / df['age'] > 0)] In [40]: outputs = ['high salary', 'medium salary', 'low salary'] In [41]: df['salary_age_relation'] = np.select(conditions, outputs, 'no salary') In [42]: df Out[42]: age bruto salary_age_relation 0 19 300 low salary 1 32 3200 medium salary 2 64 5050 medium salary 3 45 4575 high salary 4 13 0 no salary This gives the same results as the previous exercise, but with better performance. Revised Date: September 8th, 2023 In [ ]:

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

IE6400_Day6_Lab7

Related Documents