IE6400_Day6_Lab7

html

School

Northeastern University *

*We aren’t endorsed by this school

Course

6400

Subject

Industrial Engineering

Date

Feb 20, 2024

Type

html

Pages

12

Uploaded by ColonelStraw13148

Report
IE6400 Foundations for Data Analytics Engineering Fall 2023 Module 1: Pandas Library - Part 2 - Lab 7 The Column Creation of Pandas DataFrame When working on the Data Analytics, one of the most important tasks is feature engineering: creating new features from the data. This lab shows different ways to create those new features from existing columns. Part I: Basic Operations Exercise 7.1 Creating a new column using the operators In [1]: import pandas as pd df = pd.read_csv('tax_dataset.csv') In [2]: # Displaying using Jupyter Notebook df # For non-Jupyter environments # print(df) Out[2]: ID birthdate firstname lastname gender bruto netto 0 1 20/8/2004 Anna Adams F 300 250 1 2 15/12/1990 Alex Grant NaN 3200 1900 2 3 13/1/1959 Lea Fox F 5050 2750 3 4 6/5/1978 John Adams M 4575 2540 4 5 29/7/2010 Sam Baker M 0 0 In [3]: # multiplication with a scalar df['netto_times_2'] = df['netto'] * 2 In [4]: df Out[4]: ID birthdate firstname lastname gender bruto netto netto_times_2 0 1 20/8/2004 Anna Adams F 300 250 500 1 2 15/12/1990 Alex Grant NaN 3200 1900 3800 2 3 13/1/1959 Lea Fox F 5050 2750 5500
ID birthdate firstname lastname gender bruto netto netto_times_2 3 4 6/5/1978 John Adams M 4575 2540 5080 4 5 29/7/2010 Sam Baker M 0 0 0 In [5]: # subtracting two columns df['tax'] = df['bruto'] - df['netto'] In [6]: df Out[6]: ID birthdate firstname lastname gender bruto netto netto_times_2 tax 0 1 20/8/2004 Anna Adams F 300 250 500 50 1 2 15/12/1990 Alex Grant NaN 3200 1900 3800 1300 2 3 13/1/1959 Lea Fox F 5050 2750 5500 2300 3 4 6/5/1978 John Adams M 4575 2540 5080 2035 4 5 29/7/2010 Sam Baker M 0 0 0 0 In [7]: # Customizing the text df['fullname'] = df['firstname'] + ' ' + df['lastname'] In [8]: df Out[8]: ID birthdate firstname lastname gender bruto netto netto_ times_ 2 tax fullname 0 1 20/8/2004 Anna Adams F 300 250 500 50 Anna Adams 1 2 15/12/199 0 Alex Grant NaN 3200 1900 3800 1300 Alex Grant 2 3 13/1/1959 Lea Fox F 5050 2750 5500 2300 Lea Fox 3 4 6/5/1978 John Adams M 4575 2540 5080 2035 John Adams 4 5 29/7/2010 Sam Baker M 0 0 0 0 Sam
ID birthdate firstname lastname gender bruto netto netto_ times_ 2 tax fullname Baker Exercise 7.2 Extracting information from a date column Dates can contain valuable information. In our example dataframe, we can calculate the age of a person or extract the year of birth. Another option is to calculate the days since a date. In [9]: import pandas as pd # Load CSV to DF df = pd.read_csv('tax_dataset.csv') # Displaying using Jupyter Notebook df Out[9]: ID birthdate firstname lastname gender bruto netto 0 1 20/8/2004 Anna Adams F 300 250 1 2 15/12/1990 Alex Grant NaN 3200 1900 2 3 13/1/1959 Lea Fox F 5050 2750 3 4 6/5/1978 John Adams M 4575 2540 4 5 29/7/2010 Sam Baker M 0 0 In [10]: # Querying only birthdate field df = df.loc[:,['birthdate']] # Slicing only birthdate column # converting the dtype df['birthdate'] = pd.to_datetime(df['birthdate'], dayfirst=True) # DD/MM/YYYY df Out[10]: birthdate 0 2004-08-20 1 1990-12-15 2 1959-01-13 3 1978-05-06
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
birthdate 4 2010-07-29 In [11]: # extracting the year from a date df['year_of_birth'] = pd.DatetimeIndex(df['birthdate']).year In [12]: # Displaying using Jupyter Notebook df Out[12]: birthdate year_of_birth 0 2004-08-20 2004 1 1990-12-15 1990 2 1959-01-13 1959 3 1978-05-06 1978 4 2010-07-29 2010 In [13]: # calculating the age of a person (Option 1) now = pd.Timestamp.now() df['age'] = (now - df['birthdate']).astype('m8[Y]').astype('int') M8[Y] changes the date to be the year. In [14]: # Displaying using Jupyter Notebook df Out[14]: birthdate year_of_birth age 0 2004-08-20 2004 19 1 1990-12-15 1990 32 2 1959-01-13 1959 64 3 1978-05-06 1978 45 4 2010-07-29 2010 13 In [15]: # calculating the age of a person (Option 2) now = pd.Timestamp.now() df['today'] = now.year df['age_option2'] = (df['today'] - df['year_of_birth']).astype(int)
In [16]: # Displaying using Jupyter Notebook df Out[16]: birthdate year_of_birth age today age_option2 0 2004-08-20 2004 19 2023 19 1 1990-12-15 1990 32 2023 33 2 1959-01-13 1959 64 2023 64 3 1978-05-06 1978 45 2023 45 4 2010-07-29 2010 13 2023 13 In [17]: # calculating days since date now = pd.Timestamp.now() df['days_since_birth'] = df.apply(lambda row: (now - row['birthdate']).days, axis=1) In [18]: # Displaying using Jupyter Notebook df[['birthdate', 'year_of_birth', 'age', 'days_since_birth']].head() Out[18]: birthdate year_of_birth age days_since_birth 0 2004-08-20 2004 19 6973 1 1990-12-15 1990 32 11970 2 1959-01-13 1959 64 23629 3 1978-05-06 1978 45 16576 4 2010-07-29 2010 13 4804 Part II: Conditions and Functions Exercise 7.3 Conditional column creation In [19]: import pandas as pd import numpy as np # Load CSV to DF df = pd.read_csv('tax_dataset.csv') # Calculating age field df['birthdate'] = pd.to_datetime(df.birthdate, dayfirst=True) now = pd.Timestamp.now() df['age'] = (now - df['birthdate']).astype('m8[Y]').astype('int')
# Slicing only age and gender df = df.loc[:,['age','gender']] # Displaying using Jupyter Notebook df Out[19]: age gender 0 19 F 1 32 NaN 2 64 F 3 45 M 4 13 M In [20]: df['child'] = np.where(df['age'] < 18, 1, 0) In [21]: # Displaying using Jupyter Notebook df Out[21]: age gender child 0 19 F 0 1 32 NaN 0 2 64 F 0 3 45 M 0 4 13 M 1 In [22]: df['male'] = np.where(df['gender'] == 'M', 1, 0) In [23]: # Displaying using Jupyter Notebook df Out[23]: age gender child male 0 19 F 0 0 1 32 NaN 0 0
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
age gender child male 2 64 F 0 0 3 45 M 0 1 4 13 M 1 1 Exercise 7.4 Apply an existing function to a column If you want to use an existing function and apply this function to a column, df.apply is option. For example, you want to transform a numerical column using the np.log1p function. The log1p( ) method returns log(1+number) In [24]: import pandas as pd import numpy as np # Load CSV to DF df = pd.read_csv('tax_dataset.csv') # Slicing only bruto field df = df.loc[:,['bruto']] # Displaying using Jupyter Notebook df Out[24]: bruto 0 300 1 3200 2 5050 3 4575 4 0 In [25]: # applying an existing function to a column df['log1p_bruto'] = df['bruto'].apply(np.log1p) In [26]: df Out[26]: bruto log1p_bruto 0 300 5.707110
bruto log1p_bruto 1 3200 8.071219 2 5050 8.527342 3 4575 8.428581 4 0 0.000000 Exercise 7.5 Apply and lambda combined In the exercise 4.1, we subtracted the values of the bruto and netto columns. It is easy to use basic operators, but you can also use apply combined with a lambda function: In [27]: import pandas as pd import numpy as np # Load CSV to DF df = pd.read_csv('tax_dataset.csv') # Slicing only bruto and netto fields df = df.loc[:,['bruto', 'netto']] # Displaying using Jupyter Notebook df Out[27]: bruto netto 0 300 250 1 3200 1900 2 5050 2750 3 4575 2540 4 0 0 In [28]: df['tax'] = df.apply(lambda row: row.bruto - row.netto, axis=1) In [29]: df Out[29]: bruto netto tax 0 300 250 50
bruto netto tax 1 3200 1900 1300 2 5050 2750 2300 3 4575 2540 2035 4 0 0 0 Exercise 7.6 Create a custom function (and apply) Sometimes you have multiple conditions and you want to apply a function to multiple columns at the same time. It can be done by using a custom made function, and applying this function to your dataframe. Let’s create age groups in our dataframe. In [30]: import pandas as pd import numpy as np # Load CSV to DF df = pd.read_csv('tax_dataset.csv') # Calculating age field df['birthdate'] = pd.to_datetime(df.birthdate, dayfirst=True) now = pd.Timestamp.now() df['age'] = (now - df['birthdate']).astype('m8[Y]').astype('int') # Slicing only age field df = df.loc[:,['age']] # Displaying using Jupyter Notebook df Out[30]: age 0 19 1 32 2 64 3 45 4 13 In [31]: # create a function to define age groups def age_groups(row): if row['age'] < 18: return 0
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
elif row['age'] >= 18 and row['age'] < 30: return 1 elif row['age'] >= 30 and row['age'] < 60: return 2 else: return 3 In [32]: # apply to dataframe, use axis=1 to apply the function to every row df['age_groups'] = df.apply(age_groups, axis=1) In [33]: df Out[33]: age age_groups 0 19 1 1 32 2 2 64 3 3 45 2 4 13 0 You can easily use multiple columns and multiple conditions with this way of column creation. If you want to use age and bruto income to interpret salaries: In [34]: import pandas as pd import numpy as np # Load CSV to DF df = pd.read_csv('tax_dataset.csv') # Calculating age field df['birthdate'] = pd.to_datetime(df.birthdate, dayfirst=True) now = pd.Timestamp.now() df['age'] = (now - df['birthdate']).astype('m8[Y]').astype('int') # Slicing only age and bruto fields df = df.loc[:,['age','bruto']] # Displaying using Jupyter Notebook df Out[34]: age bruto 0 19 300 1 32 3200
age bruto 2 64 5050 3 45 4575 4 13 0 In [35]: def age_salary(row): if row['bruto'] / row['age'] > 100: return 'high salary' elif row['bruto'] / row['age'] <= 100 and row['bruto'] / row['age'] > 50: return 'medium salary' elif row['bruto'] / row['age'] < 50 and row['bruto'] / row['age'] > 0: return 'low salary' else: return 'no salary' In [36]: df['salary_age_relation'] = df.apply(age_salary, axis=1) In [37]: df.head() Out[37]: age bruto salary_age_relation 0 19 300 low salary 1 32 3200 medium salary 2 64 5050 medium salary 3 45 4575 high salary 4 13 0 no salary Exercise 7.7 Multiple conditions (vectorized solution) The solution in the previous exercise works, but might not be the best. If you are looking for a more efficient solution (e.g. if you deal with a large dataset), you can specify your conditions in a list and use np.select In [38]: import pandas as pd import numpy as np # Load CSV to DF df = pd.read_csv('tax_dataset.csv') # Calculating age field df['birthdate'] = pd.to_datetime(df.birthdate, dayfirst=True) now = pd.Timestamp.now() df['age'] = (now - df['birthdate']).astype('m8[Y]').astype('int')
# Slicing only age and bruto fields df = df.loc[:,['age','bruto']] # Displaying using Jupyter Notebook df Out[38]: age bruto 0 19 300 1 32 3200 2 64 5050 3 45 4575 4 13 0 In [39]: conditions = [df['bruto'] / df['age'] > 100, (df['bruto'] / df['age'] <= 100) & (df['bruto'] / df['age'] > 50), (df['bruto'] / df['age'] < 50) & (df['bruto'] / df['age'] > 0)] In [40]: outputs = ['high salary', 'medium salary', 'low salary'] In [41]: df['salary_age_relation'] = np.select(conditions, outputs, 'no salary') In [42]: df Out[42]: age bruto salary_age_relation 0 19 300 low salary 1 32 3200 medium salary 2 64 5050 medium salary 3 45 4575 high salary 4 13 0 no salary This gives the same results as the previous exercise, but with better performance. Revised Date: September 8th, 2023 In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help