IE6400_Day6_Lab7
html
keyboard_arrow_up
School
Northeastern University *
*We aren’t endorsed by this school
Course
6400
Subject
Industrial Engineering
Date
Feb 20, 2024
Type
html
Pages
12
Uploaded by ColonelStraw13148
IE6400 Foundations for Data Analytics
Engineering
¶
Fall 2023
¶
Module 1: Pandas Library - Part 2 - Lab 7
¶
The Column Creation of Pandas DataFrame
¶
When working on the Data Analytics, one of the most important tasks is feature engineering: creating new features from the data. This lab shows different ways to create those new features from existing columns.
Part I: Basic Operations
¶
Exercise 7.1 Creating a new column using the operators
¶
In [1]:
import pandas as pd
df = pd.read_csv('tax_dataset.csv')
In [2]:
# Displaying using Jupyter Notebook
df
# For non-Jupyter environments # print(df) Out[2]:
ID birthdate
firstname lastname gender bruto netto
0
1
20/8/2004
Anna
Adams
F
300
250
1
2
15/12/1990 Alex
Grant
NaN
3200
1900
2
3
13/1/1959
Lea
Fox
F
5050
2750
3
4
6/5/1978
John
Adams
M
4575
2540
4
5
29/7/2010
Sam
Baker
M
0
0
In [3]:
# multiplication with a scalar
df['netto_times_2'] = df['netto'] * 2
In [4]:
df
Out[4]:
ID birthdate
firstname lastname gender bruto netto netto_times_2
0
1
20/8/2004
Anna
Adams
F
300
250
500
1
2
15/12/1990 Alex
Grant
NaN
3200
1900
3800
2
3
13/1/1959
Lea
Fox
F
5050
2750
5500
ID birthdate
firstname lastname gender bruto netto netto_times_2
3
4
6/5/1978
John
Adams
M
4575
2540
5080
4
5
29/7/2010
Sam
Baker
M
0
0
0
In [5]:
# subtracting two columns
df['tax'] = df['bruto'] - df['netto']
In [6]:
df
Out[6]:
ID birthdate
firstname lastname gender bruto netto netto_times_2
tax
0
1
20/8/2004
Anna
Adams
F
300
250
500
50
1
2
15/12/1990 Alex
Grant
NaN
3200
1900
3800
1300
2
3
13/1/1959
Lea
Fox
F
5050
2750
5500
2300
3
4
6/5/1978
John
Adams
M
4575
2540
5080
2035
4
5
29/7/2010
Sam
Baker
M
0
0
0
0
In [7]:
# Customizing the text
df['fullname'] = df['firstname'] + ' ' + df['lastname']
In [8]:
df
Out[8]:
ID birthdate firstname lastname gender bruto netto
netto_
times_
2
tax
fullname
0
1
20/8/2004 Anna
Adams
F
300
250
500
50
Anna Adams
1
2
15/12/199
0
Alex
Grant
NaN
3200
1900
3800
1300
Alex Grant
2
3
13/1/1959
Lea
Fox
F
5050
2750
5500
2300
Lea Fox
3
4
6/5/1978
John
Adams
M
4575
2540
5080
2035
John Adams
4
5
29/7/2010 Sam
Baker
M
0
0
0
0
Sam
ID birthdate firstname lastname gender bruto netto
netto_
times_
2
tax
fullname
Baker
Exercise 7.2 Extracting information from a date column
¶
Dates can contain valuable information. In our example dataframe, we can calculate the age of a person or extract the year of birth. Another option is to calculate the days since a date.
In [9]:
import pandas as pd
# Load CSV to DF
df = pd.read_csv('tax_dataset.csv')
# Displaying using Jupyter Notebook
df
Out[9]:
ID birthdate
firstname lastname gender bruto netto
0
1
20/8/2004
Anna
Adams
F
300
250
1
2
15/12/1990 Alex
Grant
NaN
3200
1900
2
3
13/1/1959
Lea
Fox
F
5050
2750
3
4
6/5/1978
John
Adams
M
4575
2540
4
5
29/7/2010
Sam
Baker
M
0
0
In [10]:
# Querying only birthdate field
df = df.loc[:,['birthdate']] # Slicing only birthdate column
# converting the dtype
df['birthdate'] = pd.to_datetime(df['birthdate'], dayfirst=True) # DD/MM/YYYY
df
Out[10]:
birthdate
0
2004-08-20
1
1990-12-15
2
1959-01-13
3
1978-05-06
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
birthdate
4
2010-07-29
In [11]:
# extracting the year from a date
df['year_of_birth'] = pd.DatetimeIndex(df['birthdate']).year
In [12]:
# Displaying using Jupyter Notebook
df
Out[12]:
birthdate
year_of_birth
0
2004-08-20 2004
1
1990-12-15 1990
2
1959-01-13 1959
3
1978-05-06 1978
4
2010-07-29 2010
In [13]:
# calculating the age of a person (Option 1)
now = pd.Timestamp.now()
df['age'] = (now - df['birthdate']).astype('m8[Y]').astype('int')
M8[Y] changes the date to be the year.
In [14]:
# Displaying using Jupyter Notebook
df
Out[14]:
birthdate
year_of_birth age
0
2004-08-20 2004
19
1
1990-12-15 1990
32
2
1959-01-13 1959
64
3
1978-05-06 1978
45
4
2010-07-29 2010
13
In [15]:
# calculating the age of a person (Option 2)
now = pd.Timestamp.now()
df['today'] = now.year
df['age_option2'] = (df['today'] - df['year_of_birth']).astype(int)
In [16]:
# Displaying using Jupyter Notebook
df
Out[16]:
birthdate
year_of_birth age today age_option2
0
2004-08-20 2004
19
2023
19
1
1990-12-15 1990
32
2023
33
2
1959-01-13 1959
64
2023
64
3
1978-05-06 1978
45
2023
45
4
2010-07-29 2010
13
2023
13
In [17]:
# calculating days since date
now = pd.Timestamp.now()
df['days_since_birth'] = df.apply(lambda row: (now - row['birthdate']).days, axis=1)
In [18]:
# Displaying using Jupyter Notebook
df[['birthdate', 'year_of_birth', 'age', 'days_since_birth']].head()
Out[18]:
birthdate
year_of_birth age days_since_birth
0
2004-08-20 2004
19
6973
1
1990-12-15 1990
32
11970
2
1959-01-13 1959
64
23629
3
1978-05-06 1978
45
16576
4
2010-07-29 2010
13
4804
Part II: Conditions and Functions
¶
Exercise 7.3 Conditional column creation
¶
In [19]:
import pandas as pd
import numpy as np
# Load CSV to DF
df = pd.read_csv('tax_dataset.csv')
# Calculating age field
df['birthdate'] = pd.to_datetime(df.birthdate, dayfirst=True)
now = pd.Timestamp.now()
df['age'] = (now - df['birthdate']).astype('m8[Y]').astype('int')
# Slicing only age and gender
df = df.loc[:,['age','gender']]
# Displaying using Jupyter Notebook
df
Out[19]:
age gender
0
19
F
1
32
NaN
2
64
F
3
45
M
4
13
M
In [20]:
df['child'] = np.where(df['age'] < 18, 1, 0)
In [21]:
# Displaying using Jupyter Notebook
df
Out[21]:
age gender child
0
19
F
0
1
32
NaN
0
2
64
F
0
3
45
M
0
4
13
M
1
In [22]:
df['male'] = np.where(df['gender'] == 'M', 1, 0)
In [23]:
# Displaying using Jupyter Notebook
df
Out[23]:
age gender child male
0
19
F
0
0
1
32
NaN
0
0
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
age gender child male
2
64
F
0
0
3
45
M
0
1
4
13
M
1
1
Exercise 7.4 Apply an existing function to a column
¶
If you want to use an existing function and apply this function to a column, df.apply is option. For example, you want to transform a numerical column using the np.log1p function.
The log1p( ) method returns log(1+number)
In [24]:
import pandas as pd
import numpy as np
# Load CSV to DF
df = pd.read_csv('tax_dataset.csv')
# Slicing only bruto field
df = df.loc[:,['bruto']]
# Displaying using Jupyter Notebook
df
Out[24]:
bruto
0
300
1
3200
2
5050
3
4575
4
0
In [25]:
# applying an existing function to a column
df['log1p_bruto'] = df['bruto'].apply(np.log1p)
In [26]:
df
Out[26]:
bruto log1p_bruto
0
300
5.707110
bruto log1p_bruto
1
3200
8.071219
2
5050
8.527342
3
4575
8.428581
4
0
0.000000
Exercise 7.5 Apply and lambda combined
¶
In the exercise 4.1, we subtracted the values of the bruto and netto columns. It is easy
to use basic operators, but you can also use apply combined with a lambda function:
In [27]:
import pandas as pd
import numpy as np
# Load CSV to DF
df = pd.read_csv('tax_dataset.csv')
# Slicing only bruto and netto fields
df = df.loc[:,['bruto', 'netto']]
# Displaying using Jupyter Notebook
df
Out[27]:
bruto netto
0
300
250
1
3200
1900
2
5050
2750
3
4575
2540
4
0
0
In [28]:
df['tax'] = df.apply(lambda row: row.bruto - row.netto, axis=1)
In [29]:
df
Out[29]:
bruto netto
tax
0
300
250
50
bruto netto
tax
1
3200
1900
1300
2
5050
2750
2300
3
4575
2540
2035
4
0
0
0
Exercise 7.6 Create a custom function (and apply)
¶
Sometimes you have multiple conditions and you want to apply a function to multiple columns at the same time. It can be done by using a custom made function, and applying this function to your dataframe. Let’s create age groups in our dataframe.
In [30]:
import pandas as pd
import numpy as np
# Load CSV to DF
df = pd.read_csv('tax_dataset.csv')
# Calculating age field
df['birthdate'] = pd.to_datetime(df.birthdate, dayfirst=True)
now = pd.Timestamp.now()
df['age'] = (now - df['birthdate']).astype('m8[Y]').astype('int')
# Slicing only age field
df = df.loc[:,['age']]
# Displaying using Jupyter Notebook
df
Out[30]:
age
0
19
1
32
2
64
3
45
4
13
In [31]:
# create a function to define age groups
def age_groups(row):
if row['age'] < 18:
return 0
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
elif row['age'] >= 18 and row['age'] < 30:
return 1
elif row['age'] >= 30 and row['age'] < 60:
return 2
else:
return 3
In [32]:
# apply to dataframe, use axis=1 to apply the function to every row
df['age_groups'] = df.apply(age_groups, axis=1)
In [33]:
df
Out[33]:
age age_groups
0
19
1
1
32
2
2
64
3
3
45
2
4
13
0
You can easily use multiple columns and multiple conditions with this way of column creation. If you want to use age and bruto income to interpret salaries:
In [34]:
import pandas as pd
import numpy as np
# Load CSV to DF
df = pd.read_csv('tax_dataset.csv')
# Calculating age field
df['birthdate'] = pd.to_datetime(df.birthdate, dayfirst=True)
now = pd.Timestamp.now()
df['age'] = (now - df['birthdate']).astype('m8[Y]').astype('int')
# Slicing only age and bruto fields
df = df.loc[:,['age','bruto']]
# Displaying using Jupyter Notebook
df
Out[34]:
age bruto
0
19
300
1
32
3200
age bruto
2
64
5050
3
45
4575
4
13
0
In [35]:
def age_salary(row):
if row['bruto'] / row['age'] > 100:
return 'high salary'
elif row['bruto'] / row['age'] <= 100 and row['bruto'] / row['age'] > 50:
return 'medium salary'
elif row['bruto'] / row['age'] < 50 and row['bruto'] / row['age'] > 0:
return 'low salary'
else:
return 'no salary'
In [36]:
df['salary_age_relation'] = df.apply(age_salary, axis=1)
In [37]:
df.head()
Out[37]:
age bruto salary_age_relation
0
19
300
low salary
1
32
3200
medium salary
2
64
5050
medium salary
3
45
4575
high salary
4
13
0
no salary
Exercise 7.7 Multiple conditions (vectorized solution)
¶
The solution in the previous exercise works, but might not be the best. If you are looking for a more efficient solution (e.g. if you deal with a large dataset), you can specify your conditions in a list and use np.select
In [38]:
import pandas as pd
import numpy as np
# Load CSV to DF
df = pd.read_csv('tax_dataset.csv')
# Calculating age field
df['birthdate'] = pd.to_datetime(df.birthdate, dayfirst=True)
now = pd.Timestamp.now()
df['age'] = (now - df['birthdate']).astype('m8[Y]').astype('int')
# Slicing only age and bruto fields
df = df.loc[:,['age','bruto']]
# Displaying using Jupyter Notebook
df
Out[38]:
age bruto
0
19
300
1
32
3200
2
64
5050
3
45
4575
4
13
0
In [39]:
conditions = [df['bruto'] / df['age'] > 100, (df['bruto'] / df['age'] <= 100) & (df['bruto'] / df['age'] > 50), (df['bruto'] / df['age'] < 50) & (df['bruto'] / df['age'] > 0)]
In [40]:
outputs = ['high salary', 'medium salary', 'low salary']
In [41]:
df['salary_age_relation'] = np.select(conditions, outputs, 'no salary')
In [42]:
df
Out[42]:
age bruto salary_age_relation
0
19
300
low salary
1
32
3200
medium salary
2
64
5050
medium salary
3
45
4575
high salary
4
13
0
no salary
This gives the same results as the previous exercise, but with better performance.
Revised Date: September 8th, 2023
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help