accy570lab_week7
html
keyboard_arrow_up
School
University of Illinois, Urbana Champaign *
*We aren’t endorsed by this school
Course
570
Subject
Information Systems
Date
Dec 6, 2023
Type
html
Pages
12
Uploaded by HighnessLoris3604
Week 7 Lab
¶
•
File path •
Load data to pandas DataFrame •
DataFrame attributes File Path
¶
•
Absolute path
•
Windows C:\Users\Linden
....
\data.csv
•
Mac: /Users/Linden/.../data.csv
•
Relative path
•
data.csv in currcent foler(same as the notebook file): data.csv
•
data.csv in a subfolder in current folder: subfolder/data.csv
Always use / in file path
¶
Useful notebook magics
¶
•
%pwd
: print current folder as absolute path •
%ls
: list files in current folder In [1]:
%pwd
Out[1]:
'C:\\Users\\0903i\\Downloads'
In [2]:
%ls
Volume in drive C is OS
Volume Serial Number is 7415-2B2A
Directory of C:\Users\0903i\Downloads
10/05/2023 08:03 PM <DIR> .
09/28/2023 11:58 AM <DIR> ..
10/05/2023 08:01 PM <DIR> .ipynb_checkpoints
10/05/2023 07:51 PM <DIR> __MACOSX
10/03/2023 05:02 PM 430,568 517 Midterm Project Team 6 - CVS.xlsx
10/05/2023 08:03 PM 48,732 accy570lab_week7.ipynb
09/26/2023 07:49 PM 36,984 After Class Exercises 8.ipynb
09/13/2023 11:00 PM 77,259 AIDLandlordLeasingAgent.pdf
10/05/2023 10:55 AM 23,948,849 appstore_games.csv
09/30/2023 12:31 PM 57,741 Companies Cross Listed in the US.xlsx
10/05/2023 07:51 PM <DIR> data
09/30/2023 12:31 PM 90,775 fortune1000.csv
09/30/2023 12:46 PM 16,437,737 IowaLiquorSales2018.csv
10/05/2023 07:51 PM 17,558 PilgrimBank.xlsx
10/05/2023 12:29 PM 63,958 ProblemSet2_soomink4_thingyuh2-1.ipynb
10/05/2023 08:02 PM 316,891 ProblemSet2_soomink4_thingyuh2-1.twbx
10/05/2023 07:51 PM 214 sales.csv
10/05/2023 07:51 PM 61,194 titanic.csv
13 File(s) 41,588,460 bytes
5 Dir(s) 890,413,084,672 bytes free
In [3]:
%ls data
Volume in drive C is OS
Volume Serial Number is 7415-2B2A
Directory of C:\Users\0903i\Downloads\data
10/05/2023 07:51 PM <DIR> .
10/05/2023 08:03 PM <DIR> ..
10/13/2021 02:24 PM <DIR> .ipynb_checkpoints
10/05/2023 07:51 PM 186,693 COST_WMT_stock.csv
10/05/2023 07:51 PM 61,194 titanic.csv
2 File(s) 247,887 bytes
3 Dir(s) 890,404,212,736 bytes free
Exercise 1
•
Find your current working dirctory with notebook magic %pwd. •
List all files in your current working directory with notebook magic %ls. In [4]:
%pwd
Out[4]:
'C:\\Users\\0903i\\Downloads'
In [6]:
%ls
Volume in drive C is OS
Volume Serial Number is 7415-2B2A
Directory of C:\Users\0903i\Downloads
10/05/2023 08:03 PM <DIR> .
09/28/2023 11:58 AM <DIR> ..
10/05/2023 08:01 PM <DIR> .ipynb_checkpoints
10/05/2023 07:51 PM <DIR> __MACOSX
10/03/2023 05:02 PM 430,568 517 Midterm Project Team 6 - CVS.xlsx
10/05/2023 08:03 PM 48,732 accy570lab_week7.ipynb
09/26/2023 07:49 PM 36,984 After Class Exercises 8.ipynb
09/13/2023 11:00 PM 77,259 AIDLandlordLeasingAgent.pdf
10/05/2023 10:55 AM 23,948,849 appstore_games.csv
09/30/2023 12:31 PM 57,741 Companies Cross Listed in the US.xlsx
10/05/2023 07:51 PM <DIR> data
09/30/2023 12:31 PM 90,775 fortune1000.csv
09/30/2023 12:46 PM 16,437,737 IowaLiquorSales2018.csv
10/05/2023 07:51 PM 17,558 PilgrimBank.xlsx
10/05/2023 12:29 PM 63,958 ProblemSet2_soomink4_thingyuh2-1.ipynb
10/05/2023 08:02 PM 316,891 ProblemSet2_soomink4_thingyuh2-1.twbx
10/05/2023 07:51 PM 214 sales.csv
10/05/2023 07:51 PM 61,194 titanic.csv
13 File(s) 41,588,460 bytes
5 Dir(s) 890,388,803,584 bytes free
Access Data Files
¶
•
Best practice:
•
Put data file in same folder as notebook. •
Put data file in subfolder like 'data' •
Never use absolute path
•
Always use relative path
•
Use forward slash / in file path string
In [7]:
import pandas as pd
df = pd.read_csv('titanic.csv')
df.head()
Out[7]:
PassengerId Survived Pclass
Name
Sex
Age SibSp Parch
Ticket
Fare
0
1
0
3
Braund, Mr. Owen Harris
male
22.0 1
0
A/5 21171
7.2500
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female 38.0 1
0
PC 17599
71.2833
2
3
1
3
Heikkinen,
Miss. Laina
female 26.0 0
0
STON/O2
. 3101282
7.9250
3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female 35.0 1
0
113803
53.1000
4
5
0
3
Allen, Mr. William Henry
male
35.0 0
0
373450
8.0500
In [8]:
df = pd.read_csv('data/titanic.csv')
df.head()
Out[8]:
PassengerId Survived Pclass
Name
Sex
Age SibSp Parch
Ticket
Fare
0
1
0
3
Braund, Mr. Owen Harris
male
22.0 1
0
A/5 21171
7.2500
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs female 38.0 1
0
PC 17599
71.2833
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
PassengerId Survived Pclass
Name
Sex
Age SibSp Parch
Ticket
Fare
Th...
2
3
1
3
Heikkinen,
Miss. Laina
female 26.0 0
0
STON/O2
. 3101282
7.9250
3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female 35.0 1
0
113803
53.1000
4
5
0
3
Allen, Mr. William Henry
male
35.0 0
0
373450
8.0500
Exercise 2
•
Load COST_WMT_stock.csv in data folder to a DataFrame df_stock
. •
Display first five rows of df_stock with head() method. In [13]:
import pandas as pd
df_stock = pd.read_csv('COST_WMT_stock.csv')
df_stock.head()
Out[13]:
Date
COST_Adj
Close
COST_Vo
lume
COST_ch
ange
COST_UpO
rDown
WMT_Adj
Close
WMT_Vol
ume
WMT_cha
nge
WMT
D
0
10/1/
2009
43.190258 5097900
NaN
NaN
38.208164 22467100 NaN
NaN
1
10/2/
2009
43.795193 5760800
0.604935 Up
38.270535 15093800 0.062371
Up
2
10/5/
2009
44.113171 4480100
0.317978 Up
38.254951 14217500 -0.015584 Down
3
10/6/
2009
44.927494 5865700
0.814323 Up
38.582447 17582800 0.327496
Up
4
10/7/
2009
45.757320 11117300 0.829826 Up
38.590244 11620200 0.007797
Up
DataFrame manipulations
¶
•
Load data
•
pd.read_csv()
•
pd.read_excel() and more (try tab completion) •
Get intuition of data
•
df.head() •
df.tail() •
df.sample() •
df.info() In [14]:
df.head()
Out[14]:
PassengerId Survived Pclass
Name
Sex
Age SibSp Parch
Ticket
Fare
0
1
0
3
Braund, Mr. Owen Harris
male
22.0 1
0
A/5 21171
7.2500
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female 38.0 1
0
PC 17599
71.2833
2
3
1
3
Heikkinen,
Miss. Laina
female 26.0 0
0
STON/O2
. 3101282
7.9250
3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female 35.0 1
0
113803
53.1000
4
5
0
3
Allen, Mr. William Henry
male
35.0 0
0
373450
8.0500
In [15]:
df.tail()
Out[15]:
PassengerId Survived Pclass
Name
Sex
Age SibSp Parch Ticket
Fare C
886
887
0
2
Montvila, Rev. Juozas
male
27.0 0
0
211536 13.00 N
887
888
1
1
Graham, Miss. Margaret female 19.0 0
0
112053 30.00 B
PassengerId Survived Pclass
Name
Sex
Age SibSp Parch Ticket
Fare C
Edith
888
889
0
3
Johnston, Miss. Catherine
Helen "Carrie"
female NaN 1
2
W./C. 6607
23.45 N
889
890
1
1
Behr, Mr. Karl Howell
male
26.0 0
0
111369 30.00 C
890
891
0
3
Dooley, Mr. Patrick
male
32.0 0
0
370376 7.75
N
In [16]:
df.sample(5)
Out[16]:
PassengerId Survived Pclass
Name
Sex
Age SibSp Parch Ticket
Fare
185
186
0
1
Rood, Mr.
Hugh Roscoe
male
NaN 0
0
113767 50.0000
76
77
0
3
Staneff, Mr. Ivan
male
NaN 0
0
349208 7.8958
306
307
1
1
Fleming, Miss. Margaret
female NaN 0
0
17421
110.883
646
647
0
3
Cor, Mr. Liudevit
male
19.0 0
0
349231 7.8958
141
142
1
3
Nysten, Miss. Anna Sofia
female 22.0 0
0
347081 7.7500
In [17]:
df.sample(5)
Out[17]:
PassengerId Survived Pclass
Name
Sex
Age SibSp Parch Ticket
Fare
618
619
1
2
Becker, Miss. female 4.0
2
1
230136 39.00
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
PassengerId Survived Pclass
Name
Sex
Age SibSp Parch Ticket
Fare
Marion Louise
379
380
0
3
Gustafsson,
Mr. Karl Gideon
male
19.0 0
0
347069 7.775
369
370
1
1
Aubart, Mme. Leontine Pauline
female 24.0 0
0
PC 17477
69.30
342
343
0
2
Collander, Mr. Erik Gustaf
male
28.0 0
0
248740 13.00
586
587
0
2
Jarvis, Mr. John Denzil
male
47.0 0
0
237565 15.00
In [18]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64
6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64
10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
Exercise 3
•
Load sales.csv
to a DataFrame df_sales
with pandas function read_csv()
. •
Take a peek at the DataFrame with DataFrame method head()
(or tail, or sample)
•
Get basic DataFrame information with DataFrame method info()
. •
Which column has missing values? (
You don't need to answer this question in the notebook
) In [28]:
df_sales = pd.read_csv('sales.csv')
df_sales.head()
Out[28]:
Item
Sales($)
Sale Date
Notes
0
milk
$3.79
1/5/2020
2%
1
fruit
$2.99
2/5/2020
NaN
2
meat
$15.99
3/5/2020
Beaf
3
chocolate $5.99
4/5/2020
Black
4
Milk
($3.79)
13/5/2020
Return
In [25]:
df_sales.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
# Column Non-Null Count Dtype --- ------ -------------- ----- 0 Item 7 non-null object
1 Sales($) 7 non-null object
2 Sale Date 7 non-null object
3 Notes 4 non-null object
dtypes: object(4)
memory usage: 356.0+ bytes
Exercise 4
In the following code cell:
•
import pandas as pd •
type pd.read_
then press TAB key. (
Just try it, don't leave un-completed code in the notebook
) •
print pd.read_excel() help document using help() or ? •
Load data from PilgrimBank.xlsx
to df_pb
using pb.read_excel(). The excel file has two sheets, the data is in 'DataSet' sheet. •
Display the first 5 rows of df_pb
. In [33]:
import pandas as pd
pd.read_excel?
Out[33]:
<function pandas.io.excel._base.read_excel(io, sheet_name: 'str | int | list[IntStrT] |
None' = 0, *, header: 'int | Sequence[int] | None' = 0, names: 'list[str] | None' = None, index_col: 'int | Sequence[int] | None' = None, usecols: 'int | str | Sequence[int] | Sequence[str] | Callable[[str], bool] | None' = None, squeeze: 'bool | None' = None, dtype: 'DtypeArg | None' = None, engine: "Literal['xlrd', 'openpyxl', 'odf', 'pyxlsb'] | None" = None, converters: 'dict[str, Callable] | dict[int, Callable]
| None' = None, true_values: 'Iterable[Hashable] | None' = None, false_values: 'Iterable[Hashable] | None' = None, skiprows: 'Sequence[int] | int | Callable[[int], object] | None' = None, nrows: 'int | None' = None, na_values=None, keep_default_na:
'bool' = True, na_filter: 'bool' = True, verbose: 'bool' = False, parse_dates: 'list | dict | bool' = False, date_parser: 'Callable | None' = None, thousands: 'str | None' = None, decimal: 'str' = '.', comment: 'str | None' = None, skipfooter: 'int' = 0, convert_float: 'bool | None' = None, mangle_dupe_cols: 'bool' = True, storage_options: 'StorageOptions' = None) -> 'DataFrame | dict[IntStrT, DataFrame]'>
In [31]:
df_pb=pd.read_excel('PilgrimBank.xlsx',sheet_name='DataSet')
df_pb.head()
Out[31]:
ID 9age 9Inc 9AllDem 9District 9Tenure 9Profit 0Profit 9Online 0Online 9Billpay
0
1
NaN
NaN 0
1200
6.33
21
NaN
0
NaN
0
1
2
6.0
3.0
1
1200
29.50
-6
-32.0
0
0.0
0
2
3
5.0
5.0
1
1100
26.41
-49
-22.0
1
1.0
0
3
4
NaN
NaN 0
1200
2.25
-4
NaN
0
NaN
0
4
5
2.0
9.0
1
1200
9.91
-61
-4.0
0
0.0
0
DataFrame Attributes
¶
•
df.index •
df.columns •
df.shape In [34]:
df.head()
Out[34]:
PassengerId Survived Pclass
Name
Sex
Age SibSp Parch
Ticket
Fare
0
1
0
3
Braund, Mr. Owen Harris
male
22.0 1
0
A/5 21171
7.2500
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female 38.0 1
0
PC 17599
71.2833
2
3
1
3
Heikkinen,
Miss. Laina
female 26.0 0
0
STON/O2
. 3101282
7.9250
3
4
1
1
Futrelle, Mrs. Jacques Heath female 35.0 1
0
113803
53.1000
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
PassengerId Survived Pclass
Name
Sex
Age SibSp Parch
Ticket
Fare
(Lily May Peel)
4
5
0
3
Allen, Mr. William Henry
male
35.0 0
0
373450
8.0500
In [35]:
df.index
Out[35]:
RangeIndex(start=0, stop=891, step=1)
In [36]:
df.columns
Out[36]:
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
In [37]:
df.columns = [c.upper() for c in df.columns]
df.head()
Out[37]:
PASSENGERID SURVIVED PCLASS
NAME
SEX
AGE SIBSP PARCH TICKET
F
0
1
0
3
Braund, Mr. Owen Harris
male
22.0 1
0
A/5 21171
7.
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female 38.0 1
0
PC 17599
71
2
3
1
3
Heikkinen,
Miss. Laina
female 26.0 0
0
STON/O2
. 3101282
7.
3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female 35.0 1
0
113803
53
4
5
0
3
Allen, Mr. William male
35.0 0
0
373450
8.
PASSENGERID SURVIVED PCLASS
NAME
SEX
AGE SIBSP PARCH TICKET
F
Henry
In [38]:
df.shape
Out[38]:
(891, 12)
In [39]:
df.shape[0]
Out[39]:
891
Exercise 5
•
Find number of rows and columns of df_sales
with DataFrame attribute shape. •
Display column names of df_sales
with DataFrame attribute columns. In [41]:
df_sales.shape
Out[41]:
(7, 4)
In [43]:
df_sales.columns
Out[43]:
Index(['Item', 'Sales($)', 'Sale Date', 'Notes'], dtype='object')
In [44]:
df_sales.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
# Column Non-Null Count Dtype --- ------ -------------- ----- 0 Item 7 non-null object
1 Sales($) 7 non-null object
2 Sale Date 7 non-null object
3 Notes 4 non-null object
dtypes: object(4)
memory usage: 356.0+ bytes
In [46]:
df_sales.columns= ['Item', 'Sales', 'Sale Date', 'Notes'] df_sales.head()
Out[46]:
Item
Sales
Sale Date
Notes
0
milk
$3.79
1/5/2020
2%
1
fruit
$2.99
2/5/2020
NaN
2
meat
$15.99 3/5/2020
Beaf
3
chocolate $5.99
4/5/2020
Black
Item
Sales
Sale Date
Notes
4
Milk
($3.79) 13/5/2020
Return
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help