Foundations of Data Science_Week 5 Notes

docx

School

Toronto Metropolitan University *

*We aren’t endorsed by this school

Course

830

Subject

Information Systems

Date

Oct 30, 2023

Type

docx

Pages

Uploaded by CaptainTreeNewt33

Foundations of Data Science Mar 5, 2023 Module 4: Pandas - To open an excel file in Python, you can use the “read.csv” code - Using ‘head()’ or ‘talk()’ will allow you to see the first 5 or last 5 rows of the table o Leaving the () blank defaults to 5. If you put a number in the () you can specify how many rows to see - Use ‘describe(include = ‘all’)’ function to get the details of the table (quick lens of what we see in the data set) - To isolate for a specific column, we can just identify the name of the column in square brackets; to isolate for multiple columns, separate column names by a comma example: customers[‘id’] - Format will look different between isolating between 1 column compared to multiple columns because one column is a series of info, multiple columns is a data frame - Using ___.loc will search by value example: __.loc[4,’id’] -> the value 4 found in column id - Using ___.iloc will search by position (using rows and columns) example: ___iloc[10,3] -> the number found in position 10 and 3 - Loc and iloc uses ranges; can not use less than or more then functions - Inplace = True -> allows you to make changes to the data frame permanently if you change it to Inplace = False , it’s a temporary change - The index is the first column - The rows, under the indix column is represented by axis = 0 - The column names following the index column is represented by axis = 1 - By default, the drop() function will look in the index column unless specified using axis = 1 Drop function helps remove a row or column from the data frame - Sort.indix(axis = 1) -> organizing the data frame alphabetically using the columns - To isolate all the specific data for a specific variable we can use square brackets examples: customers.[customers[‘province’]==’ON’] -> in the customers data frame, we want all the data for when the province is ON - To isolate for data, you can use multiple factors using and ‘&’, or ‘|’, not ‘~’, not equal ‘!=’ - Groupby function is very similar to a pivot table Example: to group by province and customer category customers.groupby([‘province’,’cust_category’]).count() -> in the customers data frame, this will show us how many customer categories each province has - We can use the aggregate function to calculate/output various details of the data frame; we can either past a list Example -> cusomters.groupby(‘province’).aggregate({‘num_products’: [‘mean’,’min’,’max,’sum’],’len_relationship’:’mean’}) output will have number of products, with four columns listed above for each province, and another column with len_relationship, with the average for each province - When we are passing multiple variables, we need a list so we need the square brackets [ ]

Foundations of Data Science Mar 5, 2023 - The unstack function is a method for us to take a second level variable and make it a column across the table - Merge and join are two functions you can use to combine data sets - When merging data sets, you need to find a column to merge the data sets together (the same column in both sets so the data can merge using that as a guide) - Merged_data = pd.merge() - Left hand side is what table we want on the left, then the second data set; then we identify the name of the columns we are using to merge; lastly how we will merge example: merged_data = pd.merge(customers, transactions, left_on+’id’, right_on=’customer_id’, how=’inner’) we are merging tables customers and transactions the columns that have the same info are id from customers (the left table because it was listed first and we want the information of that table to display FIRST and then the info of the second table afterwards) and customer id from the transaction table how we want to merge, inner method - Note: order of table in function relates to the order of the columns your new table will display - You can only merge two data sets - When using the join function, the index on both tables need to be the same for the data sets to be joined (title and everything) - Join is a bit easier to use, because when you have the same index name, we don’t need to specify left table and right table - Using the join function, you may need to rename the index so they match - For join to work, the index for both tables must be the same title, data type, etc - Example: joined_data = customers.join(transactions) here we are joining the transaction data set to the customers data set - When creating a new column, we put it in a value and it will populate Example: joined_data[‘Spender_type’} = np.nan Here, we are adding the spender_type column in the table, joined data, and we equaled it to nan, it will automatically add it to the table at the end - You can add specific values to a new column in your table. Once you create your function, you can isolate for that table and add it Example: joined_data[‘spender_type’] = joined_data[‘txn_total’].apple(spender_type) here, we want to modify the spender type column in our joined_data data set using the function we made, we said to add values to the spender type column, based on the function, using the txn_total column as our variables to apply to the function - You can create a new column name and the calculation in one line (without having to create a new column first) example: joined_data[‘proportion_of_1000’] = joined_data[‘txn_total’]/1000 here, we are adding a new column called proportion of 1000, which will equal txntotal/1000 results - Using the fillna() function, will input a specific data in areas where there is no data (empty cells) - To add columns together, you can create a new column and add values from two already existing columns joined_data[‘doubled_transactions’] = joined_data[‘txn_total’] + joined_data[‘txn_total’]

Foundations of Data Science Mar 5, 2023 here we are creating a new column, called doubled transactions, and the values will be the addition of the two columns listed from the table - 3 types of missing data 1. Data missing at random – no underlying pattern as to why the data point is missing 2. Data missing at random - there is a pattern as to why the data point is missing 3. Data missing not at random – there is a pattern as to why the data point is not missing -

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version