Foundations of Data Science_Week 5 Notes

docx

School

Toronto Metropolitan University *

*We aren’t endorsed by this school

Course

830

Subject

Information Systems

Date

Oct 30, 2023

Type

docx

Pages

3

Uploaded by CaptainTreeNewt33

Report
Foundations of Data Science Mar 5, 2023 Module 4: Pandas - To open an excel file in Python, you can use the “read.csv” code - Using ‘head()’ or ‘talk()’ will allow you to see the first 5 or last 5 rows of the table o Leaving the () blank defaults to 5. If you put a number in the () you can specify how many rows to see - Use ‘describe(include = ‘all’)’ function to get the details of the table (quick lens of what we see in the data set) - To isolate for a specific column, we can just identify the name of the column in square brackets; to isolate for multiple columns, separate column names by a comma example: customers[‘id’] - Format will look different between isolating between 1 column compared to multiple columns because one column is a series of info, multiple columns is a data frame - Using ___.loc will search by value example: __.loc[4,’id’] -> the value 4 found in column id - Using ___.iloc will search by position (using rows and columns) example: ___iloc[10,3] -> the number found in position 10 and 3 - Loc and iloc uses ranges; can not use less than or more then functions - Inplace = True -> allows you to make changes to the data frame permanently if you change it to Inplace = False , it’s a temporary change - The index is the first column - The rows, under the indix column is represented by axis = 0 - The column names following the index column is represented by axis = 1 - By default, the drop() function will look in the index column unless specified using axis = 1 Drop function helps remove a row or column from the data frame - Sort.indix(axis = 1) -> organizing the data frame alphabetically using the columns - To isolate all the specific data for a specific variable we can use square brackets examples: customers.[customers[‘province’]==’ON’] -> in the customers data frame, we want all the data for when the province is ON - To isolate for data, you can use multiple factors using and ‘&’, or ‘|’, not ‘~’, not equal ‘!=’ - Groupby function is very similar to a pivot table Example: to group by province and customer category customers.groupby([‘province’,’cust_category’]).count() -> in the customers data frame, this will show us how many customer categories each province has - We can use the aggregate function to calculate/output various details of the data frame; we can either past a list Example -> cusomters.groupby(‘province’).aggregate({‘num_products’: [‘mean’,’min’,’max,’sum’],’len_relationship’:’mean’}) output will have number of products, with four columns listed above for each province, and another column with len_relationship, with the average for each province - When we are passing multiple variables, we need a list so we need the square brackets [ ]
Foundations of Data Science Mar 5, 2023 - The unstack function is a method for us to take a second level variable and make it a column across the table - Merge and join are two functions you can use to combine data sets - When merging data sets, you need to find a column to merge the data sets together (the same column in both sets so the data can merge using that as a guide) - Merged_data = pd.merge() - Left hand side is what table we want on the left, then the second data set; then we identify the name of the columns we are using to merge; lastly how we will merge example: merged_data = pd.merge(customers, transactions, left_on+’id’, right_on=’customer_id’, how=’inner’) we are merging tables customers and transactions the columns that have the same info are id from customers (the left table because it was listed first and we want the information of that table to display FIRST and then the info of the second table afterwards) and customer id from the transaction table how we want to merge, inner method - Note: order of table in function relates to the order of the columns your new table will display - You can only merge two data sets - When using the join function, the index on both tables need to be the same for the data sets to be joined (title and everything) - Join is a bit easier to use, because when you have the same index name, we don’t need to specify left table and right table - Using the join function, you may need to rename the index so they match - For join to work, the index for both tables must be the same title, data type, etc - Example: joined_data = customers.join(transactions) here we are joining the transaction data set to the customers data set - When creating a new column, we put it in a value and it will populate Example: joined_data[‘Spender_type’} = np.nan Here, we are adding the spender_type column in the table, joined data, and we equaled it to nan, it will automatically add it to the table at the end - You can add specific values to a new column in your table. Once you create your function, you can isolate for that table and add it Example: joined_data[‘spender_type’] = joined_data[‘txn_total’].apple(spender_type) here, we want to modify the spender type column in our joined_data data set using the function we made, we said to add values to the spender type column, based on the function, using the txn_total column as our variables to apply to the function - You can create a new column name and the calculation in one line (without having to create a new column first) example: joined_data[‘proportion_of_1000’] = joined_data[‘txn_total’]/1000 here, we are adding a new column called proportion of 1000, which will equal txntotal/1000 results - Using the fillna() function, will input a specific data in areas where there is no data (empty cells) - To add columns together, you can create a new column and add values from two already existing columns joined_data[‘doubled_transactions’] = joined_data[‘txn_total’] + joined_data[‘txn_total’]
Foundations of Data Science Mar 5, 2023 here we are creating a new column, called doubled transactions, and the values will be the addition of the two columns listed from the table - 3 types of missing data 1. Data missing at random – no underlying pattern as to why the data point is missing 2. Data missing at random - there is a pattern as to why the data point is missing 3. Data missing not at random – there is a pattern as to why the data point is not missing -
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help

Browse Popular Homework Q&A

Q: Keys: The values from the roster dictionary • Values: A list of keys from roster that share that…
Q: Identify an equation in point-slope form for the line perpendicular to y= -4x-1 that passes through…
Q: Burglary JohnCalls P(B) .001 Alarm A P(J) 90 f .05 Earthquake B t 1 f f f E t f 1 P(A) 95 94 29 .001…
Q: How do you clean up a dilute acid spill?
Q: 5.2.45 Question Help The figure to the right shows four regions bounded by the graph of y=x sin x:…
Q: A 23.0-m-long bar of steel expands due to a temperature increase. A 12.0-m-long bar of copper also…
Q: Why is it important to develop a culture of security to support the inclusion of security in the…
Q: Calculate the equilibrium constant for the following reaction using the data at 298°K H₂O(g) +…
Q: Why are database views important and how are they beneficial in a DBMS?
Q: discuss why End-User License Agreement (EULA)is important in software security.
Q: H O: || H-C-C-H H A. ionic salt B. covalent molecule
Q: Your search - 5 The mass of a substance varies directly with the volume of the substance. 100…
Q: of the OSI model requires th
Q: A student placed 0.741 mol of N2O4(g) into a 1.5 L container at 24.4 °C. At equilibrium it was found…
Q: 4.3.27. Find the QR factorization of the following matrices: 2 0 -1 1 1 -1 −1 3 1 0 1 2 -1 1 1 -1 1…
Q: 2) Mari, Carlos and Amanda collect stamps. Carlos has five more stamps than Mari, and Amanda has…
Q: What are the considerations of security you should include in planning which programming language to…
Q: Country Italy Italy Italy Italy Italy Italy Japan Japan Japan Japan Japan Japan Year 2000 2010 2020…
Q: Use the following table of data to calculate the average atomic mass of an unknown element, X:…
Q: ping: nslookup: tracert: iperf: nmap: wireshark: gns3:   answer each of these questions…
Q: ↳. Using the same graph as in the previous question, list the order in which the vertices are…
Q: Find the indicated probability using the standard normal distribution. P(Z > -0.13) Click here to…