2a) Shuffle DataAn important part of the data science process is to make sure that your dataset is not skewed or biased. Let us look at the various category values innews_df for example.Run the two cells below to look at the category breakdown.In [26]: A breakdwon in firsst 15888 valuesnews_df['category I[:15880). value_counts()Out [26]: computerrecreational43893957science2754politicsreligionnisc20791439382Name: category, dtype: int64In [27]: breakdown from 1500e to the end of dfnews_df['category ][15800:]. value_counts()Out[27): science1180religionniscpoliticsconputerNane: category, dtype: int64973578546454As you can see in the above results, there is no data point for recreational category (numerical category 2) after 15000 datapoints in the current data. If wewere to split the data into training and test sets naively, we might end up with O datapoints for recreational category in the test set and thus our analysis wouldremain biased.Randomly shuffle the dataframe using sample() method in pandas such that the resulting dataset has the same number of rows as news_df butrandomly shuffled. Store this back into news_df.Also, provide random_state-randon_state to use the value defined in randon state at the start of the assignment as an argument to the method.And, be sure to reset the indices in the resulting news_df : be sure to do this such that there is no index column in the resuiting dataframe (see the dropparameter in the reset index() method.In [28]: rs= 369new_df=df.sample(frac=1, random_state=rs).reset_index(drop=True)news_df.head()NameErrorTraceback (nost recent call last) in 1 rs = 369--> 2 new_df-df.sample(frac-1, randon_state=rs).reset_index(drop-True)3 news_df.head ()NameError: nane 'df' is not definedIn [ ]: assert news_df.shape == (18738, 6)* if you fail the following assertW go back and be sure you followed instructionsassert news_df.loc[e, "subject'] == 'Re: Is 988-1MB/sec. HD transfer slow for 486Dx-58 EISA with Ultrastor24F 2b) Numerical LabelsWe need to convert the category labels to numerical labels. Use the following mapping to convert the values in category into numerical labels. Store thenumeric values in a new column called 'category_num• politics -> 1• recreational -> 2• computer > 3religion ->4• sclence -> 5• nisc -> 6Hint you can use the replace() method from pandas.In [29]: category = ['politics', "recreational', "computer','religion', "science', "nisc']news_df = pd. DataFrame({'news' :news, 'category':category})news_dfNameErrorTraceback (nost recent call last) in 1 category = l'politics', 'recreational, 'conputer', 'religion", 'sclence", 'misc']----> 2 news_dt = pd. DataFrane({'news':news, category':category})3 news_dfNameError: nane 'news is not definedIn [ ]: assert set(np.unique(news_df['category_num' ])) == {1,2,3,4,5,6}assert sum(neNs_df['category_num"] == 2) == 39562c) Convert Text data into vrector

Code: import pandas as pd category = ['politics', 'recreational', 'computer', 'religion',…

We need to convert the category labels to numerical labels. Use the following mapping to convert the values in category into numerical labeis. Store the numeric values in a new column called category_num • politics 1 • recreational ->2 • conputer - 3 • religion 4 • science 5 • nisc 6 Hint you can use the replace) method from pandas. category L'politics','recreational', "computer', 'relligion", science', nisc] news df = pd. Datafrane({'news :news, 'category":category}) news df

We need to convert the category labels to numerical labels. Use the following mapping to convert the values in category into numerical labeis. Store the numeric values in a new column called category_num • politics 1 • recreational ->2 • conputer - 3 • religion 4 • science 5 • nisc 6 Hint you can use the replace) method from pandas. category L'politics','recreational', "computer', 'relligion", science', nisc] news df = pd. Datafrane({'news :news, 'category":category}) news df

Database System Concepts

7th Edition

ISBN:9780078022159

Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan

Publisher:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan

Chapter1: Introduction

Section: Chapter Questions

Problem 1PE

See similar textbooks

Related questions

Q: Use Python Code Your task is : Use pandas to read the csv table. Example df =…

A: Answer: Python Source Code: import pandas as pdimport matplotlib.pyplot as pltdata =…

Q: day. If there is no # price data on a day, we know that price did not change on that day, so its…

A: It appears that you want to clean and process the price change data to obtain daily prices at the…

Q: In R please provide the code and explanation for the following One Way ANOVA with the coagulation…

A: As per our guidelines, we are supposed to answer only 1st three parts. kindly repost the remaining…

Q: What feature was used to make the columns in this graph look as they are? Girls Boys 0 |Boys Girls 2…

A: From taking a look at the graph we can see that in the graph the columns pare properly maintained…

Q: In order to create an accumulating column, we had to override the default behavior of the PDV. So,…

A: As per the bartleby guidelines we can solve only one question for you if you want to another…

Q: Import the dataset Student.csv. Display only the variable names of the Student.csv dataset. code in…

A: Here, variable names means the column names for the dataset. So, The output should like…

Q: Use the Matplotlib plt.hist() method to plot area values. All we have to do for the plt.hist()…

A: As per the question statement, It is asked to write python code.

Q: use dplyr dataframe storms in rstudio to answer this question Add a column hours_by_name to the…

A: Manipulation of data frames is a common task when you start exploring your data in Rstudio and dplyr…

Q: . Generate a variable, which is age squared: gen age squared=age^2 In what kind of applications…

A: Let's see the answer:

Q: List the item ID as ITEM_ID and description as ITEM_DESC for all items. The descriptions should…

A: Below is the query to get the desired output:

Q: You need to return all travel data from the TRAVELS table for the CAR_ID which has a plate number…

A: Here, you need to join both the tables with same car_id whose car_plate is BB-883-***

Q: To generate a statistical summary for each of our numerical attributes, use the Pandas DataFrame…

A: According to the question the given dataset is forestfire_df, and it is required to generate a…

Q: invoice (or an order) contains a collection of purchased items or (order items). Should that…

A: Solution - In the given question, we have to tell whether the collection should be implemented using…

Q: As a software engineer, you are hired to arrange football players based on their country and…

A: We have to write python code for the above problem. Given tuple is my_tuple = (("Firmino",…

Q: The first DataFrame must be called features, which is your feature matrix. The features DataFrame…

A: Answer: We have written code in the Python programming language Algorithm Step1: We have import…

Q: CREATE schema parameter IS AS OR REPLACE procedure_name IN IN datatype call_spec OUT declare_section…

A: Syntax diagrams, also known as railroad diagrams, are a graphical way of representing a context-free…

Q: line and the results T TRANSACTION; TE FROM people I CT COUNT(*) FROM ЗАСК;

A: DELETE command will delete the row from people table where id=2. Hence, row 2 JENNIFER PARKER NULL…

Q: Changing the. EmployeeID = 1 to 10 in the table Employee. If it is rejected, explain EMPLOYEE(…

A: We need to changing the EmployeeID = 1 to 10 in the table Employee and if it is rejected, explain.

Q: Using Jaccard coefficient, find the most two similar objects in the following dataset. Att1 Att2…

A: NSWER : PLEASE REFER TO THE IMAGES BELOW : (Handwritten Solution)

Concept explainers

A potential method of analysis

The complexity of the algorithm model emphasizes the potential method. This method is useful for examining a data structure's amortized time and space complexity.

Question

2a) Shuffle Data
An important part of the data science process is to make sure that your dataset is not skewed or biased. Let us look at the various category values in
news_df for example.
Run the two cells below to look at the category breakdown.
In [26]: A breakdwon in firsst 15888 values
news_df['category I[:15880). value_counts()
Out [26]: computer
recreational
4389
3957
science
2754
politics
religion
nisc
2079
1439
382
Name: category, dtype: int64
In [27]: breakdown from 1500e to the end of df
news_df['category ][15800:]. value_counts()
Out[27): science
1180
religion
nisc
politics
conputer
Nane: category, dtype: int64
973
578
546
454
As you can see in the above results, there is no data point for recreational category (numerical category 2) after 15000 datapoints in the current data. If we
were to split the data into training and test sets naively, we might end up with O datapoints for recreational category in the test set and thus our analysis would
remain biased.
Randomly shuffle the dataframe using sample() method in pandas such that the resulting dataset has the same number of rows as news_df but
randomly shuffled. Store this back into news_df.
Also, provide random_state-randon_state to use the value defined in randon state at the start of the assignment as an argument to the method.
And, be sure to reset the indices in the resulting news_df : be sure to do this such that there is no index column in the resuiting dataframe (see the drop
parameter in the reset index() method.
In [28]: rs= 369
new_df=df.sample(frac=1, random_state=rs).reset_index(drop=True)
news_df.head()
NameError
Traceback (nost recent call last)
<ipython-Input-28-07c3689ee598> in <module>
1 rs = 369
--> 2 new_df-df.sample(frac-1, randon_state=rs).reset_index(drop-True)
3 news_df.head ()
NameError: nane 'df' is not defined
In [ ]: assert news_df.shape == (18738, 6)
* if you fail the following assert
W go back and be sure you followed instructions
assert news_df.loc[e, "subject'] == 'Re: Is 988-1MB/sec. HD transfer slow for 486Dx-58 EISA with Ultrastor24F

2b) Numerical Labels
We need to convert the category labels to numerical labels. Use the following mapping to convert the values in category into numerical labels. Store the
numeric values in a new column called 'category_num
• politics -> 1
• recreational -> 2
• computer > 3
religion ->4
• sclence -> 5
• nisc -> 6
Hint you can use the replace() method from pandas.
In [29]: category = ['politics', "recreational', "computer','religion', "science', "nisc']
news_df = pd. DataFrame({'news' :news, 'category':category})
news_df
NameError
Traceback (nost recent call last)
<ipython-input-29-a4639ead8fbB> in <inodule>
1 category = l'politics', 'recreational, 'conputer', 'religion", 'sclence", 'misc']
----> 2 news_dt = pd. DataFrane({'news':news, category':category})
3 news_df
NameError: nane 'news is not defined
In [ ]: assert set(np.unique(news_df['category_num' ])) == {1,2,3,4,5,6}
assert sum(neNs_df['category_num"] == 2) == 3956
2c) Convert Text data into vrector

Expert Solution

This question has been solved!

Explore an expertly crafted, step-by-step solution for a thorough understanding of key concepts.

This is a popular solution!

SEE SOLUTION Check out a sample Q&A here

Step 1:

VIEW

Step 2: Output

VIEW

Trending now

This is a popular solution!

Step by step

Solved in 2 steps

SEE SOLUTION Check out a sample Q&A here

Knowledge Booster

Learn more about

Need a deep-dive on the concept behind this application? Look no further. Learn more about this topic, computer-science and related others by exploring similar questions and additional content below.

Similar questions

Please do not give solution in image format thanku
Task 1: The InstantRide received some traffic violation tickets from the government. The Legal team of InstantRide requires the travel information of the respective drivers along with corresponding Driving License IDs to proceed further. In addition, the team wants to include the drivers without travel information in the system yet for the completion of driver list. Therefore, you need to return DRIVER_FIRST_NAME, DRIVER_LAST_NAME, DRIVER_DRIVING_LICENSE_ID, TRAVEL_START_TIME, TRAVEL_END_TIME information from the DRIVERS and TRAVELS data connected by LEFT JOIN. Task: Query all drivers with and without travel data. (SQL Database Test)
What will be the accuracy for binary class and multi class for the table? Can you solve it differently for both please.
a player plays in a single team in a given season. however, the player may change the teams from a season to the next. for a player we keep the name (we want to search based on the last name), the date of birth, the jersey number (which can change when the player changes the team). a coach is hired by a club in a given season. a coach coaches one or more teams per seasons, but at most 3. organize the persons associated with the league in a hierarchy. Create the ERD design AND Create tables ,ALSO how to Add data in SQL * In the ERD plz show me the relationship between the entity>>> Like one to one, one to many , many to many Thank you
Suppose that we have an item table and a shipment table. The Item table contains all of the items and the Shipment table contains items received by the company. Shipments Table itemNo description Items Table itemNo 342 345 346 347 description MacBook Pro Lenovo PC 4GB Memory 500GB HD qtyOnHand 10 55 100 6 342 345 347 352 What is the result of executing the following query on this data? MacBook Pro Lenovo PC 500GB HD ITB External Drive qty Received SELECT I.itemNo "Item No", qtyOnHand "Quantity On Hand", S.itemNo "Item Number", qtyReceived "Quantity Received" FROM Items I RIGHT JOIN Shipments S ON (1.itemNo= S.itemNo); O c. 'NULL' appears in the column Item Number. O d. '352' appears in the column Item No. 1 5 6 4 Identify in the list below, a value and the name of a column in which it appears. O a. '346' appears in the column Item No. O b. 'NULL' appears in the column Item No.
please follow the intstructions really need help
SELECT P.BRAND_ID, B.BRAND_NAME, B.BRAND_TYPE,MAX(AVGPRICE) FROM LGPRODUCT P INNER JOIN LGBRAND B ON P.BRAND_ID = B.BRAND_ID (SELECT P.BRAND_ID, AVG (P.PROD_PRICE) AS AVGPRICE FROM LGPRODUCT P GROUP BY P.BRAND_ID)AS AVG_PRICE GROUP BY P.BRAND_ID, B.BRAND_NAME, B.BRAND_TYPE ERROR 1064 (42000) at line 1: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'SELECT P.BRAND_ID, AVG (P.PROD_PRICE) AS AVGPRICE FROM LGPRODUCT P GROUP BY P.' at line 3
Implement
calculate_trip_time( iata_src: str, iata_dst: str, flight_walk: List[str], flights: Flight Dir float: def ) -> """ Return a float corresponding to the amount of time required to travel from the source airport to the destination airport to the destination airport, as outlined by the flight_walk. The start time of the trip should be considered zero. In other words, assuming we start the trip at 12:00am, this function should return the time it takes for the trip to finish, including all the waiting times before, and between the flights. If there is no path available, return -1.0 >>> calculate_trip_time("AA1", "AA2", ["AA1", "AA2"], TEST_FLIGHTS_DIR_FOUR_ 2.0 "AA7", ["AA7", "AA1"], TEST_FLIGHTS_DIR_FOUR_ "AA7", ["AA1", "AA7"], TEST_FLIGHTS_DIR_FOUR_ "AA1", ["AA1"], TEST_FLIGHTS_DIR_FOUR_CITIES) "AA2", ["AA4", "AA1", "AA2"], TEST_FLIGHTS_DI "AA3", ["AA1", "AA2", "AA3"], TEST_FLIGHTS_DI >>> calculate_trip_time("AA1", "AA4", ["AA1", "AA4"], TEST_FLIGHTS_DIR_FOUR_ 2.0 || || || >>>…
Numpy
a player plays in a single team in a given season. however, the player may change the teams from a season to the next. for a player we keep the name (we want to search based on the last name), the date of birth, the jersey number (which can change when the player changes the team). a coach is hired by a club in a given season. a coach coaches one or more teams per seasons, but at most 3. organize the persons associated with the league in a hierarchy. I need the ERD Model and the relationships ?
CREATE DATABASE COUNTRIES; USE COUNTRIES; DROP TABLE IF EXISTS `City`; CREATE TABLE `City` ( `ID` int(11) NOT NULL AUTO_INCREMENT, `Name` char(35) NOT NULL DEFAULT '', `CountryCode` char(3) NOT NULL DEFAULT '', `District` char(20) NOT NULL DEFAULT '', `Population` int(11) NOT NULL DEFAULT '0', PRIMARY KEY (`ID`) ) ENGINE=MyISAM AUTO_INCREMENT=4080 DEFAULT CHARSET=latin1; -- -- Dumping data for table `City` -- -- ORDER BY: `ID` INSERT INTO `City` VALUES (1,'Kabul','AFG','Kabol',1780000); INSERT INTO `City` VALUES (2,'Qandahar','AFG','Qandahar',237500); INSERT INTO `City` VALUES (3,'Herat','AFG','Herat',186800); INSERT INTO `City` VALUES (4,'Mazar-e-Sharif','AFG','Balkh',127800); INSERT INTO `City` VALUES (5,'Amsterdam','NLD','Noord-Holland',731200); INSERT INTO `City` VALUES (6,'Rotterdam','NLD','Zuid-Holland',593321); INSERT INTO `City` VALUES (7,'Haag','NLD','Zuid-Holland',440900); INSERT INTO `City` VALUES (3068,'Berlin','DEU','Berliini',3386667); INSERT INTO `City` VALUES…