We need to convert the category labels to numerical labels. Use the following mapping to convert the values in category into numerical labeis. Store the numeric values in a new column called category_num • politics 1 • recreational ->2 • conputer - 3 • religion 4 • science 5 • nisc 6 Hint you can use the replace) method from pandas. category L'politics','recreational', "computer', 'relligion", science', nisc] news df = pd. Datafrane({'news :news, 'category":category}) news df

Database System Concepts
7th Edition
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Chapter1: Introduction
Section: Chapter Questions
Problem 1PE
icon
Related questions
Question
2a) Shuffle Data
An important part of the data science process is to make sure that your dataset is not skewed or biased. Let us look at the various category values in
news_df for example.
Run the two cells below to look at the category breakdown.
In [26]: A breakdwon in firsst 15888 values
news_df['category I[:15880). value_counts()
Out [26]: computer
recreational
4389
3957
science
2754
politics
religion
nisc
2079
1439
382
Name: category, dtype: int64
In [27]: breakdown from 1500e to the end of df
news_df['category ][15800:]. value_counts()
Out[27): science
1180
religion
nisc
politics
conputer
Nane: category, dtype: int64
973
578
546
454
As you can see in the above results, there is no data point for recreational category (numerical category 2) after 15000 datapoints in the current data. If we
were to split the data into training and test sets naively, we might end up with O datapoints for recreational category in the test set and thus our analysis would
remain biased.
Randomly shuffle the dataframe using sample() method in pandas such that the resulting dataset has the same number of rows as news_df but
randomly shuffled. Store this back into news_df.
Also, provide random_state-randon_state to use the value defined in randon state at the start of the assignment as an argument to the method.
And, be sure to reset the indices in the resulting news_df : be sure to do this such that there is no index column in the resuiting dataframe (see the drop
parameter in the reset index() method.
In [28]: rs= 369
new_df=df.sample(frac=1, random_state=rs).reset_index(drop=True)
news_df.head()
NameError
Traceback (nost recent call last)
<ipython-Input-28-07c3689ee598> in <module>
1 rs = 369
--> 2 new_df-df.sample(frac-1, randon_state=rs).reset_index(drop-True)
3 news_df.head ()
NameError: nane 'df' is not defined
In [ ]: assert news_df.shape == (18738, 6)
* if you fail the following assert
W go back and be sure you followed instructions
assert news_df.loc[e, "subject'] == 'Re: Is 988-1MB/sec. HD transfer slow for 486Dx-58 EISA with Ultrastor24F
Transcribed Image Text:2a) Shuffle Data An important part of the data science process is to make sure that your dataset is not skewed or biased. Let us look at the various category values in news_df for example. Run the two cells below to look at the category breakdown. In [26]: A breakdwon in firsst 15888 values news_df['category I[:15880). value_counts() Out [26]: computer recreational 4389 3957 science 2754 politics religion nisc 2079 1439 382 Name: category, dtype: int64 In [27]: breakdown from 1500e to the end of df news_df['category ][15800:]. value_counts() Out[27): science 1180 religion nisc politics conputer Nane: category, dtype: int64 973 578 546 454 As you can see in the above results, there is no data point for recreational category (numerical category 2) after 15000 datapoints in the current data. If we were to split the data into training and test sets naively, we might end up with O datapoints for recreational category in the test set and thus our analysis would remain biased. Randomly shuffle the dataframe using sample() method in pandas such that the resulting dataset has the same number of rows as news_df but randomly shuffled. Store this back into news_df. Also, provide random_state-randon_state to use the value defined in randon state at the start of the assignment as an argument to the method. And, be sure to reset the indices in the resulting news_df : be sure to do this such that there is no index column in the resuiting dataframe (see the drop parameter in the reset index() method. In [28]: rs= 369 new_df=df.sample(frac=1, random_state=rs).reset_index(drop=True) news_df.head() NameError Traceback (nost recent call last) <ipython-Input-28-07c3689ee598> in <module> 1 rs = 369 --> 2 new_df-df.sample(frac-1, randon_state=rs).reset_index(drop-True) 3 news_df.head () NameError: nane 'df' is not defined In [ ]: assert news_df.shape == (18738, 6) * if you fail the following assert W go back and be sure you followed instructions assert news_df.loc[e, "subject'] == 'Re: Is 988-1MB/sec. HD transfer slow for 486Dx-58 EISA with Ultrastor24F
2b) Numerical Labels
We need to convert the category labels to numerical labels. Use the following mapping to convert the values in category into numerical labels. Store the
numeric values in a new column called 'category_num
• politics -> 1
• recreational -> 2
• computer > 3
religion ->4
• sclence -> 5
• nisc -> 6
Hint you can use the replace() method from pandas.
In [29]: category = ['politics', "recreational', "computer','religion', "science', "nisc']
news_df = pd. DataFrame({'news' :news, 'category':category})
news_df
NameError
Traceback (nost recent call last)
<ipython-input-29-a4639ead8fbB> in <inodule>
1 category = l'politics', 'recreational, 'conputer', 'religion", 'sclence", 'misc']
----> 2 news_dt = pd. DataFrane({'news':news, category':category})
3 news_df
NameError: nane 'news is not defined
In [ ]: assert set(np.unique(news_df['category_num' ])) == {1,2,3,4,5,6}
assert sum(neNs_df['category_num"] == 2) == 3956
2c) Convert Text data into vrector
Transcribed Image Text:2b) Numerical Labels We need to convert the category labels to numerical labels. Use the following mapping to convert the values in category into numerical labels. Store the numeric values in a new column called 'category_num • politics -> 1 • recreational -> 2 • computer > 3 religion ->4 • sclence -> 5 • nisc -> 6 Hint you can use the replace() method from pandas. In [29]: category = ['politics', "recreational', "computer','religion', "science', "nisc'] news_df = pd. DataFrame({'news' :news, 'category':category}) news_df NameError Traceback (nost recent call last) <ipython-input-29-a4639ead8fbB> in <inodule> 1 category = l'politics', 'recreational, 'conputer', 'religion", 'sclence", 'misc'] ----> 2 news_dt = pd. DataFrane({'news':news, category':category}) 3 news_df NameError: nane 'news is not defined In [ ]: assert set(np.unique(news_df['category_num' ])) == {1,2,3,4,5,6} assert sum(neNs_df['category_num"] == 2) == 3956 2c) Convert Text data into vrector
Expert Solution
trending now

Trending now

This is a popular solution!

steps

Step by step

Solved in 2 steps

Blurred answer
Knowledge Booster
Potential Method of Analysis
Learn more about
Need a deep-dive on the concept behind this application? Look no further. Learn more about this topic, computer-science and related others by exploring similar questions and additional content below.
Similar questions
  • SEE MORE QUESTIONS
Recommended textbooks for you
Database System Concepts
Database System Concepts
Computer Science
ISBN:
9780078022159
Author:
Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:
McGraw-Hill Education
Starting Out with Python (4th Edition)
Starting Out with Python (4th Edition)
Computer Science
ISBN:
9780134444321
Author:
Tony Gaddis
Publisher:
PEARSON
Digital Fundamentals (11th Edition)
Digital Fundamentals (11th Edition)
Computer Science
ISBN:
9780132737968
Author:
Thomas L. Floyd
Publisher:
PEARSON
C How to Program (8th Edition)
C How to Program (8th Edition)
Computer Science
ISBN:
9780133976892
Author:
Paul J. Deitel, Harvey Deitel
Publisher:
PEARSON
Database Systems: Design, Implementation, & Manag…
Database Systems: Design, Implementation, & Manag…
Computer Science
ISBN:
9781337627900
Author:
Carlos Coronel, Steven Morris
Publisher:
Cengage Learning
Programmable Logic Controllers
Programmable Logic Controllers
Computer Science
ISBN:
9780073373843
Author:
Frank D. Petruzella
Publisher:
McGraw-Hill Education