We need to convert the category labels to numerical labels. Use the following mapping to convert the values in category into numerical labeis. Store the numeric values in a new column called category_num • politics 1 • recreational ->2 • conputer - 3 • religion 4 • science 5 • nisc 6 Hint you can use the replace) method from pandas. category L'politics','recreational', "computer', 'relligion", science', nisc] news df = pd. Datafrane({'news :news, 'category":category}) news df
We need to convert the category labels to numerical labels. Use the following mapping to convert the values in category into numerical labeis. Store the numeric values in a new column called category_num • politics 1 • recreational ->2 • conputer - 3 • religion 4 • science 5 • nisc 6 Hint you can use the replace) method from pandas. category L'politics','recreational', "computer', 'relligion", science', nisc] news df = pd. Datafrane({'news :news, 'category":category}) news df
Database System Concepts
7th Edition
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Chapter1: Introduction
Section: Chapter Questions
Problem 1PE
Related questions
Question
![2a) Shuffle Data
An important part of the data science process is to make sure that your dataset is not skewed or biased. Let us look at the various category values in
news_df for example.
Run the two cells below to look at the category breakdown.
In [26]: A breakdwon in firsst 15888 values
news_df['category I[:15880). value_counts()
Out [26]: computer
recreational
4389
3957
science
2754
politics
religion
nisc
2079
1439
382
Name: category, dtype: int64
In [27]: breakdown from 1500e to the end of df
news_df['category ][15800:]. value_counts()
Out[27): science
1180
religion
nisc
politics
conputer
Nane: category, dtype: int64
973
578
546
454
As you can see in the above results, there is no data point for recreational category (numerical category 2) after 15000 datapoints in the current data. If we
were to split the data into training and test sets naively, we might end up with O datapoints for recreational category in the test set and thus our analysis would
remain biased.
Randomly shuffle the dataframe using sample() method in pandas such that the resulting dataset has the same number of rows as news_df but
randomly shuffled. Store this back into news_df.
Also, provide random_state-randon_state to use the value defined in randon state at the start of the assignment as an argument to the method.
And, be sure to reset the indices in the resulting news_df : be sure to do this such that there is no index column in the resuiting dataframe (see the drop
parameter in the reset index() method.
In [28]: rs= 369
new_df=df.sample(frac=1, random_state=rs).reset_index(drop=True)
news_df.head()
NameError
Traceback (nost recent call last)
<ipython-Input-28-07c3689ee598> in <module>
1 rs = 369
--> 2 new_df-df.sample(frac-1, randon_state=rs).reset_index(drop-True)
3 news_df.head ()
NameError: nane 'df' is not defined
In [ ]: assert news_df.shape == (18738, 6)
* if you fail the following assert
W go back and be sure you followed instructions
assert news_df.loc[e, "subject'] == 'Re: Is 988-1MB/sec. HD transfer slow for 486Dx-58 EISA with Ultrastor24F](/v2/_next/image?url=https%3A%2F%2Fcontent.bartleby.com%2Fqna-images%2Fquestion%2Fb7862d75-0a32-4764-b70f-7835c67b67de%2F6c5cbab4-85c0-47e3-9de1-e60ec2609219%2Fdy5ckcs_processed.jpeg&w=3840&q=75)
Transcribed Image Text:2a) Shuffle Data
An important part of the data science process is to make sure that your dataset is not skewed or biased. Let us look at the various category values in
news_df for example.
Run the two cells below to look at the category breakdown.
In [26]: A breakdwon in firsst 15888 values
news_df['category I[:15880). value_counts()
Out [26]: computer
recreational
4389
3957
science
2754
politics
religion
nisc
2079
1439
382
Name: category, dtype: int64
In [27]: breakdown from 1500e to the end of df
news_df['category ][15800:]. value_counts()
Out[27): science
1180
religion
nisc
politics
conputer
Nane: category, dtype: int64
973
578
546
454
As you can see in the above results, there is no data point for recreational category (numerical category 2) after 15000 datapoints in the current data. If we
were to split the data into training and test sets naively, we might end up with O datapoints for recreational category in the test set and thus our analysis would
remain biased.
Randomly shuffle the dataframe using sample() method in pandas such that the resulting dataset has the same number of rows as news_df but
randomly shuffled. Store this back into news_df.
Also, provide random_state-randon_state to use the value defined in randon state at the start of the assignment as an argument to the method.
And, be sure to reset the indices in the resulting news_df : be sure to do this such that there is no index column in the resuiting dataframe (see the drop
parameter in the reset index() method.
In [28]: rs= 369
new_df=df.sample(frac=1, random_state=rs).reset_index(drop=True)
news_df.head()
NameError
Traceback (nost recent call last)
<ipython-Input-28-07c3689ee598> in <module>
1 rs = 369
--> 2 new_df-df.sample(frac-1, randon_state=rs).reset_index(drop-True)
3 news_df.head ()
NameError: nane 'df' is not defined
In [ ]: assert news_df.shape == (18738, 6)
* if you fail the following assert
W go back and be sure you followed instructions
assert news_df.loc[e, "subject'] == 'Re: Is 988-1MB/sec. HD transfer slow for 486Dx-58 EISA with Ultrastor24F
![2b) Numerical Labels
We need to convert the category labels to numerical labels. Use the following mapping to convert the values in category into numerical labels. Store the
numeric values in a new column called 'category_num
• politics -> 1
• recreational -> 2
• computer > 3
religion ->4
• sclence -> 5
• nisc -> 6
Hint you can use the replace() method from pandas.
In [29]: category = ['politics', "recreational', "computer','religion', "science', "nisc']
news_df = pd. DataFrame({'news' :news, 'category':category})
news_df
NameError
Traceback (nost recent call last)
<ipython-input-29-a4639ead8fbB> in <inodule>
1 category = l'politics', 'recreational, 'conputer', 'religion", 'sclence", 'misc']
----> 2 news_dt = pd. DataFrane({'news':news, category':category})
3 news_df
NameError: nane 'news is not defined
In [ ]: assert set(np.unique(news_df['category_num' ])) == {1,2,3,4,5,6}
assert sum(neNs_df['category_num"] == 2) == 3956
2c) Convert Text data into vrector](/v2/_next/image?url=https%3A%2F%2Fcontent.bartleby.com%2Fqna-images%2Fquestion%2Fb7862d75-0a32-4764-b70f-7835c67b67de%2F6c5cbab4-85c0-47e3-9de1-e60ec2609219%2F4aq0klb_processed.jpeg&w=3840&q=75)
Transcribed Image Text:2b) Numerical Labels
We need to convert the category labels to numerical labels. Use the following mapping to convert the values in category into numerical labels. Store the
numeric values in a new column called 'category_num
• politics -> 1
• recreational -> 2
• computer > 3
religion ->4
• sclence -> 5
• nisc -> 6
Hint you can use the replace() method from pandas.
In [29]: category = ['politics', "recreational', "computer','religion', "science', "nisc']
news_df = pd. DataFrame({'news' :news, 'category':category})
news_df
NameError
Traceback (nost recent call last)
<ipython-input-29-a4639ead8fbB> in <inodule>
1 category = l'politics', 'recreational, 'conputer', 'religion", 'sclence", 'misc']
----> 2 news_dt = pd. DataFrane({'news':news, category':category})
3 news_df
NameError: nane 'news is not defined
In [ ]: assert set(np.unique(news_df['category_num' ])) == {1,2,3,4,5,6}
assert sum(neNs_df['category_num"] == 2) == 3956
2c) Convert Text data into vrector
Expert Solution

This question has been solved!
Explore an expertly crafted, step-by-step solution for a thorough understanding of key concepts.
This is a popular solution!
Trending now
This is a popular solution!
Step by step
Solved in 2 steps

Knowledge Booster
Learn more about
Need a deep-dive on the concept behind this application? Look no further. Learn more about this topic, computer-science and related others by exploring similar questions and additional content below.Recommended textbooks for you

Database System Concepts
Computer Science
ISBN:
9780078022159
Author:
Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:
McGraw-Hill Education

Starting Out with Python (4th Edition)
Computer Science
ISBN:
9780134444321
Author:
Tony Gaddis
Publisher:
PEARSON

Digital Fundamentals (11th Edition)
Computer Science
ISBN:
9780132737968
Author:
Thomas L. Floyd
Publisher:
PEARSON

Database System Concepts
Computer Science
ISBN:
9780078022159
Author:
Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:
McGraw-Hill Education

Starting Out with Python (4th Edition)
Computer Science
ISBN:
9780134444321
Author:
Tony Gaddis
Publisher:
PEARSON

Digital Fundamentals (11th Edition)
Computer Science
ISBN:
9780132737968
Author:
Thomas L. Floyd
Publisher:
PEARSON

C How to Program (8th Edition)
Computer Science
ISBN:
9780133976892
Author:
Paul J. Deitel, Harvey Deitel
Publisher:
PEARSON

Database Systems: Design, Implementation, & Manag…
Computer Science
ISBN:
9781337627900
Author:
Carlos Coronel, Steven Morris
Publisher:
Cengage Learning

Programmable Logic Controllers
Computer Science
ISBN:
9780073373843
Author:
Frank D. Petruzella
Publisher:
McGraw-Hill Education