###Run this cell. Do not change the code in this cell from nltk.tokenize import sent_tokenize, word_tokenize from nltk.corpus import gutenberg def get_rawtext(filename='carroll-alice.txt'): text=gutenberg.raw(filename) return text def get_text(filename='carroll-alice.txt'): text=gutenberg.raw(filename) sentences=sent_tokenize(text) tokenized= [word_tokenize(sent.lower()) for sent in sentences] normalised=[["Nth" if (token.endswith( ("nd","st","th")) and token [:-2].isdigit()) else token for token in sent] for sent in tokenized] normalised=[["NUM" if token.isdigit() else token for token in sent] for sent in normalised] filtered=[[word for word in sent if word.isalpha()] for sent in normalised] return filtered
###Run this cell. Do not change the code in this cell from nltk.tokenize import sent_tokenize, word_tokenize from nltk.corpus import gutenberg def get_rawtext(filename='carroll-alice.txt'): text=gutenberg.raw(filename) return text def get_text(filename='carroll-alice.txt'): text=gutenberg.raw(filename) sentences=sent_tokenize(text) tokenized= [word_tokenize(sent.lower()) for sent in sentences] normalised=[["Nth" if (token.endswith( ("nd","st","th")) and token [:-2].isdigit()) else token for token in sent] for sent in tokenized] normalised=[["NUM" if token.isdigit() else token for token in sent] for sent in normalised] filtered=[[word for word in sent if word.isalpha()] for sent in normalised] return filtered
Database System Concepts
7th Edition
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Chapter1: Introduction
Section: Chapter Questions
Problem 1PE
Related questions
Question
![- Question 2
This question is about word-cooccurences, collocations and distributional similarity.
Throughout this question, reference will be made to the sample of English stored in text1 (Lewis Carroll's Alice in Wonderland) - a sample of
which is output below.
###Run this cell.
Do not change the code in this cell
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import gutenberg
def get_rawtext(filename='carroll-alice.txt'):
text=gutenberg.raw(filename)
return text
def get_text(filename='carroll-alice.txt'):
text=gutenberg.raw(filename)
sentences=sent_tokenize(text)
tokenized= [word_tokenize (sent.lower()) for sent in sentences]
normalised= [["Nth" if (token.endswith( ("nd","st","th")) and token [:-2].isdigit()) else token for token in sent] for sent in tokenized]
normalised=[["NUM" if token.isdigit () else token for token in sent] for sent in normalised]
filtered= [[word for word in sent if word.isalpha(] for sent in normalised]
return filtered
text1=get_text()
text1[:10]
a) Explain what each step in the get_text() function does,](/v2/_next/image?url=https%3A%2F%2Fcontent.bartleby.com%2Fqna-images%2Fquestion%2F3fc6c909-dbd9-42e3-aded-512269cafd79%2Fa746275b-3484-4ab1-af6b-96a6e4567790%2Fld4yhko_processed.png&w=3840&q=75)
Transcribed Image Text:- Question 2
This question is about word-cooccurences, collocations and distributional similarity.
Throughout this question, reference will be made to the sample of English stored in text1 (Lewis Carroll's Alice in Wonderland) - a sample of
which is output below.
###Run this cell.
Do not change the code in this cell
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import gutenberg
def get_rawtext(filename='carroll-alice.txt'):
text=gutenberg.raw(filename)
return text
def get_text(filename='carroll-alice.txt'):
text=gutenberg.raw(filename)
sentences=sent_tokenize(text)
tokenized= [word_tokenize (sent.lower()) for sent in sentences]
normalised= [["Nth" if (token.endswith( ("nd","st","th")) and token [:-2].isdigit()) else token for token in sent] for sent in tokenized]
normalised=[["NUM" if token.isdigit () else token for token in sent] for sent in normalised]
filtered= [[word for word in sent if word.isalpha(] for sent in normalised]
return filtered
text1=get_text()
text1[:10]
a) Explain what each step in the get_text() function does,
Expert Solution
![](/static/compass_v2/shared-icons/check-mark.png)
This question has been solved!
Explore an expertly crafted, step-by-step solution for a thorough understanding of key concepts.
Step by step
Solved in 2 steps
![Blurred answer](/static/compass_v2/solution-images/blurred-answer.jpg)
Knowledge Booster
Learn more about
Need a deep-dive on the concept behind this application? Look no further. Learn more about this topic, computer-science and related others by exploring similar questions and additional content below.Recommended textbooks for you
![Database System Concepts](https://www.bartleby.com/isbn_cover_images/9780078022159/9780078022159_smallCoverImage.jpg)
Database System Concepts
Computer Science
ISBN:
9780078022159
Author:
Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:
McGraw-Hill Education
![Starting Out with Python (4th Edition)](https://www.bartleby.com/isbn_cover_images/9780134444321/9780134444321_smallCoverImage.gif)
Starting Out with Python (4th Edition)
Computer Science
ISBN:
9780134444321
Author:
Tony Gaddis
Publisher:
PEARSON
![Digital Fundamentals (11th Edition)](https://www.bartleby.com/isbn_cover_images/9780132737968/9780132737968_smallCoverImage.gif)
Digital Fundamentals (11th Edition)
Computer Science
ISBN:
9780132737968
Author:
Thomas L. Floyd
Publisher:
PEARSON
![Database System Concepts](https://www.bartleby.com/isbn_cover_images/9780078022159/9780078022159_smallCoverImage.jpg)
Database System Concepts
Computer Science
ISBN:
9780078022159
Author:
Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:
McGraw-Hill Education
![Starting Out with Python (4th Edition)](https://www.bartleby.com/isbn_cover_images/9780134444321/9780134444321_smallCoverImage.gif)
Starting Out with Python (4th Edition)
Computer Science
ISBN:
9780134444321
Author:
Tony Gaddis
Publisher:
PEARSON
![Digital Fundamentals (11th Edition)](https://www.bartleby.com/isbn_cover_images/9780132737968/9780132737968_smallCoverImage.gif)
Digital Fundamentals (11th Edition)
Computer Science
ISBN:
9780132737968
Author:
Thomas L. Floyd
Publisher:
PEARSON
![C How to Program (8th Edition)](https://www.bartleby.com/isbn_cover_images/9780133976892/9780133976892_smallCoverImage.gif)
C How to Program (8th Edition)
Computer Science
ISBN:
9780133976892
Author:
Paul J. Deitel, Harvey Deitel
Publisher:
PEARSON
![Database Systems: Design, Implementation, & Manag…](https://www.bartleby.com/isbn_cover_images/9781337627900/9781337627900_smallCoverImage.gif)
Database Systems: Design, Implementation, & Manag…
Computer Science
ISBN:
9781337627900
Author:
Carlos Coronel, Steven Morris
Publisher:
Cengage Learning
![Programmable Logic Controllers](https://www.bartleby.com/isbn_cover_images/9780073373843/9780073373843_smallCoverImage.gif)
Programmable Logic Controllers
Computer Science
ISBN:
9780073373843
Author:
Frank D. Petruzella
Publisher:
McGraw-Hill Education