# define stemmer function stemmer = SnowballStemmer('english') # tokenise data tokeniser = TreebankWordTokenizer() tokens = tokeniser.tokenize(data) # define lemmatiser lemmatizer = WordNetLemmatizer() # bag of words def bag_of_words_count(words, word_dict={}): """ this function takes in a list of words and returns a dictionary with each word as a key, and the value represents the number of times that word appeared""" for word in words: if word in word_dict.keys(): word_dict[word] += 1 else: word_dict[word] = 1 return word_dict # remove stopwords tokens_less_stopwords = [word for word in tokens if word not in stopwords.words('english')] # create bag of words bag_of_words = bag_of_words_count(tokens_less_stopwords) Use the stemmer and lemmatizer functions (defined in the cells above) from the relevant library to find the stem and lemma of the nth word in the token list. Function Specifications: Should take a list as input and return a dict type as output. The dictionary should have the keys 'original', 'stem' and 'lemma' with the corresponding values being the nth word transformed in that way
# define stemmer function
stemmer = SnowballStemmer('english')
# tokenise data
tokeniser = TreebankWordTokenizer()
tokens = tokeniser.tokenize(data)
# define lemmatiser
lemmatizer = WordNetLemmatizer()
# bag of words
def bag_of_words_count(words, word_dict={}):
""" this function takes in a list of words and returns a dictionary
with each word as a key, and the value represents the number of
times that word appeared"""
for word in words:
if word in word_dict.keys():
word_dict[word] += 1
else:
word_dict[word] = 1
return word_dict
# remove stopwords
tokens_less_stopwords = [word for word in tokens if word not in stopwords.words('english')]
# create bag of words
bag_of_words = bag_of_words_count(tokens_less_stopwords)
Use the stemmer and lemmatizer functions (defined in the cells above) from the relevant library to find the stem and lemma of the nth word in the token list.
Function Specifications:
- Should take a list as input and return a dict type as output.
- The dictionary should have the keys 'original', 'stem' and 'lemma' with the corresponding values being the nth word transformed in that way
Step by step
Solved in 2 steps
How many stopwords are in the text in total?
Hint : you can use the nltk stopwords dictionary
Function Specifications:
Function should take a list as input
The number of stopwords should be returned as an int