homework2-Tanmay Agarwal

pdf

School

New Jersey Institute Of Technology *

*We aren’t endorsed by this school

Course

680

Subject

Computer Science

Date

Jan 9, 2024

Type

pdf

Pages

5

Uploaded by AgentWaterPrairieDog32

Report
Homework2: n-gram language model and text classification For this assignment you will need to answer 10 QA-style questions. Each question worth 10 points. 1. What is the purpose of tokenizing text into sentences and words? Answer: Tokenizing is the foundational step in Natural Language processing and text Analysis. It basically helps in segmentation of text into words or sentences, which can then be processed, analyzed, and understood by computers or Algorithms. A few key uses of tokenization are Text Understanding, Feature Extraction, Text Cleaning, Text Processing, Counting and Statistics, etc. The tokenized text yielding set of words or sentences are then used as features to build models using machine learning algorithms like Naïve bayes classifier, Logistic regression, etc. for various applications like sentiment analysis, etc. 2. What are n-gram language models and how are they useful in NLP? Answer: N-gram language models are statistical models which are used in NLP to predict the probability of a word or sentence on the context of the previous n-1 words or sentences. N-grams are widely used in NLP for text prediction, generation, and language understanding. The “n” in n-gram models is the number of words considered in each context. There are many use cases of N-gram models in Natural Language Processing. Some of them are listed below: a. They are helpful in text generation as they can predict the likelihood of a word or a sequence of words occurring in the sentence. b. A very popular use case of N-gram models is Machine translation. It helps in determining the probabilistic relationship between words in the source and target languages helping to determine the best translation of sentences. c. Another widely used application can suggestion or Auto completion of sentences. It determines the word which is most likely to come next to the previous set of words. 3. What is the naive Bayes assumption and how does it relate to text classification? Answer: Naive Bayes is a generative classifier as the model trained using this method knows about all the features of the object and generates the probability to the new object based on how likely the features of the object are like the trained data. The Naïve Bayes assumption is that the features used to classify data are conditionally independent, given the class label.
In the context of text classification, the assumption translates into the idea that the occurrence of each word (or term) in a document is independent of the occurrences of any other word, given the class label. 4. What are some of the advantages and disadvantages of naive Bayes classifiers compared to logistic regression? Advantages: a. The first and the biggest advantage of Naïve Bayes Algorithm for NLP applications is that the algorithm is very simple and easy to implement. b. It is also computationally efficient as it requires very few resources to execute the algorithm and can be trained quickly making it suitable for large datasets and real-time applications. c. The Naïve Bayes algorithm, unlike any other machine learning algorithm, can handle high dimensional data quite easily. Disadvantages: a. One major disadvantage of Naïve Bayes classifier algorithm is that it assumes that each word in the text is independent of the other which we know is not the case every time in the literature. This can lead to suboptimal performance in cases where order of words and context matters significantly. b. Naïve Bayes models are relatively simple and thus sometimes not able to catch complex relationships or semantic nuances in the text which can only be achieved with deep understanding of the language. For example, a “.” can be a full stop but can also be a decimal in a few cases, which the model cannot always understand thus giving errors in the predicted values. 5. What do we mean by "features" in the context of text classification? Give some examples of features that might be useful for distinguishing different newsgroup topics. Answer: Features are the set of attributes that are used to train our Machine Learning model for classifying text into one or more classes. Features are essential to train the model because they help the model to distinguish different characteristics of text from each other. There can be many ways to create features that might be useful for distinguishing different newsgroup topics. Some of them are listed below: a. Bag-of-words: we can create a vocabulary of tokenized words from the given set of texts on various newsgroup topics and use each tokenized word as independent feature to classify different newsgroups. b. N-gram models: we can use bigram, trigram models and frequency of specific N- grams to predict different newgroups. For example, the bigram ‘space exploration’ can be used as a feature to predict newsgroup related to space research.
c. TF-IDF (Term frequency – Inverse Document Frequency): TF-IDF scores of words or terms portray the relevance of certain terms which serve as characteristics to identify different documents. This can be used to classify or identify different set of newsgroups. 6. What is the purpose of a test set in machine learning? Why do we need separate training and test sets? Answer: The purpose of a test set in machine learning is to apply the trained machine learning model on the data set which has never been encountered by the machine learning model. We need a separate training and test sets because each one has a separate purpose: A training set is used to train the model in such a way that when supplied with similar data it can give precise results. A test set is used to evaluate the performance of the model by using the most appropriate metrics to measure the error rate of predicted outcomes and the actual outcome of the data. 7. What metrics could you use to evaluate the performance of a text classification model? Define accuracy and any other relevant metrics. Answer: For evaluating the performance of a text classification model, we prefer to use precision or recall rather than accuracy. This is because if the dataset is biased towards one particular class then even if the classification model is mediocre it will generate good accuracy as it will correctly identify the biased class (class in majority ratio) even if it wrongly classifies the other class or classes. Accuracy : Accuracy measures the ratio of total correct predictions out of total data items in the dataset. Precision: Precision measures the ratio of True Positive predictions out of the total positive predictions by the model. Recall: Recall is also called the True Positive Rate. It calculates the ratio of all the true positives out of all the actual positives in the dataset. F1 score: The F-1 score is the harmonic mean of Precision and recall. 8. How could you determine which features are most important or indicative for a logistic regression text classification model? Answer: To determine which features are most important or indicative for a logistic regression text classification model. We can do the following: a. First, we will create a bag-of-words representation of all the tokenized words to create a vocabulary. b. We will use this representation as the feature set and the labeled target variable to train our Logistic regression model.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
c. After training the model we will check the coefficients associated with each feature and sort them. These coefficients will indicate how strongly each word in the text influences it. d. The magnitude of each coefficient and the sign (positive or negative) will tell the magnitude and impact (positive or negative) of the word on the text. 9. What is the bag-of-words representation and what are some of its limitations for text classification? Answer: The bag-of-words representation is a very simple yet effective way to convert text data into numerical format that can be used in Machine Learning and text analysis tasks. In the Bag-of-words model, a text document is tokenized into words independent of each other. The key idea is to create a vocabulary of words from the set of documents and then represent each document as a vector of word frequencies from this vocabulary. This representation is widely used in various applications related to text classification like spam detection, sentiment Analysis and topic categorization. Some of the limitations of bag-of-words are: a. Bag-of-words representation does not consider the order of the words in the text but considers them as independent. Due to this the context of the sentence in a text is lost. b. The size of the vocabulary used in a text can be large thus increasing the number of features resulting in very high-dimensional data. The algorithms which use the bag-of-words representation will require high computation power to train the model. c. Due to its fixed vocabulary any text containing tokens outside the vocabulary will be ignored resulting in wrong outputs. 10. What is overfitting in machine learning? How could you tell if your text classification model is overfitting the training data? Describe two ways to reduce overfitting. Answer: Overfitting happens when our machine learning model tries to fit the model to each and every Data point present in the training dataset such that the difference between the predicted value of the target variable in the training dataset and the true values of the target variable is negligible but when the model tries to predict the values of the target variable using the test dataset the difference is very high as the model trained is not generalized and not familiar with the new dataset and thus does not fit properly on the test dataset resulting in large error. This whole phenomenon is known as the overfitting of the model. We can tell if the model is overfitting the training dataset or not if the error observed while predicting the target variable in the training dataset is very negligible but after training if the model is showing large error in predicting the target variable on the test dataset, then we can say that model is overfitting.
There are various techniques to reduce overfitting of the Machine Learning models: a. Hyperparameter Tuning: There are various hyperparameters that are used to train the model such as learning rate, batch sizes, etc. Altering these values can change the way a model gets trained on the training dataset. b. Data Augmentation: This is a very popular and effective technique to train our model on the training dataset and make it more generalized. There are several techniques used in Machine Learning which are based on the type of data and the application. Some of the ways to augment data are: i. Up sampling or Down Sampling of data of a particular class. ii. If the data is an image then we can create more data by tilting the images by few angles so that the model gets trained on new and diverse dataset. Due date: October 10, 2023