Homework 6C

pdf

School

University of Arkansas *

*We aren’t endorsed by this school

Course

4143

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

6

Uploaded by SuperBatMaster87

Report
1 Homework 5 Problem 3 1. In a decision tree, the feature used for the first split at the root node is typically chosen based on: (a) Random selection (b) Feature with the highest information gain (c) Feature with the lowest entropy (d) Feature with the highest standard deviation 2. A decision tree with deeper depth and more complex splits is likely to: (a) Overfit the training data (b) Underfit the training data (c) Generalize well to unseen data (d) Perform better only on categorical data 3. What is the maximum depth of a decision tree with 7 leaf nodes? (The depth of a decision tree is defined as the length of the longest path from the root node to a leaf node in the tree) (a) 6 (b) 7 (c) 8 (d) It depends on the dataset 4. What is the primary advantage of using decision trees for classification and regression? (a) Simplicity and interpretability (b) Computational efficiency (c) Ability to handle missing data (d) High resistance to overfitting
2 Problem 4 The titanic train.csv dataset on the Blackboard Learn system contains information about passengers who was on the Titanic. We would like to train a logistic regression classifier to predict whether a passenger can survive or not. Detailed information about this dataset can be found at https://www.kaggle.com/competitions/titanic/data . Complete the following tasks: 1. Download the dataset, load it into Python, and preview it. Then, extract features and labels from the dataset, and describe how you identified the features and what they are: Features: Passenger Class (Pclass), Sex, Age, Sibling/Spouse (SibSp), Parent/Child (Parch), Fare, and Embarked. Label: Survived. 2. Perform appropriate preprocessing to make the data suitable for logistic regression. Please describe your data preprocessing steps in your submission. ( Hint : To deal with categorical variables, please check out sklearn.preprocessing.LabelEncoder . To remove NaN s, please use df.dropna() ). Data Processing: dfdropna(): In order to apply the code line dfdropna(), I used the fillna() code line first to display how many missing values we have in our data per feature. Then I used the dfdropna() code line to clear those missing values which are considered as outliers that could influence in our performance model. sklearn.preprocessing.LabelEncoder I used LabEncoder to transform categorical data in my data processing. The encoder is then fitted to the given labes and transformed into numerical representations using fit_transform(). The output will include the original categories labels as well as the encoded numerical representations produced by the LabEncoder.
3 3. Split the entire dataset into a training dataset and a test dataset that have reasonable sizes. Train a logistic regression model using the training dataset, evaluate its performance on the test set. In your submission, please attach the generated classification report. Classification report with precision, recall, f1-score, and support:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4 Problem 5 Answer the following questions. 1. In the figure below, you may see two classes (blue circles and green squares), a separating line (solid line) and two margins (dashed line). Which of the data points are called support vectors? The points 5 and 7 are called support vectors and are equidistant from the hyperplane. 2. Which of the following statement is true? Select all answers that are true (a) For a p -dimensional feature space, there exists a p -dimensional “plane” that cuts the feature space into two halves. (b) For a p -dimensional feature space, there exists a ( p − 1)-dimensional “plane” that cuts the feature space into two halves. (c) If a hyperplane, β 0 + β 1 x 1 + β 2 x 2 = 0, divides the 2-dimensional feature space into two halves, the upper half is represented by β 0 + β 1 x 1 + β 2 x 2 < 0. (d) If a hyperplane, β 0 + β 1 x 1 + β 2 x 2 = 0, divides the 2-dimensional feature space into two halves, the lower half is represented by β 0 + β 1 x 1 + β 2 x 2 < 0.
5 Figure 1: The Question 3 in Problem 4 3. Which of the following statement is true? Select all answers that are true (a) In Figure 1, the “margin” is given by the line A. (b) In Figure 1, the “margin” is given by the line B. (c) The maximal margin classifier minimizes the margin of the separating hyperplane. (d) The support vectors must have equal distances from the maximal margin hyperplane. 4. In Figure 1, if the Maximal Margin Classifier can be found, which data points are most likely to become the support vectors? The points that are more likely to become the support vectors are points 1 and 6 because they are closer to hyperplane. 5. Which of the following statement is true? (a) The Maximal Margin Classifier is not robust against adding or deleting data points (b) The Maximal Margin Classifier may not always exist (c) The Support Vector Classifier allows some points to be on the wrong side of the margin, but not on the wrong side of the separating hyperplane. (d) The Support Vector Classifier allows some points to be on the wrong side of the margin, or on the wrong side of the separating hyperplane.
6 Perform appropriate preprocessing to make the data suitable for logistic regression. Please describe your data preprocessing steps in your submission at the moment to use sklearn.preprocessing.LabelEncoder. and df.dropna() ).
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help