Challenge: Improve Your Spam Classifier [Ungraded] You can improve your classifier in two ways: Feature Extraction: Modify the function extract_features_challenge(). This function takes in a email (list of lines in an email) and a feature dimension B, and should output a feature vector of dimension B. The autograder will pass in both arguments. We provide naive feature extraction from above as an example. Model Training: Modify the function train_spam_filter_challenge(). This function takes in training data xTr, yTr and should output a weight vector w and bias term b for classification. The predictions will be calculated exactly the same way as we have demonstrated in the previous cell. We provide an initial implementation using gradient descent and logistic regression. Your model will be trained on the same training set above (loaded by load_spam_data()), but we will test its accuracy on a testing dataset of emails (hidden from you) feature_dimension = 512 def extract_features_challenge(email, B=feature_dimension): ''' Returns a vector representation for email. The vector is of length B. Input: email: list of lines in an email B: number of dimensions of output vector Output: B-dimensional vector ''' # initialize all-zeros feature vector v = np.zeros(B) email = ' '.join(email) # breaks for non-ascii characters tokens = email.split() for token in tokens: v[hash(token) % B] = 1 # YOUR CODE HERE return v ------- def train_spam_filter_challenge(xTr, yTr): ''' Train a model on training data xTr and labels yTr, and return weight vector and bias term. Input: xTr: data matrix of shape nxd yTr: n-dimensional vector data labels (+1 or -1) Output: w, b w: d-dimension weight vector b: scalar bias term ''' n, d = xTr.shape max_iter = 100 alpha = 1e-5 w, b, losses = logistic_regression_grader(xTr, yTr, max_iter, alpha) # YOUR CODE HERE return w, b ------ def challenge_selftest(): xTr, yTr, cTr = load_spam_data(extract_features_challenge, feature_dimension, train_url) w, b = train_spam_filter_challenge(xTr, yTr) xTe, yTe, cTe = load_spam_data(extract_features_challenge, feature_dimension, test_url) scoresTe = sigmoid_grader(xTe @ w + b) preds = (scoresTe > 0.5).astype(int) preds[preds != 1] = -1 pos_ind = (yTe == 1) neg_ind = (yTe == -1) pos_acc = np.mean(yTe[pos_ind] == preds[pos_ind]) neg_acc = np.mean(yTe[neg_ind] == preds[neg_ind]) test_accuracy = 0.5*pos_acc + 0.5*neg_acc scoresTr = sigmoid_grader(xTr @ w + b) preds_Tr = (scoresTr > 0.5).astype(int) preds_Tr[preds_Tr != 1] = -1 training_accuracy = np.mean(preds_Tr == yTr) return training_accuracy, test_accuracy training_acc, test_acc = challenge_selftest() print("Your features and model achieved training accuracy: {:.2f}% and test accuracy: {:.2f}%".format(training_acc*100, test_acc*100))

Challenge: Improve Your Spam Classifier [Ungraded] You can improve your classifier in two ways: Feature Extraction: Modify the function extract_features_challenge(). This function takes in a email (list of lines in an email) and a feature dimension B, and should output a feature vector of dimension B. The autograder will pass in both arguments. We provide naive feature extraction from above as an example. Model Training: Modify the function train_spam_filter_challenge(). This function takes in training data xTr, yTr and should output a weight vector w and bias term b for classification. The predictions will be calculated exactly the same way as we have demonstrated in the previous cell. We provide an initial implementation using gradient descent and logistic regression. Your model will be trained on the same training set above (loaded by load_spam_data()), but we will test its accuracy on a testing dataset of emails (hidden from you) feature_dimension = 512 def extract_features_challenge(email, B=feature_dimension): ''' Returns a vector representation for email. The vector is of length B. Input: email: list of lines in an email B: number of dimensions of output vector Output: B-dimensional vector ''' # initialize all-zeros feature vector v = np.zeros(B) email = ' '.join(email) # breaks for non-ascii characters tokens = email.split() for token in tokens: v[hash(token) % B] = 1 # YOUR CODE HERE return v ------- def train_spam_filter_challenge(xTr, yTr): ''' Train a model on training data xTr and labels yTr, and return weight vector and bias term. Input: xTr: data matrix of shape nxd yTr: n-dimensional vector data labels (+1 or -1) Output: w, b w: d-dimension weight vector b: scalar bias term ''' n, d = xTr.shape max_iter = 100 alpha = 1e-5 w, b, losses = logistic_regression_grader(xTr, yTr, max_iter, alpha) # YOUR CODE HERE return w, b ------ def challenge_selftest(): xTr, yTr, cTr = load_spam_data(extract_features_challenge, feature_dimension, train_url) w, b = train_spam_filter_challenge(xTr, yTr) xTe, yTe, cTe = load_spam_data(extract_features_challenge, feature_dimension, test_url) scoresTe = sigmoid_grader(xTe @ w + b) preds = (scoresTe > 0.5).astype(int) preds[preds != 1] = -1 pos_ind = (yTe == 1) neg_ind = (yTe == -1) pos_acc = np.mean(yTe[pos_ind] == preds[pos_ind]) neg_acc = np.mean(yTe[neg_ind] == preds[neg_ind]) test_accuracy = 0.5pos_acc + 0.5neg_acc scoresTr = sigmoid_grader(xTr @ w + b) preds_Tr = (scoresTr > 0.5).astype(int) preds_Tr[preds_Tr != 1] = -1 training_accuracy = np.mean(preds_Tr == yTr) return training_accuracy, test_accuracy training_acc, test_acc = challenge_selftest() print("Your features and model achieved training accuracy: {:.2f}% and test accuracy: {:.2f}%".format(training_acc100, test_acc100))

Database System Concepts

7th Edition

ISBN:9780078022159

Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan

Publisher:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan

Chapter1: Introduction

Section: Chapter Questions

Problem 1PE

See similar textbooks

Related questions

Question

feature_dimension = 512
def extract_features_challenge(email, B=feature_dimension):
'''
Returns a vector representation for email. The vector is of length B.

Input:
email: list of lines in an email
B: number of dimensions of output vector

Output:
B-dimensional vector
'''
# initialize all-zeros feature vector
v = np.zeros(B)
email = ' '.join(email)
# breaks for non-ascii characters
tokens = email.split()
for token in tokens:
v[hash(token) % B] = 1

# YOUR CODE HERE

return v

-------

def train_spam_filter_challenge(xTr, yTr):
'''
Train a model on training data xTr and labels yTr, and return weight vector and bias term.

Input:
xTr: data matrix of shape nxd
yTr: n-dimensional vector data labels (+1 or -1)

Output:
w, b
w: d-dimension weight vector
b: scalar bias term
'''
n, d = xTr.shape

max_iter = 100
alpha = 1e-5
w, b, losses = logistic_regression_grader(xTr, yTr, max_iter, alpha)

# YOUR CODE HERE

return w, b

------

def challenge_selftest():
xTr, yTr, cTr = load_spam_data(extract_features_challenge, feature_dimension, train_url)
w, b = train_spam_filter_challenge(xTr, yTr)
xTe, yTe, cTe = load_spam_data(extract_features_challenge, feature_dimension, test_url)
scoresTe = sigmoid_grader(xTe @ w + b)

preds = (scoresTe > 0.5).astype(int)
preds[preds != 1] = -1

pos_ind = (yTe == 1)
neg_ind = (yTe == -1)

pos_acc = np.mean(yTe[pos_ind] == preds[pos_ind])
neg_acc = np.mean(yTe[neg_ind] == preds[neg_ind])

test_accuracy = 0.5*pos_acc + 0.5*neg_acc

scoresTr = sigmoid_grader(xTr @ w + b)
preds_Tr = (scoresTr > 0.5).astype(int)
preds_Tr[preds_Tr != 1] = -1

training_accuracy = np.mean(preds_Tr == yTr)
return training_accuracy, test_accuracy

training_acc, test_acc = challenge_selftest()
print("Your features and model achieved training accuracy: {:.2f}% and test accuracy: {:.2f}%".format(training_acc*100, test_acc*100))

Expert Solution

Trending now

This is a popular solution!

Step by step

Solved in 3 steps

SEE SOLUTION Check out a sample Q&A here

Knowledge Booster

Learn more about

Need a deep-dive on the concept behind this application? Look no further. Learn more about this topic, computer-science and related others by exploring similar questions and additional content below.