Assignment#2 111123

docx

School

Concordia University *

*We aren’t endorsed by this school

Course

432

Subject

Electrical Engineering

Date

Nov 24, 2024

Type

docx

Pages

Uploaded by ElderAtomSnake32

Department of Electrical & Computer Engineering (ECE), Concordia University APPLIED MACHINE LEARNING & EVOLUTIONARY ALGORITHMS COEN 432 (6321): Fall 23/24 Assignment 2 (Project): Prediction of Ribozyme Activity (deadline: 1 Dec @ 23hr55 via Moodle) I. Problem Description. Your program shall accept data from a file of instances. It is up to you to correctly use these instances (or a large enough subset of them) to train and test (validate) one machine learning (ML) algorithm of your choice (such as an EA or kNN or a Decision Tree), one that you think is appropriate for the problem at hand, one that is distinguished by: (1) Instances with features that include both categorical and rational (numerical) values, and a target variable whose value is in [0, 1]; (2) A large search space due to the large number of features; (3) A large number of instances with target variable values = 0 or ~ 0; (4) A complex relationship between the values of the features and the value of the target variable. Each instance has the form of an RNA Sequence (of a hammerhead ribozyme) followed by a (linearized form of its) base-pairing probability matrix or BPPM , and concluding with an activity level (in [0, 1]). All instances are of equal length and there are no missing values. However, all variables suffer from a small amount of Gaussian noise. The computation of validation accuracy must be done on not less than 10000 instances in total using 10-fold cross-validation. The results must compare validation to training accuracy in order to identify cases of under- or over-fitting. The dataset format is number; sequence; 60x60 matrix (converted to list form); activity level\n II. Description of Ideal Solution. An ideal solution is a ML algorithm (implemented as a stand-alone program) that, upon independent testing (by the marker) using random subsets of the file of instances, returns very low RSS (or sum of squared errors), i.e., predicts the activity level of unseen ribozymes with a high degree of accuracy, on average. This entails that your program must be runnable by the marker, who should be able to (1) inspect your code for correct implementation and in-line documentation and (2) run your program, with ease to (3) train it on a number of instances, randomly chosen from the original (large) instances’ file, then assess the accuracy of the trained ML on other instances from the same instances file. III. Submission Instructions. (For undergraduates) You must submit a ZIP file that includes one folder named “Assignment2” . In this folder, you should have one brief report (PDF) and one folder containing your program . (For graduates) You must submit a ZIP file that includes one folder named “Assignment2” . In this folder, you should have one brief report (PDF), one project report (PDF) , and one folder containing your program . Name your ZIP file exactly as “Assignment2”. ONLY SUBMIT ONE ZIP FILE PER INDIVIDUAL/TEAM . The program must be in Python (preferable), C++, or Java . Place the names and IDs of all team members on the first line of each file (commented). The program must be able to read any number of instances (for training purposes) from the original instances’ file, train ML and test it on other instances. Based on your ML results, you need to write a brief report. The report must contain one paragraph of analysis and graphs (consider using two Page 1 of 3

methods among Precision, Recall, F1 Score, and ROC-AUC, as evaluation metrics, reference: https://towardsdatascience.com/a-look-at-precision-recall-and-f1-score-36b5fd0dd3ec). If the submission is in Python, include all your .py files in a folder called “Ass2Python”. If the submission is in C++, you must submit your .cpp and header files. Include them in a single folder named “Ass2C+ +”. If the submission is in Java, include all your .java files in a folder called “Ass2Java”. IV. Marking Scheme . If the submitted program does not run, then A2/Project won’t be marked (you’ll receive 0); if it runs then the A2/Project will be marked using the following criteria as a guide: Excellent (100%) Good (80%) Satisfactory (60%) Unsatisfactory (< 40%) Correct Implementation of Machine Learning Algorithm + Validation Accuracy Methodology 30+30 = 60 points Perfectly Correct implementation and evaluation (no errors) Minor errors that do not invalidate the algorithm or its results Significant errors that do compromise the correctness of the ML or its results Unsatisfactory implementation and/or invalid assessment of results In-line Documentation 20 points Very clear Sufficient Unclear Missing Software Usability 20 points Easy Minor challenges Requires a special effort to run Unusable (by a non- techie) V. Graduate Students (additional requirements and report). In addition to the deliverables required for A2, you are required to carry out an evolutionary algorithm (EA) based feature selection experiment OR a hyper-parameter optimization experiment (via grid search), for the purpose of optimizing the generalizing performance of your ML model. You need to show that, in fact, you have been able to improve validation accuracy using either one of these two approaches (feature selection or parameter optimization). As such, your deliverables must also include a short report (one PDF file) of 3- 4 pages in length max, comprising the following sections: (1) Problem description: a one-paragraph description of the problem. (2) Methods: high-level flow-chart of the two processes of model building (training) and model evaluation (computation of validation accuracy), with pseudo-code descriptions of the main parts of the flowchart. See https://www.codecademy.com/article/pseudocode-and-flowcharts. Do not copy and paste parts of your code.1-2 pages is sufficient. (3) Results & Conclusions: presentation of the results, the original training and validation accuracies, and a figure that exhibits either (a) the temporal progress of best fitness (= validation accuracy) of the population of machine learning models (representing different feature sets) or, (b) in the case of parameter optimization, a grid of the various validation accuracies for the different combinations of parameter values. Every result should be discussed, meaningfully, but briefly. 1-2 page is sufficient. Please make sure that your report has a title and the names of the authors with IDs. The report must be readable (sound English) and formatted , following a well-known standard (IEEE, https://www.ieee.org/conferences/publishing/templates.html ). VI. Timeliness. Up to 2 days of delay in submission will not incur any penalty. However, take any longer than 2 days, and we will not mark your submission. The same deadline applies to both A1 and Project. Page 2 of 3

For clarifications of the content of the assignment or submission procedure, e-mail the TA ( zhiyangdeng.30@gmail.com ). VII. Ethics . All submissions must comply with the University’s principles of academic integrity. Everyone (individual/team) must submit a signed “expectation of originality” form, available at: https://www.concordia.ca/ginacody/students/academic-services/expectation-of-originality.html ). Page 3 of 3

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version