o compute the conditional probabilities you need to determine unigram and bigram counts first (you can do this in a single pass through a file if you do things carefully) and store them in a Binary Search Tree (BST). After that, you can compute the conditional probabilities. Input files Test files can be found on (http://www.gutenberg.org/ebooks/). For example, search for “Mark Twain.” Then click on any of his books. Next download the “Plain Text UTF-8” format. In addition, you should test your program on other input files as well, for which you can hand-compute the correct answer. Output files

Computer Networking: A Top-Down Approach (7th Edition)
7th Edition
ISBN:9780133594140
Author:James Kurose, Keith Ross
Publisher:James Kurose, Keith Ross
Chapter1: Computer Networks And The Internet
Section: Chapter Questions
Problem R1RQ: What is the difference between a host and an end system? List several different types of end...
icon
Related questions
Question

To compute the conditional probabilities you need to determine unigram and
bigram counts first (you can do this in a single pass through a file if you do things
carefully) and store them in a Binary Search Tree (BST). After that, you can compute
the conditional probabilities.
Input files
Test files can be found on (http://www.gutenberg.org/ebooks/). For example,
search for “Mark Twain.” Then click on any of his books. Next download the “Plain
Text UTF-8” format.
In addition, you should test your program on other input files as well, for which you
can hand-compute the correct answer.
Output files
Your program must accept the name of an input file as a command line argument.
Let's call the file name of this file fn. Your program must then produce as output the
following set of files:
• Your program must write the unigram counts to a file named fn.uni in which
each unigram is listed on a separate line, and each line contains just the
unigram and its count (an integer), separated by a single space.
• Your program must write the bigram counts to a file named fn.bi in which
each bigram is listed on a separate line, and each line contains just the
bigram and its count (an integer), separated by a single space.
• Your program must write the conditional probabilities to a file named fn.cp,
reported in the form P(WORD(k)|WORD(k-1)) = p, where p is the conditional
probability of WORD(k) given WORD(k-1).
Notes
• You may use any BST implementation found online at your own risk
(provided the source of this code is cited properly). 50 marks bonus would
be given if you implemented the BST yourself (only the functionalities
needed to complete your project).
• Your program should accept file name(s) as command line argument(s) (no
hard-coded file names in your code).
• Your code will be tested using the latest version of the GNU C compiler on a
Unix-based operating system. If you do not have a Unix-based machine you
may want to use Windows Subsystem or a Virtual Box under your own
responsibility!
• Your code must be well commented. When writing your comments, you
should focus on what the code does at a high level; for example, describe the
main steps of an algorithm—not every detail (that what the code is for).
• A ReadMe.txt file (including instructions on how to compile and how to run
your program along with any known problems) must be submitted

Submission details: Make a directory and name it, using your first and last names,
applying the camel notation, for example haniGirgis. Please name the main file,
where execution starts, as hw2.cpp. Compress your directory by the zip utility, and
submit the compressed file using Blackboard.
Introduction
Suppose you had a document (perhaps a newspaper article, or an unattributed
manuscript for a book), and you were interested in knowing who wrote it. One way
to try to determine the authorship of the anonymous document is by comparing
properties of the anonymous document with properties of known documents, and
seeing if there is enough similarity to make a judgment of authorship.
Some simple properties one might use to distinguish different authors include:
Vocabulary (ie. the set of words an author uses).
• Word frequencies (i.e. the frequencies with which an author uses words).
Bigram frequencies (i.e. the frequencies of two consecutive words).
Bigram probabilities (i.e. the probability that one word follows another
word).
Terminology
A unigram is a sequence of words of length one (i.e. a single word).
• A bigram is a sequence of words of length two.
• The conditional probability of an event E2 given another event E1, written
p(E2|E1), is the probability that E2 will occur given that event E1 has already
occurred.
We write p(w(k)]w(k-1)) for the conditional probability of a word w in position k,
w(k), given the immediately preceding word, w(k-1). You determine the conditiona
probabilities by determining unigram counts (the number of times each word
appears, written c(w(k)), bigram counts (the number of times each pair of words
appears, written c(w(k-1) w(k)), and then dividing each bigram count by the
unigram count of the first word in the bigram:
p(WORD(k)|WORD(k-1)) = c(WORD(k-1) WORD(k}) / {WORD(k-1))
For example, if the word "time" occurs seven times in a text, and "time of" occurs
three times, then the probability of "of occurring after "time" is 3/7.
Project description
In this project, you will be determining conditional probabilities of bigrams. To do
this, you will write a C program, which reads in a file of text and produces three
output files, as described below.
Transcribed Image Text:Submission details: Make a directory and name it, using your first and last names, applying the camel notation, for example haniGirgis. Please name the main file, where execution starts, as hw2.cpp. Compress your directory by the zip utility, and submit the compressed file using Blackboard. Introduction Suppose you had a document (perhaps a newspaper article, or an unattributed manuscript for a book), and you were interested in knowing who wrote it. One way to try to determine the authorship of the anonymous document is by comparing properties of the anonymous document with properties of known documents, and seeing if there is enough similarity to make a judgment of authorship. Some simple properties one might use to distinguish different authors include: Vocabulary (ie. the set of words an author uses). • Word frequencies (i.e. the frequencies with which an author uses words). Bigram frequencies (i.e. the frequencies of two consecutive words). Bigram probabilities (i.e. the probability that one word follows another word). Terminology A unigram is a sequence of words of length one (i.e. a single word). • A bigram is a sequence of words of length two. • The conditional probability of an event E2 given another event E1, written p(E2|E1), is the probability that E2 will occur given that event E1 has already occurred. We write p(w(k)]w(k-1)) for the conditional probability of a word w in position k, w(k), given the immediately preceding word, w(k-1). You determine the conditiona probabilities by determining unigram counts (the number of times each word appears, written c(w(k)), bigram counts (the number of times each pair of words appears, written c(w(k-1) w(k)), and then dividing each bigram count by the unigram count of the first word in the bigram: p(WORD(k)|WORD(k-1)) = c(WORD(k-1) WORD(k}) / {WORD(k-1)) For example, if the word "time" occurs seven times in a text, and "time of" occurs three times, then the probability of "of occurring after "time" is 3/7. Project description In this project, you will be determining conditional probabilities of bigrams. To do this, you will write a C program, which reads in a file of text and produces three output files, as described below.
Expert Solution
steps

Step by step

Solved in 2 steps

Blurred answer
Recommended textbooks for you
Computer Networking: A Top-Down Approach (7th Edi…
Computer Networking: A Top-Down Approach (7th Edi…
Computer Engineering
ISBN:
9780133594140
Author:
James Kurose, Keith Ross
Publisher:
PEARSON
Computer Organization and Design MIPS Edition, Fi…
Computer Organization and Design MIPS Edition, Fi…
Computer Engineering
ISBN:
9780124077263
Author:
David A. Patterson, John L. Hennessy
Publisher:
Elsevier Science
Network+ Guide to Networks (MindTap Course List)
Network+ Guide to Networks (MindTap Course List)
Computer Engineering
ISBN:
9781337569330
Author:
Jill West, Tamara Dean, Jean Andrews
Publisher:
Cengage Learning
Concepts of Database Management
Concepts of Database Management
Computer Engineering
ISBN:
9781337093422
Author:
Joy L. Starks, Philip J. Pratt, Mary Z. Last
Publisher:
Cengage Learning
Prelude to Programming
Prelude to Programming
Computer Engineering
ISBN:
9780133750423
Author:
VENIT, Stewart
Publisher:
Pearson Education
Sc Business Data Communications and Networking, T…
Sc Business Data Communications and Networking, T…
Computer Engineering
ISBN:
9781119368830
Author:
FITZGERALD
Publisher:
WILEY