4. Store the new mapping (token -> code) in a hashtable called "encoder". 5. Store the reverse mapping (code -> token) in a hashtable called "decoder". 6. Pass through the data a second time. This time, replace all tokens with their codes. Now, be amazed at how much you've shrunk your data! Delivery Notes: (1) Implement your own hashtable from scratch, you are not allowed to use existing hash table libraries. (2) To be useful, your output should include the coded data as well as the decoder (code -> token) mapping file. Now GZIP all that and watch it shrink immensely!

Database System Concepts
7th Edition
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Chapter1: Introduction
Section: Chapter Questions
Problem 1PE
icon
Related questions
icon
Concept explainers
Question

I need help urgently to be able to understand how to do this program it is written in C++ but I am currently lost as to how I should start or do it or what is it even asking I would highly appreciate any good help, thank you 

Simple Data Compression
CS 10C Programming Assignment
Huffman coding is used to compress data. The idea is straightforward: represent more common
longer strings with shorter ones via a basic translation matrix. The translation matrix is easily
computed from the data itself by counting and sorting by frequency.
For example, in a well-known corpus used in Natural Language Processing called the "Brown"
corpus (see nltk.org), the top-20 most frequent tokens, which are words or punctuation marks
are listed below associated with frequency and code. The word "and" for example requires
writing three characters. However, if I encoded it differently, say, using the word "5" (yes, I
called "5" a word on purpose), then I save having to write two extra characters! Note, the word
"and" is so frequent, I save those two extra characters many times over!
Token Frequency
Code
the
62713
1
58334
2
49346
3
of
36080
4
and
27932
5
to
25732
6
a
21881
7
in
19536
that
10237
9
is
10011
10
was
9777
11
for
8841
12
8837
13
8789
14
The 7258
15
with 7012
16
6723
6706
it
17
as
18
he
6566
19
his
6466
20
So the steps of Huffman coding are relatively straightforward:
1. Pass through the data once, collecting a list of token-frequency counts.
2. Sort the token-frequency counts by frequency, in descending order.
3. Assign codes to tokens using a simple counter, for example by incrementing over the
integers; this is just to keep things simple.
Transcribed Image Text:Simple Data Compression CS 10C Programming Assignment Huffman coding is used to compress data. The idea is straightforward: represent more common longer strings with shorter ones via a basic translation matrix. The translation matrix is easily computed from the data itself by counting and sorting by frequency. For example, in a well-known corpus used in Natural Language Processing called the "Brown" corpus (see nltk.org), the top-20 most frequent tokens, which are words or punctuation marks are listed below associated with frequency and code. The word "and" for example requires writing three characters. However, if I encoded it differently, say, using the word "5" (yes, I called "5" a word on purpose), then I save having to write two extra characters! Note, the word "and" is so frequent, I save those two extra characters many times over! Token Frequency Code the 62713 1 58334 2 49346 3 of 36080 4 and 27932 5 to 25732 6 a 21881 7 in 19536 that 10237 9 is 10011 10 was 9777 11 for 8841 12 8837 13 8789 14 The 7258 15 with 7012 16 6723 6706 it 17 as 18 he 6566 19 his 6466 20 So the steps of Huffman coding are relatively straightforward: 1. Pass through the data once, collecting a list of token-frequency counts. 2. Sort the token-frequency counts by frequency, in descending order. 3. Assign codes to tokens using a simple counter, for example by incrementing over the integers; this is just to keep things simple.
4. Store the new mapping (token -> code) in a hashtable called "encoder".
5. Store the reverse mapping (code -> token) in a hashtable called "decoder".
6. Pass through the data a second time. This time, replace all tokens with their codes.
Now, be amazed at how much you've shrunk your data!
Delivery Notes:
(1) Implement your own hashtable from scratch, you are not allowed to use existing hash
table libraries.
(2) To be useful, your output should include the coded data as well as the decoder (code ->
token) mapping file.
Now GZIP all that and watch it shrink immensely!
Transcribed Image Text:4. Store the new mapping (token -> code) in a hashtable called "encoder". 5. Store the reverse mapping (code -> token) in a hashtable called "decoder". 6. Pass through the data a second time. This time, replace all tokens with their codes. Now, be amazed at how much you've shrunk your data! Delivery Notes: (1) Implement your own hashtable from scratch, you are not allowed to use existing hash table libraries. (2) To be useful, your output should include the coded data as well as the decoder (code -> token) mapping file. Now GZIP all that and watch it shrink immensely!
Expert Solution
trending now

Trending now

This is a popular solution!

steps

Step by step

Solved in 4 steps with 1 images

Blurred answer
Knowledge Booster
Operators
Learn more about
Need a deep-dive on the concept behind this application? Look no further. Learn more about this topic, computer-science and related others by exploring similar questions and additional content below.
Similar questions
Recommended textbooks for you
Database System Concepts
Database System Concepts
Computer Science
ISBN:
9780078022159
Author:
Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:
McGraw-Hill Education
Starting Out with Python (4th Edition)
Starting Out with Python (4th Edition)
Computer Science
ISBN:
9780134444321
Author:
Tony Gaddis
Publisher:
PEARSON
Digital Fundamentals (11th Edition)
Digital Fundamentals (11th Edition)
Computer Science
ISBN:
9780132737968
Author:
Thomas L. Floyd
Publisher:
PEARSON
C How to Program (8th Edition)
C How to Program (8th Edition)
Computer Science
ISBN:
9780133976892
Author:
Paul J. Deitel, Harvey Deitel
Publisher:
PEARSON
Database Systems: Design, Implementation, & Manag…
Database Systems: Design, Implementation, & Manag…
Computer Science
ISBN:
9781337627900
Author:
Carlos Coronel, Steven Morris
Publisher:
Cengage Learning
Programmable Logic Controllers
Programmable Logic Controllers
Computer Science
ISBN:
9780073373843
Author:
Frank D. Petruzella
Publisher:
McGraw-Hill Education