zyBooks_exercise_2023f
pdf
keyboard_arrow_up
School
University of British Columbia *
*We aren’t endorsed by this school
Course
404
Subject
Computer Science
Date
Jan 9, 2024
Type
Pages
18
Uploaded by JudgePenguinMaster491
1
CPSC 404:
zyBooks Exercise and Questions
“Database Systems with SQL”
Due Date:
Sunday, November 12, 2023 before 23:59
(near midnight)
–
10% penalty
per half
day late
(i.e., 10% penalty of the online component if you’re late for that; 10% penalty of the
written component if you’re late for that)
Last Update:
Nov. 4, 2023 @ 16:15
•
November 4, 2023 @ 16:15
–
Clarification for Part B
’s
Q2(a)(i)
•
October 16, 2023 @ 15:30
–
Created/Posted
You have the choice of doing either:
1.
The zyBooks online exercise with its online participation and challenge activities (the two
chapters we’re interested in are
quite do-able and the activities are laid out like a tutorial),
and you also need to answer the questions found in this document.
2.
The SQL Server DBA lab exercise (posted on Canvas).
D
on’t do both.
Either of these will make up 10% of your course grade.
Check the course outline
on Canvas for how to register for the zyBooks course.
It costs about $64 USD (roughly $90 CAD
on a credit card).
Note that option (2) above
is free, and is hosted by UBC’s Department of
Computer Science.
This zyBooks exercise involves 2 parts:
5% of your overall grade will be for Part A (zyBooks
online activities), and 5% will be for Part B (written answers to questions to be submitted on
Canvas).
If you are using your name or CWL ID or CS ID when registering for
zyBooks, that’s
fine; we can easily identify you and we can download your zyBooks points
—and, don’t worry, we
will send you e-mail if we have any doubt.
zyBooks won’t lose a record of your work on their
site.
However, if you are using a pseudonym (fake name to maintain privacy), then when you
upload your written answers to Canvas Assignments, include a note that says what pseudonym or
e-mail address you used for zyBooks, so that we can credit you with the points for successfully
completing the online exercises.
zyBooks will track your online completion of the Participation
Activities and Challenge Activities for the various sections.
After the due date, we will transfer
these points to the Canvas gradebook.
In the following questions, where we have written
“1
-
2 sentences” or “1 paragraph” (for example),
then you can always provide more sentences, if you wish.
Note that where we have written
“
1
paragraph”, we
expect 2-4 sentences (or more, if you wish), and a not just a few words.
2
There are 2 database papers that tie in to the zyBooks material
—
that you will need to read.
Both
are found in the
ACM Digital Library
.
If you’re on a UBC VPN (or accessing it from
a UBC
computer), you get them for free.
J
ust click on the link to get access (it’ll pop up a CWL sign
-in
if you’re not already connected) and download the PDF copy of each
.
The 2 papers are:
1.
[
ACM Inroads
paper about DB education, NoSQL, etc.]
Goldweber, Mikey; Wei, Min;
Aly, Sherif; Raj, Rajendra K.; and Mokbel, Mohamed.
“The 2022 Undergraduate Database
Course in Computer Science:
What to Teach”.
ACM Inroads
, Volume 13, Number 3,
September 2022, pp. 16-21.
•
https://dl.acm.org/doi/10.1145/3549545
2.
[
CACM
paper about The Seattle Report on Database Research]
Abadi, Daniel; Ailamaki,
Anastasia;
et al.
(actually 33 authors).
“The Seattle Report on Database Research”,
Communications of the ACM,
Volume 65, Issue 8, August 2022, pp. 72-79.
•
https://dl.acm.org/doi/10.1145/3524284
•
This strategic research report comes out every 5 years, and it is the output from a
large panel of database researchers that discusses research trends, needs, and
challenges in database systems and data management.
Its authors include our
textbook
’s
co-author Raghu Ramakrishnan, Turing Award winner Mike
Stonebraker (a database person who won this Nobel-like prize), ARIES crash-
recovery inventor C. Mohan of IBM, other database textbook authors, etc.
The
report discusses some of the topics mentioned in your zyBooks activities including
data lakes, big data, machine learning, data science, data integration, cloud services
for databases, key-value DBs, wide-column DBs, scaling, etc.
•
This 2022 CACM paper is an update of the original 2018 Seattle Report that
includes additional commentary and progress since then.
So, it’s kind of a “greatest
hits” type of paper, and it will let us work with the latest findings.
Part A:
The zyBooks reading, participation activities, and challenge activities
•
You need to read the materials, follow the animations, and answer the multiple choice,
short answer, drag-and-drop, etc. questions.
There are no questions whose answers are
not (reasonably) found within the zyBooks materials.
After you answer, they even tell
you which answers are wrong, and you get to repeat the questions.
Correct your wrong
answers to get full credit.
We urge you to take the activities seriously because these
are good topics in data management.
A future employer may be happy that you have
some understanding of these topics.
We will track your activities, and award you points
for successfully completing the activities.
Those marks will be imported into the
Canvas Gradebook.
3
We will only do 2 chapters in the zyBooks course materials.
In particular, we will
focus on these chapters and the following major sections within the chapters:
•
Chapter 7:
Database Architectures
…
but only the following sections
1.
MySQL Architecture
2.
Cloud Databases
•
A comparison of on-premise services, IaaS, PaaS, and SaaS
3.
Distributed (and Parallel) Databases
4.
Replicated Databases
5.
n/a (skip Data Warehouses
—
now part of CPSC 304)
6.
n/a (skip Data Warehouse Design)
7.
Other Databases, and this includes:
Data Lakes, Embedded Databases,
Federated Databases, and In-Memory Databases
•
Chapter 9:
NoSQL Databases and Big Data
1.
Big Data Databases
2.
Key-Value Databases
3.
Wide-Column Databases
4.
Document Databases
5.
Graph Databases
6.
MongoDB
Part B:
Questions and Answers
•
There are a series of questions on the following
pages.
We’re looking for
good
answers to
each, but they don’t have to be
bullet-proof explanations (but they need to be correct, not just
a “good effort”)
.
You can answer these questions as you go along in zyBooks’ activities; or,
you can save them for afterwards.
Some questions require some additional reading, namely
the
ACM Inroads
paper and the
CACM
paper about The Seattle Report.
•
Be sure to answer the answers on your own, and not copy someone else’s work.
•
The use of ChatGPT is not allowed for this assignment.
Here are the questions:
1.
Cloud Databases
a.
Hand in screenshots of your completed Challenge Activity 7.2.1, for
two
different
sets of questions.
(You might have to make multiple attempts, and there are
different sets of questions presented when you re-take it.)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
4
This is actually a nice summary and comparison of on-premise services, IaaS, PaaS,
and SaaS.
You might want to refer to Participation Activity 7.2.2 when doing this.
Be careful to read the instructions (they can change between attempts).
最下面
2.
Parallel Databases
a.
For this question, use the sailing database that we have been discussing in our
lectures, pre-class exercises, in-class exercises, and sample questions from the
assignments and exams.
Provide one example of a query that falls into each of the
3
categories in zyBooks’ Participation Activity 7.3.1.
So, there will be 3 queries in
all.
If you wish to provide the actual SQL, that’s fine (you don’t have to), but for
this question, we want you to explain the query in words, and briefly justify why it
fits into that category.
(Use 1-2 paragraphs for each.)
i.
In other words, there are 3 memory architectures described and we want
you to provide/explain a sailing query that would perform best under that
architecture, but not very well on the other two architectures.
1 Shared Memory Computer:
Query Example:
"Find the average age of sailors who have reserved a specific type of
boat."
Explanation
:
In a shared memory computer
:
•
In a shared memory computer, complex operations requiring frequent and fast
memory accesses are efficient because all processors access the same memory.
•
The given query involves joining Sailors and Reserves and then calculating the
average. Shared memory enables fast, shared access to data.
In a Shared Storage:
•
In a shared-memory computer, although the processors can handle disk
operations efficiently, the lack of shared memory can slow down operations that
require frequent inter-processor communication or access to common data sets.
Calculating averages between joined tables requires more complex coordination
between processors because each processor will use its own memory instead of
the common memory space.
Shared-Nothing computer
:
5
•
In a shared-nothing setup, the processor operates independently using its own
memory and storage. This independence can lead to inefficiencies in queries
that require frequent access and combination of data from multiple sources. The
overhead of distributing, processing, and aggregating data across different
nodes can be significant compared to a shared memory environment.
2 Shared Storage Computer:
Query Example
: "listing all sailors and their corresponding boat reservations, sorted by
date."
Explanation:
In a Shared Storage:
•
Shared-storage computers, in which each processor has its own memory but
shares storage. It is ideal for tasks involving many disks read/write operations.
This query requires accessing and merging large amounts of data from two
tables, which a shared storage system can handle efficiently. Independent
memory in this architecture avoids contention, but shared disk access enables
efficient data retrieval.
In a Shared Memory:
•
In a shared memory environment, while processors can quickly access common
memory, they may be bottlenecked by disk I/O operations. Reading large
amounts of data from disk and writing becomes less efficient due to potential
I/O contention between multiple processors.
Shared-Nothing
:
•
Shared-nothing architectures excel at parallel processing but may be less
effective for tasks that require frequent simultaneous access to shared storage
resources. Our query involves bulk reading and sorting data from shared storage,
which may be less efficient in a shared-nothing system due to the overhead of
data distribution and subsequent aggregation.
3
Shared-Nothing computer
:
Query Example:
"Identify sailors who have not made any reservations in the past year."
And also consider to separate the computer in geographical location
Explanation:
6
Shared-Nothing computer
:
•
Shared-nothing computers, in which each processor has its own memory and
storage, excel at independently processing large data sets in parallel. This query
involves filtering and cross-referencing large data sets that can be efficiently
distributed across multiple nodes in a shared-nothing architecture. Each node
can process a portion of the data, significantly reducing processing time.
In a shared memory computer:
•
While a shared memory system allows fast access to a common memory pool,
it may not be as efficient for tasks that can be easily parallelized across
independent data sets. Our queries benefit from splitting and processing across
multiple nodes, a scenario that is not well suited for shared memory systems
because the focus of shared memory systems is on shared data processing rather
than distributed, independent data processing.
In a Shared Storage:
•
Shared storage systems are well suited for tasks involving a large number of
disks operations but may have difficulty handling highly parallelized tasks that
can be distributed across multiple nodes. Shared storage systems will be
relatively poor at distributing and processing this query efficiently due to their
reliance on central storage resources.
3.
Big Data, Key-Value Databases, Wide-Column Databases, and Document Databases
In each of the following cases, use the zyBooks content/explanations to answer the
questions:
a.
What is meant by the term
sharding
and how does it differ from partitioning. (Use
2-3 sentences.)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
7
i.
Sharding splits data sets across multiple machines and enables horizontal
scaling and supports large amount volumes of data. In comparison,
partitioning splits data sets across multiple files on a single machine.
b.
Think back to the principles found in our Hash-Based Indexes unit.
When
assigning documents to shards, under what circumstances would you use a hash
function instead of a range function?
(Use 1-2 sentences.)
i.
hash buckets can be stored on different machines, enabling horizontal
scaling.
With a range function, each shard contains a contiguous range of
shard key values. Sharding splits data sets across multiple machines and
enables horizontal scaling.
ii.
So, if we need data to be evenly distributed across shards and do not need
to maintain any order between documents, we can use a hash function
instead of a range function when assigning documents to shards.
c.
Explain the difference between a row in a wide column database and a row in a
relational table. (Use 1-2 sentences.)
i.
In a wide column database, rows are physically sorted on the key, each
shard contains rows with a range of key values. Different rows may have
different sets of columns. However, in a relational table, each row
contains the same set of columns defined by the table schema, which is
fixed.
d.
How do timestamps work in a wide-column DB, and why are they needed? (Use
2-3 sentences.)
i.
A timestamp is used in a wide-column database that stores the date and
time when a version of a value was created. We need timestamps to access
old version of data. This feature is needed because wide column storage
allows version controlling of data in the same column for a given row, so
users can retain multiple versions of data over time.
4.
Graph Databases
Let us draw upon some of your CPSC 304 background.
8
a.
Vertices in property graphs, and tables in relational databases, serve different
structural purposes; but in what way might they conceptually converge?
Provide a
specific example or diagram from the zyBooks or CPSC 304/404 course material
that helps illustrate this similarity or connection.
(Use 1-2 sentences.)
i.
A vertex is like an entity in relational database.
ii.
Vertices can have properties.
iii.
Entity can have attributes.
b.
Edges provide relationships in property graphs.
Thinking back to CPSC 304, what
would be a similar analogy in CPSC 304 material? (Use 1-2 sentences.)
9
i.
From CPSC 304 material, relationship in relational databases is a similar
concept to edges in the property graphs. relationship in relational
databases is used to link one entity to another entity, establishing
relationships between different tables. Edge can have properties in
property graph and relationship can have attributes in relational database
c.
Property graphs allow vertices and edges to have
properties
.
Contrast this with the
columnar data of relational databases.
In particular, explain a scenario where the
flexibility of property graphs offers a significant advantage over traditional
relational databases.
In your answer, reference any relevant zyBooks material that
you've encountered.
(Use 1 paragraph.)
i.
Graph databases are NoSQL type and have a flexible schema. It focused
on highly connected data with an intrinsic need for relationship analysis.
New vertices, edges, and attributes can be added at any time, and relations
has its own importance. Attribute names are stored together with attribute
values, so vertices and edges with the same label can have different
attribute names.
ii.
Relational databases need to have a fixed schema. It may need schema
changes or have tables containing many null values when adding any new
column or relationship.
iii.
A scenario where property graphs are advantageous is in modeling
complex and irregular relationships, when retrieving the data is more
important than storing it and when data is highly connected, such as
knowledge graphs, where each edge between vertices may have different
attributes.
iv.
For example, from database knowledge vertices one connection might
contain attributes such as "from CPSC 304", "A based on B", and
"together as improvement ", while another connection might only have "
from CPSC 404". In a relational database, representing this often requires
multiple tables with many null value columns to cover all properties, or a
complex schema with many joined tables.
5.
MongoDB
MongoDB is currently one of the top-ranked NoSQL database companies.
It is a company
on the New York Stock Exchange.
It has a $25 billion market capitalization (number of
shares outstanding × current stock price, as of October 13, 2023)
—
double what it was
about a year ago).
So, it looks like there is significant commercial interest in
—
or at least
“
high expectations
”
for
—
NoSQL document databases.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
10
Similar to the automobile and student databases in the MongoDB exercises on zyBooks,
create a different database (on paper by typing out your MongoDB commands).
Choose a
small DB of your own on a topic that interests you (e.g., a hobby;
music;
books;
movies;
sports (perhaps one of:
hockey, football, basketball, baseball, etc.)
Don’t copy it from
elsewhere.
Create your own from scratch.
Do the following:
a.
Insert 5 rows into the database subject to the following criteria:
1.
There are at least 5 fields for each record.
2.
At least one of the fields is numeric, at least one is a string, and at least
one is an array.
use books
db.books.insertOne({
title:
‘
Great Gatsby
’
,
author:
‘Bob’
,
publicationYear: 1949,
genres: [
‘
Novel
’
],
rating: 8.5
})
db.books.insertOne({
title:
‘
Political
’
,
author:
‘John’
,
publicationYear: 1956,
genres: [
‘
Political fiction
’
],
rating: 9.0
})
db.books.insertOne({
title:
‘
a bird
’
,
author:
‘
Lee
’
,
publicationYear: 1949,
genres: [
‘
Novel
’
],
rating: 8.7
})
db.books.insertOne({
title:
‘
Pride
’
,
author:
‘
Amy
‘
,
publicationYear: 1949,
genres: [
‘
Novel
’
],
11
rating: 8.8
})
db.books.insertOne({
title:
‘
The Hobbit
’
,
author:
‘Amy’
,
publicationYear: 1937,
genres: [
‘
Fantasy
’
,
‘
Children's literature
’
],
rating: 9.2
})
b.
Using some of the query operators shown in zyBooks Table 9.6.1 (not query
operators from elsewhere), create 4 different, non-trivial MongoDB queries that
search your database and return the following results.
Each query must satisfy a
different one of the following 4 constraints
—
thus making up a total of 4 queries in
all:
1.
Return 1 record, using a single condition in the MongoDB search clause
(similar to having 1 WHERE-clause condition in an SQL statement)
db.books.find({ author: { $eq:
‘
Lee
’
} })
/return book whose title is “a bird”
2.
Return 1 record, using 2 or more conditions in the search clause
db.books.find({ $and: [ {author: { $eq:
‘Amy’
} },
{publicationYear: { $eq: 1937 } }] })
/Return book whose title is “
The Hobbit
”
3.
Return many records, using 1 condition in the search clause
db.books.find({ author: { $eq:
‘
Amy
’
} })
/Return books
whose title is “
The Hobbit
” and the title is ‘
Pride
’
4.
Return many records, using 2 or more conditions in the search clause
db.books.find({ $and: [ {author: { $in:
‘Amy’, ‘John’
} },
{publicationYear: { $lte: 1949 } }] })
/Return books
whose title is “
The Hobbit
” and the title is ‘
Pride
’
c.
Using some of the update operators shown in zyBooks Table 9.6.2 (not update
operators from elsewhere), create 3 different update statements, each of which
12
satisfies a different one of these 3 cases
—
thus making up 3 update statements in
all:
1.
Update no record (for 1 of the update statements)
db.books.updateOne({ author: 'N
OTEXIST’
},
{ $rename: { name: 'Oliver' } })
2.
Update 1 record (for 1 of the update statements)
db.books.updateOne({ author:
‘Lee’
},
{ $inc: { publicationYear: 2 } })
3.
Update 2 or more records (for 1 of the update statements)
db.books.updateOne({ author: ‘
Amy
’ },
{ $set: { rating: 9.9} })
6.
More on zyBooks topics, CPSC 304, CPSC 404, and the
ACM Inroads
paper:
The following questions deal with the
ACM Inroads
paper by Goldweber,
et al.
, mentioned
near the top of this document, and found in the ACM Digital Library:
a.
Point out some things in the article that you
disagree
with.
Justify your answer.
(Use 2 paragraphs.)
i.
Rajendra points out that “the notion of database as the persistent store of valuable
data should be introduced into CS curriculum as soon as feasible without initially
requiring a full database course.” I disagree with that because of two points.
Models containing the fundamental concept of an external persistent store like
LINQ is not popular in the industry for students who want to be software engineer
or database engineer which may not be helpful to students in their career.
Additionally, for a junior year CS student, introducing database knowledge too
early may cause more confusion when we still do not have a good understanding
of computer engineering, data structure and algorithm, and computer systems.
The knowledge of those areas is tightly connected, some storage knowledge in
external persistent stores may depend on the knowledge learned in the computer
system course, like the difference in memory and disk storage. Therefore,
introducing the database course not such early and directedly covering the
material in SQL and relational database would be better. Learning so much
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
13
knowledge and multiple classes at the same time will put a lot of learning pressure
on students and lead to poor learning results.
ii.
MIN WEI believes that undergraduates need to be proficient in the basic
knowledge he mentioned. He believes that basic knowledge is difficult to master,
but the skills required for work are easy to master. In this regard, I don't quite
agree. Although compared with practice, mastering theory requires more
investment and is more difficult. But mastering the skills needed for the job isn't
easy, and that doesn't mean every undergraduate student needs to be able to
master a particular field. 4 employees of a big company who graduated five years
ago could not answer his interview questions. Although the knowledge points, he
mentioned are very important, the fact that undergraduates have not mastered
those knowledge points does not mean that they cannot perform in the field of
engineering. Well, considering these four have pretty good jobs in big company.
This is because each graduate's job segment is different, and much basic
knowledge are easily forgotten at work because they are relatively rarely applied.
Moreover, there will be different emphasis on the application level at work. It is
often enough to be able to apply knowledge proficiently, and what the work
requires is the ability to continue learning.
iii.
MIKEY GOLDWEBER mentioned that every CS course should incorporate
ethical issues. However, I disagree with that. There is no need to address ethical
issues individually or cover them in every course. It is enough for UBC to cover
this topic in CPSC310 because the relevant content is standardized and similar in
every field. If it is covered in every course, it will bring extra work to students
and distract them. We will Expect to be able to focus on the course during the
course.
b.
Which of the 5 co-authors of the
ACM Inroads
paper is most supportive of teaching
NoSQL in an undergraduate database or data management course?
Justify your
answer with examples from the article.
(Use 1-2 paragraphs.)
•
Mikey Goldweber is the biggest supporter of NoSQL teaching. He mentioned
that "over the past few years he has been pruning away relational algebra and
SQL, B+ trees, etc. to get a bigger NoSQL space". He believes that comparing
and contrasting NoSQL and SQL is the best way to understand the internals of
relational and non-relational databases.
•
At the same time, some of the other teachers did not mention NoSQL. Some
teachers did not emphasize the importance of NoSQL alone after mentioning
NoSQL, but put NoSQL and other supplementary knowledge on the same level.
c.
In terms of the zyBooks
’
Chapters 7 and 9 topics (use the topic name and the
zyBooks’
3-number format like 7.3.1, for example, that is closest to the location of
the topic)
…
what topics did most of the 5
ACM Inroads
co-authors tend to agree
should be included in the undergraduate curriculum?
(Use 1-2 paragraphs.)
14
1.
As highlighted in zybook section 9.1.6, the authors agree on the
importance of incorporating SQL and NoSQL databases into database
education. Mohamed, Mikey, Min, and Rajendra all emphasized the need
for comprehensive learning covering both database types due to their
relevance in different domains. Furthermore, in zybook sections 7.2.2 and
7.2.1, the importance of cloud services and cloud databases in modern
database courses is emphasized. Mohamed noted the practical
applications of both technologies in data analytics, while Rajendra, Sherif,
and Min recognized the growing importance of cloud-based solutions
such as AWS DynamoDB and Azure CosmosDB. Therefore, database
education should be broad, covering both traditional and modern database
systems and technologies.
d.
Suppose we do not have 1-2 extra weeks in CPSC 304 for more material, and
suppose we had to remove some content to make room for some of the zyBooks
’
Chapter 7 or 9 material, so that the kinds of topics still fit nicely with some of the
general learning goals of CPSC 304.
Which topics would you remove from CPSC
304 to make room, which topics (mentioned in zyBooks) would you add, and why
do you think those topics can be removed?
Justify your answer.
(Use 2-3
paragraphs.)
Based on the conclusion we got from the previous question, we want to add
knowledge of cloud databases and Nosql. Star vs. Snowflake Schemas and
Microsoft’s SQL Server and SQL Server Analysis in CPSC304 Data
Warehousing
Services, join cloud database. The reason why we want to delete the above
content is because Star vs. Snowflake Schemas and Microsoft's SQL Server and
SQL Server Analysis Services are more focused on the application level. When
we already have Oracle's SQL service project, this additional application
knowledge is unnecessary.
Instead of embedded vs. Dynamic SQL in SQL, join Nosql. It should be because
the knowledge of embedded vs. Dynamic SQL is relatively not that basic. When
we have not yet mastered SQL statements flexibly, this knowledge is relatively
difficult. Because we just talked about SQL, we added Nosql here is also more
appropriate.
7.
More on zyBooks topics, CPSC 304, CPSC 404, and the 2022
CACM
Seattle Report paper
mentioned above:
Every 5 years, the database research community meets to discuss where we are heading in
the database world.
If you have not already done so, read the paper (described near the top
15
of this document) titled
“The Seattle Report on Database Research” by Abadi,
et al.
Then,
after reading it, answer these questions:
a.
According to the authors, what are the most important research problems that need
to be addressed and worked on over the next few years?
(Use 2 paragraphs.)
Database engine hardware:
•
Adapt to heterogeneous computing environments
:
Because of the development of hardware, such as GPUs and FPGAs and RDMA,
database engines can take advantage of stack bypass. Although new SSDs are
emerging, adopting a new generation of SSDs will cripple the memory system.
NVRAM enables persistence and low latency, it also affects the database engine.
•
Effectively manage data lakes
:
Because of the development of hardware, such as GPUs and FPGAs and RDMA,
database engines can take advantage of stack bypass. Although new SSDs are
emerging, adopting a new generation of SSDs will cripple the memory system.
NVRAM enables persistence and low latency, it also affects the database engine.
b.
Think back to our storage hierarchy in Chapter 9 of the textbook, and our set of
PowerPoint slides on Disks.
What hardware components mentioned in the paper
are missing from our storage hierarchy triangle/pyramid?
Where should they go?
You may need to break down the layers of the hierarchy into finer resolution to
include those components.
(Use 1-2 paragraphs, and possibly a diagram.)
GPU and FPGA: They are not traditional hardware storage, but they are
important in processing data in fast speed. They play an important role in fast
parallel tasks and compute-intensive works like DNNs. So, they could be slower
than CPU register but better than L1 cache.
High-speed SSD: It is faster than traditional HDD but slower than main
memory. So, it should be placed in between main memory and local disks.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
16
NVRAM: It is like SSD, but NVRAM has smaller size of storage than SSD
and implemented at a chip or module level. It also slower than DRAM. So, it will
between DRAM and local disks
c.
The paper discussed the use of
data lakes
.
Data lakes are very popular in the data
management space of the tech sector.
Snowflake is a company that specializes in
data lakes and data warehouses.
It is listed on the New York Stock Exchange, and
has a $52 billion market capitalization (i.e., # of shares outstanding × the share
price, as of October 13, 2023)
—
about the same value as of about a year ago.
Explain how a
data scientist
might use a
data lake
.
In answering this question,
think about what a data scientist does, and also reflect on the Seattle Report.
Mention some specific examples from the Seattle Report in your answer.
Data scientists use data lake like a central service to store and manage data
in various formats. Because data lake offers flexible environment to perform
analytical tasks. Data Scientists usually works on data preparing, model building
and data analysis.
•
Data Pipelines:
Data scientists uses data lake to form complex data pipeline
which contains several stages and many participants. And they can access and
store different data sources to support the analytical tasks. Data Science
Notebooks contain different frameworks like Jupiter, Spark, and Zeppelin.
o
For example, a team may extract data from the data lake, another team
builds models on the data extracted, and users can access the data and
built models by an interactive dashboard. The paper talked about this
multi-stage pipeline and emphasis on the role the data lake plays in
sourcing data from heterogeneous data collection.
•
Visualization
Data Analysis:
Data scientists can use approximation and
progressive visualization techniques in a large size of the dataset.
17
o
The paper motioned about the use of approximation visualization in
query to reduce latency and analysis rapidly. These would help with
exploratory data analysis.
•
Data Quality and consistency:
data lake offers ability to maintain the data
quality and governance. So, data scientists can enforce schemas, validate data,
and transactions support to make sure the result is accurate.
o
On page 74, The paper emphasises the importance of maintain data
governance guardrails within data lake to maintain data consistency
and quality.
•
Support serverless:
data scientists can use data lake to deal with large
datasets on cloud. Because data lake supports scalable data exploration, which
is of vital importance, especially useful in a cloud environment. They can use
Apache Spark for large-scale data processing tasks.
o
The paper mentioned about the potential for databases with abundant
resources in the cloud and highlight the importance of solving open-
end problems in scalable data exploration.
d.
Think back to the “pink sheet” on “What is a database?”
On that sheet, we
discussed major database components.
The left hand side of the sheet dealt with
the data, and the right hand side dealt with indexes.
Suppose we were to have a
second “pink sheet”
(one page) that deals with data lakes and its various input
sources (specific NoSQL systems and other DB sources).
What are the data sources
mentioned in the Seattle paper and in the zyBooks materials that should be included
on this sheet?
Sketch a representation (maybe a graph, or a bunch of Word boxes
and arrows, or a handwritten sketch) that relates the data sources that might go into
a data lake.
In your diagram, also list some processes that transform the content
from the data lake into usable output.
See the paper and zyBooks for some of these
processes or transformations.
(Besides providing your “sketch”, use several
18
paragraphs to explain the content of your sketch, so that the marker can understand
your work.)
1.
we have many data sources, they can be NoSQL, like MongoDB, API
sources, IoT, Cloud Services. We need some software to package the data
sources, and then we can store them in the same place. in book" data in a
data lake is not cleansed, integrated, or restructured", so when we use data
lake we do not need to consider the coexistence of different data types.
2.
The data lake itself can run on AWS. In the Seattle report, AWS serverless
function can provide flexible elasticity.
3.
The data will be mainly stored in the data lake, and the central storage can
be a disk.
4.
Data management will be connected to storage and data processing. This
draws on the CatLog in cpsc 404, where the metadata is very useful and
can save computing costs.
5.
Data processing is connected to the data lake. We use Apache and ETL to
process the dataset
6.
In data analytics, users can use popular BI systems and ML pipelines.
7.
output result from data processing. It can be the result of visualization via
ML
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Recommended textbooks for you

Fundamentals of Information Systems
Computer Science
ISBN:9781337097536
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781285196145
Author:Steven, Steven Morris, Carlos Coronel, Carlos, Coronel, Carlos; Morris, Carlos Coronel and Steven Morris, Carlos Coronel; Steven Morris, Steven Morris; Carlos Coronel
Publisher:Cengage Learning

A Guide to SQL
Computer Science
ISBN:9781111527273
Author:Philip J. Pratt
Publisher:Course Technology Ptr

Fundamentals of Information Systems
Computer Science
ISBN:9781305082168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781285867168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781305971776
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning
Recommended textbooks for you
- Fundamentals of Information SystemsComputer ScienceISBN:9781337097536Author:Ralph Stair, George ReynoldsPublisher:Cengage LearningDatabase Systems: Design, Implementation, & Manag...Computer ScienceISBN:9781285196145Author:Steven, Steven Morris, Carlos Coronel, Carlos, Coronel, Carlos; Morris, Carlos Coronel and Steven Morris, Carlos Coronel; Steven Morris, Steven Morris; Carlos CoronelPublisher:Cengage LearningA Guide to SQLComputer ScienceISBN:9781111527273Author:Philip J. PrattPublisher:Course Technology Ptr
- Fundamentals of Information SystemsComputer ScienceISBN:9781305082168Author:Ralph Stair, George ReynoldsPublisher:Cengage LearningPrinciples of Information Systems (MindTap Course...Computer ScienceISBN:9781285867168Author:Ralph Stair, George ReynoldsPublisher:Cengage LearningPrinciples of Information Systems (MindTap Course...Computer ScienceISBN:9781305971776Author:Ralph Stair, George ReynoldsPublisher:Cengage Learning

Fundamentals of Information Systems
Computer Science
ISBN:9781337097536
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781285196145
Author:Steven, Steven Morris, Carlos Coronel, Carlos, Coronel, Carlos; Morris, Carlos Coronel and Steven Morris, Carlos Coronel; Steven Morris, Steven Morris; Carlos Coronel
Publisher:Cengage Learning

A Guide to SQL
Computer Science
ISBN:9781111527273
Author:Philip J. Pratt
Publisher:Course Technology Ptr

Fundamentals of Information Systems
Computer Science
ISBN:9781305082168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781285867168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781305971776
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning