zyBooks_exercise_2023f

pdf

School

University of British Columbia *

*We aren’t endorsed by this school

Course

404

Subject

Computer Science

Date

Jan 9, 2024

Type

pdf

Pages

18

Uploaded by JudgePenguinMaster491

Report
1 CPSC 404: zyBooks Exercise and Questions “Database Systems with SQL” Due Date: Sunday, November 12, 2023 before 23:59 (near midnight) 10% penalty per half day late (i.e., 10% penalty of the online component if you’re late for that; 10% penalty of the written component if you’re late for that) Last Update: Nov. 4, 2023 @ 16:15 November 4, 2023 @ 16:15 Clarification for Part B ’s Q2(a)(i) October 16, 2023 @ 15:30 Created/Posted You have the choice of doing either: 1. The zyBooks online exercise with its online participation and challenge activities (the two chapters we’re interested in are quite do-able and the activities are laid out like a tutorial), and you also need to answer the questions found in this document. 2. The SQL Server DBA lab exercise (posted on Canvas). D on’t do both. Either of these will make up 10% of your course grade. Check the course outline on Canvas for how to register for the zyBooks course. It costs about $64 USD (roughly $90 CAD on a credit card). Note that option (2) above is free, and is hosted by UBC’s Department of Computer Science. This zyBooks exercise involves 2 parts: 5% of your overall grade will be for Part A (zyBooks online activities), and 5% will be for Part B (written answers to questions to be submitted on Canvas). If you are using your name or CWL ID or CS ID when registering for zyBooks, that’s fine; we can easily identify you and we can download your zyBooks points —and, don’t worry, we will send you e-mail if we have any doubt. zyBooks won’t lose a record of your work on their site. However, if you are using a pseudonym (fake name to maintain privacy), then when you upload your written answers to Canvas Assignments, include a note that says what pseudonym or e-mail address you used for zyBooks, so that we can credit you with the points for successfully completing the online exercises. zyBooks will track your online completion of the Participation Activities and Challenge Activities for the various sections. After the due date, we will transfer these points to the Canvas gradebook. In the following questions, where we have written “1 - 2 sentences” or “1 paragraph” (for example), then you can always provide more sentences, if you wish. Note that where we have written 1 paragraph”, we expect 2-4 sentences (or more, if you wish), and a not just a few words.
2 There are 2 database papers that tie in to the zyBooks material that you will need to read. Both are found in the ACM Digital Library . If you’re on a UBC VPN (or accessing it from a UBC computer), you get them for free. J ust click on the link to get access (it’ll pop up a CWL sign -in if you’re not already connected) and download the PDF copy of each . The 2 papers are: 1. [ ACM Inroads paper about DB education, NoSQL, etc.] Goldweber, Mikey; Wei, Min; Aly, Sherif; Raj, Rajendra K.; and Mokbel, Mohamed. “The 2022 Undergraduate Database Course in Computer Science: What to Teach”. ACM Inroads , Volume 13, Number 3, September 2022, pp. 16-21. https://dl.acm.org/doi/10.1145/3549545 2. [ CACM paper about The Seattle Report on Database Research] Abadi, Daniel; Ailamaki, Anastasia; et al. (actually 33 authors). “The Seattle Report on Database Research”, Communications of the ACM, Volume 65, Issue 8, August 2022, pp. 72-79. https://dl.acm.org/doi/10.1145/3524284 This strategic research report comes out every 5 years, and it is the output from a large panel of database researchers that discusses research trends, needs, and challenges in database systems and data management. Its authors include our textbook ’s co-author Raghu Ramakrishnan, Turing Award winner Mike Stonebraker (a database person who won this Nobel-like prize), ARIES crash- recovery inventor C. Mohan of IBM, other database textbook authors, etc. The report discusses some of the topics mentioned in your zyBooks activities including data lakes, big data, machine learning, data science, data integration, cloud services for databases, key-value DBs, wide-column DBs, scaling, etc. This 2022 CACM paper is an update of the original 2018 Seattle Report that includes additional commentary and progress since then. So, it’s kind of a “greatest hits” type of paper, and it will let us work with the latest findings. Part A: The zyBooks reading, participation activities, and challenge activities You need to read the materials, follow the animations, and answer the multiple choice, short answer, drag-and-drop, etc. questions. There are no questions whose answers are not (reasonably) found within the zyBooks materials. After you answer, they even tell you which answers are wrong, and you get to repeat the questions. Correct your wrong answers to get full credit. We urge you to take the activities seriously because these are good topics in data management. A future employer may be happy that you have some understanding of these topics. We will track your activities, and award you points for successfully completing the activities. Those marks will be imported into the Canvas Gradebook.
3 We will only do 2 chapters in the zyBooks course materials. In particular, we will focus on these chapters and the following major sections within the chapters: Chapter 7: Database Architectures but only the following sections 1. MySQL Architecture 2. Cloud Databases A comparison of on-premise services, IaaS, PaaS, and SaaS 3. Distributed (and Parallel) Databases 4. Replicated Databases 5. n/a (skip Data Warehouses now part of CPSC 304) 6. n/a (skip Data Warehouse Design) 7. Other Databases, and this includes: Data Lakes, Embedded Databases, Federated Databases, and In-Memory Databases Chapter 9: NoSQL Databases and Big Data 1. Big Data Databases 2. Key-Value Databases 3. Wide-Column Databases 4. Document Databases 5. Graph Databases 6. MongoDB Part B: Questions and Answers There are a series of questions on the following pages. We’re looking for good answers to each, but they don’t have to be bullet-proof explanations (but they need to be correct, not just a “good effort”) . You can answer these questions as you go along in zyBooks’ activities; or, you can save them for afterwards. Some questions require some additional reading, namely the ACM Inroads paper and the CACM paper about The Seattle Report. Be sure to answer the answers on your own, and not copy someone else’s work. The use of ChatGPT is not allowed for this assignment. Here are the questions: 1. Cloud Databases a. Hand in screenshots of your completed Challenge Activity 7.2.1, for two different sets of questions. (You might have to make multiple attempts, and there are different sets of questions presented when you re-take it.)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4 This is actually a nice summary and comparison of on-premise services, IaaS, PaaS, and SaaS. You might want to refer to Participation Activity 7.2.2 when doing this. Be careful to read the instructions (they can change between attempts). 最下面 2. Parallel Databases a. For this question, use the sailing database that we have been discussing in our lectures, pre-class exercises, in-class exercises, and sample questions from the assignments and exams. Provide one example of a query that falls into each of the 3 categories in zyBooks’ Participation Activity 7.3.1. So, there will be 3 queries in all. If you wish to provide the actual SQL, that’s fine (you don’t have to), but for this question, we want you to explain the query in words, and briefly justify why it fits into that category. (Use 1-2 paragraphs for each.) i. In other words, there are 3 memory architectures described and we want you to provide/explain a sailing query that would perform best under that architecture, but not very well on the other two architectures. 1 Shared Memory Computer: Query Example: "Find the average age of sailors who have reserved a specific type of boat." Explanation In a shared memory computer In a shared memory computer, complex operations requiring frequent and fast memory accesses are efficient because all processors access the same memory. The given query involves joining Sailors and Reserves and then calculating the average. Shared memory enables fast, shared access to data. In a Shared Storage: In a shared-memory computer, although the processors can handle disk operations efficiently, the lack of shared memory can slow down operations that require frequent inter-processor communication or access to common data sets. Calculating averages between joined tables requires more complex coordination between processors because each processor will use its own memory instead of the common memory space. Shared-Nothing computer
5 In a shared-nothing setup, the processor operates independently using its own memory and storage. This independence can lead to inefficiencies in queries that require frequent access and combination of data from multiple sources. The overhead of distributing, processing, and aggregating data across different nodes can be significant compared to a shared memory environment. 2 Shared Storage Computer: Query Example : "listing all sailors and their corresponding boat reservations, sorted by date." Explanation: In a Shared Storage: Shared-storage computers, in which each processor has its own memory but shares storage. It is ideal for tasks involving many disks read/write operations. This query requires accessing and merging large amounts of data from two tables, which a shared storage system can handle efficiently. Independent memory in this architecture avoids contention, but shared disk access enables efficient data retrieval. In a Shared Memory: In a shared memory environment, while processors can quickly access common memory, they may be bottlenecked by disk I/O operations. Reading large amounts of data from disk and writing becomes less efficient due to potential I/O contention between multiple processors. Shared-Nothing Shared-nothing architectures excel at parallel processing but may be less effective for tasks that require frequent simultaneous access to shared storage resources. Our query involves bulk reading and sorting data from shared storage, which may be less efficient in a shared-nothing system due to the overhead of data distribution and subsequent aggregation. 3 Shared-Nothing computer : Query Example: "Identify sailors who have not made any reservations in the past year." And also consider to separate the computer in geographical location Explanation:
6 Shared-Nothing computer Shared-nothing computers, in which each processor has its own memory and storage, excel at independently processing large data sets in parallel. This query involves filtering and cross-referencing large data sets that can be efficiently distributed across multiple nodes in a shared-nothing architecture. Each node can process a portion of the data, significantly reducing processing time. In a shared memory computer: While a shared memory system allows fast access to a common memory pool, it may not be as efficient for tasks that can be easily parallelized across independent data sets. Our queries benefit from splitting and processing across multiple nodes, a scenario that is not well suited for shared memory systems because the focus of shared memory systems is on shared data processing rather than distributed, independent data processing. In a Shared Storage: Shared storage systems are well suited for tasks involving a large number of disks operations but may have difficulty handling highly parallelized tasks that can be distributed across multiple nodes. Shared storage systems will be relatively poor at distributing and processing this query efficiently due to their reliance on central storage resources. 3. Big Data, Key-Value Databases, Wide-Column Databases, and Document Databases In each of the following cases, use the zyBooks content/explanations to answer the questions: a. What is meant by the term sharding and how does it differ from partitioning. (Use 2-3 sentences.)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
7 i. Sharding splits data sets across multiple machines and enables horizontal scaling and supports large amount volumes of data. In comparison, partitioning splits data sets across multiple files on a single machine. b. Think back to the principles found in our Hash-Based Indexes unit. When assigning documents to shards, under what circumstances would you use a hash function instead of a range function? (Use 1-2 sentences.) i. hash buckets can be stored on different machines, enabling horizontal scaling. With a range function, each shard contains a contiguous range of shard key values. Sharding splits data sets across multiple machines and enables horizontal scaling. ii. So, if we need data to be evenly distributed across shards and do not need to maintain any order between documents, we can use a hash function instead of a range function when assigning documents to shards. c. Explain the difference between a row in a wide column database and a row in a relational table. (Use 1-2 sentences.) i. In a wide column database, rows are physically sorted on the key, each shard contains rows with a range of key values. Different rows may have different sets of columns. However, in a relational table, each row contains the same set of columns defined by the table schema, which is fixed. d. How do timestamps work in a wide-column DB, and why are they needed? (Use 2-3 sentences.) i. A timestamp is used in a wide-column database that stores the date and time when a version of a value was created. We need timestamps to access old version of data. This feature is needed because wide column storage allows version controlling of data in the same column for a given row, so users can retain multiple versions of data over time. 4. Graph Databases Let us draw upon some of your CPSC 304 background.
8 a. Vertices in property graphs, and tables in relational databases, serve different structural purposes; but in what way might they conceptually converge? Provide a specific example or diagram from the zyBooks or CPSC 304/404 course material that helps illustrate this similarity or connection. (Use 1-2 sentences.) i. A vertex is like an entity in relational database. ii. Vertices can have properties. iii. Entity can have attributes. b. Edges provide relationships in property graphs. Thinking back to CPSC 304, what would be a similar analogy in CPSC 304 material? (Use 1-2 sentences.)
9 i. From CPSC 304 material, relationship in relational databases is a similar concept to edges in the property graphs. relationship in relational databases is used to link one entity to another entity, establishing relationships between different tables. Edge can have properties in property graph and relationship can have attributes in relational database c. Property graphs allow vertices and edges to have properties . Contrast this with the columnar data of relational databases. In particular, explain a scenario where the flexibility of property graphs offers a significant advantage over traditional relational databases. In your answer, reference any relevant zyBooks material that you've encountered. (Use 1 paragraph.) i. Graph databases are NoSQL type and have a flexible schema. It focused on highly connected data with an intrinsic need for relationship analysis. New vertices, edges, and attributes can be added at any time, and relations has its own importance. Attribute names are stored together with attribute values, so vertices and edges with the same label can have different attribute names. ii. Relational databases need to have a fixed schema. It may need schema changes or have tables containing many null values when adding any new column or relationship. iii. A scenario where property graphs are advantageous is in modeling complex and irregular relationships, when retrieving the data is more important than storing it and when data is highly connected, such as knowledge graphs, where each edge between vertices may have different attributes. iv. For example, from database knowledge vertices one connection might contain attributes such as "from CPSC 304", "A based on B", and "together as improvement ", while another connection might only have " from CPSC 404". In a relational database, representing this often requires multiple tables with many null value columns to cover all properties, or a complex schema with many joined tables. 5. MongoDB MongoDB is currently one of the top-ranked NoSQL database companies. It is a company on the New York Stock Exchange. It has a $25 billion market capitalization (number of shares outstanding × current stock price, as of October 13, 2023) double what it was about a year ago). So, it looks like there is significant commercial interest in or at least high expectations for NoSQL document databases.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10 Similar to the automobile and student databases in the MongoDB exercises on zyBooks, create a different database (on paper by typing out your MongoDB commands). Choose a small DB of your own on a topic that interests you (e.g., a hobby; music; books; movies; sports (perhaps one of: hockey, football, basketball, baseball, etc.) Don’t copy it from elsewhere. Create your own from scratch. Do the following: a. Insert 5 rows into the database subject to the following criteria: 1. There are at least 5 fields for each record. 2. At least one of the fields is numeric, at least one is a string, and at least one is an array. use books db.books.insertOne({ title: Great Gatsby , author: ‘Bob’ , publicationYear: 1949, genres: [ Novel ], rating: 8.5 }) db.books.insertOne({ title: Political , author: ‘John’ , publicationYear: 1956, genres: [ Political fiction ], rating: 9.0 }) db.books.insertOne({ title: a bird , author: Lee , publicationYear: 1949, genres: [ Novel ], rating: 8.7 }) db.books.insertOne({ title: Pride , author: Amy , publicationYear: 1949, genres: [ Novel ],
11 rating: 8.8 }) db.books.insertOne({ title: The Hobbit , author: ‘Amy’ , publicationYear: 1937, genres: [ Fantasy , Children's literature ], rating: 9.2 }) b. Using some of the query operators shown in zyBooks Table 9.6.1 (not query operators from elsewhere), create 4 different, non-trivial MongoDB queries that search your database and return the following results. Each query must satisfy a different one of the following 4 constraints thus making up a total of 4 queries in all: 1. Return 1 record, using a single condition in the MongoDB search clause (similar to having 1 WHERE-clause condition in an SQL statement) db.books.find({ author: { $eq: Lee } }) /return book whose title is “a bird” 2. Return 1 record, using 2 or more conditions in the search clause db.books.find({ $and: [ {author: { $eq: ‘Amy’ } }, {publicationYear: { $eq: 1937 } }] }) /Return book whose title is “ The Hobbit 3. Return many records, using 1 condition in the search clause db.books.find({ author: { $eq: Amy } }) /Return books whose title is “ The Hobbit ” and the title is ‘ Pride 4. Return many records, using 2 or more conditions in the search clause db.books.find({ $and: [ {author: { $in: ‘Amy’, ‘John’ } }, {publicationYear: { $lte: 1949 } }] }) /Return books whose title is “ The Hobbit ” and the title is ‘ Pride c. Using some of the update operators shown in zyBooks Table 9.6.2 (not update operators from elsewhere), create 3 different update statements, each of which
12 satisfies a different one of these 3 cases thus making up 3 update statements in all: 1. Update no record (for 1 of the update statements) db.books.updateOne({ author: 'N OTEXIST’ }, { $rename: { name: 'Oliver' } }) 2. Update 1 record (for 1 of the update statements) db.books.updateOne({ author: ‘Lee’ }, { $inc: { publicationYear: 2 } }) 3. Update 2 or more records (for 1 of the update statements) db.books.updateOne({ author: ‘ Amy ’ }, { $set: { rating: 9.9} }) 6. More on zyBooks topics, CPSC 304, CPSC 404, and the ACM Inroads paper: The following questions deal with the ACM Inroads paper by Goldweber, et al. , mentioned near the top of this document, and found in the ACM Digital Library: a. Point out some things in the article that you disagree with. Justify your answer. (Use 2 paragraphs.) i. Rajendra points out that “the notion of database as the persistent store of valuable data should be introduced into CS curriculum as soon as feasible without initially requiring a full database course.” I disagree with that because of two points. Models containing the fundamental concept of an external persistent store like LINQ is not popular in the industry for students who want to be software engineer or database engineer which may not be helpful to students in their career. Additionally, for a junior year CS student, introducing database knowledge too early may cause more confusion when we still do not have a good understanding of computer engineering, data structure and algorithm, and computer systems. The knowledge of those areas is tightly connected, some storage knowledge in external persistent stores may depend on the knowledge learned in the computer system course, like the difference in memory and disk storage. Therefore, introducing the database course not such early and directedly covering the material in SQL and relational database would be better. Learning so much
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
13 knowledge and multiple classes at the same time will put a lot of learning pressure on students and lead to poor learning results. ii. MIN WEI believes that undergraduates need to be proficient in the basic knowledge he mentioned. He believes that basic knowledge is difficult to master, but the skills required for work are easy to master. In this regard, I don't quite agree. Although compared with practice, mastering theory requires more investment and is more difficult. But mastering the skills needed for the job isn't easy, and that doesn't mean every undergraduate student needs to be able to master a particular field. 4 employees of a big company who graduated five years ago could not answer his interview questions. Although the knowledge points, he mentioned are very important, the fact that undergraduates have not mastered those knowledge points does not mean that they cannot perform in the field of engineering. Well, considering these four have pretty good jobs in big company. This is because each graduate's job segment is different, and much basic knowledge are easily forgotten at work because they are relatively rarely applied. Moreover, there will be different emphasis on the application level at work. It is often enough to be able to apply knowledge proficiently, and what the work requires is the ability to continue learning. iii. MIKEY GOLDWEBER mentioned that every CS course should incorporate ethical issues. However, I disagree with that. There is no need to address ethical issues individually or cover them in every course. It is enough for UBC to cover this topic in CPSC310 because the relevant content is standardized and similar in every field. If it is covered in every course, it will bring extra work to students and distract them. We will Expect to be able to focus on the course during the course. b. Which of the 5 co-authors of the ACM Inroads paper is most supportive of teaching NoSQL in an undergraduate database or data management course? Justify your answer with examples from the article. (Use 1-2 paragraphs.) Mikey Goldweber is the biggest supporter of NoSQL teaching. He mentioned that "over the past few years he has been pruning away relational algebra and SQL, B+ trees, etc. to get a bigger NoSQL space". He believes that comparing and contrasting NoSQL and SQL is the best way to understand the internals of relational and non-relational databases. At the same time, some of the other teachers did not mention NoSQL. Some teachers did not emphasize the importance of NoSQL alone after mentioning NoSQL, but put NoSQL and other supplementary knowledge on the same level. c. In terms of the zyBooks Chapters 7 and 9 topics (use the topic name and the zyBooks’ 3-number format like 7.3.1, for example, that is closest to the location of the topic) what topics did most of the 5 ACM Inroads co-authors tend to agree should be included in the undergraduate curriculum? (Use 1-2 paragraphs.)
14 1. As highlighted in zybook section 9.1.6, the authors agree on the importance of incorporating SQL and NoSQL databases into database education. Mohamed, Mikey, Min, and Rajendra all emphasized the need for comprehensive learning covering both database types due to their relevance in different domains. Furthermore, in zybook sections 7.2.2 and 7.2.1, the importance of cloud services and cloud databases in modern database courses is emphasized. Mohamed noted the practical applications of both technologies in data analytics, while Rajendra, Sherif, and Min recognized the growing importance of cloud-based solutions such as AWS DynamoDB and Azure CosmosDB. Therefore, database education should be broad, covering both traditional and modern database systems and technologies. d. Suppose we do not have 1-2 extra weeks in CPSC 304 for more material, and suppose we had to remove some content to make room for some of the zyBooks Chapter 7 or 9 material, so that the kinds of topics still fit nicely with some of the general learning goals of CPSC 304. Which topics would you remove from CPSC 304 to make room, which topics (mentioned in zyBooks) would you add, and why do you think those topics can be removed? Justify your answer. (Use 2-3 paragraphs.) Based on the conclusion we got from the previous question, we want to add knowledge of cloud databases and Nosql. Star vs. Snowflake Schemas and Microsoft’s SQL Server and SQL Server Analysis in CPSC304 Data Warehousing Services, join cloud database. The reason why we want to delete the above content is because Star vs. Snowflake Schemas and Microsoft's SQL Server and SQL Server Analysis Services are more focused on the application level. When we already have Oracle's SQL service project, this additional application knowledge is unnecessary. Instead of embedded vs. Dynamic SQL in SQL, join Nosql. It should be because the knowledge of embedded vs. Dynamic SQL is relatively not that basic. When we have not yet mastered SQL statements flexibly, this knowledge is relatively difficult. Because we just talked about SQL, we added Nosql here is also more appropriate. 7. More on zyBooks topics, CPSC 304, CPSC 404, and the 2022 CACM Seattle Report paper mentioned above: Every 5 years, the database research community meets to discuss where we are heading in the database world. If you have not already done so, read the paper (described near the top
15 of this document) titled “The Seattle Report on Database Research” by Abadi, et al. Then, after reading it, answer these questions: a. According to the authors, what are the most important research problems that need to be addressed and worked on over the next few years? (Use 2 paragraphs.) Database engine hardware: Adapt to heterogeneous computing environments Because of the development of hardware, such as GPUs and FPGAs and RDMA, database engines can take advantage of stack bypass. Although new SSDs are emerging, adopting a new generation of SSDs will cripple the memory system. NVRAM enables persistence and low latency, it also affects the database engine. Effectively manage data lakes Because of the development of hardware, such as GPUs and FPGAs and RDMA, database engines can take advantage of stack bypass. Although new SSDs are emerging, adopting a new generation of SSDs will cripple the memory system. NVRAM enables persistence and low latency, it also affects the database engine. b. Think back to our storage hierarchy in Chapter 9 of the textbook, and our set of PowerPoint slides on Disks. What hardware components mentioned in the paper are missing from our storage hierarchy triangle/pyramid? Where should they go? You may need to break down the layers of the hierarchy into finer resolution to include those components. (Use 1-2 paragraphs, and possibly a diagram.) GPU and FPGA: They are not traditional hardware storage, but they are important in processing data in fast speed. They play an important role in fast parallel tasks and compute-intensive works like DNNs. So, they could be slower than CPU register but better than L1 cache. High-speed SSD: It is faster than traditional HDD but slower than main memory. So, it should be placed in between main memory and local disks.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
16 NVRAM: It is like SSD, but NVRAM has smaller size of storage than SSD and implemented at a chip or module level. It also slower than DRAM. So, it will between DRAM and local disks c. The paper discussed the use of data lakes . Data lakes are very popular in the data management space of the tech sector. Snowflake is a company that specializes in data lakes and data warehouses. It is listed on the New York Stock Exchange, and has a $52 billion market capitalization (i.e., # of shares outstanding × the share price, as of October 13, 2023) about the same value as of about a year ago. Explain how a data scientist might use a data lake . In answering this question, think about what a data scientist does, and also reflect on the Seattle Report. Mention some specific examples from the Seattle Report in your answer. Data scientists use data lake like a central service to store and manage data in various formats. Because data lake offers flexible environment to perform analytical tasks. Data Scientists usually works on data preparing, model building and data analysis. Data Pipelines: Data scientists uses data lake to form complex data pipeline which contains several stages and many participants. And they can access and store different data sources to support the analytical tasks. Data Science Notebooks contain different frameworks like Jupiter, Spark, and Zeppelin. o For example, a team may extract data from the data lake, another team builds models on the data extracted, and users can access the data and built models by an interactive dashboard. The paper talked about this multi-stage pipeline and emphasis on the role the data lake plays in sourcing data from heterogeneous data collection. Visualization Data Analysis: Data scientists can use approximation and progressive visualization techniques in a large size of the dataset.
17 o The paper motioned about the use of approximation visualization in query to reduce latency and analysis rapidly. These would help with exploratory data analysis. Data Quality and consistency: data lake offers ability to maintain the data quality and governance. So, data scientists can enforce schemas, validate data, and transactions support to make sure the result is accurate. o On page 74, The paper emphasises the importance of maintain data governance guardrails within data lake to maintain data consistency and quality. Support serverless: data scientists can use data lake to deal with large datasets on cloud. Because data lake supports scalable data exploration, which is of vital importance, especially useful in a cloud environment. They can use Apache Spark for large-scale data processing tasks. o The paper mentioned about the potential for databases with abundant resources in the cloud and highlight the importance of solving open- end problems in scalable data exploration. d. Think back to the “pink sheet” on “What is a database?” On that sheet, we discussed major database components. The left hand side of the sheet dealt with the data, and the right hand side dealt with indexes. Suppose we were to have a second “pink sheet” (one page) that deals with data lakes and its various input sources (specific NoSQL systems and other DB sources). What are the data sources mentioned in the Seattle paper and in the zyBooks materials that should be included on this sheet? Sketch a representation (maybe a graph, or a bunch of Word boxes and arrows, or a handwritten sketch) that relates the data sources that might go into a data lake. In your diagram, also list some processes that transform the content from the data lake into usable output. See the paper and zyBooks for some of these processes or transformations. (Besides providing your “sketch”, use several
18 paragraphs to explain the content of your sketch, so that the marker can understand your work.) 1. we have many data sources, they can be NoSQL, like MongoDB, API sources, IoT, Cloud Services. We need some software to package the data sources, and then we can store them in the same place. in book" data in a data lake is not cleansed, integrated, or restructured", so when we use data lake we do not need to consider the coexistence of different data types. 2. The data lake itself can run on AWS. In the Seattle report, AWS serverless function can provide flexible elasticity. 3. The data will be mainly stored in the data lake, and the central storage can be a disk. 4. Data management will be connected to storage and data processing. This draws on the CatLog in cpsc 404, where the metadata is very useful and can save computing costs. 5. Data processing is connected to the data lake. We use Apache and ETL to process the dataset 6. In data analytics, users can use popular BI systems and ML pipelines. 7. output result from data processing. It can be the result of visualization via ML
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help