Practice Midterm 2 - With Answers

pdf

School

University of Pennsylvania *

*We aren’t endorsed by this school

Course

5500

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

15

Uploaded by momoshen17

Report
CIS 5500: Database and Information Systems Practice Midterm 2 Question 1 (20 points, General) For each of the following questions, select the (one) correct answer. 1. (2 points) If every transaction is well-formed and uses two-phase locking, then any legal interleaving of such transactions is: A) serial B) free of conflicts C) serializable D) deadlock-free 2. (2 points) Suppose that T1 has a higher priority than T2, and that T1 holds a shared lock on A when T2 requests a shared lock on A. Which of the following would happen? A) Under WOUND-WAIT: T2 will wait B) Under WAIT-DIE: T2 will die C) If deadlock detection is used: an edge will be added from T2 to T1 in the WAIT-FOR graph. D) T2 will be granted the lock under any deadlock detection/prevention strategy 3. (2 points) Which of the following statements are true about isolation levels? A) Using transactions with higher isolation levels (SERIALIZABLE being highest) decreases undesirable phenomena and the amount of parallelism. B) A read-write transaction cannot use isolation level READ COMMITTED. C) A transaction at isolation level REPEATABLE READ ensures that rerunning the same SQL query twice in the transaction gives the same result. D) In isolation level READ UNCOMMITTED, locks on updated data are held until the end of the transaction.
4. (2 points) Suppose you have a relation R over which an Alternative 1 index (called I1) has been set up over attribute A and an Alternative 2 index (called I2) is set up over attribute B. Which of the following is true? A) The data records are sorted over B. B) The data entries in I1 are the data records. C) The data entries in I1 contain a pointer to the data records. D) I2 is a sparse index. 5. (2 points) Which of the following statements is true about B+ trees? A) The number of page I/O’s to find data entry k* is log_F(N) where F is fanout and N is number of leaf nodes. B) A B+ tree index on a composite search key <sid, year> is useful for an SQL query whose selection condition is (year == ‘SOPHOMORE’). C) Every search key value occurring in an index entry page (non-leaf node) must occur in some data entry page (leaf node). D) The smallest key value on every data entry page (leaf node) appears somewhere on an index page. 6. (2 points) Which of the following is NOT true about NoSQL solutions? A) There are multiple types of NoSQL solutions. B) The schema of data is flexible. C) They scale by partitioning data and distributing computation over a cluster of machines. D) The replication of data helps updates run faster. 7. (2 points) Which of the following statements is NOT true about relational algebra? A) It is better suited for query optimization than SQL. B) For all relations R, S, T, the result of R (S T) is the same as ( R S) T, where denotes natural join. C) Projection and selection commute with each other. D) Pushing projections down can help reduce the size of intermediate query results, i.e. the number of bytes required to store the results.
8. (2 points) Suppose that a relation R fits on N= 100 pages, and that there are B=4 buffers of memory being used. Which of the following is true about sorting R using External Merge Sort? A) At the end of pass 0, each page of R is sorted internally but not relative to any other page. B) At the end of pass 0 there are 34 sorted runs of 3 pages. C) After pass 0, 4 sorted runs are merged during each pass. D) R can be sorted in 3 passes after pass 0 (4 passes total) 9. (2 points) What is the most dominant cost in query processing? A) The time to read/write a page from disk into buffered memory. B) The time to operate on data in buffered memory, e.g. to calculate the join between tuples in memory. C) The time it takes the page to be located on disk. 10. (2 points) Which of the following statements is NOT true about EXPLAIN? A) It is a profiling tool that shows how an SQL query will be executed. B) The number of rows and time indicated at each operation are accurate since the query is executed and the numbers are recorded. C) The number of rows and time indicated at each operation are estimates that may be quite different from those obtained when the query is executed. D) It can help the user to identify redundant operations in an SQL query.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Question 2 (8 points, Serializability) Consider the following schedule S of three transactions, ignoring for now the “?” in the last row: T1 T2 T3 READ(B) WRITE(A) WRITE(B) WRITE(A) READ(A) WRITE(B) WRITE(A) ? 1. (2 points) Is S serial? Prove your answer. It is not serial because the actions of the transaction T2 aren’t consecutive. 2. (4 points) Prove that S is serializable, and give an equivalent serial schedule. (Recall that the “?” is to be ignored.) The precedence graph S has vertices T1, T2, T3 and edges T3 -> T2, T2 -> T1, and T3 -> T1. It does not have cycles, so it is serializable. A serial schedule equivalent to the above schedule is T3 -> T2 -> T1 (a topological ordering of the vertices of the above precedence graph). 3. (2 points) Now consider the “?” in the last line of S. Give an action that replaces the “?” (i.e. a READ or WRITE on some data item by T3) and makes the resulting schedule NOT serializable. Add READ(B) or WRITE(B). This will add an edge T2-> T3 to the precedence graph, creating a cycle.
Question 3 (12 points, Locking) Now consider the following schedule, in which time is indicated in the first column for clarity. Time T1 T2 T3 1 READ1(A) 2 READ3(A) 3 WRITE3(A) 4 READ2(B) 5 WRITE2(B) 6 WRITE2(A) 7 WRITE1(B) Suppose all transactions are well-formed, use strict locking, and attempt to acquire each lock in the highest mode needed just before the first action on the data item (e.g. T1 requests an XLOCK(B) at time 7, just before the WRITE(B)). Assume that deadlock prevention is NOT being used (e.g. neither wound-wait nor wait-die). 1. (3 points) When is the earliest point in the schedule at which a deadlock will be detected by a deadlock detection-and-resolution algorithm? Explain your answer using the wait-for graph. The earliest that a deadlock will be detected is at time 7, W(B) of T1, when T1 requests an XLOCK on B which is held by T2. The wait-for graph has vertices T1, T2, T3 and edges T1 -> T2, T3 -> T1, T2 -> T1. It has a cycle involving vertices T1 and T2 and edges T1 -> T2 and T2 -> T1. The presence of the cycle enables the algorithm to detect that a deadlock involving T1 and T2 has occurred, and one of the two transactions would be aborted.. 2. (3 points) Suppose now that WAIT-DIE deadlock prevention was being used, and assume that a transaction’s start time determines its priority (so T1 is highest and T2 is lowest). When is the earliest point in the given schedule that a transaction would be aborted, and which transaction would be aborted?
Transaction T3 is the first to be aborted at time 2 (action R(A) of T3), when T3 requests an XLOCK on A which is held by T1. 3. (6 points) Recall the isolation levels Read Committed (RC) and Repeatable Read (RR), and how they are implemented using locking. We call RR a “higher” isolation level than RC. Can the given schedule be realized using these isolation levels for each transaction (T1, T2, T3)? If so, give the highest isolation levels for each transaction and show the resulting schedule with lock requests and releases. If not, briefly explain why. The schedule can be realized. The highest isolation levels are RC for T1, RR for T2, and RR for T3. 1: SLOCK1(A), READ1(S), REL1(A) 2: XLOCK3(A), READ3(A) 3: WRITE3(A), REL3(A) 4: SLOCK2(B), READ2(B) 5: (optional) WRITE2(B) 6: XLOCK2(A), WRITE2(A), REL2(A, B) 7: XLOCK1(B), WRITE1(B), REL1(B) Tabular version : Time T1 T2 T3 1 SLOCK1(A), READ1(A), REL1(A) 2 XLOCK(3A), READ3(A) 3 WRITE3(A), REL3(A) 4 XLOCK2(B), READ2(B) 5 WRITE2(B) 6 XLOCK2(A), WRITE2(A) REL2(A,B) 7 XLOCK1(B), WRITE1(B) REL1(B)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Question 4 (10 points, B+ Trees) 1. (2 points) Is the following B+ tree valid? Justify your answer. Invalid. The data entry 16 is less than the index entry 17 but is in the subtree to the right of it. This violates the requirement that p_i points to the subtree of data entries whose key is >= k_i. 2. (2 points) Is the following B+ tree valid? Justify your answer. Valid. Each node is at least half full and the requirement that p_i points to the subtree of data whose key is >= k_i and < k_i+1 is met. The fact that there is an index entry with key 9 but no corresponding data entry does not make the tree invalid, as the data entry could have existed and been deleted without modifying the index entry. 3. (3 points) Consider the following B+ tree in which the leaf pages (data entry pages) must contain a minimum of two data entries, and non-leaf pages (index entry pages) must contain a minimum of one key (and two pointers).
Show the tree after deleting 26* The tree decreases in height. Leaves L1-L2 remain the same, leaf L3 disappears and L4 becomes (27, 29, 30). Index nodes I2, I3 disappear, and the root (I1) now contains (13, 27) with pointers to L1, L2 and L4. The picture is below: 4. (3 points) Consider again the same initial B+ tree, repeated below for clarity. Show the tree after inserting 6* and 7*
Add a new leaf between L1 and L2 (call it, L1.5), modify L1 to contain (5, 6) and L1.5 to hold (7, 8 ). Modify I2 to hold (7, 13), when the pointer associated with 7 is to L1.5.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Question 5 (20 points, Query Processing) Formula 1(F1) is coming to Las Vegas this year. The world champion, Vettel, has decided to create a list of all drivers who are to race in F1, their personal information and their winning stats throughout their career, to see his odds of winning again. His database contains tables with the following schema: Driver( DriverId , Name, Height, Weight, Age, Nationality) Wins(TrackName, CountryName, DriverId) (DriverId) FOREIGN KEY REFERENCES Driver(DriverId) The Driver table contains personal information about all drivers who are to race in Las Vegas in the coming months. The primary key is DriverId (indicated in bold). The Wins table documents the wins by the drivers. TrackName refers to the name of the track they won in and CountryName refers to the country the track was in. 1. (5 points) Write a relational algebra query which returns the names of all British drivers who have won on some track in the USA. The query must be in a format where the selections and projections are pushed down as far as possible to the base relations. Π Name ((Π DriverId, Name σ Nationality=’British’ Driver) Driverid DriverId σ CountryName=’USA’ Wins)) $\Pi_{Name}(( \Pi_{DriverId, Name} \sigma_{Nationality=’British’} Driver) \bowtie_{DriverId} (\Pi_{DriverId} \sigma_{CountryName=’USA’} Wins))$ 2. (5 points) Consider the following query: SELECT DriverId, Name FROM Driver WHERE Age>20 and Age<=30 ORDER BY Name Assume that Driver has 30,000 tuples and that 10 tuples can fit per page, so the relation fits on 3,000 pages. It is sorted on Name, over which there is an Alternative 1 index. Describe 1) how the query is evaluated, and 2) what the cost is. 1. Since the index does not match the query, Driver is scanned. When a match is found (Age> 20 and Age<= 30) the Id and Name fields are returned. Since the relation is
ordered on Name, the result will come back ordered by Name. No duplicate removal is required. Therefore only one pass of Driver is necessary. 2. Cost: 3,000 I/Os. Note: For the following questions, you are not allowed to use a calculator. If you cannot do the math by hand, you may provide unevaluated formulas. 3. (5 points) For the following, you can assume: Drivers fits on 3,000 pages and is sorted on name Wins fits on 100 pages and is not sorted There are 7 buffers Vettel decides that he wants to create a joined version of Drivers and Wins so that he can have a single table over which to perform data analysis: Driver Driverid Wins GIve the 1) best and 2) worst cost of doing this using a Block Nested Loops join? 5 buffer for outer, 1 buffer for inner, 1 buffer for the output Worst: Driver Outer : 3000+ (3000/5)*100 = 3000+ 600*100= 63,000 Best: Wins Outer: 100 +(100/5)*3000 = 100+ 20*3000= 60,100 4. (5 points) Recall: Driver Driverid Wins Drivers fits on 3,000 pages and is sorted on Name Wins fits on 100 pages and is not sorted How many buffers would be needed to implement the join using Hash Join? What is the cost using this minimum number of buffers? The minimum buffers for performing Hash Join: 100< (B-1)(B-2) so B=12 B=12: 100<11*10= 110 B=11: 100>10*9=90 Cost: read each relation and create a partitioned version of each; using 1 scan over each partition, calculate the join. 3(3000+100)= 9300
Question 6 (15 points, MongoDB) Consider the following MongoDB collection of the NY Times bestseller list. books = [ { “_id”: 1, “title”: “Fourth Wing”, “author”: {“first”: “Rebecca”, “last”: “Yarros”}, “month”: 5, “year”: 2023, “publisher”: “ Entangled: Red Tower Books”, “rating”: 1, “formats”: [{“type”: “Kindle”, “price”: 14.99}, {“type”: “Audible”, “credit”: 1}, {“type”: “Hardcover”, “price”: 19.52}, {“type”: “Paperback”, “price”: 28.55}] }, { “_id”: 2, “title”: “Too Late”, “author”: {“first”: “Colleen”, “last”: “Hoover”}, “month”: 6, “year”: 2023, “publisher”: “ Grand Central Publishing”, “rating”: 2, “formats”: [{“type”: “Kindle”, “price”: 12.99}, {“type”: “Paperback”, “price”: 12.94}] }, { “_id”: 4, “title”: “It Ends with Us”, “author”: {“first”: “Colleen”, “last”: “Hoover”}, “month”: 8, “year”: 2016, “publisher”: “ Atria”, “rating”: 4, “formats”: [{“type”: “Kindle”, “price”: 11.99}, {“type”: “Hardcover”, “price”: 11.82}] }, { “_id”: 5, “title”: “It Starts with Us”, “author”: {“first”: “Colleen”, “last”: “Hoover”}, “month”: 10, “year”: 2022, “publisher”: “ Atria”, “rating”: 5, “formats”: [{“type”: “Kindle”, “price”: 13.99}, {“type”: “Paperback”, “price”: 10.49}]
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
}] 1. (4 points) State in English what the following query does, and show what it returns using the given instance. db.books.aggregate([ { $match: {year: {$gte: 2020}}}, { $group: { _id: "$author", count: { $sum: 1 } } }, {$match: {count: {$gte: 2}}} ]) For each author who has published two or more books since 2020, print their name and the total number of books they have published since 2020 (inclusive). [ { _id: { first: “Colleen”', last: “Hoover” }, count: 2 } ] 2. (4 points) Write a query that returns the title of books that were published in 2023 and are available both in Kindle and hardcover (potentially among other formats). The output on the given instance would be: [{title: “Fourth Wing”}] db.books.find({year: 2023, "formats.type": {$all: ["Kindle", "Hardcover"]}}, {_id: 0, title:1}) And potentially others… 3. (7 points) Write an aggregation query that, for each publisher, returns the average price of Kindle books that they publish. The expected output on the given instance is: [ { _id: “ Entangled: Red Tower Books” , avg_price: 14.99 }, { _id: “ Grand Central Publishing” , avg_price: 12.99 }, { _id: “ Atria” , avg_price: 12.99 } ] db.books.aggregate([ { $unwind: "$formats" }, { $match: { "formats.type": "Kindle" } },
{ $group: { _id: "$publisher", avg_price: {$avg: "$formats.price"}}} ]) Question 7 (15 points, Neo4j) Consider the following graph database, a picture of which is shown below: CREATE (:User{name:"Alice"})-[:LAST_POST]-> (:Post{topic:"science"})-[:PREV_POST]->(:Post{topic:"fitness”}) CREATE (:User{name:”Bob”})-[:LAST_POST]->(:Post{topic:”art”}) CREATE (:User{name:"Charles"})-[:LAST_POST]-> (:Post{topic:"science"})-[:PREV_POST]-> (:Post{topic:"art"})-[:PREV_POST]->(:Post{topic:"health"}) MATCH (p1:User{name:"Bob"}), (p2:User{name:"Alice"}) CREATE (p1)-[:KNOWS]->(p2) MATCH (p1:User{name:"Alice"}), (p2:User{name:"Charles"}) CREATE (p1)-[:KNOWS]->(p2) MATCH (p1:User{name:"Bob"}), (p2:User{name:"Charles"}) CREATE (p1)-[:KNOWS]->(p2)
Write Cypher queries for each of the following. 1. (5 points) Return the name of all people who know Charles and whose last post was about “science”. MATCH (p:User)-[:LAST_POST]->(q1:Post{topic:"science"}), (p)-[:KNOWS]->(:User{name:"Charles"}) RETURN p.name 2. (5 points) Return the name of all people whose last post was about science and some previous post was about health. MATCH (p:User)-[:LAST_POST]->(q1:Post{topic:"science"})-[:PREV_POST*]->(q2:Post{topic:"he alth"}) RETURN p.name 3. (5 points) Return a list of topics of all last posts. Eliminate duplicates. MATCH (p:User)-[:LAST_POST]->(q1:Post) RETURN COLLECT (DISTINCT q1.topic)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help