Assume the relation CITIZEN(ID, Tax-Code, Salary, Age) storing information about UK citizens’ tax codes and annual salaries. There are r = 60,000,000 records. Each attribute has the same size: 128 bytes. The relation is stored in a file sorted by the salary attribute. The block size B = 1024 bytes and any pointer in the system has size = 128 bytes. The salary attribute is assumed to be uniformly distributed across tuples. There are 6,000 distinct salary values and there are 10,000 different tax-code values. The tax-code and age are assumed to be statistically independent of salary. A data scientist, who has built a Secondary Index over the Tax Code (non- ordering, non-key attribute), is interested in the specific Tax Code: ‘1234L’. The data scientist did analyse the distribution of the Tax Code attribute and noted the following: • Let X be the number of citizens with Tax Code 1234L per data block. That is, given a random block, there are X citizens with Tax Code 1234L. Objective: Experimentation with Secondary Index over non-ordering non-key attribute and Aggregate Query Execution Planning.• P(X ≥ 1) = 0.5, i.e., the probability that at least one citizen has tax code 1234L is 50% within a block. Therefore, when we pick up a block at random, the probability of finding therein at least one citizen with Tax Code 1234L is 0.5. If there are b data blocks in the file and we are asked to retrieve those citizens with Tax Code 1234L, then ideally, we expect to access b/2 blocks (ideal case). However, in reality, we do not know where these blocks are! If we use the naïve solution (scan the whole file) to retrieve all those citizens, then we need to access b blocks. Q2.1 The data scientist claims that the expected cost using the Secondary Index should be between b/2 and b. Which is the expected cost of retrieving the citizens of Tax Code 1234L using the Secondary Index? How much bigger is this cost compared to the ideal case? Q2.2 The data scientist is asked for a query processing plan for the aggregation query: SELECT Salary, AVG(Age) FROM CITIZEN WHERE TaxCode = ‘1234L’ GROUP BY Salary If the data management system devotes 100,000 blocks of RAM (memory), i.e., approx. 103MB (each one of 1024 bytes) for executing the query, and 5000 blocks for storing the results of the query, help the scientist by providing a query execution plan. Describe your plan (e.g., steps, methods, ideas) and report on the corresponding expected number of block accesses of your proposed solution. How much memory would you need to store the result of the aggregate query?
Assume the relation CITIZEN(ID, Tax-Code, Salary, Age) storing information about UK citizens’ tax codes and
annual salaries. There are r = 60,000,000 records. Each attribute has the same size: 128 bytes. The relation is
stored in a file sorted by the salary attribute. The block size B = 1024 bytes and any pointer in the system has
size = 128 bytes. The salary attribute is assumed to be uniformly distributed across tuples. There are 6,000
distinct salary values and there are 10,000 different tax-code values. The tax-code and age are assumed to be
statistically independent of salary. A data scientist, who has built a Secondary Index over the Tax Code (non-
ordering, non-key attribute), is interested in the specific Tax Code: ‘1234L’. The data scientist did analyse the
distribution of the Tax Code attribute and noted the following:
• Let X be the number of citizens with Tax Code 1234L per data block. That is, given a random block,
there are X citizens with Tax Code 1234L.
Objective: Experimentation with Secondary Index over non-ordering non-key attribute and
Aggregate Query Execution Planning.• P(X ≥ 1) = 0.5, i.e., the probability that at least one citizen has tax code 1234L is 50% within a block.
Therefore, when we pick up a block at random, the probability of finding therein at least one citizen
with Tax Code 1234L is 0.5.
If there are b data blocks in the file and we are asked to retrieve those citizens with Tax Code 1234L, then
ideally, we expect to access b/2 blocks (ideal case). However, in reality, we do not know where these blocks
are! If we use the naïve solution (scan the whole file) to retrieve all those citizens, then we need to access b
blocks.
Q2.1 The data scientist claims that the expected cost using the Secondary Index should be between
b/2 and b. Which is the expected cost of retrieving the citizens of Tax Code 1234L using the Secondary Index?
How much bigger is this cost compared to the ideal case?
Q2.2 The data scientist is asked for a query processing plan for the aggregation query:
SELECT Salary, AVG(Age)
FROM CITIZEN
WHERE TaxCode = ‘1234L’
GROUP BY Salary
If the data management system devotes 100,000 blocks of RAM (memory), i.e., approx. 103MB (each one of
1024 bytes) for executing the query, and 5000 blocks for storing the results of the query, help the scientist by
providing a query execution plan. Describe your plan (e.g., steps, methods, ideas) and report on the
corresponding expected number of block accesses of your proposed solution. How much memory would you
need to store the result of the aggregate query?
Unlock instant AI solutions
Tap the button
to generate a solution