Quiz_ Final Exam

pdf

School

University of Cincinnati, Main Campus *

*We aren’t endorsed by this school

Course

PROBABILIT

Subject

Computer Science

Date

Jan 9, 2024

Type

pdf

Pages

9

Uploaded by PresidentFang13043

Report
Final Exam Started: 27 Nov at 19:42 Quiz instructions 1 pts Question 1 You run the following aggregation on a Delta table: df.count() You go to SQL tab and see the following: What does 'Local Table Scan' mean? It was able to cover the query from statistics in Metadata It didn't have to do a Shuffle/Exchange It pulled data from Delta cache on Worker's local SSD drive It had to read the data files to get local/global count 1 pts Question 2 You are joining a 125GB DataFrame to a 250GB DataFrame. You are concerned about Shuffle Partition size being too large, so you do a SAMPLE and find Shuffle Partitions are evenly distributed. To ensure number of Shuffle Partitions count is optimal, the best thing to do would be: Enable DPP Do nothing. Catalyst Optimizer will take care of it Set Shuffle Partition count manually to larger number Ensure AQE is enabled 1 pts Question 3 You execute the following query: df.repartition(10).write.partitionBy("DayOfWeek").parquet("/tmp/repart/") How many Disk partitions will be created? 7 Driver will do 'best effort' to create roughly 128MB files 10 Number equal to number of Cores on Cluster 1 pts Question 4 Which of following Complex data types can be variable length? Select all that apply. All of the above Map Structs Array 1 pts Question 5
You write a DataFrame to a file but want only 1 file created. Which below command would best accomplish this? copsDF.write.format("parquet").save("/opt/cops_dir/", partition = 1) copsDF.coalesce(1).write.format("parquet").save("/opt/cops_dir/") copsDF.repartition(1).write.format("parquet").save("/opt/cops_dir/") copsDF.write.format("parquet").save("/opt/cops_dir/", 1) 1 pts Question 6 You load a large Delta file into a DataFrame as shown (see below). In Line 2, where is the count being read from? From Delta cache The Delta file statistics From Spark cache From the 'Shuffle Write' files(s) 1 pts Question 7 Pick all valid places to see number of Cores in your Cluster. Type 'sc.defaultParallelism' Hovering over Cluster name on any Notebook Spark UI -> Executors tab Spark UI -> Environment tab 1 pts Question 8 The '_delta_log' directory hosts: (Select all that are true) Parquet files Bloom Filter Indexes crc files JSON files 1 pts Question 9 You execute a Spark job with an Inner Join and notice it did a SortMergeJoin. Pick one True statement concerning this query Shuffle for each Join key was Many-to-1 partition event Shuffle for each Join key was a 1-Many partition event Shuffle for each Join key was 1-to-1 partition event Shuffle for each Join key was a Many-to-Many partition event 1 pts Question 10 You've got 100GB or RAM across all your Worker nodes. What's maximum RAM amount you have for executing a query? 75GB 100GB 50GB 90GB 60GB 1 pts Question 11
You create a Table, View, TempView and GlobalTempView on your Cluster. You then issue a 'SHOW TABLES' command. Which objects are visible? Table only Table, View and TempView only All four objects (Table, View, TempView, GlobalTempView) Table and View only 1 pts Question 12 Select all wide events that cause a Shuffle union repartition dropDuplicates sort distinct 1 pts Question 13 Your boss wants the Streaming job to aggregate Sales every hour. What is best outputMode to use? Append Complete Total Update 1 pts Question 14 You create a TempView named 'tv_dept'. Which below query will work? spark.sql("tv_dept").show() spark.view("tv_dept").show() spark.tempView("tv_dept").show(() None of the above queries are valid spark.table("tv_dept").show() 1 pts Question 15 You issue a Spark query. What is the state of the RAM contents after the query returns Answer set? The result set will stay in RAM until a default timeout occurs New queries will evict older result sets from RAM on a ‘as needed’ basis RAM contents flushed after the output of the query Spark uses LRU algorithm to determine which queries to flush when RAM reaches 80% full threshold 1 pts Question 16 From Spark UI, you notice the ‘Storage’ tab has this entry (see below). Select all answers that are true.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
You cannot tell if cache() or persist() function was used Storage size would have been greater if it had used Serialized Portion of this object was evicted since only fraction cached You cannot tell if this was a Python or Scala initiated 1 pts Question 17 You issue a 'createOrReplaceTempView' statement. Which statement below is true? The TempView can be only be viewed and queried in the session it was created Any other Spark client can view the TempView via spark.sql(“show tables”).show() but will not be able to query it If logged into another application, a Hive client can both view and query the TempView If you create a new session in Databricks UI, can query the TempView 1 pts Question 18 You start your Spark Streaming job but forget to define a Trigger. What will be the default behavior? Job will spawn, but without a Trigger, no data will stream Job will run without Trigger only if ‘writeStream.format’ set to ‘console’ Job will run in near real-time as data streams in ERROR message will be displayed as Job is unable to spawn 1 pts Question 19 Pick all True statements concerning MAP data type. MAP data types can be unordered Key value pairs are considered MAP data types The Key and the Value can be different data types 'explode' function will create new row for every Key Value pair in MAP 1 pts Question 20 You suspect the Memory Partition size is too small for your query. Which Spark UI tab confirms this? Stages tab Environment tab Job tab Executors tab 1 pts Question 21 While in Scala, you run the following statements without error: val df1 = spark.table("emp") df1.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY) Pick 2 True statements below. The DataFrame is not actually cached at this point The PERSIST statement is equivalent to: df1.cache() statement The PERSIST statement has a Storage level = Serialized (as opposed to Deserialized-Raw) The Persist statement saves 2 replicas by default 1 pts Question 22 Delta is the preferred file format of Spark for these Reasons. Select all answers that are true. Columnar format allows for less Disk I/O when projecting minority of Columns in query Possible to answer a query using Delta metadata files exclusively
Human readable file format means text editors can read data if needed Delta supports compression which can allow for speedier queries 1 pts Question 23 You wish to see if Predicate Pushdown really worked. Where would you go to confirm this in Spark UI? Jobs tab -> Event Timeline SQL tab -> Details section Stages tab -> Summary Metrics Executors tab -> Heap Histogram 1 pts Question 24 You read in a DataFrame and notice it consists of 5 2GB Memory Partitions. Pick 2 most likely causes why this size was not the default. 'maxPartitionBytes' set to larger value than default File format may be unsplittable Shuffle Partitions set to small number so get larger Partitions than normal The .csv files from disk consisted of 5 files of approximately 2 GB each 'openCostInBytes' set to higher value than default 1 pts Question 25 You have a 5-Worker nodes and each Worker node has 1 Executor and 4 cores. You then do a count on a 100-GB sized DataFrame. Select all answers that are true. In most likelihood, one Core will be involved in the Global count In most likelihood, all Cores will be involved in the Local count and Driver will do the Global count In most likelihood, all Cores will be involved in the Local count In most likelihood, all Cores will be involved in the Local count 1 pts Question 26 Pick all events which are eagerly evaluated count() on a DataFrame Schema check summary() on a DataFrame Caching an SQL table 1 pts Question 27 The 'emp' SQL table resides on HDFS (Hadoop Distributed File System) which consists of 4 128-MB datablocks. You have 6 CPUs in your Spark cluster and 12 Executors (each with 1 Core). You code the below to load table into a DataFrame: df1 = spark.table("emp") By default, how many partitions are in 'df1' DataFrame? 200 6 12 4 1 pts Question 28 Your run a Streaming query that is aggregating Sales data by region every minute into a bar chart. The data in the stream is averaging about 100 files/minute and 400MB/minute. You notice performance is slow. What can you do to speed up performance? Decrease Shuffle Partitions to match number of Cores repartition() the data prior to the aggregation
Set Shuffle Partitions to match number of Files Assign a hard-coded schema 1 pts Question 29 You issue the following 3 statements in your query: df=spark.read.csv("/FileStore/stuff/") df.cache() df.first() What happens? One partition will be cached in RAM One row will be cached in RAM All partitions will be cached in RAM Nothing happens. Item is not cached unless it is followed by an action 1 pts Question 30 A 'Broadcast' hint has following characteristic: Stores entire DataFrame in each Executor Stores entire DataFrame in each Partition Stores entire DataFrame in the Driver Stores entire DataFrame in each Core of an Executor 1 pts Question 31 From Spark notebook, valid ways to influence number of files written to disk. Select all answers that are true. Change 'spark.sql.files.maxPartitionBytes' and 'openCostInBytes' configuration 'repartition()' or 'coalesce()' followed by a write() 'optimize' command on an SQL table Auto Optimize configuration on an SQL table 1 pts Question 32 The difference between an ARRAY and a STRUCT is/are: Select all that are true Arrays can be fixed in number of elements but Structs can be variable or fixed Arrays can use ‘explode’ function but Structs cannot Arrays use same data type and Structs can use different data types Arrays use Index Numbers and Structs use sub-columns to pluck out children 1 pts Question 33 Select all True statements about Dynamic Partition Pruning (DPP) when doing a Join. Must have at least one Partitioned table/DataFrame Must have a WHERE/FILTER clause in at least one of the tables DPP may invoke BroadcastHashJoin Can confirm Number of Partitions read in the SQL tab 1 pts Question 34 You have below 5 Partitions (see below graphic). Which Partitions will AQE split into smaller one due to Join Skew (assume all defaults are in place)?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Partition 1 and 2 Partition 1 All 5 Partitions Partition 1, 2 and 3 None of the Partitions 1 pts Question 35 Concerning Disk Partitioning, select all true statements. Partitioning column values not in data files, but rather Directory structure Is considered a 'Later' Pushed filter Is file independent. Works with CSV, JSON, Delta, Parquet, etc. Can be slow to query if don't have a WHERE clause with Partitioning column 1 pts Question 36 The goal of AQE's Coalesce Partitions is: Partitions are multiple of number of Executors Partitions that are divisible by number of Cores in cluster Makes Partitions minimal size threshold Give least amount in Iterations with best performance 1 pts Question 37 The advantage of joining 2 Bucket tables is: Spark with chose more performant Join strategy Like Buckets (Join column values) are guaranteed to be on same Core Join columns pre-sorted by default so this step is skipped Only have to Shuffle 1 Table 1 pts Question 38 You create a UDF since you cannot find any built-in higher order Spark functions to do your custom transformation. Which below will typically be the best UDF Performer? Pandas UDF Scala UDF forEach() function in either language using a UDF Python UDF 1 pts Question 39 The advantage of AQE's Broadcast Hash Join is: Select all answers that are true. No longer have to drop a Broadcast hint Avoids sorting the join column on both Tables
Only the smaller table is Shuffled You don't have to Shuffle the larger table 1 pts Question 40 Predicate Pushdown examples include: Select all answers that are True Querying a Partitioned table that has a Partitioning column in SELECT clause Using Z-Order or Bloom Filter column in a WHERE clause Using WHERE clause to only load data needed into Executor's RAM Using Delta file format to skip data 1 pts Question 41 Disadvantages of Python UDF's include: Select all answers that are True. More Scheduling pressure on Driver Inter-Processing Communication overhead Row at a time processing if not using Pandas Black box operation that make Optimization difficult Increased Java Garbage Collection 1 pts Question 42 You are working as consultant for Social Security Administration. You want the fastest possible performance when querying on an individual record. What technique would you suggest? Salt SS number Define SS number as a Primary Index Define SS number as a Bloom filter Define SS number as a ZOrder column 1 pts Question 43 You see the following in Details of the SQL tab (see below). This means Predicate Pushdown was: Best effort only and not guaranteed to have occured Completed on a Partitioned table Successful on a RDBMS source Accomplished via a Late Filter Successful on Cloud object storage 1 pts Question 44 On small datasets, Driver may forgo default Partition size and instead: Select all answers that are True. Match # of files = # of Partitions Create 1 Partition Match # of Cores = # of Partitions Match # of Executors = # of Partitions
Saved at 22:31 1 pts Question 45 You notice in the Spark UI that the SQL tab has number of entries with no Job IDs. What are these events? Keep alives letting Driver know Executors are still active and accepting Tasks Watch dog events Housekeeping carried out by Executors Driver events such as Schema lookups Submit quiz
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help