Laboratory 4 WINTER (2)-8851426

docx

School

Conestoga College *

*We aren’t endorsed by this school

Course

2000

Subject

Information Systems

Date

Apr 3, 2024

Type

docx

Pages

12

Uploaded by MagistrateMusic9935

Report
Laboratory Exercise 5 - HIVE In this lab you will leverage from the explanations provided during the Hive Theory Class and experience the Map Reduce Process through HIVE. Exercise 1 – Run the Below to obtain the maximum price of stocks IMPORTANT: yourfirstname is not the name of the table – You must use your first name for this exercise. 1. Create a Table in Hive CREATE EXTERNAL TABLE IF NOT EXISTS yourfirstname ( exch STRING, symbol STRING, ymd STRING, price_open FLOAT, price_high FLOAT, price_low FLOAT, price_close FLOAT, volume INT, price_adj_close FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/hirw/input/stocks'; 2. Verify if table was created SHOW TABLES
Laboratory Exercise 5 - HIVE 3. Limit of 25 rows SELECT * FROM yourfirstname LIMIT 25; Tables in Hive, an infrastructure for data warehouses built on top of Hadoop, are typically developed with more descriptive names that correspond to the data they store. All columns in the designated table are retrieved using the SELECT * query. The output is then limited to the top 25 rows using the LIMIT 25 clause.
Laboratory Exercise 5 - HIVE Capture one screenshot once the job runs – Screenshot 1 4. Create a Query that will display the maximum price of stock open, grouped by symbol SELECT symbol, max(price_close) max_close FROM yourfirstname GROUP BY symbol; The purpose of the provided SQL query is to retrieve data from the "yourfirstname" table in a Hive database. The minimal closing price (min(price_close)) is determined by the query for every distinct value found in the "symbol" field. Using the GROUP BY clause, the result set is sorted by the "symbol" column to produce a succinct summary of the lowest closing prices linked to each unique symbol in the given database. When analysing and retrieving important statistical data for various symbols in financial or related datasets, this kind of query is helpful. Capture one screenshot once the job runs – Screenshot 2 (PLEASE RETAIN SCREENSHOT EVEN IF AN ERROR OCCURS)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Laboratory Exercise 5 - HIVE Exercise 2 – Elaborate some query modifications 1. Based on Step 4 on Exercise 1, modify the query to execute a job that obtains the minimum price at price open. SELECT symbol, min(price_close) min_close FROM yourfirstname GROUP BY symbol; The "yourfirstname" table is the source of data for this Hive query, which also determines the minimum closing price (min(price_close)) for each distinct value in the "symbol" column. By grouping the data according to the unique values in the "symbol" column, the result set is arranged using the GROUP BY clause. This query gives a clear summary of the minimum closing prices linked to various symbols in the given Hive table, making it especially helpful for summarising and evaluating financial or time-series data. Hive is a Hadoop-based data warehousing system that allows distributed queries in a language like SQL to be run on big datasets. Capture one screenshot once the job runs – Screenshot 3
Laboratory Exercise 5 - HIVE 2. Delete the table you created for both Exercise 1 and Exercise 2 DROP Table yourfirstname Exercise 3 – Full Case Study Access the Hive terminal and follow the instructions for each case study below: Case Study: Sales Data Analysis In this lab, you will work with a dataset containing sales data for a company. The dataset includes information such as transaction ID, product name, quantity sold, and sales amount. Your goal is to perform basic data querying tasks using Apache Hive to gain insights from the sales data. Scenario: You have been provided with a dataset that contains sales data for a company. The dataset includes information about the transactions such as the transaction ID, product name, quantity sold, and sales amount. The company wants to analyze the sales data to identify the top-selling products, calculate the total sales amount, and find out the average quantity sold. Important >> The database name should be unique, otherwise this is not going to work! Lab Steps: 1. Create a database: Start by creating a new database named yourfirstname to store the sales data. Answer: CREATE DATABASE IF NOT EXISTS yourfirstname ; 2. Evoke the use of the database by applying the command USE: Answer: USE yourfirstname;
Laboratory Exercise 5 - HIVE 3. Create a table: Create a table named sales with appropriate columns to hold the sales data. Answer: CREATE TABLE IF NOT EXISTS sales ( transaction_id INT, product_name STRING, quantity_sold INT, sales_amount FLOAT ); 4. Insert data: Insert the provided sales records into the table to populate it with data. Answer: INSERT INTO sales VALUES (1, 'Product A', 10, 100.0), (2, 'Product B', 5, 75.0), (3, 'Product C', 8, 120.0), (4, 'Product A', 15, 200.0), (5, 'Product D', 3, 50.0), (6, 'Product C', 12, 180.0), (7, 'Product B', 6, 90.0), (8, 'Product A', 9, 150.0), (9, 'Product D', 4, 60.0), (10, 'Product B', 7, 105.0); 5. Basic data exploration: Perform basic queries to explore the sales data, such as selecting all records, selecting specific columns, and filtering records based on certain conditions. Answers: Select all records: SELECT * FROM sales; This query allows you a full view of the entire dataset with all available columns by retrieving all records from the Hive "sales" database. Select specific columns (transaction_id and sales_amount): SELECT transaction_id, sales_amount FROM sales; Only the "transaction_id" and "sales_amount" columns are chosen and shown from the "sales" table by running this query. This offers a narrowed perspective, highlighting certain specific qualities. Filter records where quantity_sold is greater than 5: SELECT * FROM sales WHERE quantity_sold > 5;
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Laboratory Exercise 5 - HIVE The "quantity_sold" column's value must be greater than 5 in order for the data from the "sales" table to be included in this query. This helps to separate and examine transactions that have a larger quantity sold. . Capture one screenshot once the job runs – Screenshot 4
Laboratory Exercise 5 - HIVE 6. Sorting data: Sort the sales data in ascending or descending order based on a specific column. Answers: Sort by sales_amount in descending order: SELECT * FROM sales ORDER BY sales_amount DESC; With the help of the "sales_amount" field, this Hive query sorts all of the data that are retrieved from the "sales" table in decreasing order. This makes it possible to quickly analyse sales data in order of highest to lowest quantities.
Laboratory Exercise 5 - HIVE Sort by quantity_sold in ascending order: SELECT * FROM sales ORDER BY quantity_sold ASC; Using the "quantity_sold" column as a basis, all entries from the "sales" table are chosen and sorted in ascending order in this query. This facilitates the examination of sales data in terms of amount by offering an ordered view of the dataset arranged by increasing quantities sold.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Laboratory Exercise 5 - HIVE Capture one screenshot once the job runs – Screenshot 5 7. Aggregating data: Use aggregate functions like COUNT, SUM, MIN, and MAX to calculate various metrics on the sales data. Answers: Calculate the total sales amount: SELECT SUM(sales_amount) AS total_sales_amount FROM sales; By adding up the entries in the "sales_amount" column in the "sales" table, this Hive query determines the overall amount of sales. The outcome, also known as "total_sales_amount," is the total sales amount in the dataset. Calculate the average quantity sold: SELECT AVG(quantity_sold) AS average_quantity_sold FROM sales; The mean of the entries in the "quantity_sold" column of the "sales" table is computed in this query to determine the average quantity sold. The outcome, denoted as "average_quantity_sold," sheds light on the average amount sold in each dataset transaction.
Laboratory Exercise 5 - HIVE Capture one screenshot once the job runs – Screenshot 6 8. Truncate table: Delete all the data from the sales table while keeping the table structure intact. Answer: TRUNCATE TABLE sales; In Hive, the "sales" table's contents can be completely removed while maintaining the table's structure by using the TRUNCATE TABLE statement. It is a possible superior performer for large datasets than the DELETE statement because it effectively cleans out the contents of the table. 9. Dropping a table: Drop the sales table to clean up the database. Answer: DROP TABLE sales;
Laboratory Exercise 5 - HIVE The "sales" table can be permanently removed from the database using the Hive DROP TABLE command. Both the table structure and the data it contained are deleted with this procedure. It is an essential step in decluttering the database and recovering disc space. 10. Delete Database Answer: DROP DATABASE database_name CASCADE; In Hive, deleting a whole database is accomplished with the DROP DATABASE statement. The database and all of the related tables are eliminated thanks to the CASCADE keyword. By deleting the database and all of its contents permanently, this step clears out resources in the Hive environment. Because CASCADE erases every table in the database, caution should be used when utilising it. *** Important: You should retain the seven screenshots in a single word file. Make sure the following happens: 1. All 7 screenshots should be in a single word file; 2. Each screenshot must have a short description; 3. Screenshots must follow the execution order above; 4. Make sure to only put one screenshot in each page alongside its description; 5. Insert page numbers; 6. All names of databases must follow the script provided above ( yourfirstname means you need to use your first name as the name); 7. If you submit and the name does not correspond to your first name. the exercise will not be considered for grading.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help