BUAN+6346+Assignment+2

pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

101

Subject

Computer Science

Date

Nov 24, 2024

Type

pdf

Pages

25

Uploaded by CoachSciencePigeon1540

Report
Dr. JERRY F. PEREZ Shalini Vijayaraghavan / SXV220089 Assignment 2 – Nov. 6 th , 2023
MIS 6346 BIG DATA 1 Table of Contents Lab Chapter 6: Create and Populate Tables in Impala or Hive .............................................................. 2 Create and Query a Table in Impala or Hive ........................................................................................... 2 Use Sqoop to Import Directly into Hive or Impala .................................................................................. 6 Lab Chapter 7: Select a Format for a Data File ..................................................................................... 9 Lab Chapter 8: Partition Data in Impala or Hive ................................................................................. 15 Lab Chapter 9: Collect Web Server Logs with Flume .......................................................................... 18 Create an HDFS Directory for Flume to save ingested data .................................................................. 18 Create a local directory for web server log output ............................................................................... 19 Configure Flume ................................................................................................................................... 21 Run the Agent ....................................................................................................................................... 21 Simulate Apache web server output ..................................................................................................... 23
MIS 6346 BIG DATA 2 Create and Query a Table in Impala or Hive 1. I decided to use Impala, so I started the server, by typing the following commands in the terminal window: 2. The HUE page was then opened in Chrome, and I clicked on HUE icon. Then, I went to the HUE homepage. The login credentials were: Login ID: training Password: training $ sudo service zookeeper-server start $ sudo service hive-server2 start
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
MIS 6346 BIG DATA 3 3. Then I opened the query editor menu and selected “Hive” from the pull-down menu and that prompted me to go to the hive query editor. 4. In the query editor pane, I entered a SQL command to create a table for the webpage data imported in the previous exercise: Query: CREATE EXTERNAL TABLE webpage (page_id SMALLINT, name STRING, assoc_files STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/loudacre/webpage'
MIS 6346 BIG DATA 4 5. The “Execute” button was pressed to run the query. To see the table that was just created, I pressed on the ‘refresh’ button next to ‘Database’ on the left-hand side of the page. I could find the webpage table with the columns: (page_id), name and (assoc_files).
MIS 6346 BIG DATA 5 6. The ‘New Query’ button was pressed to execute a test query as follows and I could see the findings of the query by clicking on the “Results” tab of the panel. 7. Then I previewed the sample data by clicking on the preview sample data icon near the table name.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
MIS 6346 BIG DATA 6 Use Sqoop to Import Directly into Hive and Impala 8. Now I went to the terminal and imported the device table directly into the hive metascore. The command I entered to do this was: 9. Using hive and HDFS command line I went to the specific data location : {/user/hive/warehouse/device}. This is to view the imported data files. This is also the default hive warehouse where the data got duplicated. sqoop import \ --connect jdbc:mysql://localhost/loudacre \ --username training --password training \ --fields-terminated-by '\t' \ --table device \ --hive-import
MIS 6346 BIG DATA 7 10. I opened part-m-00000 file to view the contents of the file which got copied into the hive warehouse. 11. Because, I am using Impala, I refreshed the Impala metadata cache by entering the command in the Hue Impala Query Editor: INVALIDATE METADATA
MIS 6346 BIG DATA 8 12. To see all the columns of the device table, I had to run a test query: SELECT * FROM device LIMIT 10;
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
MIS 6346 BIG DATA 9 Lab Chapter 7: Select a Format for a Data File 1. First, I changed the directory to the lab 7 exercise directory using the below command in the terminal window: 2. Then I imported the “accounts” table to an avro data format using the below query: cd $DEV1/exercises/data-format/ $ sqoop import \ --connect jdbc:mysql://localhost/loudacre \ --username training --password training \ --table accounts \ --target-dir /loudacre/accounts_avro \ --null-non-string '\\N' \ --as-avrodatafile
MIS 6346 BIG DATA 10 3. Then I went to the file browser and viewed the imported files by Sqoop into HDFS. 4. Then I viewed the contents of the file part-m-00000.avro
MIS 6346 BIG DATA 11 5. I could see that the file called sqoop_import_accounts.avsc in the current directory. 6. Then I viewed the contents of the sqoop_import_accounts.avsc file. Below is the screenshot of the file view:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
MIS 6346 BIG DATA 12 7. Then I navigated to the hue browser and into the loudacre directory and clicked on the upload button. A dialog box with an option to select files appeared. 8. I chose the file sqoop_import_accounts.avsc to upload the file.
MIS 6346 BIG DATA 13 9. Now I can see that the file has been uploaded to the loudacre directory in HDFS. 10. In impala the I created the table accounts_avro using this query: CREATE EXTERNAL TABLE accounts_avro STORED AS AVRO LOCATION '/loudacre/accounts_avro' TBLPROPERTIES ('avro.schema.url'= 'hdfs:/loudacre/sqoop_import_accounts.avsc ');
MIS 6346 BIG DATA 14 11. Then I ran a test query to check if the table was created or not. The query I ran was: 12. Then optionally I used DESCRIBE FORMATTED command to list the columns and data types of the accounts_avro table created from the Avro schema. SELECT * FROM accounts_avro LIMIT 10
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
MIS 6346 BIG DATA 15 Lab Chapter 8: Partition Data in Impala or Hive 1. First, I created a new empty table in Hive by using the CREATE EXTERNAL TABLE statement: CREATE EXTERNAL TABLE accounts_by_areacode ( acct_num INT, first_name STRING, last_name STRING, phone_number STRING) PARTITIONED BY (areacode STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/loudacre/accounts_by_areacode';
MIS 6346 BIG DATA 16 2. To populate the new table, I will extract the area code from the phone number. I executed the following query to do this. 3. Then I used the SELECT statement above in an INSERT INTO TABLE command to copy the specified columns to the new table, dynamically partitioning by area code. Find the screenshot below:
MIS 6346 BIG DATA 17 4. Then I ran a test query to check if the table was populated. The query I ran was: We can see in the below screenshot that the table was populated properly. 5. Using Hue, I confirmed that the index structure of the accounts_by_areacode table includes partition directories using the query: SELECT * FROM accounts_by_areacode LIMIT 10; SHOW PARTITIONS accounts_by_areacode;
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
MIS 6346 BIG DATA 18 Lab Chapter 9: Collect Web Server Logs with Flume Create an HDFS Directory for Flume to save ingested data. 1. I Created a directory in HDFS called /loudacre/weblogs to store the data files that the Flume ingests, using the below command in terminal: $ hdfs dfs -mkdir /loudacre/weblogs
MIS 6346 BIG DATA 19 2. Now in the file browser in Hue, I can see the new directory in /loudacre named “weblogs” Create a local directory for web server log output 3. Now I created the spool directory into which our web log simulator will store data files for Flume to ingest. On the local filesystem we create /flume/weblogs_spooldir 4. To view the created directory, navigate to /flume/weblogs_spooldir. See the screenshot below:
MIS 6346 BIG DATA 20 5. Now I give all users the permissions to write to the /flume/weblogs_spooldir directory by using the command $ sudo chmod a+w -R /flume in command line.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
MIS 6346 BIG DATA 21 Configure Flume Run the Agent 6. Then, I changed the directory to /training_materials/dev1/exercises/flume directory. I used the command: $cd ~/ training_materials/dev1/exercises/flume
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
MIS 6346 BIG DATA 22 7. Then I started the Flume agent using the configuration that was made. 8. Then, I saw the message “web server log source started” in the command line.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
MIS 6346 BIG DATA 23 Simulate Apache web server output. 9. I opened a separate terminal window and changed it to the exercise directory. I ran the script to place the weblog files in the /flume/weblogs_spooldir directory: 10. Then I came back to the terminal which was running the flume agent. The following output was observed when I ran the flume agent. It gave information about the files Flume is putting into HDFS: 11. Then I pressed CTRL + C to terminate the process and got the below output.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
MIS 6346 BIG DATA 24 12. Then I opened Hue browser and navigated to /loudacre/weblogs directory. We see the newly flume imported files in the HDFS.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help