Lab4 (1)

pdf

School

Centennial College *

*We aren’t endorsed by this school

Course

101

Subject

Electrical Engineering

Date

Apr 3, 2024

Type

pdf

Pages

2

Uploaded by HighnessBraveryButterfly42

Report
SRT411: Digital Data Analysis Winter 2023 pg. 1 SRT411: Lab 04 (5%) Understanding Large Data Sets and From Data Preparation to Visualization Objective The main goal of this lab is to look at raw data, understand its structure, and try to discern some meaning from it. Once you understand the nature of the data, you can begin to think of what questions you might ask and how to answer those questions from the collected data and finally create Kibana visualization for the data set collected. What to do Download a data set [ pick one with size < 2GB due to HW restriction, preferably related to cyber security such as data breach, anomaly detection, NW traffic etc.] from any one of the following links. 1. Kaggle Datasets 2. The UCI machine learning repository various large datasets. Choose a data set that interests you and investigate that dataset (make sure the dataset belongs to either 2018 or 2019, we will not be looking at older data sets). 3. AWS Datasets 4. Curated list of open Datasets Tasks Task 0: Create a report 1. Create a word document, and write the details of your lab completion in it. This will serve as proof that the lab was satisfactorily completed. 2. Each heading should be a task (Task 1, Task 2, etc.), with screenshots and descriptions that prove the task was completed satisfactorily. Fill these headings out as you complete the lab. Task 1: Describe the Data Include answers to the following questions in your report. Describe what data is contained in each observation in the file. Describe the variable (name, age, address, height, port, etc.) and the range of possible values for each variable. List the data type of each variable (factor, number, string, etc.). How each element \ variable\ field of the data might be interesting? Design the research question. Which fields you will use for your data analysis to answer the research question ? Task 2: Data Conversation Ingest the data into Elasticsearch using Logstash, you need to use a configuration file. Clean the data if necessary (deal with missing and incorrect values). Add appropriate filters in the configuration file for data type conversions. Timestamp if the data is time based . Run Logstash using the following command: Logstash -f [ path to conf file ]
SRT411: Digital Data Analysis Winter 2023 pg. 2 Task 3: Indexing Load the data into Elasticsearch. Index Mapping : Check the data structure of your index and the respective index settings. Take screenshot and add in the report Index Template: Read about index template and create an index template to be reusable in different projects. Explain the purpose and use of index template in the report. While you are working on this lab, if your machine gets low on memory, read about index lifecycle management and either try to delete some indexes or freeze them. Task 4: Searching After indexing your data, it is now time to perform search operations. Elasticsearch uses JSON based query language ( DSL ) for performing complex queries. There are different query options available. Go through them and pick any two which are most appropriate to your data set. They should give some meaningful information about your data. Write the queries in Kibana console and take screenshots to include in the report. Task 5: Visualization Visualize the data. You can shape your data using a variety of charts, tables and maps such as line, pie, bar or area charts, maps, markdown widgets. Create at least four different types of visualizations for this lab. Add visualizations to a dashboard. Include individual visualizations in your report along with an explanation as how you have created them and information that can be extracted from it. Inspect the data behind visualization using DISCOVER feature of Kibana. Filter and query data (use Kibana standard query language) in visualization. Take screenshots of your query and the result set. Save the searches. Calculate Mean, mode, median, standard deviation and variance for one or two columns form your dataset and show how these values can be useful and what information or analysis can be done based on them. The biggest finding based on the data set. What have you learned from the data and where are you going to use it ? ** You need to update your Kibana license for a free 30 days trial version, by doing this you can save the dashboards as images and can include them in your report instead of screen captures. Demonstration & Deliverables Submit the following: 1) Written report in pdf format. 2) Data set in csv format. 3) Logstash conf file. Demonstrate the following: 1) Show that data has been ingested data and the index pattern has been created properly. 2) Run the queries and show their results. 3) Show the dashboard, all the visuals created and the use of DISCOVER feature in Kibana.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help