AD571_Lecture_4

pdf

School

Boston University *

*We aren’t endorsed by this school

Course

571

Subject

Information Systems

Date

Oct 30, 2023

Type

pdf

Pages

Uploaded by ProfessorRaccoonPerson989

Lecture 4 – Data Wrangling and Visualization Learning Objectives After you complete this lecture, you will be familiar with the following concepts: Data wrangling capabilities Importance of familiarity with the data Exploratory and explanatory analysis Data visualizations and communications Data Wrangling The ability to wrangle our data allows us to garner more value from the analysis. Data wrangling is the way we retrieve, evaluate, and pre-process data into a usable format. Although the majority of the pre-processing of raw data is completed for the analyst, by specialists, to get the data into the correct format on a database, the wrangling process is ongoing and nonlinear. The nonlinearity of this process is demonstrated in Wickham’s text in chapter 9. Notice that the transformation process can happen multiple times and this is the case in the work done to complete the term project using the NYC Real Estate Data. Different forms and sources of data are often required for analysis as well as the use of different types of variables (numeric, character, calculations, dates, etc.). Wrangling also includes dealing with missing values. Many forms of wrangling will be required for cases and data structure requirements in this course as we work with various analytics techniques in descriptive, predictive, and prescriptive methodologies. Figure 4.1: Demonstration of Data Wrangling Process

Before we can begin modeling, visualizing, and story-telling with data, one of the most important steps is data wrangling or data preparation. Data preparation happens to assure that the data is clean and ready for analysis. When there is a specific goal for analytics, a majority of the time can sometimes be spent manipulating the data. Data wrangling is also known as transformation and manipulation. For students in this class, RStudio will be used for the process of cleaning and manipulating the data in preparation for analysis and interpretation. Students will use a package called “Tidyverse” for the data wrangling process. Tidyverse 1.3.0 includes multiple packages within the single package for ease of use. The packages that are loaded when using Tidyverse are: broom, cli, dbplyr, dplyr, ggplot2, haven, hms, httr, jsonlite, lubridate, magrittr, modelr, pillar, purr, readr, readxl, reprex, rlang, rstudioapi, rvest, stringr, tibble, tidyr, and xml2. From this extensive list of packages loaded with Tidyverse, we will need to apply a few select ones. The most important packages for the data manipulation process for the NYC Real Estate data will be dplyr, tidyr, lubridate, magrittr, ggplot2 and stringr. In order to complete the data wrangling process, we will require the use of dbReadTable() to extract a data frame with the data from the tables on the database with which we are connected. After the tables have been extracted into a data frame, Tidyverse is used to join tables using data pipes (%>%). Where necessary, columns will be cleaned to remove any extra spaces, records may need to be filtered with a filter() function to remove inaccurate data from the frame. Such values may be ones with 0 and 10 for the price of recorded sales and square footage. Additionally, some of the columns may need to be created with the use of a mutate() function. Additional data manipulation using group() and ungroup() functions is performed as well throughout working with the data. Data Visualization and Business Intelligence The goal of data visualization is to make sense of large amounts of data to gain the necessary insights for a competitive advantage in the business environment. Data visualization focuses on historical and current data, which places the process and concept into the descriptive analytics category. As defined by Evans, data visualization is the process of displaying data in a meaningful fashion to provide insights that will support better decision-making, provide managers with better analysis capabilities that reduce reliance on IT professionals, and improve collaboration and information sharing. Visualizing data allows better communication for different functions and levels of the business, making it possible for analysts to notice patterns and relationships throughout the process. Visualization is the first step in moving towards increasingly advanced analytics methodologies such as predictive and prescriptive analytics. Once we know how the data looks, this knowledge can guide us in choosing the best possible models in the future. Background of Data Understanding the background of your data helps to put everything into perspective. It is important to know what the root source of the data was before it arrived into the database, so that we can think about it contextually. Knowing what actually happened resulting in the data being collected and entered into the records will make that data useful. One of the main benefits of having context for the data is the fact that we are able to ask and answer the question: What else does this mean?

For example, if an analyst has data about website traffic and knows that the website sells products, then data collected about page visits for a product can answer questions about revenue even when we don’t have that data within reach. A product page with the highest level of traffic is highly likely to also have a higher revenue share. This is an example of answering: What else could this mean? Context can change the perspective with which you analyze, interpret, and represent the data to others in the communication process. Although it is standard knowledge, we should be certain that the source of our data is trustworthy, clean, and collected properly. In the process of interpretation, conversations with subject matter experts can help to change the perceptions of the analyst and open up new directions for visualization. We can also apply the 5 W’s to the background of our data to improve the perspective. Who collected the data (whether man or machine)? This can also align with how the data was collected. What does the data represent? Where does the data come from geographically and can it be generalized to other locations? Is it prone to seasonality or specific geo-trends? When was the data collected and will the insight still hold true today? Why was the data collected and is there possibly a specific agenda push, budget request, additional bias, or specific outlook involved that can change the truths we are looking at? Visualization of data is not a purely creative process nor a purely technical process. Visualizing data well requires an understanding of what it means to the real world and how it should be interpreted as we navigate the complexity and uncertainty of our data. Techniques for Storytelling with Data Dashboarding Dashboarding is a descriptive analytics process of visualizing important data for the decision-making process and maintaining a “pulse” on metrics important to strategic implementation and corrective action management. The Big Book of Dashboards defines a dashboard as a visual display of data used to monitor conditions and/or facilitate understanding. We also know that this understanding will help us to make decisions and lead to additional analysis. Although there are various types and forms of dashboards, one important element to have in a dashboard is that it is relevant to the people who need to make decisions. The dashboard is the big picture perspective for analysis and can be considered a preliminary step leading to closer examination of the data. Drilldown The process of data drilldown is used to examine a construct in more detail. For example, measuring the constructs of product performance will require an analyst to explore the revenue, profitability, and trends of specific products by drilling down into the annual, or quarterly sales. Once a construct is connected to an indicator on a dashboard, the analyst will gain the ability to evaluate information in increasing detail. Drilldown creates the benefit of going from a general overview to a specific examination. This allows the analyst to tell a detailed story with increasing relevance for

Your preview ends here