Destiny Denson DAT640 Final Project

docx

School

Southern New Hampshire University *

*We aren’t endorsed by this school

Course

640

Subject

Business

Date

Apr 3, 2024

Type

docx

Pages

11

Uploaded by DeaconSeaLionMaster750

Report
Destiny Denson Southern New Hampshire University DAT 640: Final Project 1/14/2024
I. Organizational Background A. TIC, an insurance company, aims to identify potential customers for caravan insurance. Their current method of using junk emails to reach potential clients is unreliable and costly. To improve understanding of potential policy purchasers, TIC has decided to participate in a data mining competition. This competition aims to predict potential clients for caravan insurance and understand the reasons behind their policy purchases. TIC utilizes real-world business data provided by Sentient Machine Research, benchmarking their data mining efforts. By leveraging available data and predictive analytics tools, TIC seeks to calculate the likelihood of customers owning caravan insurance and determine the factors influencing their purchasing decisions. This approach allows TIC to better understand potential customers and optimize their marketing strategies. B. The Insurance Company (TIC) benchmarked data mining based on real-world business data provided by the Dutch data mining company Sentient Machine Research to achieve its end goal. “Sentient Information Systems provides the Data Detective data mining software suite and provides services for developing dedicated business intelligence and predictive analytics applications” (Sentient, n.d.). The organization uses available data to calculate the likelihood that a customer has a caravan insurance policy outside the boundary probability. Costs and benefits can determine the limits, enabling them to understand the reasons for the caravan policy and the differences between customers and others. 
II. Data Set A. The data set used in this analysis is derived from the Insurance Company (TIC) Benchmark, provided by Sentient Machine Research. It is centered around real-world business data and specifically aims to predict potential customers for caravan insurance. With 84 variables and 5822 observations, the dataset is split into training and testing data (TICDATA2000.txt). The richness of these variables allows for a comprehensive exploration of customer behavior and characteristics. B. Utilizing R’s summary command and other descriptive statistics functions, we can unveil essential insights into the dataset. This includes understanding central tendencies, dispersions, and identifying potential outliers. Summarizing the data serves as a foundational step to grasp key characteristics, aiding in subsequent predictive analytics strategies. C. Data visualizations play a pivotal role in enhancing our understanding of the dataset. Histograms, boxplots, and scatterplots created using R enable us to visualize distributions, identify patterns, and detect potential anomalies. These visual techniques contribute to a more intuitive comprehension of the data’s intricacies, fostering informed decision-making. In summary, the data set from the TIC Insurance Company Benchmark serves as a valuable resource for predicting caravan insurance policy purchasers. Its extensive variables and observations provide a rich foundation for developing predictive analytic strategies. By employing statistical summaries and data visualizations, we can unravel the underlying
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
characteristics, setting the stage for a robust analysis aligned with the organization’s goal of optimizing their marketing approach. III. Data Visualizations
Predictive Algorithms: A. Explain your organization’s selective structure, interaction, and relationship specifications: In selecting a predictive algorithm, it’s crucial to understand the organization’s data and the nature of the predictive task. For instance, if the goal is to predict potential customers for caravan insurance, factors like customer demographics, historical data, and interactions with the insurance company need to be considered. Decision trees, like Random Forest, are suitable for capturing complex interactions and nonlinear relationships within such data. TIC spends significant resources on sending emails to identify potential customers for caravan insurance. However, much of this effort results in wasted resources as many recipients are uninterested and discard the emails. TIC aims to better understand “who would be interested in buying a caravan insurance policy and why?” B. Recommend a predictive algorithm: Random Forest is recommended for its ability to handle complex datasets with numerous variables, providing robust predictions. In the context of predicting caravan insurance purchasers, where multiple factors may influence the decision, Random Forest’s ensemble
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
learning approach is advantageous.: TIC aims to predict potential customers for caravan insurance and understand their buying behavior. Before selecting a predictive algorithm, descriptive analysis is conducted to understand previous customer behavior. Predictive algorithms, such as random forests, decision trees, and regression models, are then utilized to identify trends and patterns in customer behavior and predict future outcomes. Based on the analysis, the random forest predictive model is recommended for its ability to handle complex datasets and provide accurate C. Determine the tools that would facilitate implementation: The ‘randomForest’ package in R facilitates the implementation of Random Forest. RStudio, as an integrated development environment (IDE) for R, enhances the coding and visualization aspects, aiding in building, testing, and evaluating the predictive algorithm. TIC utilizes Rattle, an open-source GUI for the R software, as a data analysis tool. Rattle provides various functionalities for data mining tasks, including statistical and visual summaries, machine learning model building, and performance evaluation. Rattle’s user-friendly interface allows for easy exploration and analysis of data, facilitating the development of predictive models. II. Model Optimization: A. Analyze commonly used approaches and methods: Random Forest and Logistic Regression are chosen for their complementary strengths. Random Forest is effective in capturing complex relationships, while Logistic Regression is suitable for binary classification tasks, making it relevant for predicting insurance policy purchasers. Performance measures such as confusion matrices, ROC curves, and risk charts are used for model optimization and evaluation. These techniques help assess the accuracy and reliability of
predictive models and guide the selection of the most effective algorithms. Model evaluation involves assessing performance metrics on training and testing datasets using techniques like cross-validation. Continuous feedback methods are implemented to monitor and improve model performance over time. The expected result is continuous improvement in predictive model accuracy and reliability, leading to more effective marketing strategies for caravan insurance. B. Evaluate the reliability of the predictive model structures: Reliability is assessed by measuring performance metrics (accuracy, precision, recall) on both training and testing datasets. Cross-validation techniques, such as k-fold cross-validation, help ensure that the models generalize well to different subsets of the data, enhancing their reliability. C. Describe the steps for implementing a continuous feedback method: Implementing a continuous feedback method involves setting up automated processes for model retraining and validation. This includes monitoring the model’s performance over time, adapting to changes in customer behavior, and ensuring the model remains accurate and reliable as new data becomes available.
D. Document the expected results: The expected results of using the scoring engine involve continuous improvement in predictive model accuracy and reliability over time. Through regular evaluations and adjustments based on the feedback loop, the organization can achieve optimized predictive analytics models aligned with its goals, leading to more effective marketing strategies for caravan insurance. Using the risk chart, our model has an accuracy rate of 72%. The ROC curve is another excellent way to measure the performance of a classification model. The true positive rate (TPR) is a plot against the false positive rate (FPR) to the predicted probability of the classifier, and then calculate the area under the graph. The larger the area under the curve, the better the model can distinguish categories. Figure shows that the model's accuracy is 71%. Upon thorough consideration of the research question, “Who would be interested in purchasing a caravan insurance policy and for what reasons?” I have opted to utilize the random forest prediction algorithm. Ensuring the ability to document reproducible research is crucial for successfully addressing the Caravan research problem. Computational reproducibility is achieved when subsequent researchers can replicate the project’s final outcomes, including essential quantitative results, tables, and figures, using only a set of files and written instructions.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Therefore, I will employ the random forest predictive algorithm to predict potential customers interested in a caravan insurance policy. The choice of random forest is attributed to the substantial size and dimensionality of our dataset. Random Forest demonstrates proficiency in handling large datasets with numerous dimensions, thereby achieving enhanced accuracy through cross-validation. III. Pilot Plan: Once the model is developed, the pilot program will commence, utilizing both the training and test datasets to conduct the initial evaluation and optimization phase. The pilot program functions as a trial run for our proposed solutions and aids in determining whether adjustments are necessary. Throughout the duration of the pilot plan, the company can assess the model and analyze various factors, including whether a customer possesses a caravan insurance policy. The CRISP-DM procedure can be iterated with the new dataset acquired from the final model until satisfactory results are achieved. IV. Presentation: By employing the predictive model, specifically Random Forest, TIC gains the ability to discern customer responses and capitalize on cross-selling opportunities. This capability aids companies in attracting, retaining, and expanding their most profitable customer base, thereby enhancing overall organizational effectiveness. While developing a model, TIC has the flexibility to consider various predictive algorithms. However, selecting the most suitable model is paramount for optimizing future outcomes. Considering TIC’s objectives, developing a random forest model and evaluating and refining it using an error matrix and the ROC curve emerges as the optimal approach for predicting the company’s future outcomes.