sampling

docx

School

San Jose State University *

*We aren’t endorsed by this school

Course

156

Subject

Management

Date

Nov 24, 2024

Type

docx

Pages

70

Uploaded by AdmiralTitanium1237

Report
Credit Card Fraud Detection Student’s Name: Student’s ID: Date: University Name:
Abstract Over the past few years, the frequency of credit card fraud across the globe is increasing day by day due to emerging technologies. The incident of credit card fraud can directly lead to huge financial and reputational loss. This specific research topic on the topic “Credit Card Fraud Detection” has specifically focused on understanding credit card fraud and providing insights into how credit card issuers can continue to improve their security measures to prevent fraudulent use in an efficient way. Different kinds of factors are associated with the topic such as the most effective machine learning algorithms for fraud detection, the most predictive features of a credit card transaction, ethical and effective fraud detection strategies, and the limitations of machine learning algorithms in credit card fraud detection. In this study, the credit card fraud detection dataset is selected. The selected dataset is imbalanced which has 492 fraud transaction out of total number of 284,807 transactions. The positive class of the dataset is accounted for 0.172% of all the transactions. This study aimed to gather all the relevant information with the help of secondary data collection method where a bunch of secondary sources have been utilized. On the other side, this paper has also delivered an effective recommendation section which can play a major role in helping the study identify some effective and innovative strategies associated with the selected topic. The logistic regression analysis, random forest method, KNN, SVC and decision tree have significant amount of precision, recall and F1 score. Based on the requirement, demands and accuracy of the model, logistic regression has been declared as the best algorithm for the classification purpose. For this study, the Gaussian NB model is proposed. The proposed model has significant amount of accuracy as well as recall value compared other models. II
Table of Contents Chapter 1: Introduction ............................................................................................................... - 1 - 1.1 Research Background ....................................................................................................... - 1 - 1.2 Research Rationale ............................................................................................................ - 1 - 1.3 Research Problem ............................................................................................................. - 2 - 1.4 Research Aim and Objectives ........................................................................................... - 2 - 1.5 Research Question ............................................................................................................. - 3 - 1.6 Research Scope ................................................................................................................. - 3 - Chapter 2: Literature Review ...................................................................................................... - 4 - 2.1 Overview ........................................................................................................................... - 4 - 2.2 Theoretical Framework ..................................................................................................... - 4 - 2.2.1 Effective Machine Learning Algorithms for Fraud Detection ....................................... - 4 - 2.2.2 Most Predictive Features of a Credit Card Transaction ................................................. - 8 - 2.2.3 Limitations of “Machine Learning Algorithms in Credit Card Fraud Detection ......... - 10 - 2.2.4 Ethical and Effective Fraud Detection Strategies ........................................................ - 12 - 2.3 Conceptual Framework ................................................................................................... - 14 - 2.4 Literature Gap ................................................................................................................. - 14 - 2.6 Analysis of the Problem .................................................................................................. - 15 - 2.7 Summary ......................................................................................................................... - 17 - Chapter 3: Research Methodology ............................................................................................ - 18 - 3.1 Overview ......................................................................................................................... - 18 - 3.2 Research Philosophy ....................................................................................................... - 18 - 3.3 Research Approach ......................................................................................................... - 18 - 3.4 Research Design .............................................................................................................. - 19 - 3.5 Research Strategy ............................................................................................................ - 19 -
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
3.6 Data Collection Method .................................................................................................. - 19 - 3.7 Data Analysis Method ..................................................................................................... - 19 - 3.8 Ethical Considerations .................................................................................................... - 20 - 3.9 Research Limitations ....................................................................................................... - 20 - 3.10 Summary ....................................................................................................................... - 20 - Chapter 4: Artifact Design and Implementation ....................................................................... - 21 - 4.1 Design ............................................................................................................................. - 21 - 4.2 Evaluation Metric ............................................................................................................ - 24 - 4.3 Data Splitting .................................................................................................................. - 25 - 4.4 Implementation ............................................................................................................... - 26 - 4.5 Result .............................................................................................................................. - 29 - 4.6 Proposed Algorithm ........................................................................................................ - 36 - Chapter 5: Critical Evaluation ................................................................................................... - 38 - Chapter 6: Conclusion and Recommendation ........................................................................... - 42 - 6.1 Conclusion ...................................................................................................................... - 42 - 6.2 Recommendation ............................................................................................................ - 43 - 6.3 Future Scope ................................................................................................................... - 43 - References ................................................................................................................................. - 44 - Appendix ................................................................................................................................... - 49 - III
List of Figur Figure.1: Credit Card Fraud Detection Using Random Forest ................................................... - 7 - Figure.2: conceptual Framework of “Credit Card Fraud Detection” ........................................ - 14 - Figure.3: Confusion Matrix ....................................................................................................... - 25 - Figure.4: Importing Dataset ...................................................................................................... - 26 - Figure.5: Data Distribution of Amount and Time .................................................................... - 26 - Figure.6: Correlation Matrix ..................................................................................................... - 27 - Figure.7: Negative Correlation of variables .............................................................................. - 28 - Figure.8: Data Distribution of Variables ................................................................................... - 28 - Figure.9: Result of Logistic Regression .................................................................................... - 30 - Figure.10: Result of SVC .......................................................................................................... - 31 - Figure.11: Result of Random Forest ......................................................................................... - 32 - Figure.12: Result of Decision Tree ........................................................................................... - 33 - Figure.13: Result of KNN ......................................................................................................... - 34 - Figure.14: Comparison of Models with Proposed Model ......................................................... - 36 - Figure.15: Result of GaussianNB ............................................................................................. - 37 - Y IIII
Chapter 1: Introduction 1.1 Research Background Credit card fraud can be considered as a particular kind of financial fraud in which any illegal transaction is generated with the help of a stolen or fake credit card. It can also be considered a widespread problem that has been infecting the entire financial industry for many years. Fraudulent transactions are very effective in leading to major financial losses for both businesses and customers, leading to a trust loss in the whole financial system. Over the past few years, there has been a major increase in the amount of generating credit card transactions, and as an outcome, the whole amount of deceptive transactions has also significantly risen, which has played a vital role in leading to an increasing interest in credit card fraud detection, which specifically focuses on developing effectual ways of identifying and preventing fraudulent and threatening transactions (Ileberi, Sun and Wang, 2022). With the development of different kinds of machine learning algorithms, the researchers started to explore the effectiveness of machine learning algorithms in the process of detecting credit card fraud. Generally, Machine learning algorithms are trained on historical data in order to recognize patterns and generate predictions about new transactions. The most usually utilized machine learning algorithms for credit card fraud detection are neural networks, decision trees, and support vector machines. 1.2 Research Rationale The idea of detecting credit card fraud by utilizing the effectiveness of machine learning algorithms can be considered critical in the process of making sure of the financial security of both businesses and individuals. Fraudulent credit card transactions can directly lead to major financial losses, which can specifically have long-term effects on the credit card holders as well as the credit card issuer. In addition, fraud can also be considered very effective in damaging the reputation and business value of financial institutions and trimming down the confidence of the customers in electronic payment systems (Bin Sulaiman, Schetinin and Sant, 2022). On the other side, the expansion of an effectual system of credit card fraud detection has major or huge economic implications. By the process of stopping fake transactions, the business and the financial are capable of reducing their losses and boosting profits in a direct way. In addition, the idea of eliminating or decreasing the frequency of credit card fraud can play a vital role in 1
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
improving the confidence of customers in the system of electronic payment, resulting in amplified acceptance of cashless transactions. 1.3 Research Problem Credit card fraud can be considered a major issue in the whole financial industry, which plays a major role in leading to considerable financial losses to both the customers and the merchants. One of the most concerning and significant challenges associated with credit card fraud detection is the stable development of different kinds of innovative and advanced fraudulent techniques that are utilized by criminals. Fraudsters all across the world generally focus on using sophisticated techniques in order to cover their activities, which can play a vital role in making it more difficult for the traditional techniques of the entire process of traditional fraud detection to keep up (Alharbi et al., 2022). Another major challenge in the process of credit card fraud detection can be considered the high rate of false positives, which is generated by the fraud detection algorithms. False positives occur when a rightful transaction is specifically flagged as fraudulent, leading to frustration and problem for the consumers. The false positives can also directly lead to lost revenue for the merchant, as consumers may dispose of their purchases if their transactions are declined. On the other hand, the process of detecting credit card fraud by using machine learning algorithms can also be considered very effective in resulting in different kinds of issues. Machine learning algorithms require a vast amount of data in order to calculate accurate credit card fraud detection in an efficient and successful way (Bin Sulaiman, Schetinin and Sant, 2022). It can be considered very complicated and difficult for organizations to deliver high-quality and adequate data to the machine learning model, which can directly lead to inaccurate and wrong outcomes. 1.4 Research Aim and Objectives The aim of this research is to contribute to the understanding of credit card fraud and to provide insights into how credit card issuers can continue to improve their security measures to prevent fraudulent use. Based on this specific aim, the research paper has also formed some effectual and successful research objectives, which will help the study to develop an effective theory of the research topic. To identify the most effective machine learning algorithms for fraud detection. 2
To determine the most predictive features of a credit card transaction. To develop ethical and effective fraud detection strategies. To explore the limitations of “machine learning algorithms in credit card fraud detection”. 1.5 Research Question The study has also addressed some research questions, which will help the study address all its research objectives in a more efficient and successful way. The research questions are: How the research could identify the most effective machine learning algorithms for fraud detection? How this will determine the most predictive features of a credit card transaction? How to develop ethical and effective fraud detection strategies? How this research could explore the limitations of “machine learning algorithms in credit card fraud detection”? 1.6 Research Scope The purpose of this particular research is to build up a successful system of credit card fraud detection, which is capable of identifying and preventing fraudulent transactions. This study will play a major role in identifying accurate machine-learning algorithms in the process of detecting credit card fraud. The research will also focus on determining the most extrapolative features of the credit card transaction (Saheed, Baba and Raji, 2022). This study will also help in identifying different limitations of credit card fraud detection using machine learning algorithms, which can be considered very effective in helping the financial institution to build up effective strategies. 3
Chapter 2: Literature Review 2.1 Overview According to Ileberi, Sun and Wang, (2022) credit card fraud can be considered an important concern for both financial institutions and consumers. Different kinds of fraudulent activities can directly lead to significant legal disputes, financial losses, and reputational damage. Credit card companies focus on employing a variety of methods in order to detect and prevent different kinds of fraud. These methods specifically include machine learning algorithms, real-time monitoring of card transactions for suspicious patterns, recognizing irregular spending behaviors of the customers, and fixing limits on card usage in order to prevent unusual purchases. Another significant and common technique is to utilize the effectiveness of data analytics for identifying potential fraud patterns. On the other side, Abdulghani, Uçan and Alheeti, (2021) stated that credit card organizations also provide guidance and education to their customers in order to help them guard their financial and personal information. By the process of employing all these strategies, credit card companies can better defend their consumers from different kinds of fraud while reducing the chances of experiencing reputational damage and financial losses.This section of the literature review will be focused on discussing different factors associated with the topic “credit card fraud detection using Machine Learning” by multiple existing theories. 2.2 Theoretical Framework 2.2.1 Effective Machine Learning Algorithms for Fraud Detection Fraud detection can be considered one of the most critical applications of machine learning. Based on the theory by Naveen and Diwan, (2020), the effectiveness of machine learning algorithms are increasingly being utilized for detecting fraud because they are capable of analyzing huge amounts of data accurately and quickly, making them suitable for the process of rectifying or sensing any kind of fraudulent behavior. With the help of machine learning algorithms, it is possible to identify patterns in data that may be fraudulent or indicative, such as anomalies in the behavior of the users and unusual transactions. On the other side, machine learning algorithms can also be utilized in order to generate predictive models that are able to identify and sense the probability of fraud occurring according to historical data. Based on the viewpoint of Dang et al., (2021), real-time monitoring is another major area that is significantly 4
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
focused on by machine learning algorithms in the period of detecting fraud. The usefulness of machine learning algorithms can be utilized in the process of monitoring transactions in real time and flagging any doubtful activity as it takes place. Machine learning algorithms can be trained to decrease or eliminate the number of false positives, which can directly help in minimizing the number of legitimate transactions that are incorrectly flagged as deceitful. Machine learning algorithms focus on continuously learning from new data, which performs a major role in allowing them to improve and adapt over time. There are many effective algorithms that are widely been utilized in the process of detecting fraud. Some of these machine-learning algorithms are Logistic Regression, Decision Trees, Random Forests, Support Vector Machines, and Neutral Networks. Decision Trees Saheed, Baba and Raji, (2022) stated that decision trees can be considered one of the most popular machine learning algorithms for fraud detection as they are very easy to interpret and understand. Decision trees generally perform by tearing the data into minor subsets according to a variety of features and generating a prediction according to the final subset. Every subset writes to a node in the tree, and the final subsets are known as the leaves of the tree. The algorithm of the decision tree performs by selecting the forecaster variable that best splits the data at every node. The tearing or splitting criterion can be associated with a range of measures, such as entropy, information gain, or the Gini index. Based on the thesis by Ileberi, Sun and Wang, (2022), the main aim is to generate partitions that are possible and pure, indicating that they hold mostly one class of the response variable. After constructing the tree, it can generally be utilized in order to create predictions on new data by the process of following the way from the root to the suitable leaf node. The forecast at the leaf node can be considered the majority class of the training data that falls inside that partition. The decision tree is popular for having quite a few beneficial aspects in fraud detection. According to the statement of Abdulghani, Uçan and Alheeti, (2021), one of the major advantages is that it is easy to visualize and interpret, making it easy to appreciate the decision- making procedure of the model. This can also be considered very useful for explaining the model to non-technical stakeholders or for recognizing particular features that are significant for detecting fraud. Another major advantage of considering and implementing a decision tree is that 5
it is capable of handling both continuous and categorical predictor variables, and can automatically handle and detect the missing data. This makes them a powerful and flexible algorithm for fraud detection. Ahmad et al., (2023) said the idea of considering decision trees in the process of fraud detection also has some major and concerning limitations such as overfitting, and being unable to handle changes.Decision trees can be utilized in credit card fraud detection in order to successfully identify doubtful transactions based on several features such as location, transaction amount, and time of day. By considering the idea of using decision trees in the credit card fraud detection process, financial institutions can accurately and quickly identify different kinds of fraudulent transactions and take suitable action in order to prevent losses. Random Forest According to the statement of Zioviris, Kolomvatsos and Stamoulis, (2022), Random forests are one specific kind of ensemble learning method, which can be utilized in fraud detection. Random forests are generally created by the process of combining numerous decision trees, all trained on a random subset of the predictor variables and a random subset of the data. The last prediction is typically generated by aggregating the forecasts of all the trees. The idea of considering random forest in fraud detection has a bunch of beneficial aspects. Random forests are capable of handling both continuous and categorical predictor variables, as well as can automatically handle missing data. This plays a major role in making them powerful and flexible algorithms for detecting fraudulent activities. Based on the viewpoint of Ahmad et al., (2023), Random forests can be considered very effective in reducing the chances of experiencing overfitting, which can be a concerning problem with decision trees. By the process of constructing numerous trees and aggregating their forecasts, random forests are able to eliminate or reduce the discrepancy of the model and develop its generalization performance. 6
Figure.1: Credit Card Fraud Detection Using Random Forest (Source: Shukur and Kurnaz, 2019) One significant limitation associated with the concept of considering random forests in fraud detection is that they can be comparatively more expensive than decision trees, particularly when v the size of the data and the number of trees are huge. Dang et al., (2021) identified that in the case of credit card fraud detection, random forests can be utilized in order to sense or identify suspicious or doubtful transactions according to the features such as the location, transaction amount, and time of day. The random forest can also implement extra features such as credit score or the transaction history of the customer. Logistic Regression According to the thesis of Ileberi, Sun and Wang, (2022), logistic regression can be considered one of the most significant and widely used machine learning algorithms in the process of fraud detection. Logistic regression is one kind of statistical method utilized for binary classification problems, where the response variable is definite with only two different outcomes, typically represented as 1 or 0. The model of logistic regression specifically aims to measure the likelihood of an event happening, according to one or more forecaster variables. The logistic regression model also focuses on using the logistic function in order to transform the linear predictor into likelihood. The logistic function can be considered an S-shaped curve that generally ranges between 0 and 1 and is described as follows: p = 1 / (1 + e^(-z)) 7
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Where z can be considered as the linear predictor, p can be considered as the probability of the positive outcome and e can be addressed as the mathematical constant more or less equal to 2.718. Saheed, Baba and Raji, (2022) stated the logistic regression model can be considered very effective and beneficial in estimating the values of the regression coefficients that focus on maximizing the probability of the observed data. This is generally performed utilizing maximum probability estimation, which plays a major role in finding the values of the regression coefficients that increase the probability function. Shukur and Kurnaz, (2019) said that the effectiveness of logistic regression can generally be utilized for an extensive range of binary classification problems, such as disease diagnosis, spam filtering, and fraud detection. One of the beneficial aspects of considering logistic regression is that it is an interpretable model that can be simply understood by non-experts. Moreover, logistic regression is also capable of handling both continuous and categorical predictor variables. Support Vector Machine According to the viewpoint of Zioviris, Kolomvatsos and Stamoulis, (2022), SVM or support vector machine can be considered a powerful algorithm, which can be utilized for both multi- class and binary classification. Support vector machine specifically performs the process of finding the finest boundary between two classes, and it is very helpful and effective for fraud detection as it is capable of handling complex data. SVMs perform the process of finding the best hyperplane that helps in separating the data into multiple classes. In the case of fraud detection, SVMs can be successfully trained on historical data in order to learn the patterns of deceitful behavior and then utilized to classify new transactions as either legitimate or fraudulent. Esenogho et al., (2022) stated that SVMs are capable of handling high-dimensional data, which can be helpful in the procedure of fraud detection where there are often a lot of different features that need to be considered. SVMs are able to handle both non-linear and linear decision boundaries, making them more versatile and able to capture complicated patterns in the data. Different kinds of techniques such as kernel functions and regularization are utilized by SVMs in order to prevent the issue of overfitting and develop the generalization performance of the model. 8
2.2.2 Most Predictive Features of a Credit Card Transaction Based on the statement of Al-Shabi, (2019), credit card transactions can be considered very effective in the process of generating a major amount of data that can generally be utilized in order to identify different kinds of fraudulent activity. The capability of analyzing and predicting the characteristics of credit card transactions can play a major role in the process of helping financial institutions in the process of detecting fraud and generate better choices in terms of risk management. Some of the most predictive features of credit card transactions are mentioned in the following. The amount of transactions can be considered one of the most critical features associated with credit card transactions. It performs a major role in the process of representing the amount of money invested by a cardholder, and it can have a major impact in delivering insights into their habits of spending. In addition, the transaction amount can be used to identify outliers or suspicious transactions, which may indicate fraud. It specifically focuses on representing the amount of money spent by a cardholder, and it can directly deliver insights into their spending habits. Moreover, the transaction amount can be used in order to recognize suspicious transactions and outliers, which may point out fraud. On the other side, Rtayli and Enneya, (2020) stated the MCC can be considered a four-digit code, which is specifically designed for merchants that categorizes the kind of business they generally function. The date and time of the transaction can be considered one of the significant characteristics of credit card transactions that help in the process of detecting credit card fraud. The date and time of a transaction can be considered very effective in the process of delivering insights into the spending habits of the cardholder, such as whether they formulate purchases during particular times of the week or day. It can also be considered very helpful and beneficial for financial institutions in the procedure of addressing fraud through the process of sending unusual or suspicious transaction dates or times. Based on the thesis of Bin Sulaiman, Schetinin and Sant, (2022), the location of the cardholder can deliver insights into the habits of their spending, such as whether the user makes the purchases in abroad or their home country. On the other side, it can also have a significant impact in helping financial institutions in the process of detecting fraud by identifying suspicious locations, such as the locations that are popular for their high fraud levels. The location of the merchant can be considered the other major characteristic of credit card transactions that can be used by banks and other financial services in the process of detecting credit card fraud. The location of the merchant can also be considered very effective in 9
delivering insights about the habit of consumers of using their cards, such as whether they formulate a purchase at national or local retailers. It can be considered very effective in the process of helping financial institutions to detect fraud by the method of recognizing suspicious or unusual merchant locations. Gupta, Lohani and Manchanda, (2021) stated the type of credit card utilized by the user in the transaction process helps in achieving insights into the cardholder's spending habits and creditworthiness. For instance, the situation of using a platinum card may directly indicate that the cardholder generally has a towering credit score and uses more money compared to someone with a regular card. The type of credit card can also play a major role in helping financial institutions in sensing or detecting fraud by identifying suspicious or unusual card types. On the other side, the currency of the transaction can also have a major impact in helping the financial institution to identify if the purchase has been made in foreign currency or their home currency. Based on the thesis by Saheed, Baba and Raji, (2022), insights into creditworthiness and account balance can be delivered by the status of the transaction, which can be considered a major characteristic of credit card transactions. Transaction frequency also helps financial firms in the process of avoiding the chance of experiencing fraudulent credit card activities. 2.2.3 Limitations of “Machine Learning Algorithms in Credit Card Fraud Detection According to the statement of Alharbi et al., (2022) credit card fraud can be considered as one of the significant problems in the whole financial industry, leading to billions of dollars in losses every year. Traditional methodologies of detecting fraud specifically rely on rule-associated systems, which have inadequate effectiveness in the process of detecting sophisticated and new kinds of fraud. Different kinds of machine learning algorithms have shown their emphasis on the process of detecting fraudulent transactions with high accuracy. Though, there are still quite a few limitations to their use in credit card fraud detection, which have been addressed in the following. Data Quality and Availability According to the thesis by Bin Sulaiman, Schetinin and Sant, (2022), data availability and quality can be considered one of the most concerning and significant limitations associated with the idea of considering machine learning algorithms in fraud detection. The efficiency and effectiveness of machine learning algorithms specifically depend significantly on the quantity 10
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
and quality of the available data. In the case of credit card fraud detection, data is frequently imbalanced and scarce, with deceptive transactions being a little percentage of the entire data. This imbalance can directly result in bias in the machine learning algorithms, where they turn out to be better at detecting non-fraudulent or normal transactions than deceitful ones. In addition, fraudsters are continuously emerging their tools and techniques, which mean that the algorithms trained on historical data, may not be successful in the process of detecting new kinds of fraud. Thus, the data utilized in order to train the algorithms must be continuously updated and validated to make sure that they stay effective and successful. Interpretability Based on the viewpoint of Esenogho et al., (2022), interpretability can be considered the other major and concerning issue associated with the idea of using machine learning algorithms in fraud detection. Many of the state-of-the-art algorithms, including deep learning models and neural networks, are black boxes that can play a major role in making it hard to understand how they appeared in their decisions. This lack of transparency can play a significant role in making it difficult for fraud analysts to give details about why a specific transaction was marked as deceptive. It can also be considered very effective in making it difficult to recognize and correct any biases or errors in the algorithm. Time and cost According to Alharbi et al., (2022), the idea of implementing machine learning algorithms for the process of credit card fraud detection can be addressed as time-consuming and costly. Training and building the machine learning models specifically require adequate resources and expertise, and the data utilized need to be constantly validated, updated, and cleaned. On the other side, the algorithms require to be continuously updated and monitored in order to make sure that they remain successful and efficient against evolving and new kinds of fraud. The time and cost involved in the process of updating and maintaining the machine learning algorithms can be prohibitively high for some businesses. Overfitting 11
Overfitting can be considered one of the most concerning and major issues associated with machine learning algorithms in credit card fraud detection. Based on the viewpoint of Arya and Sastry, (2020), overfitting generally takes place when the machine learning algorithm becomes very complicated, fitting very closely to training data. This can directly lead to the algorithm being excessively sensitive to the noise in data, which can play a significant role in resulting in false positives. The issue of overfitting can be difficult to notice, as the algorithm may come out to be working well on the training data. On the other hand, in the period of testing on new data, the algorithm may execute poorly, resulting in an inaccurate outcome. Imbalanced Data According to the statement of Bin Sulaiman, Schetinin and Sant, (2022), credit card fraud can be considered a comparatively rare event, with only a little percentage of transactions leading to fraud. This plays a major role in creating an imbalanced dataset, which may directly result in imprecise outcomes. Different kinds of machine learning algorithms may generally be biased towards the majority class, leading to a towering rate of false negatives. This specifically means that the machine learning algorithm may not sense deceitful or fake transactions, causing huge financial losses for the card issuer and cardholder. Concept Drift Based on the viewpoints of Jain, Agrawal and Kumar, (2020), concept drift specifically takes place when the underlying distribution of the data alters over time. In the case of detecting credit card fraud, this can generally happen when the fraudsters change their hacking tactics or when new types of fraud emerge. If the selected machine learning algorithm is not properly trained on new data, it may not be capable of detecting such changes, resulting in negatives. On the other side, concept drift can also be considered very challenging to notice, as it may take place slowly over time. Adversarial Attacks Based on the thesis by Shirgave et al., (2019), fraudsters may try to bypass the machine learning algorithms by the process of manipulating the information or generating fraudulent transactions that are specifically structured in order to avoid detection. It can also be known as an adversarial 12
attack. On the other side, Adversarial attacks can also be considered very hard to detect, as the fraudsters may utilize different techniques in order to avoid detection. In addition, all these attacks can generally be utilized for manipulating the algorithm, resulting in false negatives or false positives. 2.2.4 Ethical and Effective Fraud Detection Strategies The idea of utilizing machine learning for fraud detection specifically comes with some specific ethical considerations. For instance, false positives can directly result in innocent customers experiencing their transactions being rejected, while false negatives can specifically lead to fraud going unnoticed and leading to major financial loss. Therefore, it can be considered very essential to discover the correct balance between ethical considerations and accuracy. Some Ethical and Effective credit card Fraud Detection Strategies using machine learning algorithms are addressed in the following. It can be considered very important to select the ideal machine learning algorithms in order to accomplish a desired and successful outcome. The machine learning algorithms are required to be selected based on the circumstances, data type, and objectives. Based on the thesis by Shirgave et al., (2019), the idea of analyzing a range of data sources, such as location data, transaction history, and behavioral patterns, can directly play a major role in the process of detecting fraudulent activities. By the process of combining different kinds of data sources, machine learning algorithms are more capable of precisely identifying fraudulent behavior. On the other side, Hussein et al., (2021) stated the concept of real-time monitoring of credit card transactions is also capable of rapidly recognizing doubtful activity and avoiding fraudulent charges from being accepted. Real-time monitoring can also be considered ad effective in the process of allowing the backs and credit card organizations to take action rapidly to assume fraud. Moreover, ethical guidelines require to be developed in order to make sure that the fraud detection strategies are utilized ethically or morally and in compliance with appropriate regulations and laws. Guidelines require to be developed with input from ethical and legal experts and require to be considered data protection and customer privacy. According to the statement of Jain, Agrawal and Kumar, (2020), fraud detection strategies should typically focus on the process of reducing or minimizing false positives and making sure that legitimate transactions are not rejected. Providing transparency is the other major strategy in the process of detecting credit card fraud. The customers need to be delivered with transparent and clear information about different fraud detection strategies and how these strategies are utilized in 13
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Card Fraud Detection using Machine Leraning Algopritms: Decision Trees Support Vector Machine Random Forest Logistic Regression Limitations of “Machine Learning Algorithms in Credit Card Fraud Detection: Data Quality and Availability Interpretability Time and cost Overfitting Imbalanced Data Adversarial Attacks Stratagies for effective fraud detection: Selecting idea machine leraning algorithm Analyzing several data sources Monitoring in Real time Focusng on false positives Delivering fraud education order to protect their financial interests. The customers should also be up to date about how their personal data is secured and utilized. Delivering fraud education can be considered one of tey most effective strategies that need to be considered by banks and financial organizations in the process of preventing credit card fraud detection. Based on the article by Zioviris, Kolomvatsos, and Stamoulis, (2022), the idea of educating customers about fraud detection and prevention can help them protect themselves from different kinds of fraud and identify deceitful activity. Educational resources can generally include online videos, tutorials, and articles on how to guard their confidential and important information and keep away from common fraud scams. 2.3 Conceptual Framework Figure.2: conceptual Framework of “Credit Card Fraud Detection” (Source: By Author) This conceptual framework focuses on showcasing different kinds of machine learning algorithms such as decision trees, SVM, Logistic regression, and random forest, which can be utilized by the bank and other financial institutions in the process of detecting suspicious 14
activities on credit card transactions. This conceptual framework also played a significant role in the process of representing the limitation of machine learning’s credit card fraud detection, which can be considered very effective in the process of helping the firm to identify the ideal strategies to solve and overcome the limitations. 2.4 Literature Gap This particular section of the literature gap generally aims to identify all the areas associated with the research topic, which have not been discussed in this research paper. In the case of this particular research paper, only the utilization of machine learning algorithms in the process of detecting credit card fraud has been discussed. Different kinds of deep learning algorithms can also be considered very effective and beneficial in helping banks and other financial institutions in the process of detecting credit card fraud, which has not been mentioned and discussed in this paper. On the other side, the study also hasn’t focused on discussing and analyzing the step-by- step process followed and maintained by the financial services in detecting credit card frauds using the effectiveness of machine learning algorithms. The approach of discussing the other methods that can be utilized to detect credit card fraud could perform a significant role in improving the quality and effectiveness of the study in an efficient and successful way. All these areas have a major impact in preventing this entire theory from being more precise n the chosen research topic. 2.6 Analysis of the Problem Credit card fraud is considered to be a huge problem in today’s financial world. It leads to considerable amount of financial loss for companies as well as users. The financial transactions are shifted towards the online platforms which needs the detection of the fraud transactions. The machine learning is one of the emerging methods which provides effective solution in order to handle the fraud transactions of credit card. Moreover, machine learning model provides substantial benefits which prevents several challenges in terms of security of the online transactions. The key advantages of using the machine learning model are to process the large amount of data. The traditional fraud detection system cannot detect fraud due to the innovation and new techniques of fraudsters that bypasses the rules of the credit card fraud. The machine learning model learns from new data and detect new and complex pattern of the credit card fraud. The anomaly detection as well as the supervised learning can determine the normal 15
spending behavior which helps to detect the suspicious transactions. This method also has some drawbacks such as false positive. The machine learning model learns from the past data which helps to classify the fraudulent transactions based on the past behavior. This aspect can decline transactions which can lead to the dissatisfaction of customers (Shirgave et al., 2019). The most significant challenge that is associated with the credit card fraud is the enhanced development of different kinds of new and advanced techniques of fraudulence that are hard to detect. Credit card fraud offenders use multiple techniques using high end technology to commit these fraudulent activities. All around the globe, attackers use sophisticated methods to commit such crimes which are hard to detect using simple technology. The traditional method of detection thus results in failure while facing such innovative fraudulent tactics used by offenders. Another major challenge regarding the fraud detection is the emergence of high levels of false positives that is often generated by the detection algorithms. The false positives often reflect on the system when a legitimate transaction is flagged as fraudulent which leads to loss of trust of the customers on the detection method. This leads to frustration and distrust among the consumers of the credit card. These false positives may also lead to loss of revenue for the company or the merchants. Hence, machine learning technique is an effective method to deal with the credit card fraud detection method which results in less errors and yields higher accuracy as compared to the traditional methods. Machine learning algorithms utilizes huge data and calculates the fraud occurrences in an effective manner. The organizations may find it hard to yield accurate results owing to machine learning process which often leads to inaccurate outcomes ( Stamoulis, 2022). The machine learning tool for this problem solving is one of the powerful approaches which provide significant protection against the fraud transactions of credit card. This method adopts new and innovative methods based on the large amount of data of real- time. Some of the major challenges faced by the credit card detection method are the problems of data availability as many times the consumer data or information related to the credit card transactions are private. Unclassified data is another major challenge faced by fraud detection systems as all fraud transactions or attempts are successfully caught by the system or reported. Scammers and various adversary groups often use adaptive techniques against the machine learning models and algorithms resulting in disruption of the model and failure of the fraud 16
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
detection mechanism. Another challenge that is often faced by the ML system is the huge amount of data that needs to be processed on a regular basis. The model does not always keep up with the expected amount of speed and accuracy needed for the detection of such huge number of fake transactions taking place ( Uçan and Alheeti, 2021). Data imbalance is another major problem. The Machine learning model finds it challenging to detect the exact number of fraud transactions happening throughout from the entire set of transaction data available. The machine learning model must be fast enough to detect the suspicious activities or anomalies in transactions in a faster way and hence its overall efficiency needs to be increased. The privacy of the customer often comes at stake since the transaction data needs to be available for the model to utilize it to find the fraudulent activities. Hence, protection of user privacy is another major challenge often faced by the ML model during the credit card fraud detection. The model must rely on a much more trustworthy source to cross check the data obtained for the model training process. The complex algorithms used in the model is tough to decode but is also prone to hacking or attacks by scammers. Thus, security of the model must be ensured by the developers ( Shirgave et al., 2019). The model must be adaptable to new changes so that attackers find it difficult to scam the same model every time. 2.7 Summary This section of the literature review focused on discussing different kinds of factors associated with the topic of “credit card fraud detection”. The study has identified and discussed all the machine learning algorithms that can help in detecting different kinds of credit card fraud and eliminate the chances of facing huge financial loss. The machine learning algorithms that can help credit card users and credit card issuers to prevent the chances of accruing credit card fraud are logistics regressions, decision trees, support vector machines, and random forests. This study also focused on analyzing the role of all these algorithms in credit card fraud detection, which perform a vital role in helping the study identify the appropriate algorithm to use in credit card fraud detection. Most predictive features of a credit card transaction have also been addressed and discussed in this study. Different predictive features of a credit card transaction are the time and date of the transaction, the amount of the transaction, the location of the users, the currency of the transaction, the frequency of the transaction, and many more. These entire situations perform a major role in delivering an indication to the organization if any kind of credit card fraud has taken place. There are different kinds of limitations associated with the process of 17
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
detecting credit card fraud by using machine learning algorithms, which perform a major role in preventing the organization to be more capable of resisting the situation of experiencing different kinds of credit card fraudulent activities. This section also focused on discussing different kinds of effective and beneficial strategies that help banks and other financial institutions in avoiding the chances of facing huge financial losses due to credit card fraud. 18
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Chapter 3: Research Methodology 3.1 Overview The research methodology section tends to discuss the way in which the research paper would be carried out so that the aim, objectives, and research question of the paper can be addressed successfully ( Melnikovas, 2018). Through the Saunders Research Onion, the various decisions that are considered to be important for developing a research methodology for the research paper are described ( Orth and Maçada, 2021). In this manner, the scholar can ensure the reliability and validity of the outcomes of the research. 3.2 Research Philosophy In research, there are basically 4 types of philosophies that are mainly used, which are pragmatism, positivism, realism, and interpretivism ( Žukauskas, Vveinhardt and Andriukaitienė, 2018). In the following research, the scholar had chosen to use the interpretivism research philosophy so that the important elements of the study that would be identified through the study could be interpreted successfully. In this manner, the research scholar would be able to integrate the interest of humans across the outcome of the study. The use of the respective philosophy thus would be helpful in focusing on the meaning so that the different aspects of the research problem can be reflected ( Williamson, 2021). 3.3 Research Approach In academic research studies, the concept of research approach is regarded as the general plan and procedure that helps in conducting the research study. The three main research approaches that are mainly used while conducting research are the deductive approach, inductive approach, and abductive approach ( Snyder, 2019). In the following research, considering the nature of the study, the deductive approach is used. The deductive research approach focuses on the testing of theory as well as the hypothesis with the help of empirical data ( Gupta and Gupta, 2022). The hypothesis of this study helps to detect the credit card fraud. This approach helps to test the hypothesis with the help of the collected credit card transactions with the help of multiple algorithms ( Tamminen and Poucher, 2020). 19
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
3.4 Research Design The research design acts as a research methods and techniques framework that helps the research scholar to carry out the study successfully. The different types of research designs that are mainly used in a research study are descriptive, experimental, correlational, diagnostic, and explanatory research design ( Pandey and Pandey, 2021). In this research, the scholar intended to use the experimental research design so that the cause and effect of credit card fraud detection can be established. In this manner, the scholar would be able to identify and analyze the impact of the independent variable over the dependent variable ( Patel and Patel, 2019). 3.5 Research Strategy The research strategy tends to state the overall plan for the conduction of the research study. By determining the research strategy, the important aspects of the research, such as planning, executing, and monitoring the overall research, can be done in a successful manner ( Kumar, 2018). For the following research, the scholar intended to use the quantitative strategy so that numerical data could be collected from different sources and an emphasis on objective measurements could be focused ( Snyder, 2019). 3.6 Data Collection Method In the following research the scholar had chosen to collect the data from secondary sources. From the secondary source dataset would be collected for credit card fraud detection. The dataset would be collected from the website of Kaggle (kaggle.com, 2023). The selected dataset provides the real-world relevance like the credit card fraud is one of the critical aspects for the financial institutions. This dataset is relatively imbalanced dataset compared other studies of credit card fraud detection. This dataset has small number of fraudulent cases compared to the overall data. In other studies, the legitimate and fraudulent transactions have balance data. The selected dataset makes challenging for traditional algorithms. Thus, this dataset is selected to provide another solution for credit card fraud detection. 3.7 Data Analysis Method It is important to identify the appropriate tool that can be used for analyzing the credit card fraud-related data for an improvised analysis further. For the analysis of the data that would be collected from the website of Kaggle, various machine learning algorithms would be used. The different machine learning algorithms that had been chosen to be used for the analysis process 20
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
are logistic regression, support vector classifier, random forest classifier, Gaussian Process Classifier, and the k-Naïve classifier ( Kumar, 2018). Apart from this, the data is analyzed with the help of statistical method such as descriptive statistics, and correlation matrix that helps to determine the relationship as well as the insight information of variables of the credit card fraud detection. 3.8 Ethical Considerations Ethical considerations are assumed to be a set of principles that should be followed by the research scholar while conducting the research so that the study can be guided in the correct direction. It helps in maintaining the validity and integrity of the research paper ( Mohajan, 2018). The various ethical considerations that had been followed in the research are destroying the data after it is being used in the research so that it does not get misused in any manner, avoiding any plagiarism, protecting the rights of the participants, causing no physical, psychological or mental harm to any people, checking the authenticity of the data before using it in the research to avoid any falsification of data ( Mukherjee, 2019). 3.9 Research Limitations It is not possible for the research scholar to include each and every factor associated with the research in a single paper. The research limitations help in identifying the areas of the study that had not been addressed in the study ( Williamson, 2021). The following research does not comprise any qualitative data, due to which a gap in the theoretical aspect of the study is being observed. Apart from that, the paper had also not included the primary data collection method, such as an interview or survey, due to which the study lacks any real-time data. These two are the major limitation of the methodological section. 3.10 Summary The following section successfully discussed all the aspects that are important to carry out research successfully in the correct direction, such as philosophy, approach, design, strategy, data collection methods, etc. By abiding by these step-by-step approaches, the scholar would be able to address the aim and objectives of the paper successfully and thus solve the research question. 21
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Chapter 4: Artifact Design and Implementation 4.1 Design The design of the machine learning model is the step-by-step method. The data visualization is a critical step of designing the machine learning model which helps to provide critical information about the selected dataset. The first step of designing the machine learning model is the understanding of the problem. Before starting the model. It is essential to understand the purpose of the model which helps to select the proper algorithms for this model. For this study, the purpose of the machine learning model is to classify the credit card fraud with the help of different algorithms. In this step, several independent and dependent variables are determined which helps to provide the most important variable for the classification. In order to build the credit card fraud detection model, the understanding of the industry operations needs to be analysed for developing better model ( Guo et al., 2019). 4.1.1 Experimental Setup For the development of the machine learning model, the Google Colab platform is used. This platform helps to write the code in python which is supported by all the web browsers and this platform can be accessed remotely. For this model, Python 3.10 version is used in which the libraries can be added or removed with the help of Google Colab. For this experiment, 6 GB NVIDA VRAM is used as well as 16 GB RAM is used which helps to provide efficient experimental environment. 4.1.2 Selected Dataset It is essential for organization to determine the fraudulent of the transaction of credit card so that the consumers not charged for the items which are not purchased by them. The selected dataset contains the transactions that are made by the customers of credit card of the European region. The dataset shows the transactions of two days which have more than 492 fraudulent of the total transaction amounts of 284,807. The selected dataset is unbalanced which has the positive class of 0.17% of all the transactions. This dataset contains several variables which is the outcome of the PCA transformation. This dataset does not provide the background of the variables. The features of the dataset from V1 to V28 are the PCA and two variables are transformed with the 22
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
help of PCA which are time and amount. The time contains the transaction time and amount is the transaction amount. The feature class of the dataset is the response variable which has the value 1 and 0. The class of the dataset is imbalance and accuracy of this data can be measured with the help of AUC curve and the confusion matrix (kaggle.com, 2023). 4.1.3 Used Algorithms For this machine learning model, logistic regression, support vector classification, random forest classification, decision tree, and KNN. Logistic Regression The logistic regression is the statistical method which is used for the classification and prediction. This method estimates the probability of the occurring of the event based on the independent variable. The outcome of this method is the probability and dependent variable is bound from 0 and 1. In this method, the logit transformation is applied and the probability of success is divided with the help of probability of failure. The logistic function is based on Logit(pi) = 1/ (1+ exp(-pi)) ( Yang and Shami, 2020). Support Vector Machine The support vector classification is the supervised algorithm which is used for regression and classification. The objective of this method is to determine the hyperplane in the N-dimension of the space which helps to classifies the data points. The dimension of the hyperplanes is dependent on the feature numbers. The kernel is the function of SVM which takes the low- dimensional inputs and transforms it to the higher dimensional space. It converts the non- separable problems to the separable problems which is essential for non-linear problems ( Sarker, 2021). The primary advantage of using this method is that it is efficient in terms of memory as it utilizes the subset of the training set in the decision function. 23
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Random Forest The random forest is the supervised machine learning algorithm which is highly used for the classification as well as regression. This method contains several trees and number of trees are considered as the robustness of this model. The accuracy of this method is based on the number of trees which helps to provide problem-solving capabilities. The random forest contains several decision trees based on the several subsets of the dataset which helps to improve predictive accuracy of the selected dataset. Decision Tree The decision tree is the supervised method which is used for the classification problems and it is preferred for solving the classification issues. This method is tree structured classifier and the internal node shows the feature of the dataset. The branches of the tree show the decision rules and leaf node shows the outcome of the tree. The performance of this method is based on the feature of the dataset ( Ray, 2019). Proposed Model Gaussian NB The gaussianNB is the supervised algorithm which is the special type of Naïve Bayes algorithm. This method is used when the feature has the continuous values and the all the features needs to follow the gaussian distribution. This method is based on the Bayes theorem which assumes that all the features are independent to each other. The KNN method is non-parametric method which is a supervised learning classifier. The prediction as well as classification are made based on the grouping of the individual data. The class label of this model is assigned based on the majority of the data ( Ho et al., 2021). 24
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4.2 Evaluation Metric The performance evaluation of the machine learning model is one of the essential steps to build an effective ML model. In order to analyze the performance as well as the quality of the model different types of metrics are used which are known as performance metrics or evaluation metrics. These metrices helps to understand the performance of the model based on the provided dataset. With the help of these metrics, the performance of the model can be improved by tuning the hypermeters. Each machine learning model has the purpose to generalize the data and the performance metrics determine the generalization of the model on the dataset. The performance of each model is evaluated with the help of various evaluation metrices such as accuracy, F1 score, Recall, and Precision (Saheed, Baba and Raji, 2022). 1. The accuracy provides the accurate classification value with the help of measuring the ratio of predicted value and the total number of instances. This metric is essential while the false positive and the false negative value is similar (Alharbi et al., 2022). 2. The precision value shows the ratio of correct predictions of positive values in terms of the total predicted positive values. Higher precision value shows the low rate of false negative instances (Bin Sulaiman, Schetinin and Sant, 2022). 3. The recall value provides the current predicted positive amount regarding the total number of positive amounts in the dataset. The high recall value provides the most actual positive instances of the dataset (Saheed, Baba and Raji, 2022). 4. The F1-score provides the average value of precision and recall value. It provides the balanced analysis of the performance of the model (Alharbi et al., 2022). The confusion matrix of machine learning is known as the error which is a table that provides the virilization of the performance of a specific algorithm. Each row of this matrix represents the actual class while the column presents the instances in the predicted class. The true positive shows the positive value which is true whereas the true negative value predicts the negative value that is true. The false positive value is the type-1 error value which predicts the positive value which is false. The false negative value is the type-2 error which predicts the negative value which is false. In figure 3, the confusion matrix with the actual values and predicted values (Bin Sulaiman, Schetinin and Sant, 2022). 25
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Figure.3: Confusion Matrix 4.3 Data Splitting The data splitting is another crucial part of machine learning model which improves the overall performance. This method divides the data into two subsets for the learning, and validation of the ML model. The data splitting ensures the creation of the model for the data as well as the process. The training model is used to train the developed models. This aspect is used to estimate different parameters in order to compare the performance of different. The test data is compared to check the final model in terms of correct outcome. For the data splitting, the random sampling is used. This method protects the process the data modelling from bias towards various characteristics of data. The dataset is split into two sets such as training dataset and test dataset. The data models use the data splitting for training the model based on dataset. The training data is added to the model in order to update the training parameters. After the training phase, the data from test set is measure in the comparison of handling new observation of the model. The training dataset has the 80% of the total number of dataset whereas the training dataset has the 20% of the total data. For the splitting the random state 42 is used that helps to make sure the reproducibility with the help of the solving the random state seed at the level 42 (Naveen and Diwan, 2020). 26
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4.4 Implementation 4.4.1 Data Importing Dataset import is one of the critical aspects of the machine learning model. For import the dataset panda library is used. The head () function helps to provide the first few data of the dataset which is loaded to the data frame. Figure.4: Importing Dataset 4.4.2 Data Visualization The data visualization is the critical part of the machine learning model which helps to represent the variables with the help of graphical presentation such as charts, graphs, heatmaps, etc. This process helps to determine the insight information of the dataset which is essential for the feature selection for the model. Apart from this, this method helps to provide the dependency between the variables which helps to determine the target variable of the model ( Sarker, 2021). Figure.5: Data Distribution of Amount and Time The above graph shows the data distribution of time and amount of the credit card transactions. The amount of transaction has high density in 0-5000 and the amount of the density is more than 0.0015. On the other hand, time has the highest density in 2000-8000 and 12000 to 16000. The highest density amount is 1. Based on the above graph, these variables have some outliers. 27
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Figure.6: Correlation Matrix The above figure shows the correlation between the variables. Amount has the negative relationship with V1 with the value of -0.228 and this variable has the negative connection with V2, V3 and V5 with the values of -0.53, -0.21 and -0.38. The V4 has the positive relationship with amount with the value of 0.99. The V6 has the positive connection with amount with the relationship amount of 0.21. Based on the above, all the variables are related to each other in both positive and negative manner. 28
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Figure.7: Negative Correlation of variables The above graph shows the negative correlation of variables. The V17 has the negative correlation of -0.5 and V14 has the negative correlation of higher than -0.5. The V12 has the negative correlation of -0.5 and V10 has the negative value of more than -0.5. Thus, all the variables are related to each other by negative manner. Figure.8: Data Distribution of Variables From the above graph, the V14 has the highest distribution from -20 to -5 and the highest value of the data is more than 0.10. In the case of V12, the distribution is from -20 to 0 and the highest value is more than 0.08. The V10 has the distribution value of -25 to 5 and the highest value is 0.12. Thus, all the variables have the normal distribution. 29
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4.2.3 Handling Imbalanced Dataset The selected dataset is imbalanced because it has 0.17% fraud values and 99.83% non-fraud values. With the imbalanced of the dataset, the result of the model can become biased. In order to solve this issue, StratifiedShuffleSplit is used. This function helps to manage the imbalanced dataset with the help of class distribution in the original dataset which is maintained in trained and test split. The shuffling is the part of this function which helps to split the data with the help of randomness before the splitting and the dataset is not influenced by inherent. This function helps to remove the bias which is introduced by the under sampling and the train as well as test set shows the original distribution of the class (Uçan and Alheeti, 2021). 4.5 Result Name Precisio n Recal l Accurac y F1 Logistic Regression 0.956 0.936 0.947 0.94 6 SVC 0.944 0.914 0.931 0.92 9 Random Forest 0.92 0.95 0.93 0.93 Decision Tree 0.94 0.91 0.93 0.92 Table.1: Model Validation Based on the above table, the logistic regression, and random forest have the significant amount of recall, precision and F1 score. Based on the requirement and accuracy of the model, logistic regression is the best algorithm for this classification. 30
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Logistic Regression The testing of the model is done with the help of accuracy and confusion matrix which helps to provide the overall performance of the model. In the figure 9, the performance of the logistic regression is shown. The accuracy of the model is 95% and the precision ratio of the model has the value of 0.956 which helps to correctly predict the positive values by more than 95%. The recall value identifies the actual positive values by 93.5%. From the confusion matrix, the true positive value is 92 and the false positive value is 4. The false negative value is 6 and true positive value is 87. Figure.9: Result of Logistic Regression 31
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Support Vector Classifier In the figure 10, the overall performance of the support vector machine is provided. The accuracy of this model is 0.97 which indicates that this model can successfully classifies 97% of the total prediction. The recall value of the SVC is 0.97 for the class of 0 whereas the value of this aspect for class 1 is 91. On the other hand, the F1 score for the SVM is 0.94 for class 0 and 0.94 for class 1. Thus, this model can accurately predict the positive instances by 91% and negative instances for 97%. Based on the confusion matrix, this model has 93 true negative values and 3 false positive values. The false negative value is 8 and true positive value is 85. Figure.10: Result of SVC 32
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Random Forest Classifier From the figure 11, the accuracy of the algorithm is 0.9365 which shows that this model can classify more than 93% of the total data correctly. In the case of precision value, the random forest has the value of 0.93 for class 0 and 0.95 for class 1. Moreover, this model has the recall amount of 0.95 for majority class and 0.92 for minority class. On the other hand, the F1- score of this model has the value of 0.94 for class 0 and 0.93 for class 1. From the confusion matrix, this algorithm has 93 true negative and 3 false positive. It has 8 false negative value and 85 true positive values. Figure.11: Result of Random Forest 33
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Decision Tree Based on the figure 11, the accuracy of the above model is 0.9312 which indicates that this model can classify the total data by more than 93%. The precision value of this model is 0.92 for class 0 and 0.94 for class 1. On the other hand, the recall value of this model is 0.95 for class 0 and 0.91 for class 1. Based on the f1 score, the value of class 0 is 0.93 and value of class 1 is 0.93. From the confusion matrix, the true negative is 91 with false positive of 5. It has 7 false negative and 86 true positive. Figure.12: Result of Decision Tree 34
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
KNN Based on the figure 13, The accuracy of the model is 94.17 which is significantly higher. Thus, this model can accurately predict the value by more than 94%. From the precision score, the 0 class has the value of 0.92 and class 1 has the value of 0.94. The recall value of this model is 0.95 for class 0 and 0.91 for the class 1. From the F1 score, the class 0 has the value of 0.93 and class 1 has the value of 0.93. From the confusion matrix, false positive value is 3 with the true negative value of 93. The true positive value of the model is 85 with the false negative value of 8. Figure.13: Result of KNN 35
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
36
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4.6 Proposed Algorithm The above graph shows the performance of all algorithms along with the propose algorithm Gaussian NB. From the above, the recall value is significantly high for this algorithm which helps to provide accurate positive instances compared to other models used in this study. From the figure 14, the gaussian NB has the recall value of 0.96 which is the highest value compare to other models used in this study. Moreover, the logistic regression has the highest accuracy followed by gaussian NB. The LR has the highest precision value. The decision tree as well as the random forest have the lowest accuracy compared other models used in this project. Logist ic Regr ession SVC Random Fore st Decisi on Tree Pro posed Method (Gaussian NB) 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 Precision Recall Accuracy F1 Figure.14: Comparison of Models with Proposed Model 37
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Gaussian NB From the figure 14, the accuracy of the model is 94.17 which is significantly high. The precision value of the model is 0.99 for class 0 and 0.99 for class 1. The recall value of this model is 0.99 for class 0 and 0.89 for class 1. In the case of F1 score, the value of class 0 is 0.95 and value of class 1 is 0.94. Thus, this model can accurately predict the positive value by 90% and negative instances by 99%. Thus, the performance of this model is significantly high compared to another model used in this project. From the confusion matrix, the true negative value is 93 with false positive value of 3. The true positive value is 85 with false negative value of 8. Figure.15: Result of GaussianNB 38
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Chapter 5: Critical Evaluation From the section of the literature review, it can be stated that different kinds of machine learning algorithms can perform a major role in the process of helping financial institutions in the process of sensing different kinds of credit card fraudulent activities. It can play a major role in helping the nacks and other financial services to maintain their financial stability and reputational image. Different machine learning algorithms that have a major role in the process of performing fraud detection are decision trees, random forests, and support vector machines (Esenogho et al., 2022). Support Vector Machine specifically focuses on predicting credit card fraud by the process of generating a decision boundary that helps in separating non-fraudulent and fraudulent transactions according to the features. It also plays a major role in classifying new transactions by the process of measuring which side of the border they fall on, performing a significant role in identifying potential fraud cases with the highest accuracy. Decision tree plays a vital role in predicting credit card fraud detection by the process of structuring a hierarchical structure of decision nodes according to the transaction features. It utilizes a sequence of if-else conditions in order to steer through the tree and stamp a non-fraud or fraud label to every transaction. The way followed by a transaction directly results in the last prediction, generating the decision trees effectual for the process of credit card fraud detection (Zioviris, Kolomvatsos, and Stamoulis, 2022). On the other side, Random Forest aims to credit card fraud detection by the process of combining several decision trees. It helps in generating a collection of trees, each instructing on a random subset of the data with the substitute. All trees separately predict non-fraud or fraud for the transactions, and the final forecast is typically determined through majority voting or averaging across all the trees. Based on the analysis section it can be stated that among different machine learning algorithms, the logistic regression can be considered the most effective algorithm in the process of detecting a variety of credit card frauds. The accuracy level of Logistic Regression in predicting credit card fraud can be considered almost 95%. LR is more likely to have better accuracy in the process of detecting different kinds of credit card frauds in comparison with SVM, decision trees and other algorithms as they focus on combining numerous decision trees, helping to eliminate or reduce overfitting and capturing an extensive range of patterns. They can also play a vital role in 39
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
leveraging the diversity of trees in order to mitigate individual tree errors and biases, leading to a more accurate and robust prediction by the process of aggregating the averages or votes across the ensemble. After discussing the literature review section and analysis section, it can be stated that the LRis considered the most effective machine learning algorithm with the highest accuracy level compared to the other machine learning algorithms. Thus, financial institutions can utilize the use of LR in order to detect different kinds of credit card fraud in an efficient way. According to the literature review section it can be stated that there are different kinds of predictive features of credit card transactions, which can play a vital role in helping machine learning algorithms to detect credit card fraud in a more effective and fast way. The total amount of transactions is one of the most significant features related to credit card transactions. These particular features can be utilized to identify suspicious transactions or outliers, which may point to fraud. The date and time of the transaction can be considered another major predictive feature associated with credit card transactions. If the time and date of the transaction are unusual it can directly point to credit card fraud, helping the cardholders and card issues to avoid the chances of experiencing different kinds of credit card fraud (Rtayli and Enneya, 2020). On the other side, the location of the cardholder can play a major role in delivering insights into the habits of their spending, which can be considered very effective and beneficial in helping financial institutions to detect fraudulent activities associated with credit card transactions. The currency of the transaction is also one of the most significant predictive features of a credit card transaction. The currency can help credit card holders and the credit card issuer to identify the area of the transaction as if the transaction is generated within the home country or outside of the country (Bin Sulaiman, Schetinin, and Sant, 2022). The frequency of the transaction is another vital aspect that can be considered effective in the process of sensing the chance of facing fraud in credit card transactions. Based on the analysis section it can be considered that most of the banks and financial institutions focus on detecting different kinds of fraud in the period of credit card transactions two most common features, such as the amount of transaction, and transaction frequency. With the help of measuring transaction frequency, the cardholders and card issues are capable of identifying if there are any kinds of fraudulent activities taking place. Machine learning 40
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
algorithms have gathered insights into the normal frequency of using the credit card, which can play a significant role in the process of identifying any abnormal and unusual transaction in an efficient and successful way. After analyzing the literature review chapter and analysis section it can be considered that banks and other financial institutions can consider the most predictive features of a Credit Card Transaction in the process of predicting credit card fraud. Based on the literature review section it can be stated that there are different limitations associated with the machine learning algorithms in the process of detecting credit card frauds. Some of these major limitations are Data Quality and Availability, Interpretability, Time and cost, Overfitting, Imbalanced Data, Concept Drift, Adversarial Attacks, and more. All these limitations can play a significant role in the process of damaging the ability of machine learning algorithm in credit card fraud detection (Shirgave et al., 2019). It can be considered very important to deliver high quality and adequate amount of information to the chosen machine learning algorithms in order to accomplish the desired outcomes. Sometimes, it can be considered as challenging to deliver high quality data to the machine learning model, which can directly lead to inappropriate and poor results. On the other side, the entire process of utilizing machine learning models in detecting credit card frauds can be considered very costly and time-consuming. Selecting, training, building and maintaining machine learning model also require enough expertise and resources, and gathered information required to be regularly validated, updated and cleaned. All these can be considered very difficult to perform. According to the analysis section it can be considered that it is very important to reduce the limitation of machine learning algorithms in the process of detecting credit card frauds in order to accomplish a more reliable and successful outcome (Hussein et al., 2021). This section also played a major role in describing those different kinds of limitation can perform a vital role in the process of damaging the ability of machine learning model from predicting more accurate prediction about credit card frauds. After discussing the literature review section and discussion chapter it can finally be stated that different kinds of limitations associated with the machine learning algorithms in the process of 41
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
predicting credit card frauds played a major role in encouraging the financial institution to incorporate multiple effective strategies in order to obtain the ideal and expected predictions. According to the literature review section it can finally be stated that there are multiple Ethical and Effective Fraud Detection Strategies that help machine learning to identify credit card fraud in a more efficient way. The major and effective strategies for credit card fraud detection are selecting the ideal machine learning algorithm, analyzing multiple data sources, monitoring in real-time, focusing on false positives, and delivering fraud education to the users and employees within the organizations. Based on the analysis section it can be asserted that the employees and users need to be aware of all the precautions that can help them in the process of avoiding the situation of affecting or being infected by different kinds of credit card fraud. After discussing the literature review and review it can be stated that the idea of incorporating multiple strategies that can help in preventing credit card fraud can help the financial institution to avoid the chances of facing huge financial loss and reputational loss. 42
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Chapter 6: Conclusion and Recommendation 6.1 Conclusion With the growing technology, the fraudulent activities associated with credit card transactions are also increasing day by day. Credit card fraud can directly lead to reputational ad financial damages. Thus, it is very important for credit card users to avoid the chances of acing different kinds of credit card fraud in a successful way. Different kinds of machine learning algorithms can play a significant role in the process of predicting any kinds of fraudulent activities associated with credit card transactions, which can play a major role in helping the financial institution to avoid credit card fraud. This particular study on the topic “Credit Card Fraud Detection” has specifically focused on contributing to the understanding of credit card fraud and providing insights into how credit card issuers can continue to improve their security measures to prevent fraudulent use. The introduction section has played a major role in structuring effective research objectives, which played a significant role in the process of helping the whole study to discuss the research topic in a more efficient and precise way. The study has discussed different kinds of machine learning algorithms that can be utilized in the process of detecting a variety of credit card frauds. The algorithms that have been discussed in this study are random forest, decision tree, support vector machine, and more. It is very important to select the ideal machine learning model according to the situation in order to achieve the expected prediction. All these three machine learning algorithms follow their own way in the process of predicting different kinds of fraud. On the other side, the study has also focused on addressing the most predictive features of credit card transactions, which can play a significant role in the process of forecasting the chances of credit card fraud. A range of limitations of machine learning algorithms in the process of credit card fraud detection has also been discussed in this research paper. These limitations may play a major role in the process of affecting the machine learning algorithms’ ability to detect credit card fraud. Some of these limitations are data availability and quality, interpretability, cost and time, overfitting, imbalanced data, concept drift, and adversarial attack. Credit card users and credit card issuers should successfully implement different kinds of strategies in order to avoid the chances of experiencing different kinds of fraudulent activities related to credit card transactions. The section on the literature gap helped the study to identify 43
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
the significant areas that have not been discussed in this specific research paper. Interpretivism research philosophy is utilized in this paper in order to address the research objectives in a more effective and precise way. The secondary data collection method is considered in this study and the secondary data sources that have been utilized in the process of developing this research paper are newspapers, journals, reports, magazines, books, and websites. The quantitative data that has been gathered through the secondary data collection method is analyzed with the help of a thematic tool. The illegal transactions made using credit cards are considered to be credit card fraud. These fraudulent actions must be dealt by proper monitoring of the transactions and using an effective fraud detection system using machine learning process. Various kinds of machine learning processes and related algorithms have helped companies as well as users to detect fraudulence and safeguard their interests. More extended research needs to be done in this respect to improve the efficiency of the machine learning systems in credit card detection. 6.2 Recommendation One of the most efficient approaches for credit card fraud detection can be considered the idea of implementing machine learning algorithms. By the process of leveraging historical transaction data, such algorithms can be trained about anomalies and patterns associated with deceptive activities. Different kinds of features such as time, location, transaction amount, and user behavior can be utilized in order to train the ML models that can precisely identify possible fraud. Moreover, the idea of incorporating anomaly detection techniques and real-time monitoring can improve the system's capability of identifying and preventing fraudulent transactions, providing a strong defense against credit card fraud. 6.3 Future Scope Besides different kinds of machine learning algorithms, there are different kinds of advanced and innovative technologies that can play a major role in the process of detecting credit card fraud. Future studies on the topic of “credit card fraud detection” can focus on discussing different kinds of innovative technologies rather than the machine learning model in credit card fraud detection. On the other side, the future research paper could also focus on using both primary and secondary data collection methods in order to develop the quality and reliability of the entire research paper. 44
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
References Abdulghani, A.Q., Uçan, O.N. and Alheeti, K.M.A., 2021, December. Credit card fraud detection using XGBoost algorithm. In 2021 14th International Conference on Developments in eSystems Engineering (DeSE) (pp. 487-492). IEEE. Ahmad, H., Kasasbeh, B., Aldabaybah, B. and Rawashdeh, E., 2023. Class balancing framework for credit card fraud detection based on clustering and similarity-based selection (SBS). International Journal of Information Technology , 15 (1), pp.325-333. Alharbi, A., Alshammari, M., Okon, O.D., Alabrah, A., Rauf, H.T., Alyami, H. and Meraj, T., 2022. A novel text2IMG mechanism of credit card fraud detection: a deep learning approach. Electronics , 11 (5), p.756. Al-Shabi, M.A., 2019. Credit card fraud detection using autoencoder model in unbalanced datasets. Journal of Advances in Mathematics and Computer Science , 33 (5), pp.1-16. Arya, M. and Sastry G, H., 2020. DEAL–‘Deep Ensemble ALgorithm’framework for credit card fraud detection in real-time data stream with Google TensorFlow. Smart Science , 8 (2), pp.71-83. Baier, L., Jöhren, F. and Seebacher, S., 2019, June. Challenges in the Deployment and Operation of Machine Learning in Practice. In ECIS (Vol. 1). Bhatt, U., Xiang, A., Sharma, S., Weller, A., Taly, A., Jia, Y., Ghosh, J., Puri, R., Moura, J.M. and Eckersley, P., 2020, January. Explainable machine learning in deployment. In Proceedings of the 2020 conference on fairness, accountability, and transparency (pp. 648-657). Bin Sulaiman, R., Schetinin, V. and Sant, P., 2022. Review of Machine Learning Approach on Credit Card Fraud Detection. Human-Centric Intelligent Systems , 2 (1-2), pp.55-68. Dang, T.K., Tran, T.C., Tuan, L.M. and Tiep, M.V., 2021. Machine Learning Based on Resampling Approaches and Deep Reinforcement Learning for Credit Card Fraud Detection Systems. Applied Sciences , 11 (21), p.10004. Elshawi, R., Maher, M. and Sakr, S., 2019. Automated machine learning: State-of-the-art and open challenges. arXiv preprint arXiv:1906.02287 . 45
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Esenogho, E., Mienye, I.D., Swart, T.G., Aruleba, K. and Obaido, G., 2022. A neural network ensemble with feature engineering for improved credit card fraud detection. IEEE Access , 10 , pp.16400-16407. Gevorkyan, M.N., Demidova, A.V., Demidova, T.S. and Sobolev, A.A., 2019. Review and comparative analysis of machine learning libraries for machine learning. Discrete and Continuous Models and Applied Computational Science, 27(4), pp.305-315. Guo, Q., Chen, S., Xie, X., Ma, L., Hu, Q., Liu, H., Liu, Y., Zhao, J. and Li, X., 2019, November. An empirical study towards characterizing deep learning development and deployment across different frameworks and platforms. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) (pp. 810-822). IEEE. Gupta, A. and Gupta, N., 2022. Research methodology . SBPD Publications. Gupta, A., Lohani, M.C. and Manchanda, M., 2021. Financial fraud detection using naive bayes algorithm in highly imbalance data set. Journal of Discrete Mathematical Sciences and Cryptography , 24 (5), pp.1559-1572. Ho, W.K., Tang, B.S. and Wong, S.W., 2021. Predicting property prices with machine learning algorithms. Journal of Property Research , 38 (1), pp.48-70. Hussein, A.S., Khairy, R.S., Najeeb, S.M.M. and ALRikabi, H.T., 2021. Credit Card Fraud Detection Using Fuzzy Rough Nearest Neighbor and Sequential Minimal Optimization with Logistic Regression. International Journal of Interactive Mobile Technologies , 15 (5). Ileberi, E., Sun, Y. and Wang, Z., 2022. A machine learning based credit card fraud detection using the GA algorithm for feature selection. Journal of Big Data , 9 (1), pp.1-17. Jain, V., Agrawal, M. and Kumar, A., 2020, June. Performance analysis of machine learning algorithms in credit cards fraud detection. In 2020 8th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions)(ICRITO) (pp. 86-88). IEEE. 46
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
kaggle.com, 2023. Credit Card Fraud Detection. Anonymized credit card transactions labeled as fraudulent or genuine [Online]. Available at: https://www.kaggle.com/datasets/mlg- ulb/creditcardfraud . [Assessed On: 19/5/2023] Kumar, R., 2018. Research methodology: A step-by-step guide for beginners . Sage. Melnikovas, A., 2018. Towards an explicit research methodology: Adapting research onion model for futures studies. Journal of futures Studies , 23 (2), pp.29-44. Min, Q., Lu, Y., Liu, Z., Su, C. and Wang, B., 2019. Machine learning based digital twin framework for production optimization in petrochemical industry. International Journal of Information Management , 49 , pp.502-519. Mohajan, H.K., 2018. Qualitative research methodology in social sciences and related subjects. Journal of economic development, environment and people , 7 (1), pp.23-48. Mukherjee, S.P., 2019. A guide to research methodology: An overview of research problems, tasks and methods. Naveen, P. and Diwan, B., 2020, October. Relative Analysis of ML Algorithm QDA, LR and SVM for Credit Card Fraud Detection Dataset. In 2020 Fourth International Conference on I- SMAC (IoT in Social, Mobile, Analytics and Cloud)(I-SMAC) (pp. 976-981). IEEE. Orth, C.D.O. and Maçada, A.C.G., 2021. Corporate fraud and relationships: a systematic literature review in the light of research onion. Journal of Financial Crime , 28 (3), pp.741-764. Pandey, P. and Pandey, M.M., 2021. Research methodology tools and techniques . Bridge Center. Patel, M. and Patel, N., 2019. Exploring Research Methodology. International Journal of Research and Review , 6 (3), pp.48-55. Ray, S., 2019, February. A quick review of machine learning algorithms. In 2019 International conference on machine learning, big data, cloud and parallel computing (COMITCon) (pp. 35- 39). IEEE. 47
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Rtayli, N. and Enneya, N., 2020. Enhanced credit card fraud detection based on SVM-recursive feature elimination and hyper-parameters optimization. Journal of Information Security and Applications , 55 , p.102596. Saheed, Y.K., Baba, U.A. and Raji, M.A., 2022. Big Data Analytics for Credit Card Fraud Detection Using Supervised Machine Learning Models. In Big Data Analytics in the Insurance Market (pp. 31-56). Emerald Publishing Limited. Sarker, I.H., 2021. Machine learning: Algorithms, real-world applications and research directions. SN computer science , 2 (3), p.160. Shirgave, S., Awati, C., More, R. and Patil, S., 2019. A review on credit card fraud detection using machine learning. International Journal of Scientific & technology research , 8 (10), pp.1217-1220. Shukur, H.A. and Kurnaz, S., 2019. Credit card fraud detection using machine learning methodology. International Journal of Computer Science and Mobile Computing , 8 (3), pp.257- 260. Silaparasetty, N. and Silaparasetty, N., 2020. The tensorflow machine learning library. Machine Learning Concepts with Python and the Jupyter Notebook Environment: Using Tensorflow 2.0, pp.149-171. Snyder, H., 2019. Literature review as a research methodology: An overview and guidelines. Journal of business research , 104 , pp.333-339. Suresh, H. and Guttag, J., 2021. A framework for understanding sources of harm throughout the machine learning life cycle. In Equity and access in algorithms, mechanisms, and optimization (pp. 1-9). Tamminen, K.A. and Poucher, Z.A., 2020. Research philosophies. In The Routledge international encyclopedia of sport and exercise psychology (pp. 535-549). Routledge. Williamson, T., 2021. The philosophy of philosophy . John Wiley & Sons. 48
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Yang, L. and Shami, A., 2020. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing , 415 , pp.295-316. Zioviris, G., Kolomvatsos, K. and Stamoulis, G., 2022. Credit card fraud detection using a deep learning multistage model. The Journal of Supercomputing , 78 (12), pp.14571-14596. Žukauskas, P., Vveinhardt, J. and Andriukaitienė, R., 2018. Philosophy and paradigm of scientific research. Management culture and corporate social responsibility , 121 , p.139. 49
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Appendix Python Code from google.colab import drive drive.mount('/content/drive') import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import tensorflow as tf import matplotlib.pyplot as plt import seaborn as sns from sklearn.manifold import TSNE from sklearn.decomposition import PCA, TruncatedSVD import matplotlib.patches as mpatches import time # Classifier Libraries from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.naive_bayes import GaussianNB import collections 50
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
# Other Libraries from sklearn.model_selection import train_test_split from sklearn.pipeline import make_pipeline from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline from imblearn.over_sampling import SMOTE from imblearn.under_sampling import NearMiss from imblearn.metrics import classification_report_imbalanced from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, accuracy_score, classification_report from collections import Counter from sklearn.model_selection import KFold, StratifiedKFold import warnings warnings.filterwarnings("ignore") df = pd.read_csv('/content/drive/MyDrive/creditcard/creditcard.csv') df.head() df.describe() df.isnull().sum().max() print('No Frauds', round(df['Class'].value_counts()[0]/len(df) * 100,2), '% of the dataset') print('Frauds', round(df['Class'].value_counts()[1]/len(df) * 100,2), '% of the dataset') fig, ax = plt.subplots(1, 2, figsize=(18,4)) 51
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
amount_val = df['Amount'].values time_val = df['Time'].values sns.distplot(amount_val, ax=ax[0], color='r') ax[0].set_title('Distribution of Transaction Amount', fontsize=14) ax[0].set_xlim([min(amount_val), max(amount_val)]) sns.distplot(time_val, ax=ax[1], color='b') ax[1].set_title('Distribution of Transaction Time', fontsize=14) ax[1].set_xlim([min(time_val), max(time_val)]) plt.show() from sklearn.preprocessing import StandardScaler, RobustScaler # RobustScaler is less prone to outliers. std_scaler = StandardScaler() rob_scaler = RobustScaler() df['scaled_amount'] = rob_scaler.fit_transform(df['Amount'].values.reshape(-1,1)) df['scaled_time'] = rob_scaler.fit_transform(df['Time'].values.reshape(-1,1)) df.drop(['Time','Amount'], axis=1, inplace=True) scaled_amount = df['scaled_amount'] 52
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
scaled_time = df['scaled_time'] df.drop(['scaled_amount', 'scaled_time'], axis=1, inplace=True) df.insert(0, 'scaled_amount', scaled_amount) df.insert(1, 'scaled_time', scaled_time) # Amount and Time are Scaled! df.head() from sklearn.model_selection import train_test_split from sklearn.model_selection import StratifiedShuffleSplit print('No Frauds', round(df['Class'].value_counts()[0]/len(df) * 100,2), '% of the dataset') print('Frauds', round(df['Class'].value_counts()[1]/len(df) * 100,2), '% of the dataset') X = df.drop('Class', axis=1) y = df['Class'] sss = StratifiedKFold(n_splits=5, random_state=None, shuffle=False) for train_index, test_index in sss.split(X, y): 53
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
print("Train:", train_index, "Test:", test_index) original_Xtrain, original_Xtest = X.iloc[train_index], X.iloc[test_index] original_ytrain, original_ytest = y.iloc[train_index], y.iloc[test_index] # Turn into an array original_Xtrain = original_Xtrain.values original_Xtest = original_Xtest.values original_ytrain = original_ytrain.values original_ytest = original_ytest.values # See if both the train and test label distribution are similarly distributed train_unique_label, train_counts_label = np.unique(original_ytrain, return_counts=True) test_unique_label, test_counts_label = np.unique(original_ytest, return_counts=True) print('-' * 100) print('Label Distributions: \n') print(train_counts_label/ len(original_ytrain)) print(test_counts_label/ len(original_ytest)) df = df.sample(frac=1) # amount of fraud classes 492 rows. fraud_df = df.loc[df['Class'] == 1] 54
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
non_fraud_df = df.loc[df['Class'] == 0][:492] normal_distributed_df = pd.concat([fraud_df, non_fraud_df]) # Shuffle dataframe rows new_df = normal_distributed_df.sample(frac=1, random_state=42) new_df.head() new_df_corr = df.corr() new_df_corr masking = np.triu(new_df_corr) plt.figure(figsize = (25, 15)) plt.title('Correlation Matrix') sns.heatmap(new_df_corr, cmap = 'viridis', annot = True, mask = masking, linecolor = 'white', linewidths = 0.5, fmt = '.3f') new_df.corr()["Class"] f, axes = plt.subplots(ncols=4, figsize=(20,4)) # Negative Correlations with our Class (The lower our feature value the more likely it will be a fraud transaction) sns.boxplot(x="Class", y="V17", data=new_df, ax=axes[0]) axes[0].set_title('V17 vs Class Negative Correlation') 55
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
sns.boxplot(x="Class", y="V14", data=new_df, ax=axes[1]) axes[1].set_title('V14 vs Class Negative Correlation') sns.boxplot(x="Class", y="V12", data=new_df, ax=axes[2]) axes[2].set_title('V12 vs Class Negative Correlation') sns.boxplot(x="Class", y="V10", data=new_df, ax=axes[3]) axes[3].set_title('V10 vs Class Negative Correlation') plt.show() from scipy.stats import norm f, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(20, 6)) v14_fraud_dist = new_df['V14'].loc[new_df['Class'] == 1].values sns.distplot(v14_fraud_dist,ax=ax1, fit=norm, color='#FB8861') ax1.set_title('V14 Distribution \n (Fraud Transactions)', fontsize=14) 56
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
v12_fraud_dist = new_df['V12'].loc[new_df['Class'] == 1].values sns.distplot(v12_fraud_dist,ax=ax2, fit=norm, color='#56F9BB') ax2.set_title('V12 Distribution \n (Fraud Transactions)', fontsize=14) v10_fraud_dist = new_df['V10'].loc[new_df['Class'] == 1].values sns.distplot(v10_fraud_dist,ax=ax3, fit=norm, color='#C5B3F9') ax3.set_title('V10 Distribution \n (Fraud Transactions)', fontsize=14) plt.show() plt.figure(figsize = (10, 8)) plt.pie(df['Class'].value_counts(),labels=['No Fraud','Fraud'], autopct='%1.1f%%', explode = (0.0, 0.1),startangle=50 ,colors = ['yellow','red'], shadow = True) plt.legend(title = "Class", loc = 'lower right') plt.show() v14_fraud = new_df['V14'].loc[new_df['Class'] == 1].values q25, q75 = np.percentile(v14_fraud, 25), np.percentile(v14_fraud, 75) print('Quartile 25: {} | Quartile 75: {}'.format(q25, q75)) v14_iqr = q75 - q25 print('iqr: {}'.format(v14_iqr)) 57
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
v14_cut_off = v14_iqr * 1.5 v14_lower, v14_upper = q25 - v14_cut_off, q75 + v14_cut_off print('Cut Off: {}'.format(v14_cut_off)) print('V14 Lower: {}'.format(v14_lower)) print('V14 Upper: {}'.format(v14_upper)) outliers = [x for x in v14_fraud if x < v14_lower or x > v14_upper] print('Feature V14 Outliers for Fraud Cases: {}'.format(len(outliers))) print('V10 outliers:{}'.format(outliers)) new_df = new_df.drop(new_df[(new_df['V14'] > v14_upper) | (new_df['V14'] < v14_lower)].index) print('----' * 44) # -----> V12 removing outliers from fraud transactions v12_fraud = new_df['V12'].loc[new_df['Class'] == 1].values q25, q75 = np.percentile(v12_fraud, 25), np.percentile(v12_fraud, 75) v12_iqr = q75 - q25 v12_cut_off = v12_iqr * 1.5 v12_lower, v12_upper = q25 - v12_cut_off, q75 + v12_cut_off 58
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
print('V12 Lower: {}'.format(v12_lower)) print('V12 Upper: {}'.format(v12_upper)) outliers = [x for x in v12_fraud if x < v12_lower or x > v12_upper] print('V12 outliers: {}'.format(outliers)) print('Feature V12 Outliers for Fraud Cases: {}'.format(len(outliers))) new_df = new_df.drop(new_df[(new_df['V12'] > v12_upper) | (new_df['V12'] < v12_lower)].index) print('Number of Instances after outliers removal: {}'.format(len(new_df))) print('----' * 44) # Removing outliers V10 Feature v10_fraud = new_df['V10'].loc[new_df['Class'] == 1].values q25, q75 = np.percentile(v10_fraud, 25), np.percentile(v10_fraud, 75) v10_iqr = q75 - q25 v10_cut_off = v10_iqr * 1.5 v10_lower, v10_upper = q25 - v10_cut_off, q75 + v10_cut_off print('V10 Lower: {}'.format(v10_lower)) print('V10 Upper: {}'.format(v10_upper)) outliers = [x for x in v10_fraud if x < v10_lower or x > v10_upper] 59
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
print('V10 outliers: {}'.format(outliers)) print('Feature V10 Outliers for Fraud Cases: {}'.format(len(outliers))) new_df = new_df.drop(new_df[(new_df['V10'] > v10_upper) | (new_df['V10'] < v10_lower)].index) print('Number of Instances after outliers removal: {}'.format(len(new_df))) f,(ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(20,6)) colors = ['#B3F9C5', '#f9c5b3'] # Boxplots with outliers removed # Feature V14 sns.boxplot(x="Class", y="V14", data=new_df,ax=ax1, palette=colors) ax1.set_title("V14 Feature \n Reduction of outliers", fontsize=14) ax1.annotate('Fewer extreme \n outliers', xy=(0.98, -17.5), xytext=(0, -12), arrowprops=dict(facecolor='black'), fontsize=14) # Feature 12 sns.boxplot(x="Class", y="V12", data=new_df, ax=ax2, palette=colors) ax2.set_title("V12 Feature \n Reduction of outliers", fontsize=14) ax2.annotate('Fewer extreme \n outliers', xy=(0.98, -17.3), xytext=(0, -12), arrowprops=dict(facecolor='black'), 60
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
fontsize=14) # Feature V10 sns.boxplot(x="Class", y="V10", data=new_df, ax=ax3, palette=colors) ax3.set_title("V10 Feature \n Reduction of outliers", fontsize=14) ax3.annotate('Fewer extreme \n outliers', xy=(0.95, -16.5), xytext=(0, -12), arrowprops=dict(facecolor='black'), fontsize=14) plt.show() X = new_df.drop('Class', axis=1) y = new_df['Class'] # T-SNE Implementation t0 = time.time() X_reduced_tsne = TSNE(n_components=2, random_state=42).fit_transform(X.values) t1 = time.time() print("T-SNE took {:.2} s".format(t1 - t0)) # PCA Implementation t0 = time.time() 61
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
X_reduced_pca = PCA(n_components=2, random_state=42).fit_transform(X.values) t1 = time.time() print("PCA took {:.2} s".format(t1 - t0)) # TruncatedSVD t0 = time.time() X_reduced_svd = TruncatedSVD(n_components=2, algorithm='randomized', random_state=42).fit_transform(X.values) t1 = time.time() print("Truncated SVD took {:.2} s".format(t1 - t0)) X = new_df.drop('Class', axis=1) y = new_df['Class'] from sklearn.model_selection import train_test_split # This is explicitly used for undersampling. x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) def perform(y_pred): print("Precision : ", precision_score(y_test, y_pred)) print("Recall : ", recall_score(y_test, y_pred)) print("Accuracy : ", accuracy_score(y_test, y_pred)) print("F1 Score : ", f1_score(y_test, y_pred)) 62
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
print('') print(confusion_matrix(y_test, y_pred), '\n') print(classification_report(y_test, y_pred)) cm = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_test, y_pred), display_labels = ['No Fraud', 'Fraud']) cm.plot() model_lr = LogisticRegression() model_lr.fit(x_train, y_train) from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, confusion_matrix, ConfusionMatrixDisplay, classification_report, PrecisionRecallDisplay, RocCurveDisplay y_pred_lr = model_lr.predict(x_test) perform(y_pred_lr) model_svc = SVC() model_svc.fit(x_train, y_train) y_pred_svc = model_svc.predict(x_test) perform(y_pred_svc) model_rf = RandomForestClassifier() model_rf.fit(x_train, y_train) y_pred_rf = model_rf.predict(x_test) 63
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
perform(y_pred_rf) model_dt = DecisionTreeClassifier() model_dt.fit(x_train, y_train) y_pred_dt = model_dt.predict(x_test) perform(y_pred_dt) model_nb = GaussianNB() model_nb.fit(x_train, y_train) y_pred_nb = model_nb.predict(x_test) perform(y_pred_nb) model_knn = knn = KNeighborsClassifier(n_neighbors=3) model_knn.fit(x_train, y_train) y_pred_knn = model_knn.predict(x_test) perform(y_pred_knn) fig, ax = plt.subplots() plt.title('Precision-Recall Curve') PrecisionRecallDisplay.from_predictions(y_test, y_pred_nb, name = f'Gaussian Naive Bayes', ax=ax, color = 'red') PrecisionRecallDisplay.from_predictions(y_test, y_pred_knn, name = f'KNeighborsClassifier', ax=ax, color = 'pink') PrecisionRecallDisplay.from_predictions(y_test, y_pred_lr, name = f'Logistic Regression', ax=ax, color = 'blue') 64
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
PrecisionRecallDisplay.from_predictions(y_test, y_pred_dt, name = f'Decision Tree', ax=ax, color = 'brown') PrecisionRecallDisplay.from_predictions(y_test, y_pred_rf, name = f'Random Forest', ax=ax, color = 'yellow') PrecisionRecallDisplay.from_predictions(y_test, y_pred_svc, name = f'SVC', ax=ax, color = 'orange') plt.legend(loc = 'best', fontsize = '6.8') fig, ax = plt.subplots() plt.title('ROC Curve') RocCurveDisplay.from_predictions(y_test, y_pred_nb, name = f'Gaussian Naive Bayes', ax=ax, color = 'red') PrecisionRecallDisplay.from_predictions(y_test, y_pred_knn, name = f'KNeighborsClassifier', ax=ax, color = 'pink') RocCurveDisplay.from_predictions(y_test, y_pred_lr, name = f'Logistic Regression', ax=ax, color = 'blue') RocCurveDisplay.from_predictions(y_test, y_pred_dt, name = f'Decision Tree', ax=ax, color = 'brown') RocCurveDisplay.from_predictions(y_test, y_pred_rf, name = f'Random Forest', ax=ax, color = 'yellow') RocCurveDisplay.from_predictions(y_test, y_pred_svc, name = f'SVC', ax=ax, color = 'orange') plt.plot([0, 1], [0, 1], linestyle = "--", color = 'black') plt.legend(loc = 'best', fontsize = '6.8') 65
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help