Arndt-Kohlway_assignment5
docx
keyboard_arrow_up
School
University of Maryland Global Campus (UMGC) *
*We aren’t endorsed by this school
Course
630
Subject
Electrical Engineering
Date
Apr 3, 2024
Type
docx
Pages
16
Uploaded by Yoloswaggins12
Clustering
K-Means Clustering with Wine
Nicholas Arndt-Kohlway
DATA630: Machine Learning (2215)
Professor Ami Gates
1
Clustering
Introduction
K-Means Clustering will be utilized in this assignment to determine the factors that lead to a higher and lower quality wines. The insights from this analysis will determine common characteristics among red wines and how those common characteristics lead to differing qualities
among red wines. For example, which common characteristics impact the end quality of a wine? How can manufacturers ensure a higher quality of wine from the inputs used? And crucially, which instances of red wines are statistical anomalies that might impact this analysis? From this analysis it is necessary to identify any statistical anomalies/outliers that impact the overall result of the clustering technique. For background, red wine is an alcoholic beverage that is produced through the fermentation process of dark grape juice. The difference between red wine and white wine is the type of grape used; red wine uses a dark-skinned grape whereas white wine uses a light-skinned grape. Pressed grape juice is infused and fermented with the dark grape skins to add color, flavor, and tannin to the wine. The alcohol is produced when yeast is introduced to convert the sugars in the grape to ethanol and carbon dioxide. There are four characteristics to wine: color, tannin, flavor, and acid. The colors in red wine vary from a deep purple to a light pink which is dependent on the grapes and the age of the wine. Tannins are formed form the skins, seeds, and even the stems of the grapes. Tannins add texture, structure, and age ability to the wine. These tannins determine the dryness of the wine and soften over time which makes red wine best consumed after only a few years of aging. “Different grape varieties produce aromas of fruits, flowers, herbs, spices, and earthy characteristics. For example, Pinot Noir tends to have raspberry, cherry, and forest floor notes while Cabernet Sauvignon generally boasts notes of cassis, licorice, and wet gravel”. Acid provides freshness and structure by acting as a 2
Clustering
preservative. The acidity produces tart and sour notes to balance the sweet and bitter tannin components. A K-Means Cluster is the chosen method for this analysis since it can provide a clustering of characteristics to determine which types of characteristics impact the quality of the red wine significantly. This also helps determine outliers which may have a negative impact on the significance of the K-Means Cluster. These outliers will also determine which red wines are excellent or poor. From this, winemakers can identify which ingredients to use to create a better-
quality wine and increase profitability. Analysis
This dataset was provided by the UC Irvine Machine Learning Repository through Paul Cortez at the University of Minho in Guimarães, Portugal. The dataset is related to red wines of the Portuguese “Vinho Verde” wine. The dataset includes the variables as follows: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality. Quality is the dependent variable in the dataset while
the rest are independent. Fixed acidity had a minimum value of 4.6 and a maximum value of 15.9
grams per milliliters (g/mL). Volatile acidity had a minimum value of 0.12 and a maximum value of 1.58 grams per milliliter (g/mL). Citric acid had a minimum value of 0 and a maximum value of 1 gram per milliliter (g/mL). Residual sugar had a minimum value of 0.9 and a maximum value of 15.5 grams per liter (g/L). Chlorides had a minimum value of 0.012 and a maximum value of 0.611 grams per milliliter (g/mL). Free sulfur dioxide had a minimum value of 1 and maximum value of 72 milligrams per liter (mg/L). Total sulfur dioxide had a minimum value of 6 and maximum value of 289 milligrams per liter (mg/L). Density had a minimum value
of 0.9901 and a maximum value of 1.0037 grams per milliliter (g/mL). pH had a minimum value 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Clustering
of 2.74 and a maximum value of 4.01. Sulphates had a minimum value of 0.33 and a maximum value of 2 milligrams per liter (mg/L). Alcohol had a minimum value of 8.4% and a maximum value of 14.9%. Quality had a minimum value of 3 and a maximum value of 8. Figure 1: The mean quality of wines is low with a score of 5.636 out of 10.
Figure 2: Quality 5 and 6 wines make up over 66% of the dataset.
To start the data preprocessing it was essential to check for any missing values which there were none. However, there was a value of 0 for the citric acid variable. Upon further 4
Clustering
inspection this can be possible, so the data preprocessing consisted of scaling the 11 independent variables. The K-Means Clustering being used relies on the algorithm finding common characteristics among the instances in the dataset. Once the clusters are created on this basis, the instances are placed into the cluster that has the nearest mean. The algorithm iterates to identify the best model to reduce the sum of squares as efficiently as possible. For this assignment 3 different clusters were created with k-values of 3, 4, and 5. Outlier detection and a sum of squares is used to identify the accuracy of each model. Figure 3: Chlorides and Sulphates have exceedingly high levels after scaling.
Result
The cluster plot in Figure 6 took 4 iterations to develop a strong model.
From Figure 6 it is apparent that cluster 2 has an outlier in instance 152 which elongates the cluster along component 1. All 4 clusters are elongated along component; however, this could be attributable to the scaling of the horizontal axis. There are also some outliers in cluster 1 which contain instances 1436 and 1475. The two components in this cluster account for 44.69% of the variability in the dataset. Cluster 2 has a significantly higher mean fixed acidity of 1.278 and 5
Clustering
citric acidity of 1.161. Cluster 1 has a high mean for free sulfur dioxide of 1.115 and total sulfur dioxide of 1.356. When looking at the quality it is cluster 4 that has the highest average quality of 6.266; it also had the highest average alcohol content of 1.249. While the difference in quality among the clusters is not significant, the difference in alcohol content is. The higher alcohol content might create a dryer wine which is a characteristic sought out for in red wine. The alcohol content could also cause an increase in intoxication that makes the wine more enjoyable to consumers. It is also important to note that cluster 2 had the second highest average quality. The characteristics of cluster 2 include higher sulfate levels, and density levels than the other wines. It also contains the second highest average alcohol content. The sulfate levels and add flavor and body to the wine which can also create higher alcohol levels. These properties can contribute to higher quality ratings as compared to the other red wines. Another important factor is the number of instances in each cluster. Cluster 3 had the lowest average quality; however, it had the most instances contained with 559. Meanwhile, Cluster 4 had the highest average quality, yet it had the fewest instances with only 327. Cluster 3 is more precise than the other clusters considering it has significantly more instances and only has a sum of squares of 2674.193, which is the second lowest among the clusters. Only being beat out by cluster 4 that contained over 200 fewer instances. Figure 7 identifies the top 5 outliers from the cluster plot in Figure 6. Instance 152 in cluster 2 has much higher levels of citric acid, chlorides, and sulphates. While it also has a noticeably lower pH than the other outliers in the dataset. Instances 1435 and 1436 are both outliers and identical in their mean values. These two instances have a lower alcohol content than the other outliers. Meanwhile, they have higher rates of residual sugar, free sulfur dioxide, total sulfur dioxide, and density. These two instances also had the highest quality among the top 5 outliers with a rating of 6. 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Clustering
Figure 4: Cluster 4 has the highest average quality with 6.266 and cluster 3 has the lowest with 5.284.
Figure 5: Clusters 3 and 4 have a significant number of instances of higher quality wines in the 6 to 8 range.
7
Clustering
Figure 6: Instances 152, 1436, and 1435 are obvious outliers in clusters 1 and 2.
Figure 7: Outliers 1435, and 1436 are identical and have the highest quality of 6 among the top 5 outliers.
The cluster plot in Figure 10 took 4 iterations to develop.
From Figure 10 there are outliers in cluster 2 which contain instances 152, 1436, and 1475. The two components in this cluster account for 44.69% of the variability in the dataset. Cluster 1 has a significantly higher mean fixed acidity of 1.250 and citric acidity of 1.136. Cluster 1 also has higher rates of residual sugar, chlorides, density, and sulphates with means of 0.249, 0.476, 0.889, and 0.685, 8
Clustering
respectively. When looking at the quality it is cluster 3 that has the highest average quality of 6.247; it also had the highest average alcohol content of 1.189. Cluster 3 had lower rates of acidity, residual sugar, chlorides, total sulfur dioxide, and density. However, cluster 3 did have the highest pH as well. It is also important to note that cluster 1 had the second highest average quality with a value of 5.856. The characteristics of cluster 1 include higher fixed acidity, citric acid, residual sugar, chlorides, density, and sulphates. All these variables add character/flavor to the wine that many connoisseurs look for. However, even though cluster 1 had higher levels of these variables it still only landed second in terms of quality. Another important factor is the number of instances in each cluster. Cluster 3 had the highest average quality; however, it had the fewest instances contained with 388. Meanwhile, Cluster 2 had the lowest average quality, yet it had the most instances with 815. Cluster 2 is more precise than the other clusters considering it has significantly more instances and only has a sum of squares of 5619.64, which is only slightly higher than cluster 1. Meanwhile, cluster 1 had a sum of squares of 5034.602 with less than half the instances of cluster 2 (396 instances). Cluster 3 had a sum of squares of 3020.224 with 388 instances. Figure 11 identifies the top 5 outliers from the cluster plot in Figure 10. The outliers when K=3 is the same as when K=4 except that instance 481 from K=4 is
now instance 1245 when K=3. When comparing instance 481 and 1245 it is evident that 1245 has lower means for most variables aside from volatile acidity, free sulfur dioxide, total sulfur dioxide, pH, and alcohol. This leads to higher quality of 6 for instance 1245 compared to 5 for instance 481. From Figure 9 it is easy to notice that cluster 2 had a significant number of instances that had a quality of 5. Cluster 3 had the most instances of wines with a quality of 8 at 12 instances. Cluster 3 had the highest amount of higher quality wines with 213 wines at quality 6 and 119 wines at quality 7.
9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Clustering
Figure 8: Cluster 3 has the highest quality with an average of 6.247.
Figure 9: Cluster 3 had the highest instances of higher quality wine with 213 instances of quality 6, 119 instances of quality 7, and 12 instances of quality 8.
10
Clustering
Figure 10: Instances 1435, 1436, and 152 are obvious outliers in cluster 1.
Figure 11: 1245 is now an outlier and matches the quality of instances 1435 and 1436.
The cluster plot in Figure 14 took 3 iterations to develop.
From Figure 14 there are outliers in clusters 1 and 2 which contain instances 152, 1436, and 1475. The two components in this cluster account for 44.69% of the variability in the dataset. Cluster 5 has a significantly higher mean fixed acidity of 1.365 and citric acidity of 1.151. Cluster 1 also has higher rates of chlorides, and sulphates with means of 5.521, and 3.623, respectively. When looking at the 11
Clustering
quality it is cluster 3 that has the highest average quality of 6.256; it also had the highest average alcohol content of 1.248. Cluster 3 had lower rates of every variable except for pH, sulphates, and alcohol, which lead to a higher quality rating. It is also valuable to note that cluster 5 had the second highest average quality with a value of 5.969. The characteristics of cluster 5 include higher fixed acidity, citric acid, residual sugar, and density. Since cluster 5 had higher levels of these variables it placed right outside first in quality. Cluster 3 had the highest average quality; however, it had the second fewest instances contained with 320. Meanwhile, Cluster 4 had the lowest average quality, yet it had the most instances with 557. Cluster 4 is more precise than the other clusters considering it has significantly more instances and only has a sum of squares of 2658.192, which is lower than cluster 2 which had only 335 instances. Meanwhile, cluster 2 had a sum of squares of 3041.372 with 222 fewer instances. Cluster 3 had a sum of squares of 2261.789 with 320 instances. Figure 15 identifies the top 5 outliers from the cluster plot in Figure 14. The outliers when K=5 are similar to K=3 and K=4 except now there is instance 1475 which was not in the previous two outlier groups. When looking at instance 1475 it is easy to notice the high levels of citric acid, chloride, and dioxides. Meanwhile, it has low rates of alcohol
content, and pH. This leads to lower quality of 5 for instance 1475 compared to 6 for instances 1435, 1436, and 1245. From Figure 13 it is easy to notice that cluster 3 had a significant number of instances that had a quality greater than or equal to 5. Cluster 3 had the most instances of wines with a quality of 7 and 8 with 96 and 12 instances, respectively.
Conclusion
Throughout the analysis it was evident that higher means of alcohol content corresponded
to higher ratings of the wine. This could be attributable to several different factors such as promoting a dryer flavor, or the intoxicating factor caused judges to become less stringent. 12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Clustering
Unfortunately, these factors are not available with the dataset, so this correlation is unknown. The analysis also showed that many of the lower quality wines had higher levels of residual sugars and sulfur dioxide, along with lower pH levels. The extra sugars and acidity mask the true
flavor and body of the wine. Creating an almost manufactured taste that is unable to capture the essence of the natural additives. However, a higher alcohol content can mask the flavors by getting people intoxicated. The limitation of this analysis is deciding which K-value will produce the most accurate model. When outliers are involved, it adds another difficulty to producing a model that is not heavily influenced by random anomalies in the dataset. In every model created there were elongated clusters that had high squared sums due to a few outliers. To optimize this, it would be
best to remove the outliers such as instances 152, 481, 1245, 1435, 1436, and 1475 before constructing the clusters. K-Means clustering randomly selects centroids for each cluster so different initial centroids may form different clusters. To get around this issue there is the option to do K++ which chooses the first centroid uniformly at random. Then the following centroid is chosen form remaining instances with probability proportional to its squared distance from the closest existing centroid. 13
Clustering
Figure 12: Cluster 3 had the highest quality with an average rating of 6.256. Cluster 5 was in a close second with a rating of 5.969.
Figure 13: Cluster 3 had the highest number of high-quality instances with 284 instances of a wine quality greater than or equal to 6.
14
Clustering
Figure 14: Outliers include 152, 1245, 1435, and 1436 from clusters 1 and 2.
Figure 15: The top 5 outliers have a quality of 5 and 6.
15
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Clustering
References
Wine Enthusiast. (2021, March 16). Red Wine Information & Basics. Retrieved August 3, 2021, from https://www.winemag.com/2015/10/27/red-wine-basics/
Yildirim, S. (2020, April 06). Two Challenges of K-Means Clustering. Retrieved August 3, 2021, from https://towardsdatascience.com/two-challenges-of-k-means-clustering-
72e90bdeb0da
16
Related Questions
INSTRUCTIONS:
Under DC conditions, in steady state, find the current i and the voltage Vo.
the exercise of the image, it is in Spanish but it is easy to understand.
Note:Please describe what method/application you are using in each step. (so I can understand it)
(and that the result is good, of course).
arrow_forward
Construct the Signal Flow Graph for the block diagram shown in Figure 1 below.
G5
GI
G2
G3
G4
H2
H3
Figure 1 : Block diagram for Assignment 1
arrow_forward
1. Loading error estimation: as we use voltmeter to measure
the voltage drop across a resistor Ro, it is common that we
induce the loading error by incorporating the impedance of
voltmeter, RL, parallelly into the measured portion of circuit,
as shown in the right figure.
a) Please derive the expression of practical voltage drop
across a resistor Ro measured by voltmeter.
b) Please derive the expression of the absolute loading error.
c) Please derive the expression of the relative loading error.
Ro
F1B
R₁ R₁
arrow_forward
With a neat block diagram representations discuss Cassade decomposition.
arrow_forward
Q4) Consider the system shown in Figure Q3. This is a PID control of a second-order
plant G(s). Assume that disturbances d(s) enter the system as shown in the
diagram. It is assumed that the reference input F(s) is normally held constant, and
the response characteristics to disturbances are a very important consideration in
this system.
d(s)
F(s)-
C(s)
G(s)
y(s)
H(s)
Figure Q3
1
K(as + 1)(bs + 1)
G(s)
C(s) =
H(s) = 1
s² + 7s + 10'
In the absence of the reference input i.e. F(s) = 0, derive the closed-loop
transfer function between y(s) and d(s).
a)
b)
The performance specification requires that the unit step disturbance
response be such that the settling time be approximately half a second and the
system has reasonable damping. We may interpret the specification as 3 =
0.8 and wn = 8 for the dominant closed-loop poles. We may choose the third
pole at s = - 10 so that the effect of this real pole on the response is small.
Derive the required characteristic polynomial that satisfies the…
arrow_forward
Using node analysis, determine the model equation for Vout in terms of Vpot.
Using the model equation, calculate the Vout for each value of Vpot.
wwww
R9
V3
10K
Vpot
9Vdc
R8
U2
ww
1k
Vout
SET = 0.5
OUT
%3D
OPAMP
R7
Rô
10k
1k
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Introductory Circuit Analysis (13th Edition)
Electrical Engineering
ISBN:9780133923605
Author:Robert L. Boylestad
Publisher:PEARSON
Delmar's Standard Textbook Of Electricity
Electrical Engineering
ISBN:9781337900348
Author:Stephen L. Herman
Publisher:Cengage Learning
Programmable Logic Controllers
Electrical Engineering
ISBN:9780073373843
Author:Frank D. Petruzella
Publisher:McGraw-Hill Education
Fundamentals of Electric Circuits
Electrical Engineering
ISBN:9780078028229
Author:Charles K Alexander, Matthew Sadiku
Publisher:McGraw-Hill Education
Electric Circuits. (11th Edition)
Electrical Engineering
ISBN:9780134746968
Author:James W. Nilsson, Susan Riedel
Publisher:PEARSON
Engineering Electromagnetics
Electrical Engineering
ISBN:9780078028151
Author:Hayt, William H. (william Hart), Jr, BUCK, John A.
Publisher:Mcgraw-hill Education,
Related Questions
- INSTRUCTIONS: Under DC conditions, in steady state, find the current i and the voltage Vo. the exercise of the image, it is in Spanish but it is easy to understand. Note:Please describe what method/application you are using in each step. (so I can understand it) (and that the result is good, of course).arrow_forwardConstruct the Signal Flow Graph for the block diagram shown in Figure 1 below. G5 GI G2 G3 G4 H2 H3 Figure 1 : Block diagram for Assignment 1arrow_forward1. Loading error estimation: as we use voltmeter to measure the voltage drop across a resistor Ro, it is common that we induce the loading error by incorporating the impedance of voltmeter, RL, parallelly into the measured portion of circuit, as shown in the right figure. a) Please derive the expression of practical voltage drop across a resistor Ro measured by voltmeter. b) Please derive the expression of the absolute loading error. c) Please derive the expression of the relative loading error. Ro F1B R₁ R₁arrow_forward
- With a neat block diagram representations discuss Cassade decomposition.arrow_forwardQ4) Consider the system shown in Figure Q3. This is a PID control of a second-order plant G(s). Assume that disturbances d(s) enter the system as shown in the diagram. It is assumed that the reference input F(s) is normally held constant, and the response characteristics to disturbances are a very important consideration in this system. d(s) F(s)- C(s) G(s) y(s) H(s) Figure Q3 1 K(as + 1)(bs + 1) G(s) C(s) = H(s) = 1 s² + 7s + 10' In the absence of the reference input i.e. F(s) = 0, derive the closed-loop transfer function between y(s) and d(s). a) b) The performance specification requires that the unit step disturbance response be such that the settling time be approximately half a second and the system has reasonable damping. We may interpret the specification as 3 = 0.8 and wn = 8 for the dominant closed-loop poles. We may choose the third pole at s = - 10 so that the effect of this real pole on the response is small. Derive the required characteristic polynomial that satisfies the…arrow_forwardUsing node analysis, determine the model equation for Vout in terms of Vpot. Using the model equation, calculate the Vout for each value of Vpot. wwww R9 V3 10K Vpot 9Vdc R8 U2 ww 1k Vout SET = 0.5 OUT %3D OPAMP R7 Rô 10k 1karrow_forward
arrow_back_ios
arrow_forward_ios
Recommended textbooks for you
- Introductory Circuit Analysis (13th Edition)Electrical EngineeringISBN:9780133923605Author:Robert L. BoylestadPublisher:PEARSONDelmar's Standard Textbook Of ElectricityElectrical EngineeringISBN:9781337900348Author:Stephen L. HermanPublisher:Cengage LearningProgrammable Logic ControllersElectrical EngineeringISBN:9780073373843Author:Frank D. PetruzellaPublisher:McGraw-Hill Education
- Fundamentals of Electric CircuitsElectrical EngineeringISBN:9780078028229Author:Charles K Alexander, Matthew SadikuPublisher:McGraw-Hill EducationElectric Circuits. (11th Edition)Electrical EngineeringISBN:9780134746968Author:James W. Nilsson, Susan RiedelPublisher:PEARSONEngineering ElectromagneticsElectrical EngineeringISBN:9780078028151Author:Hayt, William H. (william Hart), Jr, BUCK, John A.Publisher:Mcgraw-hill Education,
Introductory Circuit Analysis (13th Edition)
Electrical Engineering
ISBN:9780133923605
Author:Robert L. Boylestad
Publisher:PEARSON
Delmar's Standard Textbook Of Electricity
Electrical Engineering
ISBN:9781337900348
Author:Stephen L. Herman
Publisher:Cengage Learning
Programmable Logic Controllers
Electrical Engineering
ISBN:9780073373843
Author:Frank D. Petruzella
Publisher:McGraw-Hill Education
Fundamentals of Electric Circuits
Electrical Engineering
ISBN:9780078028229
Author:Charles K Alexander, Matthew Sadiku
Publisher:McGraw-Hill Education
Electric Circuits. (11th Edition)
Electrical Engineering
ISBN:9780134746968
Author:James W. Nilsson, Susan Riedel
Publisher:PEARSON
Engineering Electromagnetics
Electrical Engineering
ISBN:9780078028151
Author:Hayt, William H. (william Hart), Jr, BUCK, John A.
Publisher:Mcgraw-hill Education,