Forest_Fires

pdf

School

Northeastern University *

*We aren’t endorsed by this school

Course

7275

Subject

Electrical Engineering

Date

Apr 3, 2024

Type

pdf

Pages

10

Uploaded by MasterChinchillaPerson721

Report
In [7]: #Q.2 from google.colab import files import pandas as pd import matplotlib.pyplot as plt uploaded = files . upload() df = pd . read_csv( 'forestfires.csv' ) In [9]: df = pd . DataFrame(df) In [16]: df . sort_values( 'month' ,ascending = True ) Choose Files No file chosen Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable. Saving forestfires.csv to forestfires (1).csv Out[16]: X Y month day FFMC DMC DC ISI temp RH wind rain area 442 6 5 apr mon 87.9 24.9 41.6 3.7 10.9 64 3.1 0.0 3.35 241 4 4 apr fri 83.0 23.3 85.3 2.3 16.7 20 3.1 0.0 0.00 176 6 5 apr thu 81.5 9.1 55.2 2.7 5.8 54 5.8 0.0 4.61 240 6 3 apr wed 88.0 17.2 43.5 3.8 15.2 51 2.7 0.0 0.00 196 6 5 apr thu 81.5 9.1 55.2 2.7 5.8 54 5.8 0.0 10.93 ... ... ... ... ... ... ... ... ... ... ... ... ... ... 366 4 5 sep tue 91.1 132.3 812.1 12.5 15.9 38 5.4 0.0 1.75 367 4 5 sep tue 91.1 132.3 812.1 12.5 16.4 27 3.6 0.0 0.00 368 6 5 sep sat 91.2 94.3 744.4 8.4 16.8 47 4.9 0.0 12.64 357 6 3 sep fri 92.5 122.0 789.7 10.2 15.9 55 3.6 0.0 0.00 173 4 4 sep mon 90.9 126.5 686.5 7.0 17.7 39 2.2 0.0 3.07 517 rows × 13 columns
In [17]: df Out[17]: X Y month day FFMC DMC DC ISI temp RH wind rain area 0 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 0.00 1 7 4 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 0.00 2 7 4 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 0.00 3 8 6 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 0.00 4 8 6 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0 0.00 ... ... ... ... ... ... ... ... ... ... ... ... ... ... 512 4 3 aug sun 81.6 56.7 665.6 1.9 27.8 32 2.7 0.0 6.44 513 2 4 aug sun 81.6 56.7 665.6 1.9 21.9 71 5.8 0.0 54.29 514 7 4 aug sun 81.6 56.7 665.6 1.9 21.2 70 6.7 0.0 11.16 515 1 4 aug sat 94.4 146.0 614.7 11.3 25.6 42 4.0 0.0 0.00 516 6 3 nov tue 79.5 3.0 106.7 1.1 11.8 31 4.5 0.0 0.00 517 rows × 13 columns
In [18]: #1 month_order = [ 'jan' , 'feb' , 'mar' , 'apr' , 'may' , 'jun' , 'jul' , 'aug' , 'sep' , 'oct' , 'no v' , 'dec' ] grouped = df . groupby([ 'month' , 'day' ]) . size() . unstack() . reindex(month_order) grouped . plot(kind = 'bar' , stacked = True ) plt . xlabel( 'Month' ) plt . ylabel( 'Number of Forest Fires' ) plt . title( 'Forest Fires by Month and Day' ) plt . show() 1. Yes, the issue with the stacked bar chart is the information is very hard to read in the stacked bar chart. So we can use different types of visualizations such as heatmap to better get the information.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [19]: pivot_table = df . pivot_table(index = 'month' , columns = 'day' , aggfunc = 'size' , fill_value =0 ) pivot_table In [20]: #2 scatter_fire = df . plot . scatter( 'X' , 'Y' ,s = 'area' ,c = 'red' ) Out[19]: day fri mon sat sun thu tue wed month apr 1 1 1 3 2 0 1 aug 21 15 29 40 26 28 25 dec 1 4 0 1 1 1 1 feb 5 3 4 4 1 2 1 jan 0 0 1 1 0 0 0 jul 3 4 8 5 3 6 3 jun 3 3 2 4 2 0 3 mar 11 12 10 7 5 5 4 may 1 0 1 0 0 0 0 nov 0 0 0 0 0 1 0 oct 1 4 3 3 0 2 2 sep 38 28 25 27 21 19 14
In [21]: #3 import seaborn as sns scatter_matrix = df[[ 'temp' , 'RH' , 'DC' , 'DMC' ]] sns . set_theme(style = 'ticks' ) sns . pairplot(scatter_matrix) Looking at this scatter matrix we can interpret that DC and DMC have some correlation between them, and negative correlation present between RH and temp. Out[21]: <seaborn.axisgrid.PairGrid at 0x7f3462785f60>
In [22]: #4 plt . hist(df[ "area" ], bins =20 , edgecolor = "black" ) plt . xlabel( "Burned Area (hectares)" ) plt . ylabel( "Frequency" ) plt . title( "Distribution of Burned Area" ) plt . show() One finding from the histogram may be that the majority of the burned areas are concentrated towards the lower end, with a few instances of larger burned areas. This suggests that most forest fires in the dataset tend to be relatively small in terms of the area affected. However, there are also a few instances of larger-scale fires.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Q.3. 1. Course A = Median = 84, Q1 = 81, Q3 = 88 Course B = Median = 78, Q1 = 70, Q3 = 84 2. Course A = IQR (81, 88) = 7 Course B = IQR (70, 84) = 14 Therefore, Course B has more variability in grades. 3. Yes. Course A has outliers. There are several reasons why outliers may occur in a box plot: Data entry errors: Outliers can result from mistakes made during data collection or data entry processes. For example, a decimal point may be misplaced, leading to an unusually large or small value. Measurement errors: Outliers can arise from errors in measurement or recording. Natural variation: In some cases, outliers may simply be a result of natural variation in the data. Outliers can occur when there are extreme values in the score being given. Getting an extremely high or low score in a test can lead to outliers. Skewed distribution: Outliers can also occur when the data distribution is highly skewed. Skewness refers to the asymmetry of the distribution. In such cases, the box plot may show a long tail on one side, and data points that fall outside the whiskers of the plot are considered outliers. Data anomalies: Outliers can be caused by unusual or rare events that are not representative of the overall dataset. These anomalies may arise from a variety of factors, such as errors in the data-generating process, extreme weather events, or other exceptional circumstances. Sample size: In small sample sizes, outliers may be more likely to occur by chance. With a limited number of data points, the presence of even a single extreme value can have a significant impact on the box plot. It's important to note that the presence of outliers does not necessarily imply an error or a problem with the data. Outliers can provide valuable insights into the data distribution. However, it is important to investigate the reasons behind outliers and assess their impact on the analysis or interpretation of the data. 1. Course B seems to be more challenging as it has more IQR and also there is huge difference in its minimum and maximum value. Also, course B has lesser median than course A. 2. Course B contains mixed variety of students scoring high and low grade. Scores in course A is defined to short range than course B. Comparing both the courses highest mark is in course B and course A also contains very few students scoring very high and very low scores.
Extra Credit Q.1. Let total students = 100 Men = 60 Women = 40 Men - Passed = 42 Failed = 18 Women - Passed = 32 Failed = 8 Total Students Passed = 74 Total Students Failed = 26 1. Probability that randomly chosen student passed the exam is 74/100 = 0.74 2. Given that a student passed the exam, what is the probability that the student is a woman P(Women | Passed)= P(Women ∩ Passed)/ P(Passed) = 0.432 3. Given that a student failed the exam, what is the probability that the student is a man P(Men | Failed)= P(Men ∩ Failed)/ P(Failed) = 0.692 Q.2. 1. To determine the percentage of plants that fall between 135cm and 155cm tall in a Normal distribution with a mean of 145cm and a standard deviation of 22cm, we can use the properties of the Normal distribution. First, we need to standardize the values of 135cm and 155cm by converting them into z-scores. The z-score represents the number of standard deviations an observation is from the mean. The formula for calculating the z-score is: z = (x - μ) / σ where: z = z-score x = observed value μ = mean σ = standard deviation Using this formula, we can calculate the z-scores for 135cm and 155cm: For 135cm: z1 = (135 - 145) / 22 = -0.4545 For 155cm: z2 = (155 - 145) / 22 = 0.4545 Next, we can use a standard Normal distribution table to find the area under the curve between these two z-scores. This area represents the percentage of plants that fall between 135cm and 155cm tall. Using a standard Normal distribution table, we find that the area between z = -0.4545 and z = 0.4545 is approximately 0.3336. Therefore, approximately 33.36% of plants are between 135cm and 155cm tall in the given Normal distribution with a mean of 145cm and a standard deviation of 22cm. Q.2. 2. First, we need to calculate the standard error of the mean (SE) using the formula: SE = σ / √n where σ is the standard deviation of the population (22cm) and n is the sample size (16). SE = 22 / √16 = 22 / 4 = 5.5 Next, we can calculate the z-scores for the lower and upper limits using the formula: z1 = (x1 - μ) / SE where x1 is the lower limit (135cm), μ is the population mean (145cm), and SE is the standard error of the mean.
z1 = (135 - 145) / 5.5 = -1.818 z2 = (x2 - μ) / SE where x2 is the upper limit (155cm). z2 = (155 - 145) / 5.5 = 1.818 Using a standard Normal distribution table or statistical software, we find that the area between z = -1.818 and z = 1.818 is approximately 0.8677. Therefore, the probability that the mean height of a random sample of 16 plants would be between 135cm and 155cm tall is approximately 0.8677 or 86.77%. Q.2. 3. calculate the standard error of the mean (SE) using the formula: SE = σ / √n where σ is the standard deviation of the population (22cm) and n is the sample size (32). SE = 22 / √32 ≈ 3.897 Next, calculate the z-scores for the lower and upper limits using the formula: z1 = (x1 - μ) / SE where x1 is the lower limit (135cm), μ is the population mean (145cm), and SE is the standard error of the mean. z1 = (135 - 145) / 3.897 ≈ -2.571 z2 = (x2 - μ) / SE where x2 is the upper limit (155cm). z2 = (155 - 145) / 3.897 ≈ 2.571 Using a standard normal distribution table or statistical software, we find that the area between z = -2.571 and z = 2.571 is approximately 0.9841. Therefore, the probability that the mean height of a random sample of 32 plants would be between 135cm and 155cm tall is approximately 0.9841 or 98.41%. Q.3. We have to do chi-square independence test Firstly, we need to set up the null hypothesis (H0) and the alternative hypothesis (H1): H0: Hand preference and foot preference are independent. H1: Hand preference and foot preference are not independent. The expected frequency = (row total * column total) / grand total. The expected frequencies table is as follows: plaintext Right Hand Left Hand Total Right Foot 2089 44 2133 Left Foot 65 193 258 Total 2154 237 2391 χ^2 = Σ [(Observed frequency - Expected frequency)^2 / Expected frequency] Calculating the chi-square test statistic for each cell and summing them up, we get:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
χ^2 = (2012 - 2089)^2 / 2089 + (121 - 44)^2 / 44 + (142 - 65)^2 / 65 + (116 - 193)^2 / 193 ≈ 29.521 To determine whether this chi-square test statistic is statistically significant, we need to compare it to the critical chi- square value at a certain significance level with degrees of freedom equal to (number of rows - 1) * (number of columns - 1). In this case, we have (2 - 1) * (2 - 1) = 1 degree of freedom. Assuming a significance level of 0.05, the critical chi-square value is approximately 3.841. Since the calculated chi-square value (29.521) > critical chi-square value (3.841), we reject the null hypothesis. This means that hand preference and foot preference are not independent in the given study. Therefore, based on the chi-square independence test, hand and foot preferences are not independent.