Faculty of ICT
Department of Artificial Intelligence
David Vella (12498M)
Gabriel Camilleri (21299M)
Table of Contents
TOC o “1-3” h z u Introduction PAGEREF _Toc515814413 h 3The Relationship between Gas Mileage and Transmission PAGEREF _Toc515814416 h 8Summary PAGEREF _Toc515814417 h 8Exploratory Analysis PAGEREF _Toc515814418 h 8Regression Analysis PAGEREF _Toc515814419 h 10Residual Analysis PAGEREF _Toc515814420 h 11Diagnostics PAGEREF _Toc515814421 h 11Conclusion PAGEREF _Toc515814422 h 12Normality Testing PAGEREF _Toc515814423 h 13Price vs Horsepower: Simple Linear Regression Investigation PAGEREF _Toc515814425 h 24Simple Linear Regression PAGEREF _Toc515814426 h 24Graphical Analysis PAGEREF _Toc515814427 h 24Person’s Correlation Test PAGEREF _Toc515814428 h 26Using Simple Linear Regression to approximate the dependent variable PAGEREF _Toc515814429 h 26Conclusion PAGEREF _Toc515814430 h 27
IntroductionCar datasets are nowadays an important tool to improve the automotive industry and make cars more efficient. With the particular dataset used, that was collected through a number trials and questionnaires, several numbers of testing were performed. The variables used for testing were as follows:
Price and its relationship with Horsepower
City in contrast with Highway mpg
Gas mileage in relation with transmission
The above the variables mainly focused on. However, some other small variables might be involved along with the ones mentioned. In order to perform the testing several factors had to be considered and for each factor pie charts and other graphs were used. For example the difference between and automatic car and a manual car had to be clearly outlined otherwise the results from the test performed would have any special meaning. Below are the charts which were mostly utilized:
As one can see there are many factors to be considered and the correct testing and modelling was used whilst also considering the relevant factors.
Below is the link for the dataset which was used and below the link is a picture of how it looks like:
The Relationship between Gas Mileage and Transmission
SummaryIn this report we attempt to find an answer which gives better mpg between manual and automatic. If one exists, it is quantified. A t-test is performed to check whether there is a difference between the two, and then perform a regression analysis to find which factors are involved. The outcome indicated that, without the use of other variables, it was found that manual cars are more fuel-efficient with an average of about 3 more miles per gallon. On the other hand, after considering other features such as the power of the car, the relationship between miles per gallon and transmission type was found to not make sense, and thus the fuel efficiency of manual cars does not solely depend on their transmission type but other variables too.
Exploratory AnalysisThe ‘Automobile_data’ dataset has 205 observations with 28 variables. The variables this dataset contains are as follows:
am: Transmission (0 = automatic, 1 = manual)
Upon importing the data set, a statistical analysis was completed to distinguish the differences, if any, between the transmission types as regards to the miles per gallon. Firstly, this was done by considering that the distribution of the mpg data is normal, and a hence a t-test was performed as per below. The null hypothesis being that there is no significant difference between the mpg of a manual vehicle and an automatic are from the indicated populations.
; t.test(mpg ~ factor(am), data = Automobile_data)
Welch Two Sample t-test
data: mpg by factor(am)
t = -3.1335, df = 127.02, p-value = 0.002146
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
mean in group 0 mean in group 1
With the results obtained the null hypothesis was rejected (p ; .005) since manual cars give better miles to the gallon than automatic cars, getting roughly about 3 more miles to the gallon than an automatic vehicle (manual=29.85 mpg; automatic=26.71 mpg). The difference between the two populations was illustrated by making use of a bar plot which was generated by using boxplot(mpg ~ factor(am), data = Automobile_data) which can be seen below. The mean mpg was used for each transmission type and standard deviation as an error bar. This is clear to see that the vehicles with a manual transmission have better mpg than automatic transmission vehicles on average.
Regression AnalysisWhile we can be sure that there is a difference between manual transmission vehicles and automatic transmission vehicles, we cannot be sure whether this is solely dependent on the transmission type without any other influencing variables. To control the influences of other variables, we made use multivariate regression which was applied to the data set. After the model was obtained, which controls the other variable, we made use of anova to compare it to a model involving only one variable quantity which only considers the relationship between gas mileage and the transmission.
lm(formula = mpg ~ factor(am), data = Automobile_data)
Min 1Q Median 3Q Max
-14.8554 -4.8554 -0.3554 4.2869 21.6446
Estimate Std. Error t value Pr(;|t|)
(Intercept) 26.7131 0.5885 45.392 ; 2e-16 ***
factor(am)1 3.1423 0.9249 3.398 0.000818 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
Residual standard error: 6.5 on 203 degrees of freedom
Multiple R-squared: 0.0538,Adjusted R-squared: 0.04914
F-statistic: 11.54 on 1 and 203 DF, p-value: 0.0008182
According to this model we obtained, there is a relationship between transmission type and mpg, leading us to believe that manual transmission vehicles get roughly 3 mpg more automatic transmission vehicles. However, when this is compared to the multivariate model, which controls for all other variables, different results are obtained as can be seen below.
lm(formula = mpg ~ factor(am) + weight, data = Automobile_data)
(Intercept) factor(am)1 weight
52.392261 1.621038 -0.009807
Although we were led to believe that manual transmission vehicles give 3 more miles to the gallon, we found out according to this more accurate model that in reality manual transmission vehicles only get around 1.621 more miles per gallon, and hence the results are no longer significant. Comparing the wo models using anova displays the difference, hence the multivariate model is more accurate, and we can safely say that we won’t make use of the univariate model.
> multi <- lm(mpg ~ factor(am) + weight, data = Automobile_data)
> uni <- lm(mpg ~ factor(am), data = Automobile_data)
Analysis of Variance Table
Model 1: mpg ~ factor(am)
Model 2: mpg ~ factor(am) + weight
Res.Df RSS Df Sum of Sq F Pr(>F)
1 203 8577.2
2 202 3372.0 1 5205.2 311.82 < 2.2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘
Residual AnalysisTo be able to validate this particular model, the residuals must be examined which are the vertical distance between the data points and the regression equation graph. By making use of residual plots, we were able to asses if the residual errors are constant with the random errors. Basically, the model is valid if the residuals are random since these are analytically incorrect, there is a pattern, but this model can still be improved.
DiagnosticsBy looking at the leverage we select the observations with the largest yhat values
199 201 143
0.05951523 0.04692052 0.04264305
After, we examined the influential points by making use of the observations we found of the highest dbetas.
> influential <- dfbetas(multi)
> head(sort(influential, “factor(am)1”, decreasing = T), 3)
1 3 5
0.2535038 0.2532113 0.2174957
Below is a visual comparison of the previous page.
ConclusionUnquestionably, vehicles which have a manual transmission get better miles to the gallon, yet, this is not solely because of their transmission type. Based on the what was found above, there are other variables which infer the outcome. As a result, the relationship between gas mileage and transmission type is not what it proposes to be, and further analysis is required to determine even more exact which factors have the biggest influence on the gas mileage.
Statistical testing for example correlation, regression, t-test and analysis of variance (ANOVA) take into assumption certain types of characteristics in a given data are met. Such characteristics include the normal distribution or Gaussian distribution of that data. Such tests are also known as parametric test since the distribution of the data is the dependence of its validity. 1
Normality tests are the first step when making inferential statistics. The idea behind this is because it the data is shown to be normally distributed parametric tests are to be later on conducted. However, if not, non-parametric tests like Mann- Whitney U test and Friedman test are to be conducted.
First and foremost visual inspection must be performed to see if the data is normally distributed. The data provided in this scenario was the Automobile_data. The vectors within this dataframe were the following; symbolling, normalized.losses, make (alfa romeo, audi and some other types), fuel.type, aspiration, num.of.doors, body.style, drive.wheels, engine.location, wheel.base, length, width, height, curb.weight, engine.type, num.of.cylinders, engine.size, fuel.system, bore, stroke, compression.ratio, horsepower, peak.rpm, city.mpg, highway.mpg, price, Transmission. A step by step on how the testing was performed shall now be provided.
Firstly attach the dataframe to Rstudio. Then the horsepower and the price vectors must be changed to a numeric type so that the data can be analysed. This will introduce NAs by coercion instead of the question marks. In this particular test the horsepower and the price values for manual and automatic cars shall be compared by their normality and at the end we shall also examine whether there is a some similarity between the horsepower and the price, we will therefore check if they a proportional to each other. In order to do that the horsepower and the price columns where is put in separate variable called “man” short for manual and “automatic”.
Cars Normality Testing for Price
In order to perform the visual analysis certain packages must also be installed. In this case these were dplyr, devtools and ggpubr. We first start by analysing the prices for automatic cars. Then a sample of 10 values is displayed by writing dplyr ::sample_n(automatic,10), note that if the sample size n is larger than 30 the distribution of data must be ignored and instead parametric tests are used. We also must make use of the central limit theorem which states that no matter what distribution data has, sample distribution tends to be normal if sample n >30. The ggpubr must then be called by library(“ggpubr”) and the density graph of automatic cars’ price is plotted using ggdensity(automatic$price, main= “Density plot of price”, xlab= “price”). The analysis on this graph is to simple check if it has a form of bell shape, in this case it did not which showed the first indication the data is not normally distributed . Then a qqplot is to be plotted for the same variable, ggqqplot(automatic$price). This graph also indicated huge variations which helps us to further reject normality. However the above graphs are not to highly depend on, after them we must perform either the Shapiro-Wilk test or the Komogorov-Smirnov test. The preferred test is the Shapiro one since it is stronger than the Komogorov, shapiro.test(automatic$price). This test returned a p-value of 6.78e-12 which is less than 0.5 and therefore we can conclude this set of data rejects normality.
457200515175500146050170751500Next the normality test for price of manual cars had to be conducted. The same procedure as the one before had to be performed this however on the “man” variable. First the density plot had to be displayed using the command ggdensity(man$price, main= “Density plot of price of manual cars”, xlab= “Price”) , this was the first indication of normality rejection since it had no bell shape. Then the qqplot had to be displayed with ggqqplot(man$price). This was the second indication of normality rejection due to a number of extreme variants. Then to base our analysis of a more solid testing the shapiro test was performed, this displayed a p-value of 5.921e-9 which was the last conclusive indication rejecting normality. In this case both normality tests on manual and automatic cars rejected normality, could this be the same with horsepower?
Cars Normality Testing for Horsepower
Nomrality tests for horsepower were conducted. The first test was performed on automatic cars. The density plot was displayed with command ggdensity(automatic$horsepower main= “Density plot of horsepower automatic”, xlab= “horsepower”) . Then plot the qqplot with command ggqqplot(automatic$horsepower). Both plots seemed to reject normality so the shapiro test was used conclusively, shapiro.test(automatic$horsepower). This gave us a p-value of 9.32e-10, which indicated that the horsepower data for automatic cars is not normally distributed.
The same procedural test was performed on the manual cars. Both the density and qqplot indicated a rejection in normality with commands , ggdensity(man$horsepower, main= “Density plot of horsepower manual”, xlab= “horsepower”) and ggqqplot(man$horsepower) respectively. The shapiro test was a further indication to rejection in normality which returned a p-value of 0.000548.If we compare the horsepower and price data we can see some similarities. For example the qqplots were
62067637339200left178519100similar for the automatic cars as well as for the manuals. However, the density graphs would obviously vary from each other. However it can be a possibility that the horsepower and the price could be directly proportional to each other.
Proportionality between Horsepower and Price Testing
To check whether two variables/ vectors are proportional to each other, one must use a scatter plot. It can be used for continuous variables, where one depends on another or when both continuous variables are independent. To check if the two variables are correlated to each other the shape of the plots must be checked and depending on their correlation value we can draw specific conclusions. Below are the scatter plots that were carried out and the code to plot them respectively.
plot(automatic$horsepower, automatic$price,main=”Horsepower vs Price”, xlab=”horsepower”, ylab=”price”)
plot(man$horsepower, man$price,main=”Horsepower vs Price”, xlab=”horsepower”, ylab=”price”)
right267335000plot(Automobile_data$horsepower, Automobile_data$price,main=”Horsepower vs Price”, xlab=”horsepower”, ylab=”price”)
City vs Highway mpg testing
In the “City vs Highway MPG” scatter plot, a sort of proportion between them seem to appear. To check this an abline scatter plot had to be plotted to check if there is proportion. It seems that although they might seem to have a relation, there might be a factor which might hinder the proportionality, in this case, where the mpg is being calculated, if in the city or on the highway. Therefore, a Wilcoxon sign rank test was the testing option to check about these factors. As the same car is being driven in different roads, the mpg is varying. Firstly, a Shapiro Wilk test is applied to both to check if these variables have a normal distribution or not. Both of the variables had a p-value of less than 0.05, therefore, they do not have a normal distribution, and thus we can proceed with the Wilcoxon signed rank test. Our hypothesis to be tested is:
Accept H0 : There is no affect in mpg median when being driven in the city or on the highway.
Accept H1 : The mpg median when driving on the highway is greater than that of when driving in the city.
The following are the commands entered together with the results RStudio gave us.
The p-value of the wilcox.test was way less than 0.05, therefore, we accept H1, which means that the median of mpg on the highway is greater than that of when driving in the city.
Price vs Horsepower: Simple Linear Regression InvestigationSimple Linear Regression
Simple Linear Regression is a tool used to study the relationship between two variables. Both variables must be continuous, and the variable on the y-axis must be the dependent variable, also know as target, and the variable on the x-axis must be an independent variable, which will be used to predict the variable y. This is a very common type of regression used to predicted different types of values, often numeric values. It is called ‘Simple’ since the outcome variable is related to just a single predictor. The ordinary least squares method is used to minimize residuals. The accuracy of the line is measured by the sum of squared residuals and the goal is to make this sum as small as possible.
There are different types of relationships between 2 variables. Indeterministic is a type of relationship were the variables are not perfectly linear. Example, the relationship between miles and kilometers is a deterministic since there is a linear relationship between them, meaning if one of them is known, the other can be found. Some other statistical relationships, example, height and weight. As the height increases, the weight is expected to increase, but not perfectly. The same applies for the relationship between driving speed and gas mileage. As you increase the driving speed, you’d expect the gas mileage to decrease, although not perfectly. Therefore, our aim is to establish a linear relationship on such relationships type, between the predictor and response variable.
Graphical AnalysisA scatter plot is used to visualize the relationship between the predictor and the response variable. In our test, the target, dependent variable y Is the price, and the independent variable x is the horsepower. Therefore, with the Simple Linear Regression test, we hope to find an equation which can predict the price, using the horsepower, as accurately as possible. Simple Linear Regression is then conducted, if and only if, tight clustering is seen in the scatter plot, else, the results won’t be accurate if the two variables are too far unilinear.
Using the command: abline(m1, col=”red”, lty=2,lwd=2), to print the average line. The color of the line is set to red. This will help us visualize better the average linear relationship between the two variables.
Person’s Correlation TestThis test is conducted to measure how much the 2 variables are in a linear relationship. The lower and upper bounds value of this test are +-1, where +1 is a completely linear relationship whilst -1 is a completely unilinear relationship.
The value obtained from the test was 0.81, which is very close to +1, meaning that the 2 variables are very closely related. Thus. Predictions will be very accurate, and the better data can be approximated using a straight line.
The equation of the line is not y = mx + c but: y hat = b0 + b1X, where x is the independent variable and y is the dependent. y = price, x = hp.
Using Simple Linear Regression to approximate the dependent variableThe command: m1 <- lm(price ~ horsepower, data=Automobile_data) is used.
Call is the call that we made
Residuals contain the Min, Max values, etc.
Coefficients: The intercept is B0, whilst the Horsepower is B1
The Equation to make predictions: Y hat = -4562.18 + 172.206X (Input the horsepower Value instead of X and the result ‘Y hat’, is the predicted price for the specific horsepower value entered).
The F-Statistic Is the overall statistic.
Example: To predict the Price when the horsepower is 100:
Price hat = -4562.175 + 172.206(100) Price hat = 12,658.43
Therefore, for a 100-horsepower car, the predicted price is 12,658.43. Using the ‘interval’ keyword, the upper bounds and lower bounds are also given.
The result obtained, was a very realistic result. This was assured by the Pearson’s correlation test result, since the result was very close to +1, and therefore, our predictor equation produced close approximations, meaning that a close-linear relationship between Price and Horsepower exists.
http://biometry.github.io/APES/Stats/stats12-basic_tests.html- Hypothesis testing guide
http://www.sthda.com/english/wiki/normality-test-in-r – Normality testing information