Predict your percentage using machine learning
Machine learning, the subfield of artificial intelligence is growing so rapidly that it will soon occupy its lead in every aspects of life. So far, you have found its applications in image recognition, stock market trading, traffic prediction, product recommendation, online fraud detection, etc. You can count on a day when machine learning will be used to solve every single problems of your life.
In this particular machine learning project, we are going to predict the percentage of students based on number of study hours. We have used a linear regression technique to train our percentage prediction model.
First, we need to import the necessary libraries. After that have imported our dataset. As you can see, our dataset contains only two columns Hours and Scores. Now we have to check if there are any missing values in our data. Next, we need to find the correlation between our variables. Further performing the correlation analysis shows that there is 97% positive relationship between the two variables, which means there is 97% chance that any change in study hours will lead to a change in grade.
Correlation and Causation
Although correlation helps us determine the degree of relationship between two or more variables, it does not tell about the cause and effect relationship. Correlation does not imply causation though the existence of causation always implies correlation. Let’s understand this better with examples.
More firemen’s presence during a fire instance signifies that the fire is big but the fire is not caused by firemen.
When one sleeps with shoes on, he is likely to get a headache. This may be due to alcohol intoxication.
A simple scatter plot with hours studied on the x-axis and the test grades on the y-axis shows that the score gradually increases with the increase in hours studied. This implies that there is a linear relationship between the two variables. When we fit a slope line through all the point, we get the error.
The error could be positive or negative based on its location from the slope.
The slope equation is given by Y = mX + c, where Y is the predicted value for a given x value.
m is the change in y, divided by change in x, that is, m is the slope of the line for the x variable and it indicates the steepness at which it increases with every unit increase in x variable value.
c is the intercept that indicates the location or point on the axis where it intersects, Intercept is a constant that represents the variability in Y that is not explained by the X. It is the value of Y when X is zero.
Now we will train our model with the help of scikit-learn library and find the value of intercept (‘C’) and slope (m). Together the slope and intercept define the linear relationship between the two variables and can be used to predict or estimate an average rate of change. Now using this relation, for a new student we can determine the score based on his study hours. Say a student is planning to study an overall of 9.25 hours in preparation for the test. Let’s put the appropriate values in the slope equation (m * X + c = Y), 9.77580339* 9.25 + 2.48367341= 92.91 that means a student studying 9.25 hours has the probability of scoring 92.91 test grade.
Simply drawing a connecting line from the x-axis and y-axis to the slope shows that there is a possibility of him scoring 92.91. We can use the slope equation to predict the score for any given number of hours of study
Performance of model
R-Squared for Goodness of Fit
The R-squared metric is the most popular practice of evaluating how well your model fits the data. R-squared value designates the total proportion of variance in the dependent variable explained by the independent variable. It is a value between 0 and 1; the value toward 1 indicates a better model fit.
Root Mean Squared Error (RMSE)
This is the square root of the mean of the squared errors. RMSE indicates how close the predicted values are to the actual values; hence a lower RMSE value signifies that the model performance is good. One of the key properties of RMSE is that the unit will be the same as the target variable.
Python code
#Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#now we will import our dataset
dataset=pd.read_csv('students_score.csv')
dataset.head(5) #first 5 rows of dataset
#we need to divide our data into input and output variable
hours=dataset.iloc[:,[0]] #input data(study hours)
score=dataset.iloc[:, [1]] #output data(student score)
#training the model
from sklearn.linear_model import LinearRegression
model=LinearRegression()
model.fit(hours,score)
#finding intercept and slope
print('Intercept C: ', model.intercept_)
print('Coefficient m: ', model.coef_)
#we have trained our model
#now we are going to predict the score
predicted_score=model.predict(hours)
#checking performance of model
from sklearn.metrics import r2_score, mean_squared_error, accuracy_score
print('R Squared error: ', r2_score(predicted_score, score))
print('Root mean squared error: ', np.sqrt(mean_squared_error(predicted_score, score)))
#visualizing our linear regression model
plt.scatter(hours,score, c='blue')
plt.plot(hours, predicted_score, c='black', linewidth=3)
plt.xlabel('Hours of study')
plt.ylabel('Student score')
plt.show()
#our model is performing well
#now we will predict the score, if student studies 9.25 hours/day
test_hour=[[9.25]]
test_score=model.predict(test_hour)
print(test_score)
Comments
Post a Comment