Skip to main content

Command Palette

Search for a command to run...

Linear Regression : A Beginner’s First Step in Machine Learning

Updated
16 min read
Linear Regression : A Beginner’s First Step in Machine Learning
O

MERN Stack Developer, Machine learning & Deep Learning

As a machine learning enthusiast, Linear Regression is an ideal starting point due to its simplicity in demonstrating how models work and the fundamental logic behind predictions. In this blog, we'll explore both simple and multiple linear regression from scratch. Let’s dive in!

What is Regression?

Regression is a fundamental concept in machine learning used to predict continuous values. In simple terms, it models the relationship between a dependent variable (the outcome or target) and one or more independent variables (the predictors or features). The goal of regression is to find the best-fitting line or curve that predicts the dependent variable based on the values of the independent variables.

For example, regression can be used to predict:

  • Housing prices based on factors like square footage, number of rooms, and location.

  • Stock prices based on historical data and market indicators.

  • Temperature trends over time.

To understand better let’s start with Linear Regression.

Linear Regression

In Linear Regression, we aim to find the line that best fits the data points by determining the values of m (slope) and c (intercept) that minimize the error between the predicted values and the actual values.

In conclusion, Linear Regression finds the optimal line (or hyperplane in higher dimensions) that predicts continuous outcomes based on the given data.

Simple Linear Regression

Simple Linear Regression is a method for predicting a quantitative outcome (response variable Y) using a single predictor variable X. The model assumes a linear relationship between X and Y, meaning the change in Y is proportional to the change in X.

The simple linear regression model is represented as:

𝒀 ≈𝛽0 + 𝛽1 𝑋

𝛽0 is known as Intercept, 𝛽1 is known as slope

Our goal in Simple Linear Regression is to find the best estimates for 𝛽0 and 𝛽𝟏 that make the linear model fit the data well. This is done using training data consisting of n=506 data points.

Data ⇒ (𝑥1, 𝑦1) , (𝑥2, 𝑦2) , (𝑥3, 𝑦3) , ………………… (𝑥506, 𝑦506)

Lets call calculated 𝑦 value as y’

𝑦’1 = 𝛽0 + 𝛽𝟏𝑥1

𝑦’2 = 𝛽0 + 𝛽𝟏𝑥2

……….

𝑦’506 = 𝛽0 + 𝛽𝟏𝑥506

Residual Sum of Errors (RSS)

The “residual” is the difference between the actual value of Y for the i-th observation and the predicted value y’​ from our model:

ei = yi-y’i

The goal is to minimize the sum of the squared residuals, known as the Residual Sum of Squares (RSS):

The least squares approach chooses 𝛽0 and 𝛽𝟏 to minimize the RSS Using some calculus

This process ensures the model captures the linear relationship between X and Y as accurately as possible given the data.

Assessing the Accuracy of Simple Linear Regression:

To evaluate the accuracy of the Simple Linear Regression model, we assume the relationship between X and Y follows this form: Y = f(X) + ε

  • f(X) is an unknown true function relating X and Y,

  • ϵ is a random error term, which follows a normal distribution with a mean of zero

If we approximate f(X) with a linear function, the model becomes: Y=β0+β1X+ε

𝛽0 is known as Intercept, 𝛽1 is known as slope and ε is an error term.

Standard Error in Coefficients:

The Standard Error (SE) of the coefficients β0​ and β1 represents the uncertainty in these estimates. Essentially, the SE quantifies how much the coefficient estimates might vary due to random noise in the data. Lower SE means higher confidence in the accuracy of the coefficient estimates.

𝜎2 = 𝑉𝑎𝑟(𝜀)

For β1 (the slope), the SE can be used to compute a confidence interval. Assuming a normal distribution, there is approximately a 95% chance that the true value of β1​ lies within β1±2×SE(β1)

By calculating RSE and the standard error for coefficients, we can assess how well the model fits the data and how reliable our estimates of β0​ and β1​ are. A lower RSE and a smaller SE for the slope indicate that the model does a good job predicting Y based on X, while also giving us confidence in the estimated relationship between the two variables.

Quality of Fit in Linear Regression:

When assessing the quality of a linear regression model, two important metrics are typically used: Residual Standard Error (RSE) and the R² statistic. Both provide insights into how well the model fits the data.

Residual Standard Error (RSE):

The RSE is a measure of the model’s accuracy in capturing the true relationship. It represents the average difference between the actual observed values and the values predicted by the model. Specifically, the RSE is the standard deviation of the error term ε.

Keep In mind:

  • Lower RSE indicates a better fit, meaning the model's predictions are close to the actual data points.

  • Higher RSE means the model has more error, and the fit is not as accurate.

R² Statistic:

The R² statistic measures the proportion of the variance in the dependent variable Y that is explained by the independent variable X in the model. In other words, it tells us how well the model captures the variability in the data.

  • RSS is the Residual Sum of Squares (unexplained variation),

  • TSS is the Total Sum of Squares (total variation in Y).

  • Total Sum of Squares (TSS): The TSS measures the total variance in Y around its mean. It represents how much variation exists in the data before accounting for the model.

    Where yˉ​ is the mean of the observed y values.

  • Residual Sum of Squares (RSS): RSS measures the variance that is not explained by the model. It is the sum of squared residuals, or the difference between the actual and predicted values.

Keep In mind:

  • R² takes a value between 0 and 1.

  • 0: The model explains none of the variance in Y (very poor fit).

  • 1: The model explains all the variance in Y (perfect fit).

  • A higher R² value indicates that the model explains a larger portion of the variability in the data, leading to a better fit.

Multiple Linear Regression

In Multiple Linear Regression, we use more than one predictor (independent) variable to predict the response (dependent) variable. The relationship between the response variable Y and the predictor variables X1,X2,…,Xp​ can be expressed as:

  • Y is the response variable,

  • X1,X2,…,Xp are the predictor variables,

  • β0​ is the intercept (the expected value of Y when all X's are 0),

  • β1,β2,…,βp​ are the coefficients or slopes for each predictor variable,

  • ϵ is the error term, representing the variability in YYY that cannot be explained by the model.

F-Statistic in Multiple Linear Regression:

The F-statistic is a key metric used to assess the overall significance of a multiple linear regression model. It helps determine whether the model provides a better fit to the data than a model that contains no predictor variables (i.e., a model with only the intercept).

The F-statistic is calculated as

keep In Mind:

  • High F-Statistic: If the F-statistic is large, it suggests that the predictor variables significantly improve the model's ability to predict the response variable Y. In other words, at least one of the predictors is statistically significant.

  • Low F-Statistic: If the F-statistic is close to 1, it implies that the predictor variables do not provide much additional explanatory power beyond the intercept-only model.

Note: This was the heavy mathematics behind training and assessing a linear regression model; if you didn’t understand everything, don’t worry—Python libraries handle it for you. Just focus on the key takeaways!

Implementation of Simple Linear Regression

The dataset we'll be working with is the house_price dataset, which we previously used in the last article on data preprocessing. Since we've already covered that topic, we won’t go over it again here. Instead, let’s dive into our first model: simple linear regression.

The Code and Dataset for Linear Regression is uploaded here:

GitHub Link

There are two ways to implement Simple Linear Regression.

  1. OLS regression using the statsmodel library:

OLS is a method used in statistics to estimate the parameters (slope and intercept) of a linear regression model. Given a dataset with dependent variable y and independent variable x, OLS tries to find the parameters β0​ (intercept) and β1​ (slope) that minimize the residual sum of squares (RSS)

How to perform Ordinary Least Squares (OLS) regression using the statsmodels library in Python. Let’s break it down step by step:

import statsmodels.api as sm

This imports the statsmodels library, which is widely used for statistical modeling, including regression analysis. The api module provides access to high-level functions like OLS.

X = sm.add_constant(df["room_num"])

Our x vaiable is number or rooms.

sm.add_constant(): This function adds a constant (intercept term) to the model. In regression analysis, it's common to include a constant (bias) so that the line doesn't have to pass through the origin. By adding this constant, the model will calculate an intercept term.

  • After this step, X becomes a DataFrame with two columns:

    1. A column of 1s (the constant term).

    2. The room_num column (the predictor).

lm = sm.OLS(df["price"], X).fit()  # Y and X variables
  • sm.OLS(): This creates an OLS regression model. The OLS() function requires two inputs: df["price"]( dependent variable), number of rooms (room_num) (Independent variable) along with the constant term we added earlier.

  • .fit(): This method fits the OLS regression model to the data. It estimates the parameters (slope and intercept) of the regression line that best fits the data. The result is stored in the lm object.

lm.summary()

.summary(): This function provides a detailed summary of the regression results. It includes important statistical metrics such as:

  • Coefficients: The estimated values of the slope (9.0997) and intercept (-34.6592).

  • R-squared: How well the model explains the variance in the dependent variable.

  • p-values: Help determine if the independent variable is statistically significant.

  • Standard Errors: Measure the accuracy of the coefficient estimates.

An R-squared of 0.485 suggests a moderate relationship between the independent and dependent variables. It means that the model explains some of the variation in the data, but there is still 51.5% of the variance that is unexplained by the model.

To improve the model's predictive power and increase the R-squared, you could add more relevant features (e.g., house location, square footage, proximity to schools), or explore more advanced models (e.g., multiple linear regression, polynomial regression).

  1. Simple linear regression - using sklearn library:

How to perform a simple linear regression using scikit-learns's LinearRegression model. Let's break it down step by step.

from sklearn.linear_model import LinearRegression

This imports the LinearRegression class from the sklearn.linear_model module, which is used to create and fit a linear regression model.

#define variables
y = df["price"]
x = df[["room_num"]]

Notice that x is wrapped in double square brackets ([["room_num"]]) to ensure it is two-dimensional. scikit-learn expects x to be a 2D array (even if there's only one feature). And we don’t need to add constant term as it is handled by LinearRegression class of sklearn itself.

lm2 = LinearRegression()
lm2.fit(x, y)

This creates an instance of the LinearRegression model, stored in the variable lm2.

.fit(x, y): This method trains the linear regression model using the predictor variable x (number of rooms) and the target variable y (house price).

The model learns the slope and intercept of the line that best fits the data by minimizing the sum of squared residuals (just like OLS).

# get intercept and slope
print(lm2.intercept_, lm2.coef_)
lm2.predict(x)

predict the house prices (y) based on the number of rooms (x).

sns.jointplot(x=df['room_num'], y=df['price'], data=df, kind='reg')

sns.jointplot(): This is a function from the seaborn library used to visualize the relationship between two variables, in this case, room_num and price.

Implementation of Multiple Linear Regression

using OLS model of statsmodels:

For multiple linear regression we will use all predictor variables (features) simultaneously to predict the price of house. So let’s get started!

First take all attribute from dataset as x variables, except price as it is Y variable .

x_multi = df.drop('price', axis=1) 
#axis=1 for dropping column , axix=0 for dropping a row
x_multi.head()

y_multi= df['price']

# add constant in x ie beta0
import statsmodels.api as sm

x_multi_cons = sm.add_constant(x_multi)
x_multi_cons.head()

Fit model and get summery

lm_multi = sm.OLS(y_multi, x_multi_cons).fit()  #fit model
lm_multi.summary()

  1. degree of freedom (Df_Residuals=490) = total rows (506)- total variables(17)+1

  2. lower the p value ( p<0.05 ), more significant is the variable in predictiong dependent variable

  3. positive coefficient= directly proportional, negative coef = inversely proportional ex. if you increase room_num by 1, price increases by 4

using LinearRegression of sklearn:

similar to simple linear regression, but here we use all features to predict price.

x_multi = df.drop('price', axis=1) 
y_multi= df['price']

from sklearn.linear_model import LinearRegression
lm3 = LinearRegression()
lm3.fit(x_multi, y_multi)
print(lm3.intercept_, lm3.coef_)

Train-Test Split

Train-test split is a technique in machine learning where the dataset is divided into two parts:

  1. Training set: A subset of the data used to train (fit) the machine learning model.

  2. Test set: A separate subset of the data used to evaluate the performance of the trained model.

Why do we use Train-Test Split?

The main reason for splitting the data is to assess how well the model will generalize to unseen data. If a model is evaluated on the same data it was trained on, it might give overly optimistic results, leading to a phenomenon known as overfitting—where the model performs well on training data but poorly on new, unseen data.

By splitting the data:

  • We can train the model on one portion of the data.

  • Then we can test how well the model performs on data it has never seen before (test set).

In Python, using scikit-learn, the train_test_split function is commonly used to perform this split.

from sklearn.model_selection import train_test_split
x_train , x_test , y_train , y_test = train_test_split(x_multi, y_multi, test_size=0.2 , random_state=0)
  • test_size=0.2: This specifies that 20% of the data will be used for testing and the remaining 80% for training.

  • By setting a specific value for random_state, such as random_state=0, you are ensuring that the results of the data split are reproducible. This means every time you run the code with random_state=0, the same split of data into training and testing sets will be produced.

Lets Train a linear regression model, make predictions on both the training and test datasets, and evaluate the model's performance using the R-squared (R²) metric.

lm_a = LinearRegression()
lm_a.fit(x_train, y_train) # train model only with training data

# prdicted values of y for x_test
y_test_a = lm_a.predict(x_test)
# predicted values of y for x train
y_train_a = lm_a.predict(x_train)

# to find r2 value of train and test set prediction
# imports the r2_score function from the metrics module in scikit-learn.
from sklearn.metrics import r2_score
r2_score(y_train, y_train_a) #0.756463540591123
r2_score(y_test, y_test_a)   #0.5496468288205683

The r2_score’s suggests that the model performs significantly better on the training data, indicating potential overfitting, where it has learned patterns specific to the training set but struggles to generalize to new data.

Regression models other than OLS

The OLS (Ordinary Least Squares) model estimates the relationship between predictor variables and a target variable by minimizing the sum of squared errors. However, using all available variables can lead to overfitting, high multicollinearity, and reduced model interpretability.

Subset selection helps identify the most significant predictors, while shrinkage methods (like Ridge and Lasso) regularize coefficients to prevent overfitting and improve generalization. Together, these techniques enhance the performance and simplicity of regression models.

Subset Selection and Shrinkage method:

Subset selection refers to the process of choosing a smaller subset of predictor variables from a larger set when training a regression model. Instead of using all available variables, you focus on a select few that contribute the most to the model's performance.

This approach helps improve model interpretability, reduces overfitting by minimizing noise from irrelevant or redundant predictors, and enhances computational efficiency.

Common techniques for subset selection include:

  • Forward Selection: Starting with no predictors, adding variables one at a time based on their statistical significance.

  • Backward Elimination: Starting with all predictors and removing the least significant variables step-by-step.

  • Stepwise Selection: A combination of both forward and backward approaches, allowing for adding and removing predictors based on their significance.

Shrinkage methods involve regularization techniques that reduce the magnitude of the coefficients of some predictors towards zero. This helps to control overfitting and improve the model's predictive performance.

By shrinking coefficients, we reduce their impact on the model, which helps to:

Common Techniques:

  • Ridge Regression (L2 Regularization): Adds a penalty equal to the square of the coefficients, shrinking them towards zero but not setting any to zero.

  • Lasso Regression (L1 Regularization): Adds a penalty equal to the absolute values of the coefficients, which can result in some coefficients being exactly zero, effectively eliminating those predictors from the model.

Ridge Regression and Lasso Regression:

Both Ridge and Lasso regression are extensions of linear regression that include regularization to prevent overfitting, especially when dealing with multicollinearity or when the number of predictors is greater than the number of observations.

1. Ridge Regression (L2 Regularization)

Ridge regression adds a penalty term to the loss function based on the square of the coefficients' magnitudes. The loss function is modified as follows:

were λ is the regularization parameter, and βi​ are the coefficients of the model.

This penalty term shrinks the coefficients towards zero but does not set them exactly to zero, allowing all predictors to remain in the model while reducing their impact.

2. Lasso Regression (L1 Regularization)

Lasso regression also adds a penalty term to the loss function, but it uses the absolute values of the coefficients. The modified loss function is:

The L1 penalty can shrink some coefficients to exactly zero, effectively performing variable selection. This means that Lasso can reduce the number of predictors in the model by excluding less important variables.

In summary, Ridge regression is used for coefficient shrinkage without variable elimination, while Lasso regression is used for both shrinkage and variable selection, making it useful for simpler models.

What is heteroscedasticity?

Heteroscedasticity refers to a situation in regression analysis where the variance of the errors (or the residuals) is not constant across all levels of the independent variable(s). This violates one of the key assumptions of ordinary least squares (OLS) regression, which assumes homoscedasticity—constant variance of errors.

In a heteroscedastic dataset, the spread of the residuals increases or decreases with the value of the independent variable(s). For instance, as the value of a predictor increases, the variability of the response variable may also increase.

Heteroscedasticity often manifests as a funnel shape or fan shape when plotted in a residual vs. fitted values graph, where the residuals display a pattern that correlates with the fitted values.

Consequences:

  • Inefficient Estimates: While OLS can still provide unbiased estimates of the coefficients, the presence of heteroscedasticity leads to inefficient estimates, meaning the estimated standard errors can be incorrect.

  • Invalid Inferences: This can result in misleading statistical tests (like t-tests and F-tests), affecting confidence intervals and hypothesis testing.

Remedies:

  • Transformations: Applying transformations (e.g., log, square root) to the dependent variable to stabilize variance.

  • Weighted Least Squares (WLS): Using WLS regression, where different weights are applied to different observations based on their variance.

  • Robust Standard Errors: Adjusting standard errors to account for heteroscedasticity without changing the coefficient estimates.

heteroscedasticity is an important consideration in regression analysis that can affect the reliability of model estimates and inferences, necessitating various detection and remedial techniques to ensure valid results.

Conclusion

In this blog, we explored linear regression, a fundamental statistical method for modeling relationships between a dependent variable and one or more independent variables.

We discussed Ordinary Least Squares (OLS) regression, which estimates coefficients by minimizing the sum of squared errors, and the importance of addressing issues like multicollinearity and heteroscedasticity.

Techniques such as subset selection and shrinkage methods (Ridge and Lasso) enhance model performance and interpretability.

We also covered evaluating model performance using metrics like R-squared.

Dataset and code for linear regression:

In the next blog, we will delve into classification models, exploring how they categorize data points into distinct classes.

Happy Learning!

More from this blog

Omkar Kasture Blogs

32 posts