A simple introduction to linear regression

Author

Seth Frandsen

Published

September 29, 2025

Introduction

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. it is widely used in various fields such as economics, biology, and social sciences to understand how changes in the independent variables affect the dependent variable. it can be be very very useful and is very easy to implement.

What is Simple Linear Regression?

simple linear regression models the relationship between a dependent variable y and an independent variable x by fitting a straight line through the data points. The model can be expressed with the formula: \[y=β_0 +β_1x+ϵ \]

Where β_0 is the intercept, β_1 is the slope of the line, and ϵ is the error term. The slope is the rate at which y increases when x increases. A simple way to think of this is that for each increase in β_1 then the β_0 will increase an amount. ϵ in this case is the error term and represents random noise, unmeasured factors, or inherent variability in the data that cannot be explained by the linear relationship between x and y.

Using this model, we can make predictions about the value of y based on the value of x. It is very simple to implement, and it is very useful. Below is a simple example of how to implement linear regression in python using the scikit-learn library on a simple dataset containing running distance, heart rate, self-evaluation, and sleep score. We will be going over how to use one variable to predict another in our first example and then we will be going over how to use multiple variables to predict another in our second example.

Simple Linear Regression

In this example, we will be using the distance of a run to predict the heart rate of a person.

Step 1: Download the necessary packages and the data set that you would like to use. For this example, we will be using the heart rate dataset and using

the scikit-learn library, Pandas, Numpy and Matplotlib. There are very powerful packages, in this tutorial we will only be scratching the surface of what they can do. There are also many other packages that you can use to perform the same tasks. One of these packages is Seaborn, which is built on top of Matplotlib, for this demo we will be using Matplotlib.

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

data = pd.DataFrame({
    'Distance': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Heart-rate': [125, 130, 130, 135, 146, 142, 164, 166, 170, 172],
    'Self-Eval': [10, 15, 20, 25, 30, 35, 40, 45, 50, 55],
    'Sleep-Score': [90, 90, 90, 60, 40, 35, 40, 45, 70, 55]})
print(data.head())

   Distance  Heart-rate  Self-Eval  Sleep-Score
0         1         125         10           90
1         2         130         15           90
2         3         130         20           90
3         4         135         25           60
4         5         146         30           40

Step 2: Split the data into training and test sets

Code

X = data[['Distance']]
y = data['Heart-rate']

# 80% training, 20% testing
test_size = 0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)

Splitting the data into a training set and a test set is a very important step in the process of building a linear regression model. The training set is used to fit the model and the test set is used to evaluate the model. If you train your model on all the data that you have, not only can this be computationally expensive but the model will “memorize” the data and will not generalize to new data that it hasn’t seen before. This is called over fitting. Although many people use a 70/30 split, in this model we decided to use a 80/20 split because the data is very small.

Step 3: Fit the Model

Code

model = LinearRegression()
model.fit(X_train, y_train)

print("Slope (Coefficient):", model.coef_[0])
print("Intercept:", model.intercept_)

Slope (Coefficient): 5.913793103448276
Intercept: 114.97413793103448

Step 4: Visualize the Regression Line

Code

plt.scatter(X_train, y_train, color='blue', label='Train data')
plt.scatter(X_test, y_test, color='green', label='Test data')
plt.plot(X, model.predict(X), color='red', label='Fitted line')
plt.xlabel('Distance')
plt.ylabel('Heart Rate')
plt.title('Simple Linear Regression')
plt.legend()
plt.grid(True)
plt.show()

Step 5: Evaluate the Model

Code

r2_train = model.score(X_train, y_train)
r2_test = model.score(X_test, y_test)
print(f"R^2 score (Train): {r2_train:.2f}")
print(f"R^2 score (Test): {r2_test:.2f}")

R^2 score (Train): 0.92
R^2 score (Test): 0.98

We want to evaluate the model to see how well it can predict based on the data training set. The R^2 score is a measure of how well the model can predict the data. A score of 1 means that the model can perfectly predict the data, while a score of 0 means that the model cannot predict the data at all. A score of 0.7 or higher is generally considered to be a good score. Because we got a score of .92 on the training data and a .98 on the test data, we will proceed to make a prediction.

Step 6: Make Predictions

Code

new_distance = np.array([[7.5]])
predicted_hr = model.predict(new_distance)
print(f"Predicted Heart Rate at {new_distance[0][0]} units: {predicted_hr[0]:.2f} bpm")

Predicted Heart Rate at 7.5 units: 159.33 bpm

/opt/miniconda3/lib/python3.13/site-packages/sklearn/utils/validation.py:2749: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
  warnings.warn(

Multiple Linear Regression

In this example, we will be using the distance of a run, self-evaluation, and sleep score to predict the heart rate of a person. But first we will talk about what multiple linear regression is. Multiple linear regression is a generalization of simple linear regression where we can use multiple independent variables to predict a dependent variable. the equation for multiple linear regression is: \[y=β_0 +β_1x_1 +β_2x_2 +...+β_nx_n +ϵ \] where n is the number of independent variables and x_i is the value of the i-th independent variable.

Step 1: Download the neceessary packages and the dataset

–We will be using the same data set as before.

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

data = pd.DataFrame({
    'Distance': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Heart-rate': [125, 130, 130, 135, 146, 142, 164, 166, 170, 172],
    'Self-Eval': [10, 15, 20, 25, 30, 35, 40, 45, 50, 55],
    'Sleep-Score': [90, 90, 90, 60, 40, 35, 40, 45, 70, 55]})
print(data.head())

   Distance  Heart-rate  Self-Eval  Sleep-Score
0         1         125         10           90
1         2         130         15           90
2         3         130         20           90
3         4         135         25           60
4         5         146         30           40

Step 2: Prepare the data and split the data into training and test sets and fit the model

Code

X_multi = data[['Distance', 'Self-Eval', 'Sleep-Score']]
y = data['Heart-rate']

X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(X_multi, y, test_size=test_size, random_state=42)
model_multi = LinearRegression()
model_multi.fit(X_train_m, y_train_m)

print("Coefficients:", model_multi.coef_)
print("Intercept:", model_multi.intercept_)

Coefficients: [0.23036712 1.15183559 0.01452435]
Intercept: 107.97225196613596

We split the data into a training set and a test set. The training set is used to fit the model, and the test set is used to evaluate the model.

Step 4: Evaluate the model

Code

r2_train_m = model_multi.score(X_train_m, y_train_m)
r2_test_m = model_multi.score(X_test_m, y_test_m)
print(f"R^2 score (Train): {r2_train_m:.2f}")
print(f"R^2 score (Test): {r2_test_m:.2f}")

R^2 score (Train): 0.92
R^2 score (Test): 0.99

Because our model is very simple and we got a score of .93 on the training data and a .99 on the test data, we will proceed to make a prediction.

Step 5: Make Predictions

Code

# Example: Predict for Distance=7.5, Self-Eval=30, Sleep-Score=60
new_data = np.array([[7.5, 30, 60]])
predicted_hr_multi = model_multi.predict(new_data)
print(f"Predicted Heart Rate (Multiple Regression): {predicted_hr_multi[0]:.2f} bpm")

Predicted Heart Rate (Multiple Regression): 145.13 bpm

/opt/miniconda3/lib/python3.13/site-packages/sklearn/utils/validation.py:2749: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
  warnings.warn(

Conclusion

Linear regression is a very useful tool for data analysis and machine learning, with it we can make predictions about the relationship between two variables and can be used to predict future values based on past data. It is very easy to implement and is very useful. Try it out yourself! If you are looking for a more in depth explanation of linear regression or or looking for data sets to practice with, check out Kaggle to see what data sets are available and for another great demo on how to use line regression in python.

--- title: A simple introduction to linear regression date: "2025-09-29" author: "Seth Frandsen" format: html: code-fold: true toc: true toc-depth: 3 --- ## Introduction Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. it is widely used in various fields such as economics, biology, and social sciences to understand how changes in the independent variables affect the dependent variable. it can be be very very useful and is very easy to implement. ## What is Simple Linear Regression? simple linear regression models the relationship between a dependent variable **y** and an independent variable **x** by fitting a straight line through the data points. The model can be expressed with the formula: $$y=β_0 +β_1x+ϵ $$ Where β_0 is the intercept, β_1 is the slope of the line, and ϵ is the error term. The slope is the rate at which y increases when x increases. A simple way to think of this is that for each increase in β_1 then the β_0 will increase an amount. ϵ in this case is the error term and represents random noise, unmeasured factors, or inherent variability in the data that cannot be explained by the linear relationship between x and y. Using this model, we can make predictions about the value of y based on the value of x. It is very simple to implement, and it is very useful. Below is a simple example of how to implement linear regression in python using the scikit-learn library on a simple dataset containing running distance, heart rate, self-evaluation, and sleep score. We will be going over how to use one variable to predict another in our first example and then we will be going over how to use multiple variables to predict another in our second example. # Simple Linear Regression In this example, we will be using the distance of a run to predict the heart rate of a person. ### Step 1: Download the necessary packages and the data set that you would like to use. For this example, we will be using the heart rate dataset and using the [scikit-learn library](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html), [Pandas](https://pandas.pydata.org/docs/), [Numpy](https://numpy.org/doc/) and [Matplotlib](https://matplotlib.org/stable/index.html). There are very powerful packages, in this tutorial we will only be scratching the surface of what they can do. There are also many other packages that you can use to perform the same tasks. One of these packages is Seaborn, which is built on top of Matplotlib, for this demo we will be using Matplotlib. ```{python} import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split data = pd.DataFrame({ 'Distance': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'Heart-rate': [125, 130, 130, 135, 146, 142, 164, 166, 170, 172], 'Self-Eval': [10, 15, 20, 25, 30, 35, 40, 45, 50, 55], 'Sleep-Score': [90, 90, 90, 60, 40, 35, 40, 45, 70, 55]}) print(data.head()) ``` ### Step 2: Split the data into training and test sets ```{python} X = data[['Distance']] y = data['Heart-rate'] # 80% training, 20% testing test_size = 0.2 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42) ``` Splitting the data into a training set and a test set is a very important step in the process of building a linear regression model. The training set is used to fit the model and the test set is used to evaluate the model. If you train your model on all the data that you have, not only can this be computationally expensive but the model will "memorize" the data and will not generalize to new data that it hasn't seen before. This is called over fitting. Although many people use a 70/30 split, in this model we decided to use a 80/20 split because the data is very small. ### Step 3: Fit the Model ```{python} model = LinearRegression() model.fit(X_train, y_train) print("Slope (Coefficient):", model.coef_[0]) print("Intercept:", model.intercept_) ``` ### Step 4: Visualize the Regression Line ```{python} plt.scatter(X_train, y_train, color='blue', label='Train data') plt.scatter(X_test, y_test, color='green', label='Test data') plt.plot(X, model.predict(X), color='red', label='Fitted line') plt.xlabel('Distance') plt.ylabel('Heart Rate') plt.title('Simple Linear Regression') plt.legend() plt.grid(True) plt.show() ``` ### Step 5: Evaluate the Model ```{python} r2_train = model.score(X_train, y_train) r2_test = model.score(X_test, y_test) print(f"R^2 score (Train): {r2_train:.2f}") print(f"R^2 score (Test): {r2_test:.2f}") ``` We want to evaluate the model to see how well it can predict based on the data training set. The R^2 score is a measure of how well the model can predict the data. A score of 1 means that the model can perfectly predict the data, while a score of 0 means that the model cannot predict the data at all. A score of 0.7 or higher is generally considered to be a good score. Because we got a score of .92 on the training data and a .98 on the test data, we will proceed to make a prediction. ### Step 6: Make Predictions ```{python} new_distance = np.array([[7.5]]) predicted_hr = model.predict(new_distance) print(f"Predicted Heart Rate at {new_distance[0][0]} units: {predicted_hr[0]:.2f} bpm") ``` # Multiple Linear Regression In this example, we will be using the distance of a run, self-evaluation, and sleep score to predict the heart rate of a person. But first we will talk about what multiple linear regression is. Multiple linear regression is a generalization of simple linear regression where we can use multiple independent variables to predict a dependent variable. the equation for multiple linear regression is: $$y=β_0 +β_1x_1 +β_2x_2 +...+β_nx_n +ϵ $$ where n is the number of independent variables and x_i is the value of the i-th independent variable. ### Step 1: Download the neceessary packages and the dataset --We will be using the same data set as before. ```{python} import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split data = pd.DataFrame({ 'Distance': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'Heart-rate': [125, 130, 130, 135, 146, 142, 164, 166, 170, 172], 'Self-Eval': [10, 15, 20, 25, 30, 35, 40, 45, 50, 55], 'Sleep-Score': [90, 90, 90, 60, 40, 35, 40, 45, 70, 55]}) print(data.head()) ``` ### Step 2: Prepare the data and split the data into training and test sets and fit the model ```{python} X_multi = data[['Distance', 'Self-Eval', 'Sleep-Score']] y = data['Heart-rate'] X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(X_multi, y, test_size=test_size, random_state=42) model_multi = LinearRegression() model_multi.fit(X_train_m, y_train_m) print("Coefficients:", model_multi.coef_) print("Intercept:", model_multi.intercept_) ``` We split the data into a training set and a test set. The training set is used to fit the model, and the test set is used to evaluate the model. ### Step 4: Evaluate the model ```{python} r2_train_m = model_multi.score(X_train_m, y_train_m) r2_test_m = model_multi.score(X_test_m, y_test_m) print(f"R^2 score (Train): {r2_train_m:.2f}") print(f"R^2 score (Test): {r2_test_m:.2f}") ``` Because our model is very simple and we got a score of .93 on the training data and a .99 on the test data, we will proceed to make a prediction. ### Step 5: Make Predictions ```{python} # Example: Predict for Distance=7.5, Self-Eval=30, Sleep-Score=60 new_data = np.array([[7.5, 30, 60]]) predicted_hr_multi = model_multi.predict(new_data) print(f"Predicted Heart Rate (Multiple Regression): {predicted_hr_multi[0]:.2f} bpm") ``` ## Conclusion Linear regression is a very useful tool for data analysis and machine learning, with it we can make predictions about the relationship between two variables and can be used to predict future values based on past data. It is very easy to implement and is very useful. Try it out yourself! If you are looking for a more in depth explanation of [linear regression](https://www.kaggle.com/code/sudhirnl7/linear-regression-tutorial) or or looking for data sets to practice with, check out [Kaggle](https://www.kaggle.com/datasets) to see what data sets are available and for another great demo on how to use line regression in python.