# How to fit a linear model to two columns of data

## Description

Let’s say we have two columns of data, one for a single independent variable $x$ and the other for a single dependent variable $y$. How can I find the best fit linear model that predicts $y$ based on $x$?

In other words, what are the model coefficients $\beta_0$ and $\beta_1$ that give me the best linear model $\hat y=\beta_0+\beta_1x$ based on my data?

## Solution, in Julia

View this solution alone.

This solution uses fake example data. When using this code, replace our fake data with your real data.

1
2
3
4
5
6
7
8
9
10
11
# Here is the fake data you should replace with your real data.
xs = [ 393, 453, 553, 679, 729, 748, 817 ]
ys = [  24,  25,  27,  36,  55,  68,  84 ]

# Place the data into a DataFrame, because that's what Julia's modeling tools expect:
using DataFrames
data = DataFrame( xs=xs, ys=ys )  # Or you can name the columns whatever you like

# Create the linear model:
using GLM
lm( @formula( ys ~ xs ), data )

1
2
3
4
5
6
7
8
9
10
11
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

ys ~ 1 + xs

Coefficients:
───────────────────────────────────────────────────────────────────────────
Coef.  Std. Error      t  Pr(>|t|)    Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept)  -37.3214    18.9954    -1.96    0.1066  -86.1508      11.5079
xs             0.13272    0.029589   4.49    0.0065    0.0566587    0.20878
───────────────────────────────────────────────────────────────────────────


The linear model in this example is approximately $y=0.13272x-37.3214$.

See a problem? Tell us or edit the source.

## Using SciPy, in Python

View this solution alone.

This solution uses a pandas DataFrame of fake example data. When using this code, replace our fake data with your real data.

Although the solution below uses plain Python lists of data, it also works if the data are stored in NumPy arrays or pandas Series.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Here is the fake data you should replace with your real data.
xs = [ 393, 453, 553, 679, 729, 748, 817 ]
ys = [  24,  25,  27,  36,  55,  68,  84 ]

# We will use SciPy to build the model
import scipy.stats as stats

# If you need the model coefficients stored in variables for later use, do:
model = stats.linregress( xs, ys )
beta0 = model.intercept
beta1 = model.slope

# If you just need to see the coefficients (and some other related data),
# do this alone:
stats.linregress( xs, ys )

1
LinregressResult(slope=0.1327195637885226, intercept=-37.32141898334582, rvalue=0.8949574425541466, pvalue=0.006486043236692156, stderr=0.029588975845594334, intercept_stderr=18.995444317768097)


The linear model in this example is approximately $\hat y=0.133x-37.32$.

See a problem? Tell us or edit the source.

## Using statsmodels, in Python

View this solution alone.

This solution uses fake example data. When using this code, replace our fake data with your real data.

Although the solution below uses plain Python lists of data, it also works if the data are stored in NumPy arrays or pandas Series.

1
2
3
4
5
6
7
8
9
10
11
12
13
# Here is the fake data you should replace with your real data.
xs = [ 393, 453, 553, 679, 729, 748, 817 ]
ys = [  24,  25,  27,  36,  55,  68,  84 ]

# We will use statsmodels to build the model
import statsmodels.api as sm

# statsmodels does not add a constant term to the model unless you request it:

# Fit the model and tell us all about it:
model = sm.OLS( ys, xs ).fit()
model.summary()

1
2
/opt/conda/lib/python3.10/site-packages/statsmodels/stats/stattools.py:74: ValueWarning: omni_normtest is not valid with less than 8 observations; 7 samples were given.
warn("omni_normtest is not valid with less than 8 observations; %i "

Dep. Variable: R-squared: y 0.801 OLS 0.761 Least Squares 20.12 Mon, 24 Jul 2023 0.00649 20:46:39 -25.926 7 55.85 5 55.74 1 nonrobust
coef std err t P>|t| [0.025 0.975] -37.3214 18.995 -1.965 0.107 -86.151 11.508 0.1327 0.030 4.485 0.006 0.057 0.209
 Omnibus: Durbin-Watson: nan 0.806 nan 0.52 -0.366 0.771 1.883 2780

Notes:
 Standard Errors assume that the covariance matrix of the errors is correctly specified.
 The condition number is large, 2.78e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

The linear model in this example is approximately $\hat y=0.1327x-37.3214$.

See a problem? Tell us or edit the source.

## Solution, in R

View this solution alone.

This solution uses fake example data. When using this code, replace our fake data with your real data.

1
2
3
4
5
6
7
8
9
10
11
# Here is the fake data you should replace with your real data.
xs <- c( 393, 453, 553, 679, 729, 748, 817 )
ys <- c(  24,  25,  27,  36,  55,  68,  84 )

# If you need the model coefficients stored in variables for later use, do:
model <- lm( ys ~ xs )
beta0 = model$coefficients beta1 = model$coefficients

# If you just need to see the coefficients, do this alone:
lm( ys ~ xs )

1
2
3
4
5
6
Call:
lm(formula = ys ~ xs)

Coefficients:
(Intercept)           xs
-37.3214       0.1327


The linear model in this example is approximately $y=0.133x-37.32$.