# How to add a transformed term to a model

## Description

Sometimes, a simple linear model isn’t sufficient for our data, and we need more complex terms or transformed variables in the model to make adequate predictions. How do we include these complex and transformed terms in a regression model?

## Using NumPy and sklearn, in Python

View this solution alone.

We’re going to create the Pressure dataset as example data. It contains observations of pressure and temperature. You would use your own data instead.

1
2
3
4
5
6
7
8
9
import pandas as pd
pressure = pd.DataFrame( {
'temperature': [0,20,40,60,80,100,120,140,160,180,200,
220,240,260,280,300,320,340,360],
'pressure':    [0.0002,0.0012,0.0060,0.0300,0.0900,0.2700,0.7500,
1.8500,4.2000,8.8000,17.3000,32.1000,57.0000,96.0000,
157.0000,247.0000,376.0000,558.0000,806.0000]
} )
pressure

temperature pressure
0 0 0.0002
1 20 0.0012
2 40 0.0060
3 60 0.0300
4 80 0.0900
5 100 0.2700
6 120 0.7500
7 140 1.8500
8 160 4.2000
9 180 8.8000
10 200 17.3000
11 220 32.1000
12 240 57.0000
13 260 96.0000
14 280 157.0000
15 300 247.0000
16 320 376.0000
17 340 558.0000
18 360 806.0000

Let’s model temperature as the dependent variable with the logarithm of pressure as the independent variable. To transform the independent variable pressure, we use NumPy’s np.log function, as shown below. It uses the natural logarithm (base $e$).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import numpy as np

# Compute the logarithm of pressure
X = pressure[['pressure']]
log_X = np.log(X)

# Build the linear model using Scikit-Learn
from sklearn.linear_model import LinearRegression
y = pressure['temperature']
log_model = LinearRegression()
log_model.fit(log_X, y)

# Display regression coefficients and R-squared value of the model
log_model.intercept_, log_model.coef_, log_model.score(log_X, y)

1
(153.97045660511063, array([23.78440995]), 0.9464264282083346)


The model is $\hat t = 153.97 + 23.784\log p$, where $t$ stands for temperature and $p$ for pressure.

Another example transformation is the square root transformation. As with np.log, just apply the np.sqrt function to the appropriate term when defining the model.

1
2
3
4
5
6
7
8
9
10
11
12
# Compute the square root of pressure
X = pressure[['pressure']]
sqrt_X = np.sqrt(X)

# Build the linear model using Scikit-Learn
from sklearn.linear_model import LinearRegression
y = pressure['temperature']
sqrt_model = LinearRegression()
sqrt_model.fit(sqrt_X, y)

# Display regression coefficients and R-squared value of the model
sqrt_model.intercept_, sqrt_model.coef_, sqrt_model.score( log_X, y )

1
(98.56139249917803, array([11.44621468]), 0.29600246256782614)


The model is $\hat t = 98.561 + 11.446\sqrt{p}$, with $t$ and $p$ having the same meanings as above.

See a problem? Tell us or edit the source.

## Solution, in R

View this solution alone.

We’re going to use the Pressure dataset in R’s ggplot library as example data. It contains observations of pressure and temperature. You would use your own data instead.

1
2
3
# install.packages( "ggplot2" ) # if you haven't done this already
library(ggplot2)
data("pressure")


Let’s model temperature as the dependent variable with the logarithm of pressure as the independent variable. To place the “log of pressure” term in the model, we use R’s log function, as shown below. It uses the naturarl logarithm (base $e$).

1
2
3
# Build the model
model.log <- lm(temperature ~ log(pressure), data = pressure)
summary(model.log)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Call:
lm(formula = temperature ~ log(pressure), data = pressure)

Residuals:
Min     1Q Median     3Q    Max
-28.60 -22.30 -10.13  20.00  48.61

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)    153.970      6.330   24.32 1.20e-14 ***
log(pressure)   23.784      1.372   17.33 3.07e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 26.81 on 17 degrees of freedom
Multiple R-squared:  0.9464,	Adjusted R-squared:  0.9433
F-statistic: 300.3 on 1 and 17 DF,  p-value: 3.07e-12


The model is $\hat t = 153.97 + 23.784\log p$, where $t$ stands for temperature and $p$ for pressure.

Another example transformation is the square root transformation. As with log, just apply the sqrt function to the appropriate term when defining the model.

1
2
3
# Build the model
model.sqrt <- lm(temperature ~ sqrt(pressure), data = pressure)
summary(model.sqrt)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Call:
lm(formula = temperature ~ sqrt(pressure), data = pressure)

Residuals:
Min     1Q Median     3Q    Max
-98.72 -34.74  11.53  42.75  56.59

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)      98.561     15.244   6.465 5.81e-06 ***
sqrt(pressure)   11.446      1.367   8.372 1.95e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 51.16 on 17 degrees of freedom
Multiple R-squared:  0.8048,	Adjusted R-squared:  0.7933
F-statistic:  70.1 on 1 and 17 DF,  p-value: 1.953e-07


The model is $\hat t = 98.561 + 11.446\sqrt{p}$, with $t$ and $p$ having the same meanings as above.