How to add a transformed term to a model (in Python, using NumPy and sklearn)

Task

Sometimes, a simple linear model isn’t sufficient for our data, and we need more complex terms or transformed variables in the model to make adequate predictions. How do we include these complex and transformed terms in a regression model?

Related tasks:

Solution

We’re going to create the Pressure dataset as example data. It contains observations of pressure and temperature. You would use your own data instead.

import pandas as pd
pressure = pd.DataFrame( {
    'temperature': [0,20,40,60,80,100,120,140,160,180,200,
                    220,240,260,280,300,320,340,360],
    'pressure':    [0.0002,0.0012,0.0060,0.0300,0.0900,0.2700,0.7500,
                    1.8500,4.2000,8.8000,17.3000,32.1000,57.0000,96.0000,
                    157.0000,247.0000,376.0000,558.0000,806.0000]
} )
pressure

	temperature	pressure
0	0	0.0002
1	20	0.0012
2	40	0.0060
3	60	0.0300
4	80	0.0900
5	100	0.2700
6	120	0.7500
7	140	1.8500
8	160	4.2000
9	180	8.8000
10	200	17.3000
11	220	32.1000
12	240	57.0000
13	260	96.0000
14	280	157.0000
15	300	247.0000
16	320	376.0000
17	340	558.0000
18	360	806.0000

Let’s model temperature as the dependent variable with the logarithm of pressure as the independent variable. To transform the independent variable pressure, we use NumPy’s np.log function, as shown below. It uses the natural logarithm (base $e$ ).

import numpy as np

# Compute the logarithm of pressure
X = pressure[['pressure']]
log_X = np.log(X)

# Build the linear model using Scikit-Learn
from sklearn.linear_model import LinearRegression
y = pressure['temperature']
log_model = LinearRegression()
log_model.fit(log_X, y)

# Display regression coefficients and R-squared value of the model
log_model.intercept_, log_model.coef_, log_model.score(log_X, y)

(153.97045660511063, array([23.78440995]), 0.9464264282083346)

The model is $\hat{t} = 153.97 + 23.784 \log p$ , where $t$ stands for temperature and $p$ for pressure.

Another example transformation is the square root transformation. As with np.log, just apply the np.sqrt function to the appropriate term when defining the model.

# Compute the square root of pressure
X = pressure[['pressure']]
sqrt_X = np.sqrt(X)

# Build the linear model using Scikit-Learn
from sklearn.linear_model import LinearRegression
y = pressure['temperature']
sqrt_model = LinearRegression()
sqrt_model.fit(sqrt_X, y)

# Display regression coefficients and R-squared value of the model
sqrt_model.intercept_, sqrt_model.coef_, sqrt_model.score( log_X, y )

(98.56139249917803, array([11.44621468]), 0.29600246256782614)

The model is $\hat{t} = 98.561 + 11.446 \sqrt{p}$ , with $t$ and $p$ having the same meanings as above.

Content last modified on 24 July 2023.

See a problem? Tell us or edit the source.

Contributed by Ni Shi (shi_ni@bentley.edu)