Counterintuitive effects in linear regression

Vesna Lukic
Oct 7, 2020
2 min read

I encountered a situation where the task was to do a seemingly simple linear regression with one variable, however the line of best fit produced by the linear regression appeared to look incorrect. In the end, the line turned out to be the right one.

We begin by simulating some data in x and y. x is simulated as an array of 2200 points with a normal distribution of variance 0.3, using 200 points from 0 to 1 in increments of 0.1. y is also simulated to be an array of 2200 points in the same range, but having a uniform distribution of variance 0.1 to 1.1 in increments of 0.1.

The histograms of the x,y distributions, in log space are as shown.

The scatterplot and kernel density estimate plot of the points is as follows.

Next, let's do a line of best fit using linear regression, which works by minimising the residuals (the sum of the differences between the true values (y) and the predicted values from the line of best fit). One might be inclined to think this line should pass through the middle of the points of highest density, which are the brightest parts in the density plot.

We see that the line actually does not go through the middle of the highest density of points; it appears to be off-centre and an ill fit. We can calculate the sum of the absolute value of the residuals (the differences between the true and predicted points), which works out to be around 385.8.

Next we will try to see if we can get the line of best fit by rearranging x and y: solving for x using y, then rearranging to make y the subject.

We can see that this appears to solve the issue. However if we calculate the residuals of this red line, it works out to be around 527.9, which is higher compared to the residuals from the blue line (385.8). Therefore the predictions from the red line are worse compared to the ones from the blue line (original line of best fit).

If we look into how linear regression works; it makes predictions in y by minimising the residuals in y. The fit looks incorrect because the residuals are taken in the y axis, rather than in the x axis, making the fit look off-centre with respect to the x-axis. However, the residuals have to be taken in the axis of the quantity that we are predicting. The reason why the red line looks more intuitively correct is because the residuals are taken in the x axis, so they are more symmetric in that axis.

Overall, it is always worth checking the results of fits, especially if they seem to go against what we intuitively expect.

Counterintuitive effects in linear regression

Recent Posts

Comments