Linear Regression Calculator

Fit a least-squares regression line to any paired X-Y data. Get the regression equation, slope, intercept, R², r, RMSE, residuals and outliers, with a full step-by-step solution and live scatter plot.

Enter your data

10 numeric values parsed
10 numeric values parsed

Separate numbers with commas, spaces, or new lines. The X list is the predictor (independent variable); the Y list is what you are trying to model.

Line of best fit
y = 4.4182x + 50.6000
n = 10, r = 0.9949
Slope
4.4182
m
Intercept
50.6000
b
0.9899
RMSE
1.283
Scatter plot with fitted line
Residual plot - look for randomness, not patterns

Residuals should look like a structureless cloud around y = 0. A clear curve means the relationship isn't linear; a funnel means the variance grows with X (heteroscedasticity).

Step-by-Step Solution

Computed from the 10 pairs you entered using the least-squares method.

Step 1
Tabulate paired values and running products
ixᵢyᵢxᵢ·yᵢxᵢ²
115252.001.00
2260120.004.00
3365195.009.00
4470280.0016.00
5573365.0025.00
6678468.0036.00
7782574.0049.00
8885680.0064.00
9990810.0081.00
101094940.00100.00
Σ55.00749.004484.00385.00
Step 2
Compute the running sums
  • n = 10
  • Σx = 55.0000
  • Σy = 749.0000
  • Σxy = 4484.0000
  • Σx² = 385.0000
  • Σy² = 57727.0000
  • x̄ = 5.5000
  • ȳ = 74.9000
Step 3
Compute the slope (m)
m=nxyxynx2(x)2m = \frac{n\sum xy - \sum x \sum y}{n\sum x^2 - (\sum x)^2}
m=10(4484.00)(55.00)(749.00)10(385.00)(55.00)2=4.418182m = \frac{10(4484.00) - (55.00)(749.00)}{10(385.00) - (55.00)^2} = 4.418182
Step 4
Compute the intercept (b)
b=yˉmxˉb = \bar{y} - m\bar{x}
b=74.9000(4.4182)(5.5000)=50.600000b = 74.9000 - (4.4182)(5.5000) = 50.600000
Step 5
Write the regression equation
y^=4.4182x+(50.6000)\hat{y} = 4.4182x + (50.6000)
Step 6
Compute R² (goodness of fit)
R2=r2R^2 = r^2
R2=(0.994925)2=0.989875R^2 = (0.994925)^2 = 0.989875

About 99.0% of the variation in Y is explained by X under this linear model.

Step 7
Compute RMSE
RMSE=1n(yiy^i)2\text{RMSE} = \sqrt{\frac{1}{n}\sum (y_i - \hat{y}_i)^2}
RMSE=1.283461\text{RMSE} = 1.283461

A typical prediction from this line will be off by about 1.28 units of Y.

Step 8
Use the equation to predict

To predict Y for any new X value, substitute it into the fitted equation. For example, at X = x̄ = 5.50 the line predicts Y = ȳ = 74.90 - a sanity check that always holds for ordinary least squares with an intercept.

The Linear Regression Formulas

Ordinary least squares minimises (yiy^i)2\sum (y_i - \hat{y}_i)^2 - the sum of squared vertical distances between each point and the fitted line. Solving that minimisation gives closed-form expressions for the slope and intercept.

Slope

m=nxyxynx2(x)2m = \frac{n\sum xy - \sum x \sum y}{n\sum x^2 - (\sum x)^2}

Equivalent to Cov(X,Y)/Var(X)\operatorname{Cov}(X,Y) / \operatorname{Var}(X). Units of m are units of Y per unit of X.

Intercept

b=yˉmxˉb = \bar{y} - m\bar{x}

The predicted Y when X = 0. Only meaningful if X = 0 is inside or near the range of your data - extrapolating outside that range is risky.

R² (coefficient of determination)

R2=1(yiy^i)2(yiyˉ)2R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}

Proportion of variance in Y explained by the model. Equal to r² for simple linear regression.

RMSE

RMSE=1n(yiy^i)2\text{RMSE} = \sqrt{\frac{1}{n}\sum (y_i - \hat{y}_i)^2}

Typical prediction error, in the original units of Y. Use alongside R² to judge fit quality.

Assumptions Behind Linear Regression

Least-squares regression always returns a slope and intercept - even when the model is the wrong shape for the data. The following assumptions decide whether those numbers can be trusted for inference and prediction.

Linearity

The true relationship between X and Y is a straight line. Check by plotting Y against X (the calculator does this for you) and looking for curvature.

Independence of errors

Each residual is independent of the others. Time-series data often violates this; use specialised models when it does.

Constant variance (homoscedasticity)

The spread of residuals is roughly the same across the range of X. A funnel shape in the residual plot means heteroscedasticity - consider transforming Y (log, square root) or fitting weighted least squares.

Normally distributed errors

Residuals are approximately normal. This matters for confidence intervals and p-values, less so for the point estimates of slope and intercept.

No influential outliers

A single extreme point with high leverage can swing the line dramatically. Examine flagged outliers and check whether they are recording errors or genuinely informative.

Worked Example: Five Students by Hand

Five students recorded hours studied (X) and test score (Y). Compute the regression line by hand.

StudentXYX·Y
1152521
22601204
33682049
447329216
558241025
Σ15335107855
Slope
m=5(1078)(15)(335)5(55)(15)2=7.30m = \frac{5(1078) - (15)(335)}{5(55) - (15)^2} = 7.30
Intercept
b=67(7.30)(3)=45.10b = 67 - (7.30)(3) = 45.10
y^=7.30x+45.10,r=0.997\hat{y} = 7.30x + 45.10, \quad r = 0.997

Each additional hour of study is associated with roughly 7.3 extra points. Paste 1, 2, 3, 4, 5 and 52, 60, 68, 73, 82 into the calculator above to reproduce the result.

Where Linear Regression is Used

Economics & forecasting

Demand curves, Phillips curves, and short-term price models start as linear regressions before fancier dynamics are added. They remain the baseline against which complex models are judged.

Real estate price modelling

Square footage, bedrooms, neighbourhood - each predictor enters as a coefficient in a multiple regression. Even single-predictor versions (price per sq ft) are useful sanity checks.

Lab calibration curves

Chemistry instruments are calibrated by fitting a line through known standards. The slope becomes the conversion factor; the R² confirms the linear range.

A/B testing - covariate adjustment

Regressing the outcome on pre-experiment covariates (CUPED, MLRATE) reduces variance and shrinks the confidence interval on the treatment effect.

Machine learning baseline

Before deep learning, before random forests, fit a linear regression. If a simple line explains most of the variance, the gain from complexity may not be worth it.

Education research

Predicting test performance from hours of instruction, attendance, or prior scores typically starts as a linear regression.

Common Mistakes to Avoid

  1. Extrapolating beyond the data. A regression fit on house sizes from 800 to 2,500 sq ft tells you nothing reliable about a 6,000 sq ft mansion. The line is only validated inside the range of X you trained on.
  2. Reading the intercept literally when X = 0 is unrealistic. For a "years of experience vs salary" regression, the intercept is just an arithmetic anchor - not a meaningful starting salary, because the data doesn't include negative experience.
  3. Mistaking correlation for causation. Regression gives you a slope, not a causal effect. A significant slope can still come from confounders or reverse causation.
  4. Trusting R² alone to judge fit. R² can be high for a fundamentally wrong model. Always look at the residual plot for curvature, funnels, or clusters.
  5. Letting one point dominate. A single high-leverage point can change slope, intercept and R² substantially. The calculator flags large standardised residuals to make this visible.
  6. Forgetting that slope has units. A slope of 0.4 means nothing without knowing it is "0.4 dollars per additional minute of ad time", not 0.4 of something abstract.

Frequently Asked Questions

What is linear regression?
Linear regression fits a straight line through a set of (x, y) data points so that the total squared distance from the points to the line is as small as possible. The result is an equation of the form y = mx + b that lets you predict y from x and quantify how much of the variation in y is explained by x.
How do you calculate the slope and intercept by hand?
Use the least-squares formulas: m = [n·Σxy − Σx·Σy] / [n·Σx² − (Σx)²] for the slope, then b = ȳ − m·x̄ for the intercept. The calculator above computes every intermediate sum (Σx, Σy, Σxy, Σx², Σy²) so you can verify the arithmetic step by step.
What does the R² value mean?
R² (coefficient of determination) is the proportion of variance in y that the regression line explains. R² = 0.85 means 85% of the variation in y is captured by the linear model, and the remaining 15% is unexplained - random error, omitted variables, or non-linearity.
What is the difference between r and R²?
r is the Pearson correlation coefficient. R² is simply r squared. r tells you direction and strength of the linear relationship (between −1 and +1); R² tells you the proportion of variance explained (between 0 and 1). For simple linear regression with one predictor, R² and r are essentially two views of the same number.
What is RMSE and why is it different from R²?
RMSE (root mean square error) is the typical size of a prediction error in the original units of y. R² is a dimensionless proportion. Two models with the same R² can have very different RMSE if the y variable is on different scales. Use both: R² for relative fit, RMSE for the size of the average miss.
How is the line of best fit calculated?
The 'best fit' is defined as the line that minimises the sum of squared vertical distances between each data point and the line. Calculus or linear algebra both lead to the same closed-form slope and intercept formulas the calculator uses.
What is a residual?
A residual is the difference between an actual y value and the value predicted by the regression line: residual = y − ŷ. Looking at the pattern of residuals (a residual plot) is the standard way to check whether a linear model is appropriate - residuals should look randomly scattered with no curve, no funnel, and no obvious clusters.
How are outliers detected in regression?
This calculator standardises each residual by dividing by the residual standard deviation. Points with a standardised residual whose absolute value exceeds 2 are flagged as potential outliers. Outliers may be data entry errors, genuinely unusual observations, or signs that the linear model is the wrong shape.
When should I not use linear regression?
When the scatter plot shows obvious curvature (use polynomial or non-linear regression), when the variance of y changes with x (heteroscedasticity - consider transforms or weighted least squares), when y is binary or count data (use logistic or Poisson regression), or when there are influential outliers that you cannot justify excluding (consider robust regression).
How many data points do I need for linear regression?
Mechanically you need at least 2 points to fit a line, but the estimates are extremely unstable below n = 10. A common rule of thumb for inference is at least 10 observations per predictor. For a single predictor that means around 10 - for tight intervals and stable R², aim for 30 or more.
Does the regression line always go through the mean point?
Yes. For ordinary least squares with an intercept, the fitted line is guaranteed to pass through the point (x̄, ȳ). This is a useful sanity check: compute the means of both columns and plug x̄ into your equation - you should get ȳ.

When Ordinary Least Squares Is Not the Right Tool

The calculator above fits ordinary least squares (OLS), which assumes a single linear predictor, roughly constant error variance, and an unbounded continuous response. When those assumptions fail, the right move is not to torture the OLS output. It is to pick a different model.

Polynomial regression

Use when the scatter plot shows a clear curve. Fit y on x, x², (sometimes x³). Stop at the lowest degree that fixes the residual curve - higher degrees overfit fast.

Ridge & Lasso regression

Use when you have many correlated predictors and a small n. Both shrink coefficients; Lasso also drives some to zero, doubling as feature selection.

Logistic regression

Use when Y is binary (clicked or not, churned or not). OLS on a 0/1 outcome produces probabilities outside [0, 1] and badly biased standard errors.

Poisson / negative binomial

Use when Y is a non-negative integer count (number of bugs, customer arrivals). The variance grows with the mean, breaking the constant-variance assumption.

Robust regression

Use when one or two influential points dominate the slope. Methods like Huber or RANSAC down-weight outliers instead of letting them swing the line.

Weighted least squares

Use when the residuals form a funnel (heteroscedasticity). Each point is weighted by the inverse of its estimated variance, restoring valid inference.

References and Further Reading

  • Galton, F. (1886) and Gauss / Legendre on the origin of least-squares - Ordinary least squares (Wikipedia).
  • Anscombe's quartet shows why a high R² alone is not enough - Anscombe's quartet.
  • NIST/SEMATECH e-Handbook on linear models and residual analysis - NIST handbook: linear regression.
  • Deng, A. et al. (2013). Improving the sensitivity of online controlled experiments by utilizing pre-experiment data (CUPED) - practical use of regression in A/B testing.
  • For an intuition primer on the difference between correlation and a regression slope, see Correlation vs causation on this site.

Related Calculators on this Site