Data Science Group

Our Blog

Linear Regression in H2O

Description

The goal of this tutorial is to get R users acquainted with the syntax and methods of the H2O package, in the context of fitting a simple linear regression model.

Data

The used dataset is “Boston Housing Data”, maintained at Carnegie Mellon University. The data can be found HERE.

Features

  • CRIM: per capita crime rate by town

  • ZN: proportion of residential land zoned for lots over 25,000 sq.ft.

  • INDUS: proportion of non-retail business acres per town

  • CHAS: Charles River dummy variable, equals 1 if tract bounds river; 0 otherwise

  • NOX: nitric oxides concentration (parts per 10 million)

  • RM: average number of rooms per dwelling

  • AGE: proportion of owner-occupied units built prior to 1940

  • DIS: weighted distances to five Boston employment centres

  • RAD: index of accessibility to radial highways

  • TAX: full-value property-tax rate per 10,000

  • PTRATIO: pupil-teacher ratio by town

  • B: 1000⋅(Bk−0.63)21000⋅(Bk−0.63)2 where BkBk is the proportion of blacks by town

  • LSTAT: % lower status of the population

  • MEDV: Median value of owner-occupied homes in 1000$

Analysis Objective

The main goal of this analysis is to estimate the effect different parameters have on the median value of homes.

The dataset will be imported into an R dataframe, which is then converted to an H2O dataframe for model fitting.

The following packages are used:

library(h2o)
library(magrittr)
library(dplyr)
library(lattice)

Initialization

Launch an H20 cluster with the default settings

h2o.init()

When the connection is complete, a “Connection successful!” message will appear along, with information regarding the cluster, which can be reproduced using the command:

h2o.clusterInfo() 

Analysis

Fetch data

tbl <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data", 
    header = FALSE)
names_tbl <- c("crim", "zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", 
    "tax", "ptratio", "b", "lstat", "medv")
names(tbl) <- names_tbl

Before converting to an H2O dataframe, let’s get familiar with the data.

apply(tbl, 2, class)
##      crim        zn     indus      chas       nox        rm       age 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##       dis       rad       tax   ptratio         b     lstat      medv 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
apply(tbl, 2, n_distinct)
##    crim      zn   indus    chas     nox      rm     age     dis     rad 
##     504      26      76       2      81     446     356     412       9 
##     tax ptratio       b   lstat    medv 
##      66      46     357     455     229

Currently all columns are of numeric class, so factor columns must be changed before model fitting. Indeed, we see that the categorical * {CHAS} column has only 2 distinct values (small sanity check).

A look at the response variable

medv_dens <- density(x = tbl$medv)
hist(tbl$medv, xlab = "Median Value", freq = FALSE, main = "Median Value Histogram (density)")
lines(medv_dens$y ~ medv_dens$x, col = "blue")

The right skewness of the data suggests a log transformation may be beneficial. However, for the sake of interpretation, we shall leave the data on its original scale.

Now let’s have a look at the median values as a function of each of the variables:

plot_medv_by <- function(index, tbl) {
    plot(tbl$medv ~ tbl[, index], main = names(tbl)[index], ylab = "Median_Value", 
        xlab = names(tbl)[index])
}

par(mfrow = c(2, 3))
lapply(1:6, FUN = plot_medv_by, tbl = tbl)

par(mfrow = c(3, 3))
lapply(7:13, FUN = plot_medv_by, tbl = tbl)

  • There seems to be a clear linear connection between number of rooms rm and medv. A positive correlation between the two makes sense, as we would expect prices to rise the more rooms we have. The numeric influence of number of rooms will be more evident once we fit a model.

  • Prices are a decreasing function of lstat (percentage of lower status of the population). Again, something we could predict on our own, since the “better” the neighborhood, the higher the prices.

  • There seems to be a negative relationship between prices and ptratio, an understandable result as better schools have lower pupil-teacher ratios (an appealing feature for parents) and subsequently raising prices.

  • We can see that tracts with a high percentage of units built before 1940 display lower prices. This seems reasonable due to the fact that “new” neighborhoods are always pricier, partially since buildings and infrastructures are newer, in contrast to “old” neighborhoods that are less desirable in the market.

  • Understandably, prices tend to be higher where the crime rates are lower.

Now let’s have a look at the correlation variables. First, which are the variables with the “strongest” correlation (say, above 0.5), with * {medv}.

cor_tbl <- cor(tbl)
dimnames(cor_tbl)[[1]][abs(cor_tbl[ ,14]) > 0.5 ]
## [1] "rm"      "ptratio" "lstat"   "medv"

Indeed, the observations we made based on the scatterplots were right, and there seems to be a linear relationship worth looking into. Now, let’s see the correlation between the explanatory variables

cor_tbl %>% levelplot()

Now let’s move on to fitting the model. First, convert to H2O dataframe

tbl_h <- as.h2o(tbl)
class(tbl_h)
## [1] "H2OFrame"

H2O dataframe columns may be accessed using the ‘$’, as in R dataframes.

Factorize appropriate variables and verify

tbl_h$chas <- as.factor(tbl_h$chas)
is.factor(tbl_h$chas)
## [1] TRUE

Applying the summarysummary function to factor columns, a count of each level is returned. For instance, the chaschas column contains a two level factor, giving us

summary(tbl_h$chas)
##  chas  
##  0:471 
##  1: 35

whereas summarizing the numeric column medvmedv gives us quantiles, means and extreme points

summary(tbl_h$medv)
##  medv           
##  Min.   : 5.00  
##  1st Qu.:16.99  
##  Median :21.20  
##  Mean   :22.53  
##  3rd Qu.:24.98  
##  Max.   :50.00

In order to see the levels of a categorical variable, use

h2o.levels(tbl_h$chas)
## [1] "0" "1"

When building a model that contains categorical variables, H2O automatically treats each categorical column as a set of sparse binary vectors, so there is no need to modify the object.

It is convenient to split the data into training and test sets according to a certain ratio using the h2o.splitFrame()h2o.splitFrame() function, giving us

tbl_h_split <- h2o.splitFrame(data = tbl_h, ratios = 0.85)
tbl_h_train <- tbl_h_split[[1]]
tbl_h_test <- tbl_h_split[[2]]

In order to fit a model, we must define the response and independent variable names

y_var <- "medv"
x_vars <- names_tbl[!names_tbl %in% y_var]

Fit a simple linear regression model (Guassian errors) and display results

med_val_glm <- h2o.glm(training_frame = tbl_h_train, validation_frame = tbl_h_test, 
    x = x_vars, y = y_var, family = "gaussian", lambda = 0, missing_values_handling = "Skip", 
    standardize = TRUE)
summary(med_val_glm)
## Model Details:
## ==============
## 
## H2ORegressionModel: glm
## Model Key:  GLM_model_R_1465303808376_35 
## GLM Model: summary
##     family     link regularization number_of_predictors_total
## 1 gaussian identity           None                         13
##   number_of_active_predictors number_of_iterations  training_frame
## 1                          13                    0 RTMP_sid_b934_6
## 
## H2ORegressionMetrics: glm
## ** Reported on training data. **
## 
## MSE:  22.10204
## R2 :  0.7337485
## Mean Residual Deviance :  22.10204
## Null Deviance :35363.05
## Null D.o.F. :425
## Residual Deviance :9415.468
## Residual D.o.F. :412
## AIC :2557.691
## 
## 
## H2ORegressionMetrics: glm
## ** Reported on validation data. **
## 
## MSE:  21.80702
## R2 :  0.7620231
## Mean Residual Deviance :  21.80702
## Null Deviance :7357.455
## Null D.o.F. :79
## Residual Deviance :1744.562
## Residual D.o.F. :66
## AIC :503.6087
## 
## 
## 
## 
## Scoring History: 
## data frame with 0 columns and 0 rows
## 
## Variable Importance: (Extract with `h2o.varimp`) 
## =================================================
## 
## Standardized Coefficient Magnitudes: standardized coefficient magnitudes
##      names coefficients sign
## 1    lstat     3.942425  NEG
## 2      dis     3.407899  NEG
## 3      rad     3.053269  POS
## 4   chas.1     2.963556  POS
## 5       rm     2.311217  POS
## 6      tax     2.300871  NEG
## 7  ptratio     2.028037  NEG
## 8      nox     1.976243  NEG
## 9       zn     1.165615  POS
## 10       b     1.076567  POS
## 11    crim     0.946399  NEG
## 12     age     0.105046  NEG
## 13   indus     0.009465  POS

We can easily apply the resulting model on our test data, and access the summary statistics

med_val_glm_w_test <- h2o.glm(training_frame = tbl_h_train, validation_frame = tbl_h_test, 
    x = x_vars, y = y_var, family = "gaussian", lambda = 0, missing_values_handling = "Skip", 
    standardize = TRUE)

Standardized Coefficients

In the above, we calculate the standardized coefficients by setting the standardize parameter to TRUE. Technically, one way of achieving this is : prior to forming the regression equation, all variables (response and explanatory) are standardized by subtracting the variable mean from each observation and dividing by the variable’s standard deviation. The resulting standardized coefficients represent the change in response to a change of one standard deviation in a predictor, and ignore the independent variable’s scale of unit.

Standardization is thought of as a means of bringing coefficients to the same scale, thus enabling us to interpret their importance in relation to another, however we must be careful in our interpretation - standardization does not add information.

The original coefficients are estimates of the relationships between the explanatory variables and the response variable. However, the standardized variables are measures of this relationship, as well as of the variance of the independent variable.

We must carefully define our objective, deciding whether we are interested in the relationship between the response and predictors, or in the variance of the predictor variables. We might not want both.

It is important to make sure that it makes sense to compare two explanatory variables on a logical level, before applying methods to the outcome in order to do so numerically.

Display model coefficients

med_val_glm_w_test@model$coefficients
##     Intercept        chas.1          crim            zn         indus 
##  39.482019217   2.963556089  -0.120149330   0.051519461   0.001392715 
##           nox            rm           age           dis           rad 
## -16.951668010   3.340159314  -0.003744536  -1.660405145   0.345959672 
##           tax       ptratio             b         lstat 
##  -0.013585531  -0.929616665   0.011181044  -0.545957928
med_val_glm_w_test@model$coefficients
##     Intercept        chas.1          crim            zn         indus 
##  39.482019217   2.963556089  -0.120149330   0.051519461   0.001392715 
##           nox            rm           age           dis           rad 
## -16.951668010   3.340159314  -0.003744536  -1.660405145   0.345959672 
##           tax       ptratio             b         lstat 
##  -0.013585531  -0.929616665   0.011181044  -0.545957928

Or more conveniently in a scatterplot. *Intercept omitted since its value is much larger than the rest of the coefficients’, and we are interested in seeing the differences between the non-intercept coefficients.

coeffs <- med_val_glm_w_test@model$coefficients
plot(coeffs[-1], ylim = c(min(coeffs[-1] - 1), max(coeffs[-1] + 2)), ylab = "Coefficient Value")
text(1:length(coeffs[-1]), coeffs[-1], names(coeffs[-1]), pos = 3, cex = 0.6)

med_val_glm_w_test@model$coefficients
##     Intercept        chas.1          crim            zn         indus 
##  39.482019217   2.963556089  -0.120149330   0.051519461   0.001392715 
##           nox            rm           age           dis           rad 
## -16.951668010   3.340159314  -0.003744536  -1.660405145   0.345959672 
##           tax       ptratio             b         lstat 
##  -0.013585531  -0.929616665   0.011181044  -0.545957928

Note that as in lm objects, the first level coefficient of a categorical value is included in the intercept, so for n levels of a certain variable, there will in fact be n-1 coefficients described in the model.

Display training and validation metrics

med_val_glm_w_test@model$training_metrics
## H2ORegressionMetrics: glm
## ** Reported on training data. **
## 
## MSE:  22.10204
## R2 :  0.7337485
## Mean Residual Deviance :  22.10204
## Null Deviance :35363.05
## Null D.o.F. :425
## Residual Deviance :9415.468
## Residual D.o.F. :412
## AIC :2557.691
med_val_glm_w_test@model$validation_metrics
## H2ORegressionMetrics: glm
## ** Reported on validation data. **
## 
## MSE:  21.80702
## R2 :  0.7620231
## Mean Residual Deviance :  21.80702
## Null Deviance :7357.455
## Null D.o.F. :79
## Residual Deviance :1744.562
## Residual D.o.F. :66
## AIC :503.6087

Or, we can extract single model fit statistics on the test or validation set

h2o.mse(med_val_glm_w_test, valid = TRUE)
## [1] 21.80702
h2o.mse(med_val_glm_w_test, valid = FALSE)
## [1] 22.10204
h2o.r2(med_val_glm, valid = TRUE)
## [1] 0.7620231
h2o.r2(med_val_glm, valid = FALSE)
## [1] 0.7337485

When r2=R−Squaredr2=R−Squared, the proportion of the variance in the response variable that is explained by the independent variables, and MSEMSE is the Mean Squared Error, the average of the diversions of the predictions from the actual observations 1n∑ni=1(yi^−yi)21n∑i=1n(yi^−yi)2.

R−SquaredR−Squared and MSEMSE values come in handy when comparing different fitted models, however, R−SquaredR−Squared cannot determine whether the coefficient estimates and predictions are biased, which is why we must assess the residual plots. A desirable model is one that better captures the response’s variation (higher R−SquaredR−Squared), and on the other hand minimizes prediction errors on the training set (lower MSEMSE). For instance, when performing stepwise regression, a decision whether to add/omit a variable may be made based on the values of these statistics.

Predictions are calculated in a familiar way

tbl_fit <- h2o.predict(med_val_glm, newdata = tbl_h) %>% as.data.frame()

A quick look at the distribution of our residuals

resids <- tbl$medv - tbl_fit$predict
plot(resids, ylab = "Residuals")

hist(resids, main = "Residuals", xlab = "Residuals")

Our aim is to end up with residuals that follow a normal distribution. Recalling that our underlying assumption in this regression is that the median values (real ones, not sample medians), in matrix representation, are of the form
Y=Xβ+ϵY=Xβ+ϵ

where ϵ∼N(0,σ2)ϵ∼N(0,σ2). If we are to accept our model, we must make sure that the residuals (which are the “sample ϵϵ’s”) are indeed normally distributed. Another perspective is that if the residuals form a shapeless cloud around the zero, then our coefficients aren’t biased, and there is no trend in our residuals as a function of the coefficients.

We are able to plot our predictions against the observations to assess the accuracy of our model

plot(tbl_fit$predict ~ tbl$medv, main = "Predictions ~ Observations", ylab = "prediction", 
    xlab = "obs")
abline(a = 0, b = 1, col = "red")

We see that our model suffers from underfitting in the higher valued observations (where we have less observations), whilst better capturing the data where there is a larger bulk of observations to learn from. Note that no transformations were applied to neither the response or explanatory variables. Transforming variables may result in more linear relations, helping us build a better fitting linear model that captures the data more accurately.

When our work is done, we must shut down H2O cluster

h2o.shutdown()

LIKE THIS POST? SHARE THE ARTICLE

WE DO NOT JUST IMPLEMENT DATA MININGS TOOLS… WE DEVELOP MACHINE LEARNING ALGORITHMS ESPECIALLY IN THE EMERGING FIELD OF DEEP LEARNING.