Our Blog

Description

The goal of this tutorial is to get R users acquainted with the syntax and methods of the H2O package, in the context of fitting a simple linear regression model.

Data

The used dataset is “Boston Housing Data”, maintained at Carnegie Mellon University. The data can be found HERE.

Features

• CRIM: per capita crime rate by town

• ZN: proportion of residential land zoned for lots over 25,000 sq.ft.

• INDUS: proportion of non-retail business acres per town

• CHAS: Charles River dummy variable, equals 1 if tract bounds river; 0 otherwise

• NOX: nitric oxides concentration (parts per 10 million)

• RM: average number of rooms per dwelling

• AGE: proportion of owner-occupied units built prior to 1940

• DIS: weighted distances to five Boston employment centres

• TAX: full-value property-tax rate per 10,000

• PTRATIO: pupil-teacher ratio by town

• B: 1000⋅(Bk−0.63)21000⋅(Bk−0.63)2 where BkBk is the proportion of blacks by town

• LSTAT: % lower status of the population

• MEDV: Median value of owner-occupied homes in 1000$Analysis Objective The main goal of this analysis is to estimate the effect different parameters have on the median value of homes. The dataset will be imported into an R dataframe, which is then converted to an H2O dataframe for model fitting. The following packages are used: library(h2o) library(magrittr) library(dplyr) library(lattice) Initialization Launch an H20 cluster with the default settings h2o.init() When the connection is complete, a “Connection successful!” message will appear along, with information regarding the cluster, which can be reproduced using the command: h2o.clusterInfo() Analysis Fetch data tbl <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data", header = FALSE) names_tbl <- c("crim", "zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax", "ptratio", "b", "lstat", "medv") names(tbl) <- names_tbl Before converting to an H2O dataframe, let’s get familiar with the data. apply(tbl, 2, class) ## crim zn indus chas nox rm age ## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" ## dis rad tax ptratio b lstat medv ## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" apply(tbl, 2, n_distinct) ## crim zn indus chas nox rm age dis rad ## 504 26 76 2 81 446 356 412 9 ## tax ptratio b lstat medv ## 66 46 357 455 229 Currently all columns are of numeric class, so factor columns must be changed before model fitting. Indeed, we see that the categorical * {CHAS} column has only 2 distinct values (small sanity check). A look at the response variable medv_dens <- density(x = tbl$medv)
hist(tbl$medv, xlab = "Median Value", freq = FALSE, main = "Median Value Histogram (density)") lines(medv_dens$y ~ medv_dens$x, col = "blue") The right skewness of the data suggests a log transformation may be beneficial. However, for the sake of interpretation, we shall leave the data on its original scale. Now let’s have a look at the median values as a function of each of the variables: plot_medv_by <- function(index, tbl) { plot(tbl$medv ~ tbl[, index], main = names(tbl)[index], ylab = "Median_Value",
xlab = names(tbl)[index])
}

par(mfrow = c(2, 3))
lapply(1:6, FUN = plot_medv_by, tbl = tbl)

par(mfrow = c(3, 3))
lapply(7:13, FUN = plot_medv_by, tbl = tbl)

• There seems to be a clear linear connection between number of rooms rm and medv. A positive correlation between the two makes sense, as we would expect prices to rise the more rooms we have. The numeric influence of number of rooms will be more evident once we fit a model.

• Prices are a decreasing function of lstat (percentage of lower status of the population). Again, something we could predict on our own, since the “better” the neighborhood, the higher the prices.

• There seems to be a negative relationship between prices and ptratio, an understandable result as better schools have lower pupil-teacher ratios (an appealing feature for parents) and subsequently raising prices.

• We can see that tracts with a high percentage of units built before 1940 display lower prices. This seems reasonable due to the fact that “new” neighborhoods are always pricier, partially since buildings and infrastructures are newer, in contrast to “old” neighborhoods that are less desirable in the market.

• Understandably, prices tend to be higher where the crime rates are lower.

Now let’s have a look at the correlation variables. First, which are the variables with the “strongest” correlation (say, above 0.5), with * {medv}.

cor_tbl <- cor(tbl)
dimnames(cor_tbl)[[1]][abs(cor_tbl[ ,14]) > 0.5 ]
## [1] "rm"      "ptratio" "lstat"   "medv"

Indeed, the observations we made based on the scatterplots were right, and there seems to be a linear relationship worth looking into. Now, let’s see the correlation between the explanatory variables

cor_tbl %>% levelplot()

Now let’s move on to fitting the model. First, convert to H2O dataframe

tbl_h <- as.h2o(tbl)
class(tbl_h)
## [1] "H2OFrame"

H2O dataframe columns may be accessed using the ‘$’, as in R dataframes. Factorize appropriate variables and verify tbl_h$chas <- as.factor(tbl_h$chas) is.factor(tbl_h$chas)
## [1] TRUE

Applying the summarysummary function to factor columns, a count of each level is returned. For instance, the chaschas column contains a two level factor, giving us

summary(tbl_h$chas) ## chas ## 0:471 ## 1: 35 whereas summarizing the numeric column medvmedv gives us quantiles, means and extreme points summary(tbl_h$medv)
##  medv
##  Min.   : 5.00
##  1st Qu.:16.99
##  Median :21.20
##  Mean   :22.53
##  3rd Qu.:24.98
##  Max.   :50.00

In order to see the levels of a categorical variable, use

h2o.levels(tbl_h$chas) ## [1] "0" "1" When building a model that contains categorical variables, H2O automatically treats each categorical column as a set of sparse binary vectors, so there is no need to modify the object. It is convenient to split the data into training and test sets according to a certain ratio using the h2o.splitFrame()h2o.splitFrame() function, giving us tbl_h_split <- h2o.splitFrame(data = tbl_h, ratios = 0.85) tbl_h_train <- tbl_h_split[[1]] tbl_h_test <- tbl_h_split[[2]] In order to fit a model, we must define the response and independent variable names y_var <- "medv" x_vars <- names_tbl[!names_tbl %in% y_var] Fit a simple linear regression model (Guassian errors) and display results med_val_glm <- h2o.glm(training_frame = tbl_h_train, validation_frame = tbl_h_test, x = x_vars, y = y_var, family = "gaussian", lambda = 0, missing_values_handling = "Skip", standardize = TRUE) summary(med_val_glm) ## Model Details: ## ============== ## ## H2ORegressionModel: glm ## Model Key: GLM_model_R_1465303808376_35 ## GLM Model: summary ## family link regularization number_of_predictors_total ## 1 gaussian identity None 13 ## number_of_active_predictors number_of_iterations training_frame ## 1 13 0 RTMP_sid_b934_6 ## ## H2ORegressionMetrics: glm ## ** Reported on training data. ** ## ## MSE: 22.10204 ## R2 : 0.7337485 ## Mean Residual Deviance : 22.10204 ## Null Deviance :35363.05 ## Null D.o.F. :425 ## Residual Deviance :9415.468 ## Residual D.o.F. :412 ## AIC :2557.691 ## ## ## H2ORegressionMetrics: glm ## ** Reported on validation data. ** ## ## MSE: 21.80702 ## R2 : 0.7620231 ## Mean Residual Deviance : 21.80702 ## Null Deviance :7357.455 ## Null D.o.F. :79 ## Residual Deviance :1744.562 ## Residual D.o.F. :66 ## AIC :503.6087 ## ## ## ## ## Scoring History: ## data frame with 0 columns and 0 rows ## ## Variable Importance: (Extract with h2o.varimp) ## ================================================= ## ## Standardized Coefficient Magnitudes: standardized coefficient magnitudes ## names coefficients sign ## 1 lstat 3.942425 NEG ## 2 dis 3.407899 NEG ## 3 rad 3.053269 POS ## 4 chas.1 2.963556 POS ## 5 rm 2.311217 POS ## 6 tax 2.300871 NEG ## 7 ptratio 2.028037 NEG ## 8 nox 1.976243 NEG ## 9 zn 1.165615 POS ## 10 b 1.076567 POS ## 11 crim 0.946399 NEG ## 12 age 0.105046 NEG ## 13 indus 0.009465 POS We can easily apply the resulting model on our test data, and access the summary statistics med_val_glm_w_test <- h2o.glm(training_frame = tbl_h_train, validation_frame = tbl_h_test, x = x_vars, y = y_var, family = "gaussian", lambda = 0, missing_values_handling = "Skip", standardize = TRUE) Standardized Coefficients In the above, we calculate the standardized coefficients by setting the standardize parameter to TRUE. Technically, one way of achieving this is : prior to forming the regression equation, all variables (response and explanatory) are standardized by subtracting the variable mean from each observation and dividing by the variable’s standard deviation. The resulting standardized coefficients represent the change in response to a change of one standard deviation in a predictor, and ignore the independent variable’s scale of unit. Standardization is thought of as a means of bringing coefficients to the same scale, thus enabling us to interpret their importance in relation to another, however we must be careful in our interpretation - standardization does not add information. The original coefficients are estimates of the relationships between the explanatory variables and the response variable. However, the standardized variables are measures of this relationship, as well as of the variance of the independent variable. We must carefully define our objective, deciding whether we are interested in the relationship between the response and predictors, or in the variance of the predictor variables. We might not want both. It is important to make sure that it makes sense to compare two explanatory variables on a logical level, before applying methods to the outcome in order to do so numerically. Display model coefficients med_val_glm_w_test@model$coefficients
##     Intercept        chas.1          crim            zn         indus
##  39.482019217   2.963556089  -0.120149330   0.051519461   0.001392715
##           nox            rm           age           dis           rad
## -16.951668010   3.340159314  -0.003744536  -1.660405145   0.345959672
##           tax       ptratio             b         lstat
##  -0.013585531  -0.929616665   0.011181044  -0.545957928
med_val_glm_w_test@model$coefficients ## Intercept chas.1 crim zn indus ## 39.482019217 2.963556089 -0.120149330 0.051519461 0.001392715 ## nox rm age dis rad ## -16.951668010 3.340159314 -0.003744536 -1.660405145 0.345959672 ## tax ptratio b lstat ## -0.013585531 -0.929616665 0.011181044 -0.545957928 Or more conveniently in a scatterplot. *Intercept omitted since its value is much larger than the rest of the coefficients’, and we are interested in seeing the differences between the non-intercept coefficients. coeffs <- med_val_glm_w_test@model$coefficients
plot(coeffs[-1], ylim = c(min(coeffs[-1] - 1), max(coeffs[-1] + 2)), ylab = "Coefficient Value")
text(1:length(coeffs[-1]), coeffs[-1], names(coeffs[-1]), pos = 3, cex = 0.6)

med_val_glm_w_test@model$coefficients ## Intercept chas.1 crim zn indus ## 39.482019217 2.963556089 -0.120149330 0.051519461 0.001392715 ## nox rm age dis rad ## -16.951668010 3.340159314 -0.003744536 -1.660405145 0.345959672 ## tax ptratio b lstat ## -0.013585531 -0.929616665 0.011181044 -0.545957928 Note that as in lm objects, the first level coefficient of a categorical value is included in the intercept, so for n levels of a certain variable, there will in fact be n-1 coefficients described in the model. Display training and validation metrics med_val_glm_w_test@model$training_metrics
## H2ORegressionMetrics: glm
## ** Reported on training data. **
##
## MSE:  22.10204
## R2 :  0.7337485
## Mean Residual Deviance :  22.10204
## Null Deviance :35363.05
## Null D.o.F. :425
## Residual Deviance :9415.468
## Residual D.o.F. :412
## AIC :2557.691
med_val_glm_w_test@model$validation_metrics ## H2ORegressionMetrics: glm ## ** Reported on validation data. ** ## ## MSE: 21.80702 ## R2 : 0.7620231 ## Mean Residual Deviance : 21.80702 ## Null Deviance :7357.455 ## Null D.o.F. :79 ## Residual Deviance :1744.562 ## Residual D.o.F. :66 ## AIC :503.6087 Or, we can extract single model fit statistics on the test or validation set h2o.mse(med_val_glm_w_test, valid = TRUE) ## [1] 21.80702 h2o.mse(med_val_glm_w_test, valid = FALSE) ## [1] 22.10204 h2o.r2(med_val_glm, valid = TRUE) ## [1] 0.7620231 h2o.r2(med_val_glm, valid = FALSE) ## [1] 0.7337485 When r2=R−Squaredr2=R−Squared, the proportion of the variance in the response variable that is explained by the independent variables, and MSEMSE is the Mean Squared Error, the average of the diversions of the predictions from the actual observations 1n∑ni=1(yi^−yi)21n∑i=1n(yi^−yi)2. R−SquaredR−Squared and MSEMSE values come in handy when comparing different fitted models, however, R−SquaredR−Squared cannot determine whether the coefficient estimates and predictions are biased, which is why we must assess the residual plots. A desirable model is one that better captures the response’s variation (higher R−SquaredR−Squared), and on the other hand minimizes prediction errors on the training set (lower MSEMSE). For instance, when performing stepwise regression, a decision whether to add/omit a variable may be made based on the values of these statistics. Predictions are calculated in a familiar way tbl_fit <- h2o.predict(med_val_glm, newdata = tbl_h) %>% as.data.frame() A quick look at the distribution of our residuals resids <- tbl$medv - tbl_fit$predict plot(resids, ylab = "Residuals") hist(resids, main = "Residuals", xlab = "Residuals") Our aim is to end up with residuals that follow a normal distribution. Recalling that our underlying assumption in this regression is that the median values (real ones, not sample medians), in matrix representation, are of the form Y=Xβ+ϵY=Xβ+ϵ where ϵ∼N(0,σ2)ϵ∼N(0,σ2). If we are to accept our model, we must make sure that the residuals (which are the “sample ϵϵ’s”) are indeed normally distributed. Another perspective is that if the residuals form a shapeless cloud around the zero, then our coefficients aren’t biased, and there is no trend in our residuals as a function of the coefficients. We are able to plot our predictions against the observations to assess the accuracy of our model plot(tbl_fit$predict ~ tbl\$medv, main = "Predictions ~ Observations", ylab = "prediction",
xlab = "obs")
abline(a = 0, b = 1, col = "red")

We see that our model suffers from underfitting in the higher valued observations (where we have less observations), whilst better capturing the data where there is a larger bulk of observations to learn from. Note that no transformations were applied to neither the response or explanatory variables. Transforming variables may result in more linear relations, helping us build a better fitting linear model that captures the data more accurately.

When our work is done, we must shut down H2O cluster

h2o.shutdown()

LIKE THIS POST? SHARE THE ARTICLE