Skip to content

Latest commit

 

History

History
118 lines (98 loc) · 7.4 KB

README.md

File metadata and controls

118 lines (98 loc) · 7.4 KB

Sparkling Water for R

This is a proof of concept extension package for sparkapi / sparklyr that demonstrates creating an R front-end for a Spark package (in this case Sparkling Water from H2O).

This package implements only the most basic functionality (creating an H2OContext, showing the H2O Flow interface, and converting a Spark DataFrame to an H2O Frame). Note that the package won't be developed further since it's just a demonstration.

Connecting to Spark

First we connect to Spark. The call to library(sparklingwater) will make the H2O functions available on the R search path and will also ensure that the dependencies required by the Sparkling Water package are included when we connect to Spark.

library(sparklyr)
library(sparklingwater)
sc <- spark_connect(master = "local")

H2O Context and Flow

The call to library(sparklingwater) automatically registered the Sparkling Water extension, which in turn specified that the Sparkling Water Spark package should be made available for Spark connections. Let's inspect the H2OContext for our Spark connection:

h2o_context(sc)
## <jobj[6]>
##   class org.apache.spark.h2o.H2OContext
##   
## Sparkling Water Context:
##  * H2O name: sparkling-water-jjallaire_-1482215501
##  * number of executors: 1
##  * list of used executors:
##   (executorId, host, port)
##   ------------------------
##   (driver,localhost,54323)
##   ------------------------
## 
##   Open H2O Flow in browser: http://127.0.0.1:54323 (CMD + click in Mac OSX)
## 

We can also view the H2O Flow web UI:

h2o_flow(sc)

H2O with Spark DataFrames

Let's copy the mtcars dataset to to Spark so we can access it from Sparkling Water:

library(dplyr)
mtcars_tbl <- copy_to(sc, mtcars, overwrite = TRUE)
mtcars_tbl
## Source:   query [?? x 11]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
## 
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1   21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
## 2   21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
## 3   22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
## 4   21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
## 5   18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
## 6   18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
## 7   14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
## 8   24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
## 9   22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
## 10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
## # ... with more rows

The use case we'd like to enable is calling the H2O algorithms and feature transformers directly on Spark DataFrames that we've manipulated with dplyr. This is indeed supported by the Sparkling Water package. Here though we'll just convert the Spark DataFrame into an H2O Frame to prove that it's possible:

mtcars_hf <- h2o_frame(mtcars_tbl)
mtcars_hf
## <jobj[103]>
##   class water.fvec.H2OFrame
##   Frame frame_rdd_39 (32 rows and 11 cols):
##                        mpg  cyl                disp   hp                drat                  wt                qsec  vs  am  gear  carb
##     min               10.4    4                71.1   52                2.76               1.513                14.5   0   0     3     1
##    mean          20.090625    6          230.721875  146           3.5965625             3.21725  17.848750000000003   0   0     3     2
##  stddev  6.026948052089104    1  123.93869383138194   68  0.5346787360709715  0.9784574429896966  1.7869432360968436   0   0     0     1
##     max               33.9    8               472.0  335                4.93               5.424                22.9   1   1     5     8
## missing                0.0    0                 0.0    0                 0.0                 0.0                 0.0   0   0     0     0
##       0               21.0    6               160.0  110                 3.9                2.62               16.46   0   1     4     4
##       1               21.0    6               160.0  110                 3.9               2.875               17.02   0   1     4     4
##       2               22.8    4               108.0   93                3.85                2.32               18.61   1   1     4     1
##       3               21.4    6               258.0  110                3.08               3.215               19.44   1   0     3     1
##       4               18.7    8               360.0  175                3.15                3.44               17.02   0   0     3     2
##       5               18.1    6               225.0  105                2.76                3.46               20.22   1   0     3     1
##       6               14.3    8               360.0  245                3.21                3.57               15.84   0   0     3     4
##       7               24.4    4               146.7   62                3.69                3.19                20.0   1   0     4     2
##       8               22.8    4               140.8   95                3.92                3.15                22.9   1   0     4     2
##       9               19.2    6               167.6  123                3.92                3.44                18.3   1   0     4     4
##      10               17.8    6               167.6  123                3.92                3.44                18.9   1   0     4     4
##      11               16.4    8               275.8  180                3.07                4.07                17.4   0   0     3     3
##      12               17.3    8               275.8  180                3.07                3.73                17.6   0   0     3     3
##      13               15.2    8               275.8  180                3.07                3.78                18.0   0   0     3     3
##      14               10.4    8               472.0  205                2.93                5.25               17.98   0   0     3     4
##      15               10.4    8               460.0  215                 3.0               5.424               17.82   0   0     3     4
##      16               14.7    8               440.0  230                3.23               5.345               17.42   0   0     3     4
##      17               32.4    4                78.7   66                4.08                 2.2               19.47   1   1     4     1
##      18               30.4    4                75.7   52                4.93               1.615               18.52   1   1     4     2
##      19               33.9    4                71.1   65                4.22               1.835                19.9   1   1     4     1

Now we disconnect from Spark, this will result in the H2OContext being stopped as well since it's owned by the spark shell process used by our Spark connection:

spark_disconnect(sc)