Train LightGBM with CRITEO on Databricks #556

miguelgfierro · 2019-02-21T15:39:32Z

Description

Train LightGBM with CRITEO on Databricks

Other Comments

Mockup code:

import pyspark.sql.functions as F
import pyspark.sql.types as T
 
from pyspark.ml.feature import (
  Imputer,
  OneHotEncoderEstimator,
  StringIndexer,
  VectorAssembler,
)
from pyspark.ml.pipeline import Pipeline
 
...
 
pipeline = Pipeline(stages=[
  Imputer(strategy='median',
          inputCols=features,
          outputCols=[f + '_imp' for f in features]),
  # LightGBM can handle categoricals directly if StringIndexer is used through meta-data
  *[StringIndexer(inputCol=f+'_na', outputCol=f+'_vec') for f in sparse_features + features],
  VectorAssembler(inputCols=[f + '_imp' for f in features] +
                            [f + '_vec' for f in sparse_features + features] +
                            app_features,
                  outputCol='features')
])
...
pipeline.fit(all).transform(all).saveAsTable('train')
 
model = LightGBMClassifier(featuresCol='features', 
                           labelCol='label', 
                           numIterations=classifier_lightgbm_iterations, 
                           numLeaves=64,
                           isUnbalance=True)
 
grid = (ParamGridBuilder()
  # .addGrid(model.numLeaves, [31]) # for now it's good enough
  .build())
 
evaluator = BinaryClassificationEvaluator(labelCol='label')
 
cv = CrossValidator(estimator=model, estimatorParamMaps=grid, evaluator=evaluator, numFolds=n_folds)
model = cv.fit(train)

The text was updated successfully, but these errors were encountered:

jreynolds01 · 2019-02-26T03:55:03Z

First pass at this based on the code above is available here. The fit() calls for both the model and cv seem unstable - some times they succeed and sometimes they don't. Hopefully this helps - I need to work on a couple of other things tomorrow and will come back to this later this week. (likely running out of memory)

imatiach-msft · 2019-03-05T04:24:37Z

@jreynolds01 what is your cluster configuration? did you make sure to disable dynamic allocation?

miguelgfierro · 2019-04-02T12:12:13Z

done in #680

emnajaoua · 2019-07-08T13:29:27Z

Hi everyone, I am trying to run the lightGBMClassifier with my own dataset but I am always facing this error: java.net.ConnectException: Connection refused (Connection refused). I have described very well my issue here : https://github.com/Azure/mmlspark/issues/609
I would be very grateful if someone could have a look on it !
Thank you in advance

gramhagen · 2019-07-08T13:37:02Z

Hi @emnajaoua, let's track this issue in a single place. I think mmlspark is the right place for the discussion. I will add comments there.

miguelgfierro self-assigned this Feb 21, 2019

miguelgfierro mentioned this issue Feb 21, 2019

LightGBM Scenario: Databricks + MMLSpark + LightGBM distributed pattern, o16n on AKS + CRITEO #524

Closed

yueguoguo added data algorithm labels Feb 27, 2019

miguelgfierro closed this as completed Apr 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train LightGBM with CRITEO on Databricks #556

Train LightGBM with CRITEO on Databricks #556

miguelgfierro commented Feb 21, 2019

jreynolds01 commented Feb 26, 2019 •

edited

Loading

imatiach-msft commented Mar 5, 2019

miguelgfierro commented Apr 2, 2019

emnajaoua commented Jul 8, 2019

gramhagen commented Jul 8, 2019

Train LightGBM with CRITEO on Databricks #556

Train LightGBM with CRITEO on Databricks #556

Comments

miguelgfierro commented Feb 21, 2019

Description

Other Comments

jreynolds01 commented Feb 26, 2019 • edited Loading

imatiach-msft commented Mar 5, 2019

miguelgfierro commented Apr 2, 2019

emnajaoua commented Jul 8, 2019

gramhagen commented Jul 8, 2019

jreynolds01 commented Feb 26, 2019 •

edited

Loading