Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train LightGBM with CRITEO on Databricks #556

Closed
miguelgfierro opened this issue Feb 21, 2019 · 5 comments
Closed

Train LightGBM with CRITEO on Databricks #556

miguelgfierro opened this issue Feb 21, 2019 · 5 comments
Assignees

Comments

@miguelgfierro
Copy link
Collaborator

Description

Train LightGBM with CRITEO on Databricks

Other Comments

Mockup code:

import pyspark.sql.functions as F
import pyspark.sql.types as T
 
from pyspark.ml.feature import (
  Imputer,
  OneHotEncoderEstimator,
  StringIndexer,
  VectorAssembler,
)
from pyspark.ml.pipeline import Pipeline
 
...
 
pipeline = Pipeline(stages=[
  Imputer(strategy='median',
          inputCols=features,
          outputCols=[f + '_imp' for f in features]),
  # LightGBM can handle categoricals directly if StringIndexer is used through meta-data
  *[StringIndexer(inputCol=f+'_na', outputCol=f+'_vec') for f in sparse_features + features],
  VectorAssembler(inputCols=[f + '_imp' for f in features] +
                            [f + '_vec' for f in sparse_features + features] +
                            app_features,
                  outputCol='features')
])
...
pipeline.fit(all).transform(all).saveAsTable('train')
 
model = LightGBMClassifier(featuresCol='features', 
                           labelCol='label', 
                           numIterations=classifier_lightgbm_iterations, 
                           numLeaves=64,
                           isUnbalance=True)
 
grid = (ParamGridBuilder()
  # .addGrid(model.numLeaves, [31]) # for now it's good enough
  .build())
 
evaluator = BinaryClassificationEvaluator(labelCol='label')
 
cv = CrossValidator(estimator=model, estimatorParamMaps=grid, evaluator=evaluator, numFolds=n_folds)
model = cv.fit(train)
@jreynolds01
Copy link
Collaborator

jreynolds01 commented Feb 26, 2019

First pass at this based on the code above is available here. The fit() calls for both the model and cv seem unstable - some times they succeed and sometimes they don't. Hopefully this helps - I need to work on a couple of other things tomorrow and will come back to this later this week. (likely running out of memory)

@imatiach-msft
Copy link
Collaborator

@jreynolds01 what is your cluster configuration? did you make sure to disable dynamic allocation?

@miguelgfierro
Copy link
Collaborator Author

done in #680

@emnajaoua
Copy link

Hi everyone, I am trying to run the lightGBMClassifier with my own dataset but I am always facing this error: java.net.ConnectException: Connection refused (Connection refused). I have described very well my issue here : https://github.com/Azure/mmlspark/issues/609
I would be very grateful if someone could have a look on it !
Thank you in advance

@gramhagen
Copy link
Collaborator

Hi @emnajaoua, let's track this issue in a single place. I think mmlspark is the right place for the discussion. I will add comments there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants