Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] How do I use lgb.Dataset() with lgb.Predict() without using pandas df or np array? #6285

Closed
wil70 opened this issue Jan 22, 2024 · 5 comments

Comments

@wil70
Copy link

wil70 commented Jan 22, 2024

Description

I'm trying optura and flaml. I'm able to train (lgb.train) models with optura with csv and bin files as input for training and validation dataset. This is great as the speed is good.
The problem is with the prediction (lgb.predict), I'm not able to get a good speed as I need to go via pandas df or np array.
Is there a way to by pass those and use lgb.Dataset()?

Reproducible example

I have big datasets (csv and bin). I would like to use those with lgb.Dataset('train.csv.bin') instead of Panda df pd.read_csv('train.csv') for 1) speed reason and also 2) for consistency on how the LightGBM (CLI version) handle "na" and "+-inf" which pandas handle differently.

   
params = {
        "objective": "multiclass",
        #"metric": "multi_logloss,multi_error,auc_mu",
        "metric": "multi_error",
        "verbosity": -1,
        "boosting_type": "gbdt",
        "num_threads" : "10",
        "num_class" : "2",
        "ignore_column" : "1",
        "label_column" : "10",
        "categorical_feature":"8,9",
        "data" : 'train.csv.bin',
        "valid_data" : 'validate.csv.bin',
    }

    #model = lgb.train(
    #    params,
    #    dtrain,
    #    valid_sets=[dval],
    #   callbacks=[early_stopping(1), log_evaluation(100)],
    #)
    
    model.save_model("model.txt")
    
    #dval = lgb.Dataset('test.csv')
    dval = lgb.Dataset('validate.csv.bin', label=-1) #, params=params)
    #val_data = pd.read_csv('validate.csv',header=None) 

    # Load the model from file
    model = lgb.Booster(model_file='model.txt')

    # Get the true labels
    y_true = dval.get_label()

    # Get the predicted probabilities
    y_pred = model.predict(dval.get_data())
    # **Error: Exception: Cannot get data before construct Dataset**
    #y_pred = model.predict(dval.data)
    #**Error: lightgbm.basic.LightGBMError: Unknown format of training data. Only CSV, TSV, and LibSVM (zero-based) formatted text files are supported.**

How can I achieve this? how do I specify all columns are features except column 10 and ignore column 1?
I tried to feed the param to lgb.Dataset, but that didn't do it

Environment info

Win10 pro + Python 3.12.0 + latest optura

LightGBM version or commit hash: Latest as of today

Command(s) you used to install LightGBM

pip install lightgbm

Additional Comments

@wil70 wil70 changed the title [Python, Question] How to use lgbDataset with lgb.Predict without using pandas df or np array? [Python, Question] How to use lgb.Dataset() with lgb.Predict() without using pandas df or np array? Jan 22, 2024
@wil70 wil70 changed the title [Python, Question] How to use lgb.Dataset() with lgb.Predict() without using pandas df or np array? [Python, Question] How do I use lgb.Dataset() with lgb.Predict() without using pandas df or np array? Jan 22, 2024
@jameslamb jameslamb changed the title [Python, Question] How do I use lgb.Dataset() with lgb.Predict() without using pandas df or np array? [python-package] How do I use lgb.Dataset() with lgb.Predict() without using pandas df or np array? Jan 23, 2024
@wil70
Copy link
Author

wil70 commented Jan 29, 2024

If no reply to the "question" then may be this is a feature enhancement request?

This would be a great feature enhacement for large data set. LightGBM is good at handling big dataset for training and validation with its c++ engine, keeping the same performance for the testing phase as well would be a big plus.

In my code, all is good until after the line "model = lgb.Booster(model_file='model.txt')"...
If we could directly use a LightGBM Dataset to predict from the model (moel.predict(...)) that would solve the issue as all the data would stay within the c++ engine and not be manipulated in python.

@jameslamb
Copy link
Collaborator

Thanks as always for your interest in LightGBM and for pushing the limits of what it can do with larger datasets and larger models.

As you've discovered, directly calling predict() on a LightGBM dataset isn't supported today. We already have these feature requests tracking it (in #2302):

The best way to get that functionality into LightGBM is to contribute it yourself. If that interests you, consider putting up a draft pull request and @-ing us for help on specific questions.

@jameslamb
Copy link
Collaborator

Panda df pd.read_csv('train.csv')

If you have large enough data that it's a significant runtime + memory problem to load it, and you're using Python, consider storing it in a different format than a CSV file. CSV is a text format and pandas is going to be doing a ton of type-guessing and type-conversion while reading that.

For example, consider storing it as a dense numpy array in the .npy file format (numpy docs) and then reading it up into a numpy matrix.

Or in Parquet format and reading that into pandas (to at least eliminate most of the type-conversion overhead of CSV).

@jameslamb
Copy link
Collaborator

stay within the C++ engine and not be manipulated in Python

LightGBM also supports predicting directly on a CSV file

void Predict(int start_iteration, int num_iteration, int predict_type, const char* data_filename,

Have you tried that?

You could do that with the lightgbm CLI or using Booster.predict() in the Python package. Booster.predict() accepts a path to a CSV/TSV/LibSVM formatted file.

https://github.com/microsoft/LightGBM/blob/252828fd86627d7405021c3377534d6a8239dd69/python-package/lightgbm/basic.py#L1073-1075

@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants