Distributed LightGBM on Ray

LightGBM-Ray is a distributed backend for LightGBM, built on top of distributed computing framework Ray.

LightGBM-Ray

enables multi-node and multi-GPU training
integrates seamlessly with distributed hyperparameter optimization library Ray Tune
comes with fault tolerance handling mechanisms, and
supports distributed dataframes and distributed data loading

All releases are tested on large clusters and workloads.

This package is based on XGBoost-Ray. As of now, XGBoost-Ray is a dependency for LightGBM-Ray.

Installation

You can install the latest LightGBM-Ray release from PIP:

pip install "lightgbm_ray"

If you'd like to install the latest master, use this command instead:

pip install "git+https://github.com/ray-project/lightgbm_ray.git#egg=lightgbm_ray"

Usage

LightGBM-Ray provides a drop-in replacement for LightGBM's train function. To pass data, a RayDMatrix object is required, common with XGBoost-Ray. You can also use a scikit-learn interface - see next section.

Just as in original lgbm.train() function, the training parameters are passed as the params dictionary.

Ray-specific distributed training parameters are configured with a lightgbm_ray.RayParams object. For instance, you can set the num_actors property to specify how many distributed actors you would like to use.

Here is a simplified example (which requires sklearn):

Training:

from lightgbm_ray import RayDMatrix, RayParams, train
from sklearn.datasets import load_breast_cancer

train_x, train_y = load_breast_cancer(return_X_y=True)
train_set = RayDMatrix(train_x, train_y)

evals_result = {}
bst = train(
    {
        "objective": "binary",
        "metric": ["binary_logloss", "binary_error"],
    },
    train_set,
    evals_result=evals_result,
    valid_sets=[train_set],
    valid_names=["train"],
    verbose_eval=False,
    ray_params=RayParams(num_actors=2, cpus_per_actor=2))

bst.booster_.save_model("model.lgbm")
print("Final training error: {:.4f}".format(
    evals_result["train"]["binary_error"][-1]))

Prediction:

from lightgbm_ray import RayDMatrix, RayParams, predict
from sklearn.datasets import load_breast_cancer
import lightgbm as lgbm

data, labels = load_breast_cancer(return_X_y=True)

dpred = RayDMatrix(data, labels)

bst = lgbm.Booster(model_file="model.lgbm")
pred_ray = predict(bst, dpred, ray_params=RayParams(num_actors=2))

print(pred_ray)

scikit-learn API

LightGBM-Ray also features a scikit-learn API fully mirroring pure LightGBM scikit-learn API, providing a completely drop-in replacement. The following estimators are available:

RayLGBMClassifier
RayLGBMRegressor

Example usage of RayLGBMClassifier:

from lightgbm_ray import RayLGBMClassifier, RayParams
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

seed = 42

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.25, random_state=42)

clf = RayLGBMClassifier(
    n_jobs=2,  # In LightGBM-Ray, n_jobs sets the number of actors
    random_state=seed)

# scikit-learn API will automatically convert the data
# to RayDMatrix format as needed.
# You can also pass X as a RayDMatrix, in which case
# y will be ignored.

clf.fit(X_train, y_train)

pred_ray = clf.predict(X_test)
print(pred_ray)

pred_proba_ray = clf.predict_proba(X_test)
print(pred_proba_ray)

# It is also possible to pass a RayParams object
# to fit/predict/predict_proba methods - will override
# n_jobs set during initialization

clf.fit(X_train, y_train, ray_params=RayParams(num_actors=2))

pred_ray = clf.predict(X_test, ray_params=RayParams(num_actors=2))
print(pred_ray)

For more details on the paramter, Refer : Ray Documentation