Distributed LightGBM on Ray
LightGBM-Ray is a distributed backend for LightGBM, built on top of distributed computing framework Ray.
LightGBM-Ray
- enables multi-node and multi-GPU training
- integrates seamlessly with distributed hyperparameter optimization library Ray Tune
- comes with fault tolerance handling mechanisms, and
- supports distributed dataframes and distributed data loading
All releases are tested on large clusters and workloads.
This package is based on XGBoost-Ray. As of now, XGBoost-Ray is a dependency for LightGBM-Ray.
Installation
You can install the latest LightGBM-Ray release from PIP:
pip install "lightgbm_ray"
If you'd like to install the latest master, use this command instead:
pip install "git+https://github.com/ray-project/lightgbm_ray.git#egg=lightgbm_ray"
Usage
LightGBM-Ray provides a drop-in replacement for LightGBM's train
function. To pass data, a RayDMatrix object is required, common
with XGBoost-Ray. You can also use a scikit-learn
interface - see next section.
Just as in original lgbm.train() function, the
training parameters
are passed as the params dictionary.
Ray-specific distributed training parameters are configured with a
lightgbm_ray.RayParams object. For instance, you can set
the num_actors property to specify how many distributed actors
you would like to use.
Here is a simplified example (which requires sklearn):
Training:
from lightgbm_ray import RayDMatrix, RayParams, train
from sklearn.datasets import load_breast_cancer
train_x, train_y = load_breast_cancer(return_X_y=True)
train_set = RayDMatrix(train_x, train_y)
evals_result = {}
bst = train(
{
"objective": "binary",
"metric": ["binary_logloss", "binary_error"],
},
train_set,
evals_result=evals_result,
valid_sets=[train_set],
valid_names=["train"],
verbose_eval=False,
ray_params=RayParams(num_actors=2, cpus_per_actor=2))
bst.booster_.save_model("model.lgbm")
print("Final training error: {:.4f}".format(
evals_result["train"]["binary_error"][-1]))
Prediction:
from lightgbm_ray import RayDMatrix, RayParams, predict
from sklearn.datasets import load_breast_cancer
import lightgbm as lgbm
data, labels = load_breast_cancer(return_X_y=True)
dpred = RayDMatrix(data, labels)
bst = lgbm.Booster(model_file="model.lgbm")
pred_ray = predict(bst, dpred, ray_params=RayParams(num_actors=2))
print(pred_ray)
scikit-learn API
LightGBM-Ray also features a scikit-learn API fully mirroring pure LightGBM scikit-learn API, providing a completely drop-in replacement. The following estimators are available:
RayLGBMClassifierRayLGBMRegressor
Example usage of RayLGBMClassifier:
from lightgbm_ray import RayLGBMClassifier, RayParams
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
seed = 42
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=0.25, random_state=42)
clf = RayLGBMClassifier(
n_jobs=2, # In LightGBM-Ray, n_jobs sets the number of actors
random_state=seed)
# scikit-learn API will automatically convert the data
# to RayDMatrix format as needed.
# You can also pass X as a RayDMatrix, in which case
# y will be ignored.
clf.fit(X_train, y_train)
pred_ray = clf.predict(X_test)
print(pred_ray)
pred_proba_ray = clf.predict_proba(X_test)
print(pred_proba_ray)
# It is also possible to pass a RayParams object
# to fit/predict/predict_proba methods - will override
# n_jobs set during initialization
clf.fit(X_train, y_train, ray_params=RayParams(num_actors=2))
pred_ray = clf.predict(X_test, ray_params=RayParams(num_actors=2))
print(pred_ray)
- For more details on the paramter, Refer : Ray Documentation