Array API support

The Array API specification defines a standard API for all array manipulation libraries with a NumPy-like API. Extension for Scikit-learn doesn’t require array-api-compat to be installed for functional support of the array API standard. In the current implementation, the functional support of array api follows the functional support of different array or DataFrame inputs and does not modify the precision of the input and output data formats unless necessary. Any array API input will be converted to host numpy.ndarrays and all internal manipulations with data will be done with these representations of the input data. DPNP’s ‘ndarray’ and Data Parallel Control’s ‘usm_ndarray’ have special handling requirements that are described in the relevant section of this document. Output values will in all relevant cases match the input data format.

Note

Currently, only array-api-strict, dpctl, dpnp and numpy are known to work with sklearnex estimators.

Note

Stock Scikit-learn’s array API support requires array-api-compat to be installed.

Support for DPNP and DPCTL

The functional support of input data for sklearnex estimators also extended for SYCL USM array types. These include SYCL USM arrays dpnp’s ndarray and Data Parallel Control usm_ndarray. DPNP ndarray and Data Parallel Control usm_ndarray contain SYCL contexts which can be used for sklearnex device offloading.

Note

Current support for DPNP and DPCTL usm_ndarray data can be copied and moved to and from device in sklearnex and have impacts on memory utilization.

DPCTL or DPNP inputs are not required to use config_context(target_offload=device). sklearnex will use input usm_ndarray sycl context for device offloading.

Note

As DPCTL or DPNP inputs contain SYCL contexts, they do not require config_context(target_offload=device). However, the use of config_context` will override the contained SYCL context and will force movement of data to the targeted device.

Support for Array API-compatible inputs

All patched estimators, metrics, tools and non-scikit-learn estimators functionally support Array API standard. Extension for Scikit-learn preserves input data format for all outputs. For all array inputs except SYCL USM arrays dpnp’s ndarray and Data Parallel Control usm_ndarray all computation will be only accomplished on CPU unless specified by a config_context` with an available GPU device.

Stock scikit-learn uses config_context(array_api_dispatch=True) for enabling Array API support. If array_api_dispatch is enabled and the installed Scikit-Learn version supports array API, then the original inputs are used when falling back to Scikit-Learn functionality.

Note

Data Parallel Control usm_ndarray or DPNP ndarray inputs will use host numpy data copies when falling back to Scikit-Learn since they are not array API compliant.

Note

Functional support doesn’t guarantee that after the model is trained, fitted attributes that are arrays will also be from the same namespace as the training data.

Example usage

DPNP ndarrays

Here is an example code to demonstrate how to use dpnp arrays to run RandomForestRegressor on a GPU without config_context(array_api_dispatch=True):

# ==============================================================================
# Copyright 2023 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

# sklearnex RF example for GPU offloading with DPNP ndarray:
#    python ./random_forest_regressor_dpnp.py

import dpctl
import dpnp
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

# Import estimator via sklearnex's patch mechanism from sklearn
from sklearnex import patch_sklearn, sklearn_is_patched

patch_sklearn()

# Function that can validate current state of patching
sklearn_is_patched()

# Import estimator from the patched sklearn namespace.
from sklearn.ensemble import RandomForestRegressor

# Or just directly import estimator from sklearnex namespace.
from sklearnex.ensemble import RandomForestRegressor

# We create GPU SyclQueue and then put data to dpctl tensor using
# the queue. It allows us to do computation on GPU.
queue = dpctl.SyclQueue("gpu")

X, y = make_regression(
    n_samples=1000, n_features=4, n_informative=2, random_state=0, shuffle=False
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

dpnp_X_train = dpnp.asarray(X_train, usm_type="device", sycl_queue=queue)
dpnp_y_train = dpnp.asarray(y_train, usm_type="device", sycl_queue=queue)
dpnp_X_test = dpnp.asarray(X_test, usm_type="device", sycl_queue=queue)

rf = RandomForestRegressor(max_depth=2, random_state=0).fit(dpnp_X_train, dpnp_y_train)

pred = rf.predict(dpnp_X_test)

print("Random Forest regression results:")
print("Ground truth (first 5 observations):\n{}".format(y_test[:5]))
print("Regression results (first 5 observations):\n{}".format(pred[:5]))
print("Are predicted results on GPU: {}".format(pred.sycl_device.is_gpu))

Note

Functional support doesn’t guarantee that after the model is trained, fitted attributes that are arrays will also be from the same namespace as the training data.

For example, if dpnp’s namespace was used for training, then fitted attributes will be on the CPU and numpy.ndarray data format.

DPCTL usm_ndarrays

Here is an example code to demonstrate how to use dpctl arrays to run RandomForestClassifier on a GPU without config_context(array_api_dispatch=True):

# ==============================================================================
# Copyright 2023 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

# sklearnex RF example for GPU offloading with DPCtl tensor:
#    python ./random_forest_classifier_dpctl_batch.py

import dpctl
import dpctl.tensor as dpt
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

from sklearnex.ensemble import RandomForestClassifier

# Make sure that all DPCtl tensors using the same device.
q = dpctl.SyclQueue("gpu")  # GPU

X, y = make_classification(
    n_samples=1000,
    n_features=4,
    n_informative=2,
    n_redundant=0,
    random_state=0,
    shuffle=False,
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

dpt_X_train = dpt.asarray(X_train, usm_type="device", sycl_queue=q)
dpt_y_train = dpt.asarray(y_train, usm_type="device", sycl_queue=q)
dpt_X_test = dpt.asarray(X_test, usm_type="device", sycl_queue=q)

rf = RandomForestClassifier(max_depth=2, random_state=0).fit(dpt_X_train, dpt_y_train)

pred = rf.predict(dpt_X_test)

print("Random Forest classification results:")
print("Ground truth (first 5 observations):\n{}".format(y_test[:5]))
print("Classification results (first 5 observations):\n{}".format(pred[:5]))
print("Are predicted results on GPU: {}".format(pred.sycl_device.is_gpu))

As on previous example, if dpctl Array API namespace was used for training, then fitted attributes will be on the CPU and numpy.ndarray data format.

Use of `array-api-strict`

Here is an example code to demonstrate how to use array-api-strict arrays to run DBSCAN.

# ==============================================================================
# Copyright 2024 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

import array_api_strict

from sklearnex import config_context, patch_sklearn

patch_sklearn()

from sklearn.cluster import DBSCAN

X = array_api_strict.asarray(
    [[1.0, 2.0], [2.0, 2.0], [2.0, 3.0], [8.0, 7.0], [8.0, 8.0], [25.0, 80.0]],
    dtype=array_api_strict.float32,
)

# Could be launched without `config_context(array_api_dispatch=True)`. This context
# manager for sklearnex, only guarantee that in case of the fallback to stock
# scikit-learn, fitted attributes to be from the same Array API namespace as
# the training data.
clustering = DBSCAN(eps=3, min_samples=2).fit(X)

print(f"Fitted labels :\n", clustering.labels_)

Array API support

Support for DPNP and DPCTL

Support for Array API-compatible inputs

Example usage

DPNP ndarrays

DPCTL usm_ndarrays

Use of array-api-strict

Use of `array-api-strict`