Supported input types

Just like scikit-learn estimators, estimators from the Extension for Scikit-learn* are able to accept and work with different classes of input data, including:

In addition, Extension for Scikit-learn* also supports dpnp.ndarray arrays, which are particularly useful for GPU computations.

Stock Scikit-Learn estimators, depending on the version, might offer support for additional input types beyond this list, such as DataFrame and Series classes from other libraries like Polars.

Extension for Scikit-learn* currently does not offer accelerated routines for input types not listed here - when receiving an unsupported class, estimators will either convert to a supported class under some circumstances (e.g. PyArrow tables might get converted to NumPy arrays when passed to data validators from stock scikit-learn), throw an error (e.g. when passing some data format not that’s not recognized by scikit-learn), or fall back to stock scikit-learn to handle it (when array API is enabled the input is unsupported).

Warning

In some cases data passed to estimators might be copied/duplicated during calls to methods such as fit/predict. The affected cases are listed below.

  • Non-contiguous NumPy array - i.e. where strides are wider than one element across both rows and columns

  • For SciPy CSR matrix / array, index arrays are always copied. Note that sparse matrices in formats other than CSR will be converted to CSR, which implies more than just data copying.

  • Heterogeneous NumPy array

  • If SYCL queue is provided for device without float64 support but data are float64, data are copied with reduced precision.

  • If Array API is not enabled then data from GPU devices are always copied to the host device and then result table (for applicable methods) is copied to the source device.