Contributor Reference

Adding an estimator

Estimator classes in the Extension for Scikit-learn* are wrappers over algorithms from the oneAPI Data Analytics Library. In order to add a new estimator, an example class DummyEstimator is available in the library, along with code comments and tests which explain how it should work. Estimators spawn multiple files, ranging from C++ wrappers from PyBind11, direct wrappers in the onedal/ module, scikit-learn-conformant wrappers over those in the sklearnex/ module, direct tests, configurations for general tests, and others.

Example estimator

The following files and folders might be of help when looking at how the example DummyEstimator works and what is needed of an estimator:

The following files might also require changes after adding a new estimator - look out for the “dummy” keyword:

Note

The library contains lots of classes with legacy code from previous designs that do not work in the same way as the DummyEstimator class, such as classes based off daal4py. New estimators should nevertheless not try to mimic those, and follow instead the design from DummyEstimator.

Tip

Another good reference example for how estimators should be implemented is sklearn.linear_model.LinearRegression from the sklearnex module.

For estimators that somehow depend on functionality that is only exposed through daal4py, an internal wrapper akin to the files under onedal/ must first be created under daal4py/sklearn, and then imported in a corresponding class on onedal/. Note that new functionalities in the oneAPI Data Analytics Library are meant to be introduced through the oneAPI interface, so only legacy functionalities should ever need to go through this route.

Version compatibilities

OneDAL

The Extension for Scikit-learn* is intended to be backwards-compatible with different versions of the oneAPI Data Analytics Library, but not forwards-compatible except within a major release series - meaning: it is meant to run with a version of the oneAPI Data Analytics Library that is lower or equal than the version of the Extension for Scikit-learn*, such that onedal==2025.0 + sklearnex==2025.0 and onedal==2025.0 + sklearnex==2025.2 should both work correctly, even though the latter might not expose the same functionalities with onedal==2025.0 as with onedal==2025.2.

This is achieved with conditional runtime checks of the library versions in order to determine whether some class or function or similar should be defined or not, through the provided function daal_check_version, which accepts a tuple as argument containing the major version number, the "P" string (other possibilities for this parameter are not used anymore), and the minor version multiplied by 100. So for example, if a given piece of code requires onedal>=2025.2, the function should be called as follows:

if daal_check_version((2025, "P", 200)):
    # code branch for onedal>=2025.2
else:
    # code branch for onedal<2025.2

Hint

This helper is meant for usage in both source code and tests.

On C++ code, the macro ONEDAL_VERSION should be checked at compile-time for conditional code inclusions or exclusions. This macro contains a single integral number with the major version, followed by the minor version using 2 digits, and other patch versions using another two digits. For example, if a given piece of code requires onedal>=2025.2, the check would be as follows:

#if defined(ONEDAL_VERSION) && ONEDAL_VERSION >= 20250200
// code for newer version
#else
// code for older version
#endif

Scikit-learn

The Extension for Scikit-learn* is intended to be compatible with multiple versions of scikit-learn. In order to achieve this compatibility, conditional runtime checks for the version of scikit-learn are executed in order to offer different code paths for different versions, through function sklearn_check_version, which accepts a string with the major and minor version as recognized by pip. For example, in order to have different code branches depending on sklearn>=1.7 (which would also trigger for sklearn==1.7.2, for example), the following can be used:

if sklearn_check_version("1.7"):
    # code branch for sklearn>=1.7
else:
    # code branch for sklearn<1.7

Test helpers

Note that not all estimators offer the same functionalities, and thus tests should be designed accordingly. The tests provide some custom marks, fixtures, and helpers that one might to use for some cases:

  • @pytest.mark.allow_sklearn_fallback: will avoid having tests fail when they end up calling procedures from scikit-learn instead of from the oneAPI Data Analytics Library. This can be helpful for example when testing that some corner case falls back correctly when it should.

  • onedal.tests.utils._dataframes_support._as_numpy: this function can be used to convert an input array or data frame to NumPy, regardless of whether it lives on host or on device, and regardless of array API support.

  • pass_if_not_implemented_for_gpu: skips tests not implemented for GPU when GPU support is enabled. Requires a skip reason argument that matches the backend’s error message.

Tests with optional dependencies

Tests that require optional dependencies in order to execute should have a conditional skip logic through usage of @pytest.mark.skipif. The test files are meant to be executable without the optional dependencies being installed, so they should be imported conditionally or in a try + except ImportError block.

SPMD tests

Tests that involve distributed mode functionalities should rely on pytest-mpi and need to be marked with @pytest.mark.mpi.

Running benchmarks

As this library aims to offer accelerated versions of algorithms, when it comes to adding or modifying estimators and related helper functions, it is usually helpful - and in many cases required - to conduct benchmarks to assess the performance implications of changes, whether against scikit-learn or against the current version of the Extension for Scikit-learn* when introducing changes.

Benchmarks are usually conducted through the scikit-learn_bench tool, which lives in a different repository. See the instructions in that repository for how to run the appropriate benchmarks.

Results from benchmarks are usually shared as a relative improvement over the baseline being compared against, which will be available in the sheets of the generated .xlsx comparison reports from that repository. Usually, the geometric mean is used as a final number, but changes for individual datasets and estimator methods are typically still of interest within a given pull request.

Building the documentation

The source code for the documentation being rendered here is available from the same repository as the library’s source code, and hosted on GitHub pages through automated deployments. The source code for the documentation is written in Sphinx, taking some docstrings from the classes and functions in the library to render them.

Thus, building the documentation from source requires being able to import the library in the same Python environment that is building the documentation, in addition to having all of the Python packages used by the Sphinx built script, such as Sphinx itself and the Sphinx extensions used throughout these docs.

Building documentation locally

For development purposes, it’s helpful to build the docs locally to inspect them offline without deploying, based off the current version of the source code instead of a public release version. This can be done using the provided scripts in this repository.

Requirements

Being based off Sphinx, the scripts for building documentation require a Python environment with documentation-related packages installed. The locked requirements (and note that in many cases specific versions of the dependencies might be needed) are available in file requirements-doc.txt. They can be installed from the root of the repository as follows:

pip install -r requirements-doc.txt

Tip

It’s advised to create a separate Python environment for building the docs due to the locked requirements and version conflicts with what’s used for the tests.

Instructions

With the necessary dependencies being installed, the docs can then be built locally on Linux* by executing the following script from the root of the repository:

./doc/build-doc.sh

Note

The script accepts additional arguments and environment variables which are used for the versioned doc pages hosted on GitHub pages. Those are not meant to be used for local development.

The script will copy over necessary files to the docs folder and make calls to Sphinx to build the docs as HTML. After that script is executed for the first time, if no new embedded notebooks / examples from .py files have been added, the docs can be built without the script using the provided Makefile:

cd doc
make clean
make html

Note

The docs can be built on Windows* using the file make.bat, but be aware that it will not render everything correctly if the commands from build-doc.sh that copy files haven’t been executed.