Additional parameters in estimators and functions

For the most part, estimators and functions in the Extension for Scikit-learn* that have an analog in scikit-learn offer the same signatures for class constructors and functions, but there are a few exceptions where the classes/functions from the Extension for Scikit-learn* allow additional parameters, which will be available both under patching and when importing them from the sklearnex module.

The cases with additional parameters are listed below:

Parameter n_jobs

All estimators from the Extension for Scikit-learn* accept an n_jobs parameter to control parallelism, even if their analog in the scikit-learn doesn’t. See Parallelism Specifics for more details.

Random Forests

Random Forest models (including their “Extremely Randomized” variants) accelerated with Extension for Scikit-learn* use histogram-based algorithms for splitting subsamples of data, which differs a bit from the sorting-based splitting logic used in the same classes from scikit-learn. The following keyword arguments can be used to control how histograms are created:

Keyword argument

Possible values

Default value

Description

max_bins

\([2, \infty)\)

\(256\)

Number of bins in the histogram with the discretized training data.

min_bin_size

\([1, \infty)\)

\(5\)

Minimum number of training data points in each bin after discretization.

Note that using discretized training data can greatly accelerate model training times, especially for larger data sets. However, due to the reduced fidelity of the data, the resulting model can present worse performance metrics compared to a model trained on the original data. In such cases, the number of bins can be increased with the max_bins parameter.

This parameter is available in the following classes:

Train-test splitting

Function sklearn.model_selection.train_test_split offers an additional keyword-only argument rng which can be used to select the algorithm to be used for random number generation, by passing its name as a string, with a default value of "OPTIMIZED_MT19937".

This parameter is only used when passing shuffle=True and stratify=None.

If the mkl_random package is installed and rng passed is something other than "OPTIMIZED_MT19937" or "default", under the above conditions, it will be used to generate random numbers for the splits, and the rng keyword will be forwarded to mkl_random as brng argument. See the mkl_random documentation for details about which values are allowed.

Otherwise, if passing rng="OPTIMIZED_MT19937" (the default), random numbers will be generated using an optimized version of the MT19937 algorithm (as offered by NumPy, for example) from the oneAPI Data Analytics Library, and if rng="default", then splitters from scikit-learn will be used to generate random indices.