Intel® Extension for Scikit-learn Ridge Regression for New York City Bike Share dataset
[1]:
import pandas as pd
from timeit import default_timer as timer
from sklearn import metrics
from sklearn.model_selection import train_test_split
import warnings
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import LabelEncoder
from IPython.display import HTML
warnings.filterwarnings("ignore")
Download the data
[2]:
dataset = fetch_openml(data_id=43526, as_frame=True)
Preprocessing
Let’s encode categorical features with LabelEncoder
[3]:
# Access the data as a DataFrame
data = dataset.frame
# Convert date columns to datetime
data['Start_Time'] = pd.to_datetime(data['Start_Time'])
data['Stop_Time'] = pd.to_datetime(data['Stop_Time'])
# Extract useful features from datetime columns
data['Start_Year'] = data['Start_Time'].dt.year
data['Start_Month'] = data['Start_Time'].dt.month
data['Start_Day'] = data['Start_Time'].dt.day
data['Start_Hour'] = data['Start_Time'].dt.hour
data['Stop_Year'] = data['Stop_Time'].dt.year
data['Stop_Month'] = data['Stop_Time'].dt.month
data['Stop_Day'] = data['Stop_Time'].dt.day
data['Stop_Hour'] = data['Stop_Time'].dt.hour
# Drop the original datetime columns
data = data.drop(columns=['Start_Time', 'Stop_Time'])
# Encode categorical variables
for col in ['Start_Station_Name', 'End_Station_Name', 'Gender', 'User_Type']:
le = LabelEncoder().fit(data[col])
data[col] = le.transform(data[col])
# Set the target variable
data['target'] = data['Trip_Duration']
# Separate features and target
x = data.drop(columns=['target', 'Trip_Duration'])
y = data['target']
[4]:
# Ensure x and y are defined and not None
if x is not None and y is not None:
for col in ['User_Type', 'Gender']:
if col in x.columns:
le = LabelEncoder().fit(x[col])
x[col] = le.transform(x[col])
else:
print(f"Column {col} does not exist in the DataFrame.")
# Split the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=0)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
else:
print("x or y is None. Please check your data.")
(661951, 22) (73551, 22) (661951,) (73551,)
[5]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler_x = MinMaxScaler()
scaler_y = StandardScaler()
[6]:
y_train = y_train.to_numpy().reshape(-1, 1)
y_test = y_test.to_numpy().reshape(-1, 1)
scaler_x.fit(x_train)
x_train = scaler_x.transform(x_train)
x_test = scaler_x.transform(x_test)
scaler_y.fit(y_train)
y_train = scaler_y.transform(y_train).ravel()
y_test = scaler_y.transform(y_test).ravel()
Patch original Scikit-learn with Intel® Extension for Scikit-learn
Intel® Extension for Scikit-learn (previously known as daal4py) contains drop-in replacement functionality for the stock Scikit-learn package. You can take advantage of the performance optimizations of Intel® Extension for Scikit-learn by adding just two lines of code before the usual Scikit-learn imports:
[7]:
from sklearnex import patch_sklearn
patch_sklearn()
Intel(R) Extension for Scikit-learn* enabled (https://github.com/uxlfoundation/scikit-learn-intelex)
[8]:
from sklearn.linear_model import Ridge
params = {
"alpha": 0.3,
"fit_intercept": False,
"random_state": 0,
"copy_X": False,
}
start = timer()
model = Ridge(random_state=0).fit(x_train, y_train)
train_patched = timer() - start
f"Intel® extension for Scikit-learn time: {train_patched:.2f} s"
[8]:
'Intel® extension for Scikit-learn time: 0.04 s'
[9]:
y_predict = model.predict(x_test)
mse_metric_opt = metrics.mean_squared_error(y_test, y_predict)
f"Patched Scikit-learn MSE: {mse_metric_opt}"
[9]:
'Patched Scikit-learn MSE: 0.29078674972552815'
Train the same algorithm with original Scikit-learn
In order to cancel optimizations, we use unpatch_sklearn and reimport the class Ridge
[10]:
from sklearnex import unpatch_sklearn
unpatch_sklearn()
[11]:
from sklearn.linear_model import Ridge
start = timer()
model = Ridge(random_state=0).fit(x_train, y_train)
train_unpatched = timer() - start
f"Original Scikit-learn time: {train_unpatched:.2f} s"
[11]:
'Original Scikit-learn time: 0.19 s'
[12]:
y_predict = model.predict(x_test)
mse_metric_original = metrics.mean_squared_error(y_test, y_predict)
f"Original Scikit-learn MSE: {mse_metric_original}"
[12]:
'Original Scikit-learn MSE: 0.29078674972650354'
[13]:
HTML(
f"<h3>Compare MSE metric of patched Scikit-learn and original</h3>"
f"MSE metric of patched Scikit-learn: {mse_metric_opt} <br>"
f"MSE metric of unpatched Scikit-learn: {mse_metric_original} <br>"
f"Metrics ratio: {mse_metric_opt/mse_metric_original} <br>"
f"<h3>With Scikit-learn-intelex patching you can:</h3>"
f"<ul>"
f"<li>Use your Scikit-learn code for training and prediction with minimal changes (a couple of lines of code);</li>"
f"<li>Get comparable model quality</li>"
f"<li>Get a <strong>{(train_unpatched/train_patched):.1f}x</strong> speedup.</li>"
f"</ul>"
)
[13]:
Compare MSE metric of patched Scikit-learn and original
MSE metric of patched Scikit-learn: 0.29078674972552815MSE metric of unpatched Scikit-learn: 0.29078674972650354
Metrics ratio: 0.9999999999966457
With Scikit-learn-intelex patching you can:
- Use your Scikit-learn code for training and prediction with minimal changes (a couple of lines of code);
- Get comparable model quality
- Get a 4.7x speedup.