Intel® Extension for Scikit-learn RandomForestClassifier for rain in Australia dataset

To predict will it rain the next day.

[1]:
import pandas as pd
from timeit import default_timer as timer
from IPython.display import HTML
import warnings

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score

warnings.filterwarnings('ignore')

Download the data

[2]:
data = fetch_openml(data_id=46315, as_frame=True)
df = data.frame
df.head()
[2]:
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RainTomorrow
0 2008-12-01 Albury 13.4 22.9 0.6 NaN NaN W 44.0 W ... 71.0 22.0 1007.7 1007.1 8.0 NaN 16.9 21.8 No No
1 2008-12-02 Albury 7.4 25.1 0.0 NaN NaN WNW 44.0 NNW ... 44.0 25.0 1010.6 1007.8 NaN NaN 17.2 24.3 No No
2 2008-12-03 Albury 12.9 25.7 0.0 NaN NaN WSW 46.0 W ... 38.0 30.0 1007.6 1008.7 NaN 2.0 21.0 23.2 No No
3 2008-12-04 Albury 9.2 28.0 0.0 NaN NaN NE 24.0 SE ... 45.0 16.0 1017.6 1012.8 NaN NaN 18.1 26.5 No No
4 2008-12-05 Albury 17.5 32.3 1.0 NaN NaN W 41.0 ENE ... 82.0 33.0 1010.8 1006.0 7.0 8.0 17.8 29.7 No No

5 rows × 23 columns

Explore the data

[3]:
# Show the dimensions of the dataset
df.shape
[3]:
(145460, 23)
[4]:
# Show the summary of the dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
 #   Column         Non-Null Count   Dtype
---  ------         --------------   -----
 0   Date           145460 non-null  object
 1   Location       145460 non-null  object
 2   MinTemp        143975 non-null  float64
 3   MaxTemp        144199 non-null  float64
 4   Rainfall       142199 non-null  float64
 5   Evaporation    82670 non-null   float64
 6   Sunshine       75625 non-null   float64
 7   WindGustDir    135134 non-null  object
 8   WindGustSpeed  135197 non-null  float64
 9   WindDir9am     134894 non-null  object
 10  WindDir3pm     141232 non-null  object
 11  WindSpeed9am   143693 non-null  float64
 12  WindSpeed3pm   142398 non-null  float64
 13  Humidity9am    142806 non-null  float64
 14  Humidity3pm    140953 non-null  float64
 15  Pressure9am    130395 non-null  float64
 16  Pressure3pm    130432 non-null  float64
 17  Cloud9am       89572 non-null   float64
 18  Cloud3pm       86102 non-null   float64
 19  Temp9am        143693 non-null  float64
 20  Temp3pm        141851 non-null  float64
 21  RainToday      142199 non-null  object
 22  RainTomorrow   142193 non-null  object
dtypes: float64(16), object(7)
memory usage: 25.5+ MB
[5]:
# Check the missing values and the percentage of missing values in each column
missing_values = df.isnull().sum()
missing_values_percentage = missing_values / df.shape[0] * 100
missing_values_df = pd.DataFrame({
    'Missing Values': missing_values,
    'Percentage (%)': missing_values_percentage
})
missing_values_df
[5]:
Missing Values Percentage (%)
Date 0 0.000000
Location 0 0.000000
MinTemp 1485 1.020899
MaxTemp 1261 0.866905
Rainfall 3261 2.241853
Evaporation 62790 43.166506
Sunshine 69835 48.009762
WindGustDir 10326 7.098859
WindGustSpeed 10263 7.055548
WindDir9am 10566 7.263853
WindDir3pm 4228 2.906641
WindSpeed9am 1767 1.214767
WindSpeed3pm 3062 2.105046
Humidity9am 2654 1.824557
Humidity3pm 4507 3.098446
Pressure9am 15065 10.356799
Pressure3pm 15028 10.331363
Cloud9am 55888 38.421559
Cloud3pm 59358 40.807095
Temp9am 1767 1.214767
Temp3pm 3609 2.481094
RainToday 3261 2.241853
RainTomorrow 3267 2.245978

Preprocessing

[6]:
# Drop columns with more than 30% missing values
df = df.dropna(thresh=df.shape[0]*0.7, axis=1)
df.shape
[6]:
(145460, 19)
[7]:
# Drop rows with missing target value
df = df.dropna(subset=['RainTomorrow'])
df.shape
[7]:
(142193, 19)
[8]:
# Encode the target variable
df['RainTomorrow'] = df['RainTomorrow'].map({'No': 0, 'Yes': 1})
[9]:
# Split the Date column into Year, Month, and Day
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year.astype('int64')
df['Month'] = df['Date'].dt.month.astype('int64')
df['Day'] = df['Date'].dt.day.astype('int64')

df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 142193 entries, 0 to 145458
Data columns (total 22 columns):
 #   Column         Non-Null Count   Dtype
---  ------         --------------   -----
 0   Date           142193 non-null  datetime64[ns]
 1   Location       142193 non-null  object
 2   MinTemp        141556 non-null  float64
 3   MaxTemp        141871 non-null  float64
 4   Rainfall       140787 non-null  float64
 5   WindGustDir    132863 non-null  object
 6   WindGustSpeed  132923 non-null  float64
 7   WindDir9am     132180 non-null  object
 8   WindDir3pm     138415 non-null  object
 9   WindSpeed9am   140845 non-null  float64
 10  WindSpeed3pm   139563 non-null  float64
 11  Humidity9am    140419 non-null  float64
 12  Humidity3pm    138583 non-null  float64
 13  Pressure9am    128179 non-null  float64
 14  Pressure3pm    128212 non-null  float64
 15  Temp9am        141289 non-null  float64
 16  Temp3pm        139467 non-null  float64
 17  RainToday      140787 non-null  object
 18  RainTomorrow   142193 non-null  int64
 19  Year           142193 non-null  int64
 20  Month          142193 non-null  int64
 21  Day            142193 non-null  int64
dtypes: datetime64[ns](1), float64(12), int64(4), object(5)
memory usage: 25.0+ MB
[10]:
# Define the features and the target
X = df.drop(columns=['RainTomorrow', 'Date'])
y = df['RainTomorrow']
[11]:
# Identify the numerical and categorical columns
num_columns = X.select_dtypes(include=['int64', 'float64']).columns
cat_columns = X.select_dtypes(include=['object']).columns

print(f'Numerical Columns: {list(num_columns)}')
print(f'Categorical Columns: {list(cat_columns)}')

Numerical Columns: ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Temp9am', 'Temp3pm', 'Year', 'Month', 'Day']
Categorical Columns: ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday']
[12]:
# Preprocess the numerical features

# Impute missing values with the mean
imputer_num = SimpleImputer(strategy='mean')
X[num_columns] = imputer_num.fit_transform(X[num_columns])

# Scale the numerical columns
scaler = StandardScaler()
X[num_columns] = scaler.fit_transform(X[num_columns])
[13]:
# Preprocess the categorical features

# Impute missing values with the most frequent value
imputer_cat = SimpleImputer(strategy='most_frequent')
X[cat_columns] = imputer_cat.fit_transform(X[cat_columns])

# Label encode the categorical columns
encoder = LabelEncoder()
for col in cat_columns:
    X[col] = encoder.fit_transform(X[col])
[14]:
# Ensure all columns are numerical and no missing values
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 142193 entries, 0 to 145458
Data columns (total 22 columns):
 #   Column         Non-Null Count   Dtype
---  ------         --------------   -----
 0   Date           142193 non-null  datetime64[ns]
 1   Location       142193 non-null  object
 2   MinTemp        141556 non-null  float64
 3   MaxTemp        141871 non-null  float64
 4   Rainfall       140787 non-null  float64
 5   WindGustDir    132863 non-null  object
 6   WindGustSpeed  132923 non-null  float64
 7   WindDir9am     132180 non-null  object
 8   WindDir3pm     138415 non-null  object
 9   WindSpeed9am   140845 non-null  float64
 10  WindSpeed3pm   139563 non-null  float64
 11  Humidity9am    140419 non-null  float64
 12  Humidity3pm    138583 non-null  float64
 13  Pressure9am    128179 non-null  float64
 14  Pressure3pm    128212 non-null  float64
 15  Temp9am        141289 non-null  float64
 16  Temp3pm        139467 non-null  float64
 17  RainToday      140787 non-null  object
 18  RainTomorrow   142193 non-null  int64
 19  Year           142193 non-null  int64
 20  Month          142193 non-null  int64
 21  Day            142193 non-null  int64
dtypes: datetime64[ns](1), float64(12), int64(4), object(5)
memory usage: 25.0+ MB
[15]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
[15]:
((113754, 20), (28439, 20), (113754,), (28439,))

Patch original Scikit-learn with Intel® Extension for Scikit-learn

Intel® Extension for Scikit-learn (previously known as daal4py) contains drop-in replacement functionality for the stock Scikit-learn package. You can take advantage of the performance optimizations of Intel® Extension for Scikit-learn by adding just two lines of code before the usual Scikit-learn imports:

[16]:
from sklearnex import patch_sklearn
patch_sklearn()
Intel(R) Extension for Scikit-learn* enabled (https://github.com/uxlfoundation/scikit-learn-intelex)

Training of the RandomForestClassifier with Intel® Extension for Scikit-learn for Rain in Australia dataset

[17]:
from sklearn.ensemble import RandomForestClassifier

params = {
    'n_estimators': 1000,
    'criterion': 'gini',
    'max_features': 'sqrt',
    'n_jobs': -1
}
start = timer()
patched_model = RandomForestClassifier(**params).fit(X_train, y_train)
patched_train_time = timer() - start

print(f"Intel® extension for Scikit-learn Training Time: {patched_train_time:.3f} seconds")
Intel® extension for Scikit-learn Training Time: 9.439 seconds

Predict and get a result of the RandomForestClassifier algorithm with Intel® Extension for Scikit-learn

[18]:
patched_y_pred = patched_model.predict(X_test)
patched_accuracy = accuracy_score(y_test, patched_y_pred)

print(f"Intel® extension for Scikit-learn Accuracy: {patched_accuracy:.4f}")
Intel® extension for Scikit-learn Accuracy: 0.8551

Train the same algorithm with original Scikit-learn

In order to cancel optimizations, we use unpatch_sklearn and reimport the class RandomForestClassifier.

[19]:
from sklearnex import unpatch_sklearn
unpatch_sklearn()

Training of the RandomForestClassifier with original Scikit-learn for Rain in Australia dataset

[20]:
from sklearn.ensemble import RandomForestClassifier

start = timer()
ori_model = RandomForestClassifier(**params).fit(X_train, y_train)
ori_train_time = timer() - start

print(f"Original Scikit-learn Training Time: {ori_train_time:.3f} seconds")
Original Scikit-learn Training Time: 47.955 seconds

Predict and get a result of the RandomForestClassifier algorithm with original Scikit-learn

[21]:
ori_y_pred = ori_model.predict(X_test)
ori_accuracy = accuracy_score(y_test, ori_y_pred)

print(f"Original Scikit-learn Accuracy: {ori_accuracy:.4f}")
Original Scikit-learn Accuracy: 0.8549

Comparison

[22]:
compare_df = pd.DataFrame({
    'Original': [ori_accuracy, ori_train_time],
    'Patched': [patched_accuracy, patched_train_time]
}, index=['Accuracy', 'Training Time (s)'])

for col in compare_df.columns:
    compare_df[col] = compare_df[col].round(4)

# Calculate the improvement in percentage
compare_df['Improvement (%)'] = (compare_df['Patched'] - compare_df['Original']) / compare_df['Original'] * 100
compare_df['Improvement (%)'] = compare_df['Improvement (%)'].round(2)

compare_df
[22]:
Original Patched Improvement (%)
Accuracy 0.8549 0.8551 0.02
Training Time (s) 47.9546 9.4395 -80.32
[23]:
HTML(
    f"<h3>Compare Accuracy of patched Scikit-learn and original</h3>"
    f"Accuracy of patched Scikit-learn: {patched_accuracy} <br>"
    f"Accuracy of unpatched Scikit-learn: {ori_accuracy} <br>"
    f"Metrics ratio: {patched_accuracy/ori_accuracy} <br>"
    f"<h3>With Scikit-learn-intelex patching you can:</h3>"
    f"<ul>"
    f"<li>Use your Scikit-learn code for training and prediction with minimal changes (a couple of lines of code);</li>"
    f"<li>Get comparable model quality</li>"
    f"<li>Get a <strong>{(ori_train_time/patched_train_time):.1f}x</strong> speedup.</li>"
    f"</ul>"
)
[23]:

Compare Accuracy of patched Scikit-learn and original

Accuracy of patched Scikit-learn: 0.8550933577129998
Accuracy of unpatched Scikit-learn: 0.8548823798305144
Metrics ratio: 1.0002467917077986

With Scikit-learn-intelex patching you can:

  • Use your Scikit-learn code for training and prediction with minimal changes (a couple of lines of code);
  • Get comparable model quality
  • Get a 5.1x speedup.