Intel® Extension for Scikit-learn RandomForestClassifier for rain in Australia dataset

To predict will it rain the next day.

[1]:

import pandas as pd
from timeit import default_timer as timer
from IPython.display import HTML
import warnings

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score

warnings.filterwarnings('ignore')

Download the data

[2]:

data = fetch_openml(data_id=46315, as_frame=True)
df = data.frame
df.head()

[2]:

	Date	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday	RainTomorrow
0	2008-12-01	Albury	13.4	22.9	0.6	NaN	NaN	W	44.0	W	...	71.0	22.0	1007.7	1007.1	8.0	NaN	16.9	21.8	No	No
1	2008-12-02	Albury	7.4	25.1	0.0	NaN	NaN	WNW	44.0	NNW	...	44.0	25.0	1010.6	1007.8	NaN	NaN	17.2	24.3	No	No
2	2008-12-03	Albury	12.9	25.7	0.0	NaN	NaN	WSW	46.0	W	...	38.0	30.0	1007.6	1008.7	NaN	2.0	21.0	23.2	No	No
3	2008-12-04	Albury	9.2	28.0	0.0	NaN	NaN	NE	24.0	SE	...	45.0	16.0	1017.6	1012.8	NaN	NaN	18.1	26.5	No	No
4	2008-12-05	Albury	17.5	32.3	1.0	NaN	NaN	W	41.0	ENE	...	82.0	33.0	1010.8	1006.0	7.0	8.0	17.8	29.7	No	No

5 rows × 23 columns

Explore the data

[3]:

# Show the dimensions of the dataset
df.shape

[3]:

(145460, 23)

[4]:

# Show the summary of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
 #   Column         Non-Null Count   Dtype
---  ------         --------------   -----
 0   Date           145460 non-null  object
 1   Location       145460 non-null  object
 2   MinTemp        143975 non-null  float64
 3   MaxTemp        144199 non-null  float64
 4   Rainfall       142199 non-null  float64
 5   Evaporation    82670 non-null   float64
 6   Sunshine       75625 non-null   float64
 7   WindGustDir    135134 non-null  object
 8   WindGustSpeed  135197 non-null  float64
 9   WindDir9am     134894 non-null  object
 10  WindDir3pm     141232 non-null  object
 11  WindSpeed9am   143693 non-null  float64
 12  WindSpeed3pm   142398 non-null  float64
 13  Humidity9am    142806 non-null  float64
 14  Humidity3pm    140953 non-null  float64
 15  Pressure9am    130395 non-null  float64
 16  Pressure3pm    130432 non-null  float64
 17  Cloud9am       89572 non-null   float64
 18  Cloud3pm       86102 non-null   float64
 19  Temp9am        143693 non-null  float64
 20  Temp3pm        141851 non-null  float64
 21  RainToday      142199 non-null  object
 22  RainTomorrow   142193 non-null  object
dtypes: float64(16), object(7)
memory usage: 25.5+ MB

[5]:

# Check the missing values and the percentage of missing values in each column
missing_values = df.isnull().sum()
missing_values_percentage = missing_values / df.shape[0] * 100
missing_values_df = pd.DataFrame({
    'Missing Values': missing_values,
    'Percentage (%)': missing_values_percentage
})
missing_values_df

[5]:

	Missing Values	Percentage (%)
Date	0	0.000000
Location	0	0.000000
MinTemp	1485	1.020899
MaxTemp	1261	0.866905
Rainfall	3261	2.241853
Evaporation	62790	43.166506
Sunshine	69835	48.009762
WindGustDir	10326	7.098859
WindGustSpeed	10263	7.055548
WindDir9am	10566	7.263853
WindDir3pm	4228	2.906641
WindSpeed9am	1767	1.214767
WindSpeed3pm	3062	2.105046
Humidity9am	2654	1.824557
Humidity3pm	4507	3.098446
Pressure9am	15065	10.356799
Pressure3pm	15028	10.331363
Cloud9am	55888	38.421559
Cloud3pm	59358	40.807095
Temp9am	1767	1.214767
Temp3pm	3609	2.481094
RainToday	3261	2.241853
RainTomorrow	3267	2.245978

Preprocessing

[6]:

# Drop columns with more than 30% missing values
df = df.dropna(thresh=df.shape[0]*0.7, axis=1)
df.shape

[6]:

(145460, 19)

[7]:

# Drop rows with missing target value
df = df.dropna(subset=['RainTomorrow'])
df.shape

[7]:

(142193, 19)

[8]:

# Encode the target variable
df['RainTomorrow'] = df['RainTomorrow'].map({'No': 0, 'Yes': 1})

[9]:

# Split the Date column into Year, Month, and Day
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year.astype('int64')
df['Month'] = df['Date'].dt.month.astype('int64')
df['Day'] = df['Date'].dt.day.astype('int64')

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 142193 entries, 0 to 145458
Data columns (total 22 columns):
 #   Column         Non-Null Count   Dtype
---  ------         --------------   -----
 0   Date           142193 non-null  datetime64[ns]
 1   Location       142193 non-null  object
 2   MinTemp        141556 non-null  float64
 3   MaxTemp        141871 non-null  float64
 4   Rainfall       140787 non-null  float64
 5   WindGustDir    132863 non-null  object
 6   WindGustSpeed  132923 non-null  float64
 7   WindDir9am     132180 non-null  object
 8   WindDir3pm     138415 non-null  object
 9   WindSpeed9am   140845 non-null  float64
 10  WindSpeed3pm   139563 non-null  float64
 11  Humidity9am    140419 non-null  float64
 12  Humidity3pm    138583 non-null  float64
 13  Pressure9am    128179 non-null  float64
 14  Pressure3pm    128212 non-null  float64
 15  Temp9am        141289 non-null  float64
 16  Temp3pm        139467 non-null  float64
 17  RainToday      140787 non-null  object
 18  RainTomorrow   142193 non-null  int64
 19  Year           142193 non-null  int64
 20  Month          142193 non-null  int64
 21  Day            142193 non-null  int64
dtypes: datetime64[ns](1), float64(12), int64(4), object(5)
memory usage: 25.0+ MB

[10]:

# Define the features and the target
X = df.drop(columns=['RainTomorrow', 'Date'])
y = df['RainTomorrow']

[11]:

# Identify the numerical and categorical columns
num_columns = X.select_dtypes(include=['int64', 'float64']).columns
cat_columns = X.select_dtypes(include=['object']).columns

print(f'Numerical Columns: {list(num_columns)}')
print(f'Categorical Columns: {list(cat_columns)}')

Numerical Columns: ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Temp9am', 'Temp3pm', 'Year', 'Month', 'Day']
Categorical Columns: ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday']

[12]:

# Preprocess the numerical features

# Impute missing values with the mean
imputer_num = SimpleImputer(strategy='mean')
X[num_columns] = imputer_num.fit_transform(X[num_columns])

# Scale the numerical columns
scaler = StandardScaler()
X[num_columns] = scaler.fit_transform(X[num_columns])

[13]:

# Preprocess the categorical features

# Impute missing values with the most frequent value
imputer_cat = SimpleImputer(strategy='most_frequent')
X[cat_columns] = imputer_cat.fit_transform(X[cat_columns])

# Label encode the categorical columns
encoder = LabelEncoder()
for col in cat_columns:
    X[col] = encoder.fit_transform(X[col])

[14]:

# Ensure all columns are numerical and no missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 142193 entries, 0 to 145458
Data columns (total 22 columns):
 #   Column         Non-Null Count   Dtype
---  ------         --------------   -----
 0   Date           142193 non-null  datetime64[ns]
 1   Location       142193 non-null  object
 2   MinTemp        141556 non-null  float64
 3   MaxTemp        141871 non-null  float64
 4   Rainfall       140787 non-null  float64
 5   WindGustDir    132863 non-null  object
 6   WindGustSpeed  132923 non-null  float64
 7   WindDir9am     132180 non-null  object
 8   WindDir3pm     138415 non-null  object
 9   WindSpeed9am   140845 non-null  float64
 10  WindSpeed3pm   139563 non-null  float64
 11  Humidity9am    140419 non-null  float64
 12  Humidity3pm    138583 non-null  float64
 13  Pressure9am    128179 non-null  float64
 14  Pressure3pm    128212 non-null  float64
 15  Temp9am        141289 non-null  float64
 16  Temp3pm        139467 non-null  float64
 17  RainToday      140787 non-null  object
 18  RainTomorrow   142193 non-null  int64
 19  Year           142193 non-null  int64
 20  Month          142193 non-null  int64
 21  Day            142193 non-null  int64
dtypes: datetime64[ns](1), float64(12), int64(4), object(5)
memory usage: 25.0+ MB

[15]:

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

[15]:

((113754, 20), (28439, 20), (113754,), (28439,))

Patch original Scikit-learn with Intel® Extension for Scikit-learn

Intel® Extension for Scikit-learn (previously known as daal4py) contains drop-in replacement functionality for the stock Scikit-learn package. You can take advantage of the performance optimizations of Intel® Extension for Scikit-learn by adding just two lines of code before the usual Scikit-learn imports:

[16]:

from sklearnex import patch_sklearn
patch_sklearn()

Intel(R) Extension for Scikit-learn* enabled (https://github.com/uxlfoundation/scikit-learn-intelex)

Training of the RandomForestClassifier with Intel® Extension for Scikit-learn for Rain in Australia dataset

[17]:

from sklearn.ensemble import RandomForestClassifier

params = {
    'n_estimators': 1000,
    'criterion': 'gini',
    'max_features': 'sqrt',
    'n_jobs': -1
}
start = timer()
patched_model = RandomForestClassifier(**params).fit(X_train, y_train)
patched_train_time = timer() - start

print(f"Intel® extension for Scikit-learn Training Time: {patched_train_time:.3f} seconds")

Intel® extension for Scikit-learn Training Time: 9.439 seconds

Predict and get a result of the RandomForestClassifier algorithm with Intel® Extension for Scikit-learn

[18]:

patched_y_pred = patched_model.predict(X_test)
patched_accuracy = accuracy_score(y_test, patched_y_pred)

print(f"Intel® extension for Scikit-learn Accuracy: {patched_accuracy:.4f}")

Intel® extension for Scikit-learn Accuracy: 0.8551

Train the same algorithm with original Scikit-learn

In order to cancel optimizations, we use unpatch_sklearn and reimport the class RandomForestClassifier.

[19]:

from sklearnex import unpatch_sklearn
unpatch_sklearn()

Training of the RandomForestClassifier with original Scikit-learn for Rain in Australia dataset

[20]:

from sklearn.ensemble import RandomForestClassifier

start = timer()
ori_model = RandomForestClassifier(**params).fit(X_train, y_train)
ori_train_time = timer() - start

print(f"Original Scikit-learn Training Time: {ori_train_time:.3f} seconds")

Original Scikit-learn Training Time: 47.955 seconds

Predict and get a result of the RandomForestClassifier algorithm with original Scikit-learn

[21]:

ori_y_pred = ori_model.predict(X_test)
ori_accuracy = accuracy_score(y_test, ori_y_pred)

print(f"Original Scikit-learn Accuracy: {ori_accuracy:.4f}")

Original Scikit-learn Accuracy: 0.8549

Comparison

[22]:

compare_df = pd.DataFrame({
    'Original': [ori_accuracy, ori_train_time],
    'Patched': [patched_accuracy, patched_train_time]
}, index=['Accuracy', 'Training Time (s)'])

for col in compare_df.columns:
    compare_df[col] = compare_df[col].round(4)

# Calculate the improvement in percentage
compare_df['Improvement (%)'] = (compare_df['Patched'] - compare_df['Original']) / compare_df['Original'] * 100
compare_df['Improvement (%)'] = compare_df['Improvement (%)'].round(2)

compare_df

[22]:

	Original	Patched	Improvement (%)
Accuracy	0.8549	0.8551	0.02
Training Time (s)	47.9546	9.4395	-80.32

[23]:

HTML(
    f"<h3>Compare Accuracy of patched Scikit-learn and original</h3>"
    f"Accuracy of patched Scikit-learn: {patched_accuracy} <br>"
    f"Accuracy of unpatched Scikit-learn: {ori_accuracy} <br>"
    f"Metrics ratio: {patched_accuracy/ori_accuracy} <br>"
    f"<h3>With Scikit-learn-intelex patching you can:</h3>"
    f"<ul>"
    f"<li>Use your Scikit-learn code for training and prediction with minimal changes (a couple of lines of code);</li>"
    f"<li>Get comparable model quality</li>"
    f"<li>Get a <strong>{(ori_train_time/patched_train_time):.1f}x</strong> speedup.</li>"
    f"</ul>"
)

[23]:

Compare Accuracy of patched Scikit-learn and original

Accuracy of patched Scikit-learn: 0.8550933577129998
Accuracy of unpatched Scikit-learn: 0.8548823798305144
Metrics ratio: 1.0002467917077986

With Scikit-learn-intelex patching you can:

Use your Scikit-learn code for training and prediction with minimal changes (a couple of lines of code);
Get comparable model quality
Get a 5.1x speedup.