Intel® Extension for Scikit-learn RandomForestClassifier for rain in Australia dataset
To predict will it rain the next day.
[1]:
import pandas as pd
from timeit import default_timer as timer
from IPython.display import HTML
import warnings
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
warnings.filterwarnings('ignore')
Download the data
[2]:
data = fetch_openml(data_id=46315, as_frame=True)
df = data.frame
df.head()
[2]:
Date | Location | MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustDir | WindGustSpeed | WindDir9am | ... | Humidity9am | Humidity3pm | Pressure9am | Pressure3pm | Cloud9am | Cloud3pm | Temp9am | Temp3pm | RainToday | RainTomorrow | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2008-12-01 | Albury | 13.4 | 22.9 | 0.6 | NaN | NaN | W | 44.0 | W | ... | 71.0 | 22.0 | 1007.7 | 1007.1 | 8.0 | NaN | 16.9 | 21.8 | No | No |
1 | 2008-12-02 | Albury | 7.4 | 25.1 | 0.0 | NaN | NaN | WNW | 44.0 | NNW | ... | 44.0 | 25.0 | 1010.6 | 1007.8 | NaN | NaN | 17.2 | 24.3 | No | No |
2 | 2008-12-03 | Albury | 12.9 | 25.7 | 0.0 | NaN | NaN | WSW | 46.0 | W | ... | 38.0 | 30.0 | 1007.6 | 1008.7 | NaN | 2.0 | 21.0 | 23.2 | No | No |
3 | 2008-12-04 | Albury | 9.2 | 28.0 | 0.0 | NaN | NaN | NE | 24.0 | SE | ... | 45.0 | 16.0 | 1017.6 | 1012.8 | NaN | NaN | 18.1 | 26.5 | No | No |
4 | 2008-12-05 | Albury | 17.5 | 32.3 | 1.0 | NaN | NaN | W | 41.0 | ENE | ... | 82.0 | 33.0 | 1010.8 | 1006.0 | 7.0 | 8.0 | 17.8 | 29.7 | No | No |
5 rows × 23 columns
Explore the data
[3]:
# Show the dimensions of the dataset
df.shape
[3]:
(145460, 23)
[4]:
# Show the summary of the dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 145460 non-null object
1 Location 145460 non-null object
2 MinTemp 143975 non-null float64
3 MaxTemp 144199 non-null float64
4 Rainfall 142199 non-null float64
5 Evaporation 82670 non-null float64
6 Sunshine 75625 non-null float64
7 WindGustDir 135134 non-null object
8 WindGustSpeed 135197 non-null float64
9 WindDir9am 134894 non-null object
10 WindDir3pm 141232 non-null object
11 WindSpeed9am 143693 non-null float64
12 WindSpeed3pm 142398 non-null float64
13 Humidity9am 142806 non-null float64
14 Humidity3pm 140953 non-null float64
15 Pressure9am 130395 non-null float64
16 Pressure3pm 130432 non-null float64
17 Cloud9am 89572 non-null float64
18 Cloud3pm 86102 non-null float64
19 Temp9am 143693 non-null float64
20 Temp3pm 141851 non-null float64
21 RainToday 142199 non-null object
22 RainTomorrow 142193 non-null object
dtypes: float64(16), object(7)
memory usage: 25.5+ MB
[5]:
# Check the missing values and the percentage of missing values in each column
missing_values = df.isnull().sum()
missing_values_percentage = missing_values / df.shape[0] * 100
missing_values_df = pd.DataFrame({
'Missing Values': missing_values,
'Percentage (%)': missing_values_percentage
})
missing_values_df
[5]:
Missing Values | Percentage (%) | |
---|---|---|
Date | 0 | 0.000000 |
Location | 0 | 0.000000 |
MinTemp | 1485 | 1.020899 |
MaxTemp | 1261 | 0.866905 |
Rainfall | 3261 | 2.241853 |
Evaporation | 62790 | 43.166506 |
Sunshine | 69835 | 48.009762 |
WindGustDir | 10326 | 7.098859 |
WindGustSpeed | 10263 | 7.055548 |
WindDir9am | 10566 | 7.263853 |
WindDir3pm | 4228 | 2.906641 |
WindSpeed9am | 1767 | 1.214767 |
WindSpeed3pm | 3062 | 2.105046 |
Humidity9am | 2654 | 1.824557 |
Humidity3pm | 4507 | 3.098446 |
Pressure9am | 15065 | 10.356799 |
Pressure3pm | 15028 | 10.331363 |
Cloud9am | 55888 | 38.421559 |
Cloud3pm | 59358 | 40.807095 |
Temp9am | 1767 | 1.214767 |
Temp3pm | 3609 | 2.481094 |
RainToday | 3261 | 2.241853 |
RainTomorrow | 3267 | 2.245978 |
Preprocessing
[6]:
# Drop columns with more than 30% missing values
df = df.dropna(thresh=df.shape[0]*0.7, axis=1)
df.shape
[6]:
(145460, 19)
[7]:
# Drop rows with missing target value
df = df.dropna(subset=['RainTomorrow'])
df.shape
[7]:
(142193, 19)
[8]:
# Encode the target variable
df['RainTomorrow'] = df['RainTomorrow'].map({'No': 0, 'Yes': 1})
[9]:
# Split the Date column into Year, Month, and Day
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year.astype('int64')
df['Month'] = df['Date'].dt.month.astype('int64')
df['Day'] = df['Date'].dt.day.astype('int64')
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 142193 entries, 0 to 145458
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 142193 non-null datetime64[ns]
1 Location 142193 non-null object
2 MinTemp 141556 non-null float64
3 MaxTemp 141871 non-null float64
4 Rainfall 140787 non-null float64
5 WindGustDir 132863 non-null object
6 WindGustSpeed 132923 non-null float64
7 WindDir9am 132180 non-null object
8 WindDir3pm 138415 non-null object
9 WindSpeed9am 140845 non-null float64
10 WindSpeed3pm 139563 non-null float64
11 Humidity9am 140419 non-null float64
12 Humidity3pm 138583 non-null float64
13 Pressure9am 128179 non-null float64
14 Pressure3pm 128212 non-null float64
15 Temp9am 141289 non-null float64
16 Temp3pm 139467 non-null float64
17 RainToday 140787 non-null object
18 RainTomorrow 142193 non-null int64
19 Year 142193 non-null int64
20 Month 142193 non-null int64
21 Day 142193 non-null int64
dtypes: datetime64[ns](1), float64(12), int64(4), object(5)
memory usage: 25.0+ MB
[10]:
# Define the features and the target
X = df.drop(columns=['RainTomorrow', 'Date'])
y = df['RainTomorrow']
[11]:
# Identify the numerical and categorical columns
num_columns = X.select_dtypes(include=['int64', 'float64']).columns
cat_columns = X.select_dtypes(include=['object']).columns
print(f'Numerical Columns: {list(num_columns)}')
print(f'Categorical Columns: {list(cat_columns)}')
Numerical Columns: ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Temp9am', 'Temp3pm', 'Year', 'Month', 'Day']
Categorical Columns: ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday']
[12]:
# Preprocess the numerical features
# Impute missing values with the mean
imputer_num = SimpleImputer(strategy='mean')
X[num_columns] = imputer_num.fit_transform(X[num_columns])
# Scale the numerical columns
scaler = StandardScaler()
X[num_columns] = scaler.fit_transform(X[num_columns])
[13]:
# Preprocess the categorical features
# Impute missing values with the most frequent value
imputer_cat = SimpleImputer(strategy='most_frequent')
X[cat_columns] = imputer_cat.fit_transform(X[cat_columns])
# Label encode the categorical columns
encoder = LabelEncoder()
for col in cat_columns:
X[col] = encoder.fit_transform(X[col])
[14]:
# Ensure all columns are numerical and no missing values
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 142193 entries, 0 to 145458
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 142193 non-null datetime64[ns]
1 Location 142193 non-null object
2 MinTemp 141556 non-null float64
3 MaxTemp 141871 non-null float64
4 Rainfall 140787 non-null float64
5 WindGustDir 132863 non-null object
6 WindGustSpeed 132923 non-null float64
7 WindDir9am 132180 non-null object
8 WindDir3pm 138415 non-null object
9 WindSpeed9am 140845 non-null float64
10 WindSpeed3pm 139563 non-null float64
11 Humidity9am 140419 non-null float64
12 Humidity3pm 138583 non-null float64
13 Pressure9am 128179 non-null float64
14 Pressure3pm 128212 non-null float64
15 Temp9am 141289 non-null float64
16 Temp3pm 139467 non-null float64
17 RainToday 140787 non-null object
18 RainTomorrow 142193 non-null int64
19 Year 142193 non-null int64
20 Month 142193 non-null int64
21 Day 142193 non-null int64
dtypes: datetime64[ns](1), float64(12), int64(4), object(5)
memory usage: 25.0+ MB
[15]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
[15]:
((113754, 20), (28439, 20), (113754,), (28439,))
Patch original Scikit-learn with Intel® Extension for Scikit-learn
Intel® Extension for Scikit-learn (previously known as daal4py) contains drop-in replacement functionality for the stock Scikit-learn package. You can take advantage of the performance optimizations of Intel® Extension for Scikit-learn by adding just two lines of code before the usual Scikit-learn imports:
[16]:
from sklearnex import patch_sklearn
patch_sklearn()
Intel(R) Extension for Scikit-learn* enabled (https://github.com/uxlfoundation/scikit-learn-intelex)
Training of the RandomForestClassifier with Intel® Extension for Scikit-learn for Rain in Australia dataset
[17]:
from sklearn.ensemble import RandomForestClassifier
params = {
'n_estimators': 1000,
'criterion': 'gini',
'max_features': 'sqrt',
'n_jobs': -1
}
start = timer()
patched_model = RandomForestClassifier(**params).fit(X_train, y_train)
patched_train_time = timer() - start
print(f"Intel® extension for Scikit-learn Training Time: {patched_train_time:.3f} seconds")
Intel® extension for Scikit-learn Training Time: 9.439 seconds
Predict and get a result of the RandomForestClassifier algorithm with Intel® Extension for Scikit-learn
[18]:
patched_y_pred = patched_model.predict(X_test)
patched_accuracy = accuracy_score(y_test, patched_y_pred)
print(f"Intel® extension for Scikit-learn Accuracy: {patched_accuracy:.4f}")
Intel® extension for Scikit-learn Accuracy: 0.8551
Train the same algorithm with original Scikit-learn
In order to cancel optimizations, we use unpatch_sklearn and reimport the class RandomForestClassifier.
[19]:
from sklearnex import unpatch_sklearn
unpatch_sklearn()
Training of the RandomForestClassifier with original Scikit-learn for Rain in Australia dataset
[20]:
from sklearn.ensemble import RandomForestClassifier
start = timer()
ori_model = RandomForestClassifier(**params).fit(X_train, y_train)
ori_train_time = timer() - start
print(f"Original Scikit-learn Training Time: {ori_train_time:.3f} seconds")
Original Scikit-learn Training Time: 47.955 seconds
Predict and get a result of the RandomForestClassifier algorithm with original Scikit-learn
[21]:
ori_y_pred = ori_model.predict(X_test)
ori_accuracy = accuracy_score(y_test, ori_y_pred)
print(f"Original Scikit-learn Accuracy: {ori_accuracy:.4f}")
Original Scikit-learn Accuracy: 0.8549
Comparison
[22]:
compare_df = pd.DataFrame({
'Original': [ori_accuracy, ori_train_time],
'Patched': [patched_accuracy, patched_train_time]
}, index=['Accuracy', 'Training Time (s)'])
for col in compare_df.columns:
compare_df[col] = compare_df[col].round(4)
# Calculate the improvement in percentage
compare_df['Improvement (%)'] = (compare_df['Patched'] - compare_df['Original']) / compare_df['Original'] * 100
compare_df['Improvement (%)'] = compare_df['Improvement (%)'].round(2)
compare_df
[22]:
Original | Patched | Improvement (%) | |
---|---|---|---|
Accuracy | 0.8549 | 0.8551 | 0.02 |
Training Time (s) | 47.9546 | 9.4395 | -80.32 |
[23]:
HTML(
f"<h3>Compare Accuracy of patched Scikit-learn and original</h3>"
f"Accuracy of patched Scikit-learn: {patched_accuracy} <br>"
f"Accuracy of unpatched Scikit-learn: {ori_accuracy} <br>"
f"Metrics ratio: {patched_accuracy/ori_accuracy} <br>"
f"<h3>With Scikit-learn-intelex patching you can:</h3>"
f"<ul>"
f"<li>Use your Scikit-learn code for training and prediction with minimal changes (a couple of lines of code);</li>"
f"<li>Get comparable model quality</li>"
f"<li>Get a <strong>{(ori_train_time/patched_train_time):.1f}x</strong> speedup.</li>"
f"</ul>"
)
[23]:
Compare Accuracy of patched Scikit-learn and original
Accuracy of patched Scikit-learn: 0.8550933577129998Accuracy of unpatched Scikit-learn: 0.8548823798305144
Metrics ratio: 1.0002467917077986
With Scikit-learn-intelex patching you can:
- Use your Scikit-learn code for training and prediction with minimal changes (a couple of lines of code);
- Get comparable model quality
- Get a 5.1x speedup.