What is One Hot Encoding?
A one-hot encoding is a representation of categorical variables as binary vectors.
This first requires that the categorical values be mapped to integer values.
Then, each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.
Why use One Hot Encoding?
Many machine learning algorithms cannot work with categorical data directly. The categories must be converted into numbers. This is required for both input and output variables that are categorical.
One Hot Encode with scikit-learn
step 1: Create the dataset
import pandas as pd
#create DataFrame
df = pd.DataFrame({'team': ['A', 'A', 'B', 'B', 'B', 'B', 'C', 'C'],
'points': [25, 12, 15, 14, 19, 23, 25, 29]})
#view DataFrame
print(df)
team points
0 A 25
1 A 12
2 B 15
3 B 14
4 B 19
5 B 23
6 C 25
7 C 29
Step 2: Perform one hot encoding
from sklearn.preprocessing import OneHotEncoder
#creating instance of one-hot-encoder
encoder = OneHotEncoder(handle_unknown='ignore')
#perform one-hot encoding on 'team' column
encoder_df = pd.DataFrame(encoder.fit_transform(df[['team']]).toarray())
#merge one-hot encoded columns back with original DataFrame
final_df = df.join(encoder_df)
#view final df
print(final_df)
team points 0 1 2
0 A 25 1.0 0.0 0.0
1 A 12 1.0 0.0 0.0
2 B 15 0.0 1.0 0.0
3 B 14 0.0 1.0 0.0
4 B 19 0.0 1.0 0.0
5 B 23 0.0 1.0 0.0
6 C 25 0.0 0.0 1.0
7 C 29 0.0 0.0 1.0
Step 3: Drop the column and get the results
#drop 'team' column
final_df.drop('team', axis=1, inplace=True)
#view final df
print(final_df)
points 0 1 2
0 25 1.0 0.0 0.0
1 12 1.0 0.0 0.0
2 15 0.0 1.0 0.0
3 14 0.0 1.0 0.0
4 19 0.0 1.0 0.0
5 23 0.0 1.0 0.0
6 25 0.0 0.0 1.0
7 29 0.0 0.0 1.0