Data Preprocessing in Python
Data Preprocessing tools
Importing the libraries
We will need the following three libraries for most of the data pre-processing tasks
- Numpy - Allows us to work with arrays
- Matplotlib - Allows us to plot charts and graphs
- Pandas - Allows us to import the dataset and create matrix of features and dependent vectors.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Importing the dataset
Step 1: Import the dataset
Step 2: Create matrix of feature and dependent variable vectors
In any machine learing dataset we have two sets of data which are the features and the dependent vector variable.
The features are the columns with which you are going to predict the dependent variable.
The dependent variable is the column which you want the ML model to predict.
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
print(X)
print(y)
Taking care of missing data
One way to take care of missing data is to remove them. This is effective when you have a large dataset and only few of the data have missing values. Hence removing them will not impact the test data very much.
Second way is to replace the missing data with average of the whole column.
For this we will use one of best data science library ScikitLearn
It contains a lot of tools including lot of preprocessing tools.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
print(X)
Encoding categorical data
The categorical data may most probably contain string data. So it is necessary to encode this data into numerical format for the machine learning model to interpret the result accurately.
One of the most popular method is One-hot encoding
If the column is categorized only into two values we can simply represent the values as binary format 1/0. This can be achieved by using Label Encoding
Encoding the independent variable
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
print(X)
Encoding the dependent variable
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
print(y)
Splitting the dataset into Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
print(X_train)
print(X_test)
print(y_train)
print(y_test)
Feature Scaling
Helps put all our features on the same scale.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])
print(X_train)
print(X_test)