In this tutorial, we will focus on Scikit-Learn library usage in Python. Scikit-Learn is simple, fast and useful library used in python machine learning. It provides simple and efficient tools used in data analysis and data mining. Its free, easy to use and accessible to everyone. It is reusable in several contexts. Its built on Numpy, Scipy and matplotlib. Its open source library provided by Google and commercially usable. Scikit-Learn which is also known as sklearn can be used in statistical modeling like classification, regression and clustering etc. It includes many tuning parameters.
How to use Scikit-Learn in Python [Complete Tutorial]
The target audience of sklearn library are those persons who have keen interest in machine learning python. To understand python sklearn in-depth, one should have basic knowledge of python machine learning.
If you have anaconda in your system, then you don’t need to install sklearn separately. Anaconda have built-in installed sklearn. If you don’t have anaconda in your system, you can install sklearn by using this command.
pip install scikit-learn
conda install scikit-learn
To use sklearn in your program, you need to import it using Import sklearn as shown below.
Modeling with Scikit-learn
If we need to import model from scikit-learn, we are suppose to follow some basic steps starting with dataset loading. Dataset consist of two components:-
- Features Names: Column names are feature names in dataset.
- Feature matrix: If dataset have multiple columns that is called feature matrix.
Let’s load a simple dataset named Iris. Let’s see how to load the dataset using scikit-learn. It can be imported by using simple command:-
from sklearn import datasets
After importing, we will load our specific dataset to use in our Jupyter Notebook.
Data = datasets.load_iris()
To confirm if data is loaded, print shape of data:-
If we want print details of iris dataset , we can acquire it by using:-
Before using or fitting any model we will split the data into 70:30 ratio, i.e. 70% data will be used as training data and 30% will be used as testing data.
As seen in the example above, it uses train_test_split() function of scikit-learn to split the dataset. This function has the following arguments:-
- X, y − Here, X is the feature matrix and y is the response vector, which need to be split.
- test_size − This represents the ratio of test data to the total given data. As in the above example, we are setting test_data = 0.3 for 150 rows of X. It will produce test data of 150*0.3 = 45 rows.
- random_size − It is used to guarantee that the split will always be the same. This is useful in the situations where you want reproducible results.
Now we can use multiple examples to drive different insights or results from data by using sklearn. In this article, we will see three models:-
- K-Nearest neighbor classifier
- Preprocessing the Data
K-Nearest neighbor classifier
Let’s understand first, what exactly k-nearest neighbor is ? k-nearest neighbor is also known as Knn. In laymen's terms, KNN is a non-parametric classification technique. It is one of the most well-known classification methods. The idea is that predetermined features establish a space in which known data are ordered. The algorithm will decide the class of the new data when it receives new data by comparing the classes of the k nearest data.
Working of KNN
Large-scale data mining projects have adopted the machine-learning standard K-Nearest Neighbors (KNN) approach. The concept is to employ a lot of training data, with each data point having a unique set of factors. Each point is conceptually displayed in a high-dimensional space, where each axis represents a distinct variable.
The dataset is divided into ‘k’ clusters and each observation is assigned to a cluster. This is done iteratively until the clusters converge. Working with implementation explained below:-
a) First of all we will import knn in our notebook.
b) Then we will fit model named knn.
c) Then we will predict and print the result.
d) To compute accuracy, we will import the metrics in our notebook.
e) And then lastly we need to perform accuracy computation as shown below.
f) Accuracy that we achieve here is 97%
Preprocessing the Data
We need to transform the large volume of raw data we are working with into relevant data before feeding it into machine learning algorithms. Preprocessing the data is the name of this procedure. Preprocessing is a package in Scikit-Learn that can be used for this purpose. We will discuss multiple preprocessing techniques using sklearn.
1. Mean Removal
It is used to reduce mean from feature vector so that every feature centered on zero.
a) Starting with importing preprocessing module in our notebook :-
b) Let’s take array as input data:-
c) Displaying the mean and the standard deviation of the input data:-
d) Removing the mean and the standard deviation of the input data:-
This preprocessing technique is used for scaling the feature vectors. It is important because the features should not be synthetically large or small.
K-Means Clustering with sklearn
Let’s learn the basics mathematics of the well-known k-means clustering technique and how scikit-learn can be used to implement it. Clustering is simply gathering things that are similar to each other in one group more than other groups. Clustering is also known as cluster analysis.
The K-Means approach is extremely popular because it is simple to use and computationally efficient when compared to other clustering algorithms. k-means algorithm belongs to Prototype-based clustering. In Prototype-based clustering cluster is a collection of items where one or more of the objects are closer to the cluster's prototype than to the prototype of another cluster.
Applications of K-Means
The major applications of kmeans clustering are:-
- Market segmentation: In the field of marketing, K-means clustering can be used to identify and separate out different customers groups based on existing customer data.
- Document clustering: It is a very specific technique to organize similar documents for variety of tasks.
- Image segmentation: K-means can also be used to partition image into multiple segments so that it can be easily analyzed for extracting useful information.
- Image compression: It is a method of digital image compression without compromising the image quality. Unlike lossless compression, K-means uses lossy compression technique.
K-Means Clustering Steps
The first and basic step of kmeans algorithm is to decide k number of clusters and assume their centers or centroid. We can take any random cluster as centroid or centers.
Then Kmeans clustering involves threes steps until convergence:-
- Find the centroid coordinate
- Find the distance of each point to the centroids
- Group the points based on minimum distance from centroid
K-Means Clustering Implementation
Let’s have look at kmeans implementation using sklearn
a) Importing Libraries required for k-means:-
b) Let’s load the dataset named “Mall_Customers.csv” that we have here for the demo purpose.
The output is:-
c) With the shape function, it can be seen that it has 5 columns and 200 rows.
The shape of our dataset is:-
d) Data feature scaling is a necessary aspect of data preprocessing for clustering algorithms like K-means to deliver accurate results. This is due to the fact that clustering approaches calculate the distance between the data points. Therefore, it is appropriate to combine data from many units under a single scale. We will use min-max scaler for scaling in k-means.
The output, we got after applying min-max scaler:-
e) Let us see how to apply K-Means in Sklearn to group dataset into 2 clusters (0 and 1). The output shows the cluster (0th or 1st) corresponding to the data points in the dataset.
The two clusters are defined as n_clusters = 2
The two clusters are divided into 0 and 1.
f) cluster_centers_ can be used to find the centroid of the two clusters. The centroids of two clusters can be determined by:-
Centroid of the two clusters are shown below:-
g) Let's visualize the outcomes. We can clearly see in the graph, data can be divided into more than two groups. Now we need to figure out, how many clusters are there ?
The visualization is:-
In the following image, we can see two colors of dots. I have marked the blue points belongs to cluster 0(zero) as red and blue points belongs to cluster 1(one) as yellow.
Optimum number of Clusters in K-Means
Now the trickiest part of kmeans algorithm is “In how many clusters dataset can be divided into” Multiple methods can be used to find optimal number of clusters. It can also be done by naïve method “Trial and error”. The other efficient methods are:-
- Elbow Method with Within-Cluster-Sum of Squared Error (WCSS)
- The Silhouette Method
Let’s use the WCSS method. The Elbow Method is a well-liked method for figuring out how many clusters are ideal.
Here, we determined the Within-Cluster-Sum of Squared Errors (WCSS) for several values of k and select the k at which WSS first begins to decline. This appears as an elbow in the WSS-versus-k graphic.
In this article, we have seen that Scikit-Learn makes it easy to work with several machine learning algorithms. We have seen examples of KNN and preprocessing techniques and k-means clustering using sklearn. Scikit-Learn is still in development phase and being developed and maintained by volunteers but is very popular in community.