Cyberithub

How to use Scikit-Learn in Python [Complete Tutorial]

Advertisements

In this tutorial, we will focus on Scikit-Learn library usage in Python. Scikit-Learn is simple, fast and useful library used in python machine learning. It provides simple and efficient tools used in data analysis and data mining. Its free, easy to use and accessible to everyone. It is reusable in several contexts. Its built on Numpy, Scipy and matplotlib. Its open source library provided by Google and commercially usable.  Scikit-Learn which is also known as sklearn can be used in statistical modeling like classification, regression and clustering etc. It includes many tuning parameters.

How to use Scikit-Learn in Python [Complete Tutorial]

How to use Scikit-Learn in Python [Complete Tutorial]

Also Read: Solved: ModuleNotFoundError: No module named 'skbuild'

The target audience of sklearn library are those persons who have keen interest in machine learning python. To understand python sklearn in-depth, one should have basic knowledge of python machine learning.

Installation

If you have anaconda in your system, then you don’t need to install sklearn separately. Anaconda have built-in installed sklearn. If you don’t have anaconda in your system, you can install sklearn by using this command.

pip install scikit-learn

Or,

conda install scikit-learn

 

Importing

To use sklearn in your program, you need to import it using Import sklearn as shown below.

Import sklearn

 

Modeling with Scikit-learn

If we need to import model from scikit-learn, we are suppose to follow some basic steps starting with dataset loading. Dataset consist of two components:-

  • Features Names: Column names are feature names in dataset.
  • Feature matrix: If dataset have multiple columns that is called feature matrix.

Let’s load a simple dataset named Iris. Let’s see how to load the dataset using scikit-learn. It can be imported by using simple command:-

from sklearn import datasets

After importing, we will load our specific dataset to use in our Jupyter Notebook.

Data = datasets.load_iris()

To confirm if data is loaded, print shape of data:-

print(iris.data.shape)

How to use Scikit-Learn in Python [Complete Tutorial] 2

 

 

 

If we want print details of iris dataset , we can acquire it by using:-

How to use Scikit-Learn in Python [Complete Tutorial] 3

 

 

 

Before using or fitting any model we will split the data into 70:30 ratio, i.e. 70% data will be used as training data and 30% will be used as testing data.

How to use Scikit-Learn in Python [Complete Tutorial] 4

 

 

As seen in the example above, it uses train_test_split() function of scikit-learn to split the dataset. This function has the following arguments:-

  • X, y − Here, X is the feature matrix and y is the response vector, which need to be split.
  • test_size − This represents the ratio of test data to the total given data. As in the above example, we are setting test_data = 0.3 for 150 rows of X. It will produce test data of 150*0.3 = 45 rows.
  • random_size − It is used to guarantee that the split will always be the same. This is useful in the situations where you want reproducible results.

Now we can use multiple examples to drive different insights or results from data by using sklearn. In this article, we will see three models:-

  • K-Nearest neighbor classifier
  • Preprocessing the Data

 

K-Nearest neighbor classifier

Let’s understand first, what exactly k-nearest neighbor is ? k-nearest neighbor is also known as Knn. In laymen's terms, KNN is a non-parametric classification technique. It is one of the most well-known classification methods. The idea is that predetermined features establish a space in which known data are ordered. The algorithm will decide the class of the new data when it receives new data by comparing the classes of the k nearest data.

 

Working of KNN

Large-scale data mining projects have adopted the machine-learning standard K-Nearest Neighbors (KNN) approach. The concept is to employ a lot of training data, with each data point having a unique set of factors. Each point is conceptually displayed in a high-dimensional space, where each axis represents a distinct variable.

The dataset is divided into ‘k’ clusters and each observation is assigned to a cluster. This is done iteratively until the clusters converge. Working with implementation explained below:-

a) First of all we will import knn in our notebook.

How to use Scikit-Learn in Python [Complete Tutorial] 5

 

 

b) Then we will fit model named knn.

How to use Scikit-Learn in Python [Complete Tutorial] 6

 

 

 

c) Then we will predict and print the result.

How to use Scikit-Learn in Python [Complete Tutorial] 7

 

 

 

 

d) To compute accuracy, we will import the metrics in our notebook.

How to use Scikit-Learn in Python [Complete Tutorial] 8

 

 

e) And then lastly we need to perform accuracy computation as shown below.

How to use Scikit-Learn in Python [Complete Tutorial] 9

 

 

 

f) Accuracy that we achieve here is 97%

 

Preprocessing the Data

We need to transform the large volume of raw data we are working with into relevant data before feeding it into machine learning algorithms. Preprocessing the data is the name of this procedure. Preprocessing is a package in Scikit-Learn that can be used for this purpose. We will discuss multiple preprocessing techniques using sklearn.

1. Mean Removal

It is used to reduce mean from feature vector so that every feature centered on zero.

a) Starting with importing preprocessing module in our notebook :-

How to use Scikit-Learn in Python [Complete Tutorial] 10

 

 

 

b) Let’s take array as input data:-

How to use Scikit-Learn in Python [Complete Tutorial] 11

 

 

 

 

c) Displaying the mean and the standard deviation of the input data:-

How to use Scikit-Learn in Python [Complete Tutorial] 12

 

 

 

 

d) Removing the mean and the standard deviation of the input data:-

How to use Scikit-Learn in Python [Complete Tutorial] 13

 

 

 

 

2. Scaling

This preprocessing technique is used for scaling the feature vectors. It is important because the features should not be synthetically large or small.

Min-max scaler

How to use Scikit-Learn in Python [Complete Tutorial] 14

 

 

K-Means Clustering with sklearn

Let’s learn the basics mathematics of the well-known k-means clustering technique and how scikit-learn can be used to implement it. Clustering is simply gathering things that are similar to each other in one group more than other groups. Clustering is also known as cluster analysis.

The K-Means approach is extremely popular because it is simple to use and computationally efficient when compared to other clustering algorithms. k-means algorithm belongs to Prototype-based clustering. In Prototype-based clustering cluster is a collection of items where one or more of the objects are closer to the cluster's prototype than to the prototype of another cluster.

 

Applications of K-Means

The major applications of kmeans clustering are:-

  • Market segmentation: In the field of marketing, K-means clustering can be used to identify and separate out different customers groups based on existing customer data.
  • Document clustering: It is a very specific technique to organize similar documents for variety of tasks.
  • Image segmentation: K-means can also be used to partition image into multiple segments so that it can be easily analyzed for extracting useful information.
  • Image compression: It is a method of digital image compression without compromising the image quality. Unlike lossless compression, K-means uses lossy compression technique.

 

K-Means Clustering Steps

The first and basic step of kmeans algorithm is to decide k number of clusters and assume their centers or centroid. We can take any random cluster as centroid or centers.

Then Kmeans clustering involves threes steps until convergence:-

  • Find the centroid coordinate
  • Find the distance of each point to the centroids
  • Group the points based on minimum distance from centroid

 

K-Means Clustering Implementation

Let’s have look at kmeans implementation using sklearn

a) Importing Libraries required for k-means:-

How to use Scikit-Learn in Python [Complete Tutorial] 15

 

 

 

 

 

b) Let’s load the dataset named “Mall_Customers.csv” that we have here for the demo purpose.

How to use Scikit-Learn in Python [Complete Tutorial] 16

 

 

The output is:-

How to use Scikit-Learn in Python [Complete Tutorial] 17

 

 

 

 

 

 

c) With the shape function, it can be seen that it has 5 columns and 200 rows.

data.shape

The shape of our dataset is:-

(200,5)

 

d) Data feature scaling is a necessary aspect of data preprocessing for clustering algorithms like K-means to deliver accurate results. This is due to the fact that clustering approaches calculate the distance between the data points. Therefore, it is appropriate to combine data from many units under a single scale. We will use min-max scaler for scaling in k-means.

How to use Scikit-Learn in Python [Complete Tutorial] 18

 

 

The output, we got after applying min-max scaler:-

How to use Scikit-Learn in Python [Complete Tutorial] 19

 

 

 

 

 

 

e) Let us see how to apply K-Means in Sklearn to group dataset into 2 clusters (0 and 1). The output shows the cluster (0th or 1st) corresponding to the data points in the dataset.

The two clusters are defined as n_clusters = 2

How to use Scikit-Learn in Python [Complete Tutorial] 20

 

 

The two clusters are divided into 0 and 1.

How to use Scikit-Learn in Python [Complete Tutorial] 21

 

 

 

 

 

 

f) cluster_centers_ can be used to find the centroid of the two clusters. The centroids of two clusters can be determined by:-

km.cluster_centers_

Centroid of the two clusters are shown below:-

How to use Scikit-Learn in Python [Complete Tutorial] 22

 

 

g) Let's visualize the outcomes. We can clearly see in the graph, data can be divided into more than two groups. Now we need to figure out, how many clusters are there ?

How to use Scikit-Learn in Python [Complete Tutorial] 23

 

 

The visualization is:-

How to use Scikit-Learn in Python [Complete Tutorial] 24

In the following image, we can see two colors of dots. I have marked the blue points belongs to cluster 0(zero) as red and blue points belongs to cluster 1(one) as yellow.

How to use Scikit-Learn in Python [Complete Tutorial] 25

 

Optimum number of Clusters in K-Means

Now the trickiest part of kmeans algorithm is “In how many clusters dataset can be divided into” Multiple methods can be used to find optimal number of clusters. It can also be done by naïve method “Trial and error”. The other efficient methods are:-

  • Elbow Method with Within-Cluster-Sum of Squared Error (WCSS)
  • The Silhouette Method

Let’s use the WCSS method. The Elbow Method is a well-liked method for figuring out how many clusters are ideal.

How to use Scikit-Learn in Python [Complete Tutorial] 26

Here, we determined the Within-Cluster-Sum of Squared Errors (WCSS) for several values of k and select the k at which WSS first begins to decline. This appears as an elbow in the WSS-versus-k graphic.

How to use Scikit-Learn in Python [Complete Tutorial] 27

Conclusion

In this article, we have seen that Scikit-Learn makes it easy to work with several machine learning algorithms. We have seen examples of KNN and preprocessing techniques and k-means clustering using sklearn. Scikit-Learn is still in development phase and being developed and maintained by volunteers but is very popular in community.

Leave a Comment