Electron microscopy
 
K-Nearest Neighbours (KNN Algorithm)
- Integrated Circuits -
- An Online Book -
Integrated Circuits                                                                                   http://www.globalsino.com/ICs/        


=================================================================================

 

KNN (K-Nearest Neighbours) uses similarity for classification, thus, this model uses input values to put into the train model to predict the output. KNN is one of the simplest supervised Machine Learning algorithms mostly used to classify a data point based on how its neighbors are classified. KNN stores all available cases and then classifies new cases based on a similarity measure.

Choosing the right value of k is a process called parameter turning, and it is import for good accuracy of predication. If k equals m, then the program will look at the m nearest neighbors. If k is too low, the bias is too noisy; if k is too large, then it will take forever to process the predication, and/or a category with only a few samples in it will always be out voted by other categories, even though large values for k smoothen over things. Assuming n is the total number of data points, common rules of choosing n are:
         i) To use the square root of n.
         ii) If n is even number, the k is chosen a odd number.

Advantages of KNN algorithm are:
         i) Its training is very easy as it does not learn a lot to use it.
         ii) KNN is easy to implement since it involves only two input parameters, namely the number of the nearest neighbors and the test parameter (test size).
         iii) Adding additional training data to KNN model is very easy as the model trains in real time and do not require retraining the model.
         iv) It can be used for both classification and regression problems. In classification, input is assigned to the class most common among its k nearest neighbors; however, with regression, the average of the values of k nearest neighbors is considered.
         v) It does not need additional assumptions as it is a non-parametric algorithm.

Disadvantages of KNN algorithm are:
         i) It is slow while it is making prediction as the algorithm trains real time.
         ii) It is very sensitive to outliers because it chooses the neighbors based on distance criteria.
         iii) It is sensitive to dimension as the numbers of input variables grow, KNN algorithm finds it difficult to predict the output because it becomes difficult to calculate the distance in every dimension efficiently.
         iv) It is sensitive to scale which is required. This is an error based algorithm.
         v) It is difficult to find optimal k because we may need a plot between error rate and k to choose the k value which has a minimum error rate.

As an example, the application of KNN procedure (for two classes) is:
         i) Assuming we have a set of data with two classes (blue and orange) below (code):

                  The data set is {x(1), x(2), ... x(m)}

           

         ii) Pick up two points as centroids for each class by the best guess (code).

                  Note, for k classes, we then need to choose k initial cluster centroids (where k is the number of clusters we want to identify):

                               μ1, μ2, ..., μk ∈ℝ

                  For 2D data, it is easier to pick the values of the k initial cluster centroids. However, for high dimensional data, it is more complicated. Therefore, we normally pick up k examples from the m examples as the initial cluster centroids.

         iii) Assign each data point to the nearest cluster centroid, namely, color the dots based on the closer cluster centroid in the current case. In the case above, we have two clusters (green and red), then assign each point to the green or red cluster based on proximity (code)

         This step is used to assign each data point to the cluster whose centroid is the closest, where is the cluster index.

         Mathematically, the step is defined as:

                      centroid --------------------------- [4707a]

where:

  • is the index of the cluster to which the data point is assigned.
  • μj is the centroid of cluster .
  • ||...|| denotes the Euclidean distance.

The centroid μj for cluster is calculated as the mean of all the data points assigned to that cluster:

                      centroid --------------------------- [4707b]

where,:

  • μj is the centroid of cluster .
  • |C| is the number of data points assigned to cluster .
  • centroid is the sum of the coordinates of all data points assigned to cluster .

         iv) Recalculate the centroids of the clusters based on the assigned data points: look at all the green and red dots and compute the averages of the green and red dots separately, and then re-assign their centroids. (code)

         v) Repeat: Repeat the two steps above until convergence. Convergence occurs when the centroids no longer change significantly or a predefined number of iterations is reached.

                v.a) Recalculate the centroids of the clusters, and then re-color the dots based on the closer cluster centroid.

                v.b) Recalculate the centroids of the clusters, and then re-color the dots based on the closer cluster centroid again.

                v.c) Recalculate the centroids of the clusters, and then re-color the dots based on the closer cluster centroid again.

The cost function for the k-means clustering algorithm is typically the sum of squared distances between each data point and its assigned cluster centroid:

                centroid --------------------------- [4707c]

where,

  • is the number of data points.
  • x(i) is the -th data point.
  • μc is the centroid of the cluster to which x(i) is assigned.

The cost function measures the overall distortion or spread of the clusters. The goal of the k-means algorithm is to minimize this cost function by adjusting the cluster assignments and centroids iteratively. The steps of the algorithm involve updating the cluster assignments and centroids to minimize this sum of squared distances. When the cost function is at the lowest value (but it cannot be zero), then it is converged.

The question of determining the appropriate number of clusters, often denoted as , is one of the most frequently asked and challenging aspects in applying clustering algorithms, including k-means. The choice of is crucial because it directly affects the structure and interpretation of the clusters.

Several methods are commonly used to determine the optimal , and it's an active area of research in data clustering:

  1. Elbow Method: Plot the cost function (sum of squared distances) against different values of . The "elbow" in the plot is often considered the optimal where the rate of decrease in the cost slows down. The elbow method is a common method for choosing the optimal k.

  2. Figure 4707g shows Elbow Method for optimal k for K-nearest neighbours. The idea is to plot the cost (or inertia) against different values of and look for an "elbow" in the plot, where the rate of decrease in cost starts to slow down. Looking at the plot, it seems that the optimal could be 4. This is because up to , there is a significant reduction in inertia, and after , the reduction is not as substantial.

    Elbow Method for optimal k for K-nearest neighbours

    Figure 4707g. Elbow Method for optimal k for K-nearest neighbours.

  3. Silhouette Method: Evaluate the silhouette score for different values of . The silhouette score measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation). A higher silhouette score indicates better-defined clusters.

  4. Gap Statistics: Compare the cost function of the clustering algorithm on the actual data with the cost function on a reference random dataset. The optimal is where the gap between the two is the largest.

  5. Cross-Validation: Use cross-validation techniques to assess the performance of the clustering algorithm for different values of and choose the that yields the best performance.

  6. Domain Knowledge: In some cases, domain knowledge about the data may provide insights into a reasonable choice for .

Choosing is often a subjective decision and involves a trade-off between computational efficiency and the desire for meaningful, interpretable clusters. It's important to consider the characteristics of the data and the goals of the analysis when making this decision.

Table 4707. Applications of K-Nearest Neighbours (KNN algorithm).
Applications Details
Expectation-Maximization (EM) algorithm working in Gaussian Mixture Models (GMMs)

===================================================

Machine learning: KNN algorithm: code:
        Machine learning: KNN algorithm
Input:        
        Machine learning: KNN algorithm
Output:        
        Machine learning: KNN algorithm

===================================================

Machine learning: KNN algorithm (version 3 -- more functions are added): code:
        Machine learning: KNN algorithm
        Machine learning: KNN algorithm
Input:        
        Machine learning: KNN algorithm
Output:        
        Machine learning: KNN algorithm        
        Machine learning: KNN algorithm        

===================================================

Find the nearest neighbour (number) by using K-dimentional tree: code:
        Machine learning: KNN algorithm
Output:        
        Machine learning: KNN algorithm


                


        

 

=================================================================================