ML/AI: k-nearest neighbor models from scratch

R,python, and julia all have nice knn packages for this very simple type of classification model, but the knn method is so simple that you are better off just hand-coding it.

The idea is that in some dataset of points with p features and a classification, points that are very similar in feature space should have the same classification. So we create a metric in feature space to compare points. The most naive metric is simply Euclidean distance.

For each data point we find the distance to every other data point in feature space, we sort them, and we select the k closest points to create a neighborhood of our selected point. The class of the selected point is the majority class of its k nearest neighbors. We can do this for various values of k, and select the k for the model that gives the best agreement between neighborhood majority class and the actual class of each point in the training set.

The julia code below is very simple and not rich in features, so it is easy to augment and enhance

 

using FreqTables, DataFrames, DelimitedFiles

"""Euclidean metric """
eucdist= function(a, b)
  edist2 = sum( (a .- b).^2)
 return edist2
end

""" The k row index of the nearest neighbors of point ptidx in data array x"""
point_knn=function(x, ptidx, k)
pts,features=size(x)
# our point
point=x[ptidx,:]
dists = [eucdist(point, x[j,:]) for j in 1:pts]
nns = sortperm(dists)
knns=nns[2:(k+1)]
return knns
end

"""The predicted (majority vote) class of point ptidx for a given k"""
classify=function(x,c,ptidx,k)
ptknns=point_knn(x,ptidx,k)
knnclasses=[c[j] for j in ptknns]
freqs=sort(freqtable(knnclasses))
class=last(names(freqs,1))
return class
end

I usually use the RDataSets package to get a dataset to test codes on, in this case I googled for a dataset and found one on Github for the book “Machine Learning in Action” by Peter Harrington.

It is a 1000 line dataset  with three features. The features are floats, and a fourth column is a classification, which is an integer 1,2,3. I load this using DelimitedFiles into an array, and create a feature matrix xt, a class vector ct, and run it through the code. I will just use the whole dataset as a training set.

data=open(readdlm,"datingTestSet2.txt")
xt=data[:,1:3]
ct=data[:,4]

results=DataFrame(k=[],accuracy=[])
rns=["k","accuracy"]
results[!,column]=rns
for k in 1:20
prediction= [classify(xt,ct,j,k) for j in 1:size(xt,1)]
acc=sum(prediction .== ct)/size(ct)[1]
#println(k," ",acc)
push!(results,[k,acc])
end

So what k values give the best accuracy between predicted and actual classes?

julia> results
20×2 DataFrame
 Row │ k     accuracy 
     │ Any   Any      
─────┼────────────────
   1 │ 1.0   0.767
   2 │ 2.0   0.761
   3 │ 3.0   0.795
   4 │ 4.0   0.803
   5 │ 5.0   0.791
   6 │ 6.0   0.808
   7 │ 7.0   0.798
  ⋮  │  ⋮       ⋮
  15 │ 15.0  0.801
  16 │ 16.0  0.808
  17 │ 17.0  0.809
  18 │ 18.0  0.812
  19 │ 19.0  0.809
  20 │ 20.0  0.815
        7 rows omitted

Several do quite well in fact. It might be interesting to use a different metric, or apply PCA (principle component analysis) to the data to find see if some features are better class indicators than others. That’s a hint…

Home 2.0
error: Content is protected !!