AIML KNN

Setup Steps:

In [3]:

#@title Run this cell to complete the setup for this Notebook

from IPython import get_ipython
ipython = get_ipython()
  
notebook="M0W1_EXP_3_KNN_weight_sphericity" #name of the notebook

def setup():
#  ipython.magic("sx pip3 install torch")
    ipython.magic("sx wget https://cdn.talentsprint.com/aiml/Experiment_related_data/kNN_weight_sphericity_train.csv")
    ipython.magic("sx wget https://cdn.talentsprint.com/aiml/Experiment_related_data/kNN_weight_sphericity_test.csv")
  
    print ("Setup completed successfully")
    return

def submit_notebook():
    
    ipython.magic("notebook -e "+ notebook + ".ipynb")
    
    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:        
        print(r["err"])
        return None        
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getAnswer() and getComplexity() and getAdditional() and getConcepts():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional, 
              "concepts" : Concepts, "record_id" : submission_id, 
              "answer" : Answer, "id" : Id, "file_hash" : file_hash,
              "notebook" : notebook}

      r = requests.post(url, data = data)
      r = json.loads(r.text)
      print("Your submission is successful.")
      print("Ref Id:", submission_id)
      print("Date of submission: ", r["date"])
      print("Time of submission: ", r["time"])
      print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
      return submission_id
    else: submission_id
    

def getAdditional():
  try:
    if Additional: return Additional      
    else: raise NameError('')
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None
  
def getConcepts():
  try:
    return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None

def getAnswer():
  try:
    return Answer
  except NameError:
    print ("Please answer Question")
    return None

def getId():
  try: 
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup 
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
  
else:
  print ("Please complete Id and Password cells before running setup")

Setup completed successfully

In this experiment, we will use the data set on fruits which we explored earlier and learn how a simple K nearest neighbour classification works.

Let us consider a simple situation. Given some data about a fruit, we want to label it automatically.

Fruits are characterized by

weight in grams as a float
sphericity in mm as a float
label as a string
- "A" $\to$ Apple
- "O" $\to$ Orange

We are given some sample data such as (303, 3, "A") meaning the fruit with 303 gram weight, and yellow colour is an apple. A set of such training samples is given in “kNN_weight_sphericity_train.csv”.

We are given a set of test data where only weight and colour are given, eg. (373,1). We should design a simple Nearest Neighbour classifier that will find the fruit label. i.e., "A" or "O", meaning Apple or Orange.

We have several such testcases, split into train and test. We try to see if our predictions are fine on test data.

In [0]:

## Let us set up the file names
FRUITS_TRAIN = "kNN_weight_sphericity_train.csv"
FRUITS_TEST1 = "kNN_weight_sphericity_test.csv"

In [0]:

# Let us first read the data from the file and do a quick visualization
import pandas as pd
train = pd.read_csv(FRUITS_TRAIN)
train

Out[0]:

	Weight	Sphericity	Label
0	200	0.400	A
1	320	1.000	O
2	310	0.950	O
3	220	0.450	A
4	230	0.500	A
5	245	0.400	A
6	240	0.350	A
7	330	0.980	O
8	315	0.900	O
9	243	0.530	A
10	333	0.920	O
11	255	0.600	A
12	259	0.420	A
13	253	0.510	A
14	272	0.700	O
15	260	0.750	O
16	267	0.480	A
17	265	0.600	A
18	266	0.680	O
19	266	0.700	A
20	260	0.300	A
21	220	0.330	A
22	326	0.980	O
23	313	0.910	O
24	210	0.390	A
25	232	0.440	A
26	236	0.600	A
27	270	0.333	A
28	330	0.890	O
29	233	0.410	A
30	269	0.670	O
31	255	0.690	O
32	261	0.390	A
33	271	0.550	A
34	269	0.710	O
35	250	0.590	A
36	270	0.370	A
37	266	0.690	A
38	210	0.420	A

We see that similar fruits come close in the feature (weight, color) space? Now let us plot one sample data given in black.

In [0]:

import matplotlib.pyplot as plt

In [0]:

apples=train[train['Label']=='A']
oranges=train[train['Label']=='O']

In [0]:

plt.plot(apples.Weight, apples.Sphericity, "ro")
plt.plot(oranges.Weight, oranges.Sphericity, "bo")
plt.xlabel("Weight -- in grams")
plt.ylabel("Sphericity -- r-o-y-g-b-p")
plt.legend(["Apples", "Oranges"])
plt.plot([373], [1], "ko")
plt.show()

From the visualization alone, we can infer that the unknown fruit is likely to be an Orange.

The job now is to instead of eyeballing it one at a time like above, use a kNN classifier with, say, $k = 3$ and using the Euclidean distance, to determine the correct label for the data in the file "kNN_weight_sphericity_test.csv"

Let us first write a distance function to calculate the Euclidean distance between two fruits.

$d i s t a n c e$ = $s q r t (Σ (a_{i} - b_{i})^{2})$

In [0]:

import math
def dist(a, b):
    ''' a is the n-dimesnional co-ordinate of point 1
        b is the n-dimensional co-ordinate of point 2'''
    sqSum = 0
    for i in range(len(a)):
        sqSum += (a[i] - b[i]) ** 2
    return math.sqrt(sqSum)

Now let us write code to find the $k$ nearest neighbours of a given fruit

In [0]:

def kNN(k, train, given):
    distances = []
    for t in train.values:              
        # loop over all training samples
        distances.append((dist(t[:2], given), t[2])) 
        # compute and store distance of each training sample from the given sample
    distances.sort()            
    return distances[:k]    # return first k samples = nearest  k distances to the given sample

In [0]:

print(kNN(3, train, (373, 1)))
print(kNN(5, train, (373, 1)))

[(40.00007999992, 'O'), (43.00000465116254, 'O'), (43.00014069744424, 'O')]
[(40.00007999992, 'O'), (43.00000465116254, 'O'), (43.00014069744424, 'O'), (47.000004255318956, 'O'), (53.0, 'O')]

In [0]:

print(kNN(7,train, (250,1)))
print(kNN(5,train, (250,1)))

[(0.41000000000000003, 'A'), (3.0397532794620026, 'A'), (5.009600782497543, 'O'), (5.015974481593781, 'A'), (5.035871324805669, 'A'), (7.015760828306506, 'A'), (9.018669524935483, 'A')]
[(0.41000000000000003, 'A'), (3.0397532794620026, 'A'), (5.009600782497543, 'O'), (5.015974481593781, 'A'), (5.035871324805669, 'A')]

Try entering other sample entry for example " print(kNN(3,train, (250,1))) "

As you can see above, the 3 (and 5) nearest neighbours of the fruit with the characteristics (373, 1) are all Oranges -- label 1; which is what we visually saw when we plotted the point as a black spot in the chart. Of course we need to write another function to get this attribute rather than read, so we have written a function for that. We have used collections.Counter, which we explored in the earlier experiment.

In [0]:

import collections
def kNNmax(k, train, given):
    tally = collections.Counter()
    for nn in kNN(k, train, given):
        tally.update(nn[-1])
    return tally.most_common(1)[0]
print(kNNmax(5, train, (340, 1)))
print(kNNmax(7, train, (340, 1)))

('O', 5)
('O', 7)

Now let us load the test data and find the labels for all of them. Also let us count how many are correct.

In [0]:

testData = pd.read_csv(FRUITS_TEST1).values[:,:-1]
testResults = pd.read_csv(FRUITS_TEST1).values[:,-1]
results = []
for i, t in enumerate(testData):
    results.append(kNNmax(3, train, t)[0] == testResults[i])
print(results.count(True), "are correct")

18 are correct

In [0]:

#@title Linear classifiers learn a separating hyperplane from the training samples. Run the following cell to create a Linear Classifier and plot the output.
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
import numpy as np
from sklearn import linear_model
from sklearn.preprocessing import MinMaxScaler

scaler=MinMaxScaler()

X2 = train.iloc[:,:-1].values#np.array([[2,1],[3,4],[4,2],[3,1]])
X=scaler.fit_transform(X2)


Y = le.fit_transform(train.iloc[:,-1].values)
h = .02  # step size in the mesh

#clf2 = Perceptron(max_iter=1000).fit(X, Y)
clf2 = linear_model.SGDClassifier(max_iter=1000, tol=1e-3)

clf2.fit(X,Y)

# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
fig, ax = plt.subplots()
Z = clf2.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
ax.contourf(xx, yy, Z, cmap=plt.cm.Paired)
ax.axis('off')

# Plot also the training points
ax.scatter(X[:, 0], X[:, 1], c=Y)

ax.set_title('Linear Classfier')

Exercise 1 :: Repeat the above experiment with $k = 5$ and $k = 7$ . Explain which $k$ is better and why?

In [0]:

## Your code here
print(kNN(5, train, (373, 1)))
print(kNN(7, train, (373, 1)))

[(40.00007999992, 'O'), (43.00000465116254, 'O'), (43.00014069744424, 'O'), (47.000004255318956, 'O'), (53.0, 'O')]
[(40.00007999992, 'O'), (43.00000465116254, 'O'), (43.00014069744424, 'O'), (47.000004255318956, 'O'), (53.0, 'O'), (58.000086206832485, 'O'), (60.00006749996203, 'O')]

Exercise 2 :: Repeat the above experiment with $k = 17$ . What do you think is happening?

In [0]:

## Your code here
print(kNN(17, train, (373, 1)))

[(40.00007999992, 'O'), (43.00000465116254, 'O'), (43.00014069744424, 'O'), (47.000004255318956, 'O'), (53.0, 'O'), (58.000086206832485, 'O'), (60.00006749996203, 'O'), (63.00001984126672, 'O'), (101.00044554357174, 'O'), (102.00099264222872, 'A'), (103.00192668100922, 'A'), (103.00215963269896, 'A'), (104.00040432613712, 'O'), (104.00052355637446, 'O'), (106.00127546402449, 'A'), (107.00042055992117, 'A'), (107.00044906447823, 'A')]

Exercise 3 :: Modify the distance function to ignore the colour feature. Calculate the accuracy on the smaller test dataset.

In [0]:

import math
def dist(a, b):
    ''' a is the n-dimesnional co-ordinate of point 1
        b is the n-dimensional co-ordinate of point 2'''
    sqSum = 0
    for i in range(len(a)-1):
        sqSum += (a[i] - b[i]) ** 2
    return math.sqrt(sqSum)

Summary

In the above experiment, we find that

a simple nearest neighbour method can successfully predict labels with a small number of labelled examples. We also demo a Linear Classifier that separates the two classes, for the same data.
But we also see that the results can go really wrong
- if we make some wrong choices (like weight in Kg, or a very large K).

This should remind you about the practical expertise and experimental skills that will become equally important as we move forward.

Search This Blog

Learning Fish

Learn how a simple K nearest neighbor classification works

AIML KNN

Setup Steps:

Summary

Comments

Post a Comment

Popular posts from this blog

How To Run JMeter GUI Mode In Ubuntu Servers

SELENIUM

Sql Basics With Examples