Understand AIML Decision Tree

 

The objective of this experiment is to understand Decision Tree.

Decision Tree

As the name says all about it, it is a tree which helps us by assisting us in decision-making. Used for both classification and regression, it is a very basic and important predictive learning algorithm.

1. It is different from others because it works intuitively i.e., taking decisions one-by-one.
2. Non-Parametric: Fast and efficient.

It consists of nodes which have parent-child relationships

The core algorithm for building decision trees called ID3 by J. R. Quinlan which employs a top-down, greedy search through the space of possible branches with no backtracking. ID3 uses Entropy and Information Gain to construct a decision tree. In ZeroR model there is no predictor, in OneR model we try to find the single best predictor, naive Bayesian includes all predictors using Bayes' rule and the independence assumptions between predictors but decision tree includes all predictors with the dependence assumptions between predictors.

In this experiment we will be using a Zoo dataset. The "type" attribute appears to be the class attribute. Here is a breakdown of which animals are in which type:

1 -- Mammals (41) aardvark, antelope, bear, boar, buffalo, calf, cavy, cheetah, deer, dolphin, elephant, fruitbat, giraffe, girl, goat, gorilla, hamster, hare, leopard, lion, lynx, mink, mole, mongoose, opossum, oryx, platypus, polecat, pony, porpoise, puma, pussycat, raccoon, reindeer, seal, sealion, squirrel, vampire, vole, wallaby,wolf

2 -- Birds (20) chicken, crow, dove, duck, flamingo, gull, hawk, kiwi, lark, ostrich, parakeet, penguin, pheasant, rhea, skimmer, skua, sparrow, swan, vulture, wren

3 -- Reptiles (5) pitviper, seasnake, slowworm, tortoise, tuatara

4 -- Aquatic (13) bass, carp, catfish, chub, dogfish, haddock, herring, pike, piranha, seahorse, sole, stingray, tuna

5 -- Amphibians (4) frog, frog, newt, toad

6 -- Insects (8) flea, gnat, honeybee, housefly, ladybird, moth, termite, wasp

7 -- Arthropods (10) clam, crab, crayfish, lobster, octopus, scorpion, seawasp, slug, starfish, worm

Keywords

  • Numpy
  • Pandas
  • ID3 Algorithm
  • Train, Test Split

In [6]:
#@title Run this cell to complete the setup for this Notebook

from IPython import get_ipython
ipython = get_ipython()
  
notebook="M0W2_EXP_1_Decision_Tree_Zoo" #name of the notebook

def setup():
#  ipython.magic("sx pip3 install torch")
    ipython.magic("sx wget https://cdn.talentsprint.com/aiml/Experiment_related_data/Zoo_New.csv")
    ipython.magic("sx apt-get install graphviz")
    ipython.magic("sx pip install graphviz")
    print ("Setup completed successfully")
    return

def submit_notebook():
    
    ipython.magic("notebook -e "+ notebook + ".ipynb")
    
    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:        
        print(r["err"])
        return None        
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getAnswer() and getComplexity() and getAdditional() and getConcepts():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional, 
              "concepts" : Concepts, "record_id" : submission_id, 
              "answer" : Answer, "id" : Id, "file_hash" : file_hash,
              "notebook" : notebook}

      r = requests.post(url, data = data)
      r = json.loads(r.text)
      print("Your submission is successful.")
      print("Ref Id:", submission_id)
      print("Date of submission: ", r["date"])
      print("Time of submission: ", r["time"])
      print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
      return submission_id
    else: submission_id
    

def getAdditional():
  try:
    if Additional: return Additional      
    else: raise NameError('')
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None
  
def getConcepts():
  try:
    return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None

def getAnswer():
  try:
    return Answer
  except NameError:
    print ("Please answer Question")
    return None

def getId():
  try: 
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup 
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
  
else:
  print ("Please complete Id and Password cells before running setup")
Setup completed successfully

Importing Required Packages

In [0]:
import pandas as pd
import numpy as np
import graphviz
from sklearn.tree import export_graphviz

Loading Dataset

In [0]:
#Import all columns omitting the fist which consists the names of the animals
dataset = pd.read_csv('Zoo_New.csv',
                      names=['animal_name','hair','feathers','eggs','milk',
                                                   'airbone','aquatic','predator','toothed','backbone',
                                                  'breathes','venomous','fins','legs','tail','domestic','catsize','class',])
#We don't use animal name for classification because it is just a string stating the 
#name and it won't provide any extra information for classification in this context
dataset=dataset.drop('animal_name',axis=1)
In [29]:
dataset.head()
Out[29]:
hairfeatherseggsmilkairboneaquaticpredatortoothedbackbonebreathesvenomousfinslegstaildomesticcatsizeclass
010010011110040011
110010001110041011
200100111100101004
310010011110040011
410010011110041011
In [9]:
np.unique(dataset['class'].values)
Out[9]:
array([1, 2, 3, 4, 5, 6, 7])
In [10]:
dataset['class']
Out[10]:
0      1
1      1
2      4
3      1
4      1
5      1
6      1
7      4
8      4
9      1
10     1
11     2
12     4
13     7
14     7
15     7
16     2
17     1
18     4
19     1
20     2
21     2
22     1
23     2
24     6
25     5
26     5
27     1
28     1
29     1
      ..
71     2
72     7
73     4
74     1
75     1
76     3
77     7
78     2
79     2
80     3
81     7
82     4
83     2
84     1
85     7
86     4
87     2
88     6
89     5
90     3
91     3
92     4
93     1
94     1
95     2
96     1
97     6
98     1
99     7
100    2
Name: class, Length: 101, dtype: int64
In [11]:
dataset.shape
Out[11]:
(101, 17)

Splitting the datasets into train and test

In [0]:
def train_test_split(dataset):
    training_data = dataset.iloc[:80].reset_index(drop=True)#We drop the index respectively relabel the index
    #starting form 0, because we do not want to run into errors regarding the row labels / indexes
    testing_data = dataset.iloc[80:].reset_index(drop=True)
    return training_data,testing_data
training_data = train_test_split(dataset)[0]
testing_data = train_test_split(dataset)[1]
In [0]:
training_data = training_data.values
In [0]:
testing_data = testing_data.values

Predict the class of test data

In [0]:
from sklearn import tree
In [0]:
clf = tree.DecisionTreeClassifier()
In [0]:
clf = clf.fit(training_data[:,:15],training_data[:,16])
In [0]:
pred = clf.predict(testing_data[:,:15])
In [0]:
from sklearn.metrics import accuracy_score
In [20]:
accuracy_score(testing_data[:,16], pred)
Out[20]:
0.7619047619047619
In [21]:
import os
save_dot = "output" + ".dot"
save_png = "output"+ ".png"
graph = graphviz.Source(export_graphviz(clf, out_file=save_dot, filled = True,feature_names=(list(dataset.columns))[:-2]))
os.system("dot -T png -o " + save_png + " " + save_dot)
Out[21]:
0
In [22]:
import matplotlib.pyplot as plt
plt.figure(figsize=(20,20))
plt.grid(False)
plt.imshow(plt.imread(save_png)),
plt.show()

Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature.

In [23]:
def feature_importance_chart(clf, classifier_name, feature_names):
    sorted_feature_importances, sorted_feature_names = (
        zip(*sorted(zip(clf.tree_.compute_feature_importances(normalize=False), feature_names)))
    )
    plt.figure(figsize=(16, 9))
    plt.barh(range(len(sorted_feature_importances)), sorted_feature_importances)
    plt.yticks(
        range(len(sorted_feature_importances)),
        ["{}: {:.3}".format(a, b) for a, b in zip(sorted_feature_names, sorted_feature_importances)]
    )
    plt.title("Feature importance for the tree")
    plt.show()

feature_importance_chart(clf, "simple tree", list(dataset.columns))

Exercise 1

Change the train and test split ratio and observe the change in accuracy

In [0]:
#### Your code here
def train_test_split(dataset):
    training_data = dataset.iloc[:50].reset_index(drop=True)#We drop the index respectively relabel the index
    #starting form 0, because we do not want to run into errors regarding the row labels / indexes
    testing_data = dataset.iloc[50:].reset_index(drop=True)
    return training_data,testing_data
training_data = train_test_split(dataset)[0]
testing_data = train_test_split(dataset)[1]
In [0]:
training_data = training_data.values
testing_data = testing_data.values
In [26]:
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(training_data[:,:15],training_data[:,16])
pred = clf.predict(testing_data[:,:15])
from sklearn.metrics import accuracy_score
accuracy_score(testing_data[:,16], pred)
Out[26]:
0.803921568627451
In [27]:
import os
save_dot = "output" + ".dot"
save_png = "output"+ ".png"
graph = graphviz.Source(export_graphviz(clf, out_file=save_dot, filled = True,feature_names=(list(dataset.columns))[:-2]))
os.system("dot -T png -o " + save_png + " " + save_dot)
Out[27]:
0
In [28]:
import matplotlib.pyplot as plt
plt.figure(figsize=(20,20))
plt.grid(False)
plt.imshow(plt.imread(save_png)),
plt.show()



Comments

Popular posts from this blog

How To Run JMeter GUI Mode In Ubuntu Servers

SELENIUM

Sql Basics With Examples