The objective of this experiment is to understand Decision Tree.

Decision Tree

As the name says all about it, it is a tree which helps us by assisting us in decision-making. Used for both classification and regression, it is a very basic and important predictive learning algorithm.

1. It is different from others because it works intuitively i.e., taking decisions one-by-one.
2. Non-Parametric: Fast and efficient.

It consists of nodes which have parent-child relationships

The core algorithm for building decision trees called ID3 by J. R. Quinlan which employs a top-down, greedy search through the space of possible branches with no backtracking. ID3 uses Entropy and Information Gain to construct a decision tree. In ZeroR model there is no predictor, in OneR model we try to find the single best predictor, naive Bayesian includes all predictors using Bayes' rule and the independence assumptions between predictors but decision tree includes all predictors with the dependence assumptions between predictors.

In this experiment we will be using a Zoo dataset. The "type" attribute appears to be the class attribute. Here is a breakdown of which animals are in which type:

1 -- Mammals (41) aardvark, antelope, bear, boar, buffalo, calf, cavy, cheetah, deer, dolphin, elephant, fruitbat, giraffe, girl, goat, gorilla, hamster, hare, leopard, lion, lynx, mink, mole, mongoose, opossum, oryx, platypus, polecat, pony, porpoise, puma, pussycat, raccoon, reindeer, seal, sealion, squirrel, vampire, vole, wallaby,wolf

2 -- Birds (20) chicken, crow, dove, duck, flamingo, gull, hawk, kiwi, lark, ostrich, parakeet, penguin, pheasant, rhea, skimmer, skua, sparrow, swan, vulture, wren

3 -- Reptiles (5) pitviper, seasnake, slowworm, tortoise, tuatara

4 -- Aquatic (13) bass, carp, catfish, chub, dogfish, haddock, herring, pike, piranha, seahorse, sole, stingray, tuna

5 -- Amphibians (4) frog, frog, newt, toad

6 -- Insects (8) flea, gnat, honeybee, housefly, ladybird, moth, termite, wasp

7 -- Arthropods (10) clam, crab, crayfish, lobster, octopus, scorpion, seawasp, slug, starfish, worm

Keywords

Numpy
Pandas
ID3 Algorithm
Train, Test Split

In [6]:

#@title Run this cell to complete the setup for this Notebook

from IPython import get_ipython
ipython = get_ipython()
  
notebook="M0W2_EXP_1_Decision_Tree_Zoo" #name of the notebook

def setup():
#  ipython.magic("sx pip3 install torch")
    ipython.magic("sx wget https://cdn.talentsprint.com/aiml/Experiment_related_data/Zoo_New.csv")
    ipython.magic("sx apt-get install graphviz")
    ipython.magic("sx pip install graphviz")
    print ("Setup completed successfully")
    return

def submit_notebook():
    
    ipython.magic("notebook -e "+ notebook + ".ipynb")
    
    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:        
        print(r["err"])
        return None        
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getAnswer() and getComplexity() and getAdditional() and getConcepts():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional, 
              "concepts" : Concepts, "record_id" : submission_id, 
              "answer" : Answer, "id" : Id, "file_hash" : file_hash,
              "notebook" : notebook}

      r = requests.post(url, data = data)
      r = json.loads(r.text)
      print("Your submission is successful.")
      print("Ref Id:", submission_id)
      print("Date of submission: ", r["date"])
      print("Time of submission: ", r["time"])
      print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
      return submission_id
    else: submission_id
    

def getAdditional():
  try:
    if Additional: return Additional      
    else: raise NameError('')
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None
  
def getConcepts():
  try:
    return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None

def getAnswer():
  try:
    return Answer
  except NameError:
    print ("Please answer Question")
    return None

def getId():
  try: 
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup 
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
  
else:
  print ("Please complete Id and Password cells before running setup")

Setup completed successfully

Importing Required Packages

In [0]:

import pandas as pd
import numpy as np
import graphviz
from sklearn.tree import export_graphviz

Loading Dataset

In [0]:

#Import all columns omitting the fist which consists the names of the animals
dataset = pd.read_csv('Zoo_New.csv',
                      names=['animal_name','hair','feathers','eggs','milk',
                                                   'airbone','aquatic','predator','toothed','backbone',
                                                  'breathes','venomous','fins','legs','tail','domestic','catsize','class',])
#We don't use animal name for classification because it is just a string stating the 
#name and it won't provide any extra information for classification in this context
dataset=dataset.drop('animal_name',axis=1)

In [29]:

dataset.head()

Out[29]:

	hair	eggs	milk	aquatic	predator	toothed	backbone	breathes	fins	legs	tail	catsize	class
0	1	0	1	0	1	1	1	1	0	4	0	1	1
1	1	0	1	0	0	1	1	1	0	4	1	1	1
2	0	1	0	1	1	1	1	0	1	0	1	0	4
3	1	0	1	0	1	1	1	1	0	4	0	1	1
4	1	0	1	0	1	1	1	1	0	4	1	1	1

In [9]:

np.unique(dataset['class'].values)

Out[9]:

array([1, 2, 3, 4, 5, 6, 7])

In [10]:

dataset['class']

Out[10]:

0      1
1      1
2      4
3      1
4      1
5      1
6      1
7      4
8      4
9      1
10     1
11     2
12     4
13     7
14     7
15     7
16     2
17     1
18     4
19     1
20     2
21     2
22     1
23     2
24     6
25     5
26     5
27     1
28     1
29     1
      ..
71     2
72     7
73     4
74     1
75     1
76     3
77     7
78     2
79     2
80     3
81     7
82     4
83     2
84     1
85     7
86     4
87     2
88     6
89     5
90     3
91     3
92     4
93     1
94     1
95     2
96     1
97     6
98     1
99     7
100    2
Name: class, Length: 101, dtype: int64

In [11]:

dataset.shape

Out[11]:

(101, 17)

Splitting the datasets into train and test

In [0]:

def train_test_split(dataset):
    training_data = dataset.iloc[:80].reset_index(drop=True)#We drop the index respectively relabel the index
    #starting form 0, because we do not want to run into errors regarding the row labels / indexes
    testing_data = dataset.iloc[80:].reset_index(drop=True)
    return training_data,testing_data
training_data = train_test_split(dataset)[0]
testing_data = train_test_split(dataset)[1]

In [0]:

training_data = training_data.values

In [0]:

testing_data = testing_data.values

Predict the class of test data

In [0]:

from sklearn import tree

In [0]:

clf = tree.DecisionTreeClassifier()

In [0]:

clf = clf.fit(training_data[:,:15],training_data[:,16])

In [0]:

pred = clf.predict(testing_data[:,:15])

In [0]:

from sklearn.metrics import accuracy_score

In [20]:

accuracy_score(testing_data[:,16], pred)

Out[20]:

0.7619047619047619

In [21]:

import os
save_dot = "output" + ".dot"
save_png = "output"+ ".png"
graph = graphviz.Source(export_graphviz(clf, out_file=save_dot, filled = True,feature_names=(list(dataset.columns))[:-2]))
os.system("dot -T png -o " + save_png + " " + save_dot)

Out[21]:

In [22]:

import matplotlib.pyplot as plt
plt.figure(figsize=(20,20))
plt.grid(False)
plt.imshow(plt.imread(save_png)),
plt.show()

Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature.

In [23]:

def feature_importance_chart(clf, classifier_name, feature_names):
    sorted_feature_importances, sorted_feature_names = (
        zip(*sorted(zip(clf.tree_.compute_feature_importances(normalize=False), feature_names)))
    )
    plt.figure(figsize=(16, 9))
    plt.barh(range(len(sorted_feature_importances)), sorted_feature_importances)
    plt.yticks(
        range(len(sorted_feature_importances)),
        ["{}: {:.3}".format(a, b) for a, b in zip(sorted_feature_names, sorted_feature_importances)]
    )
    plt.title("Feature importance for the tree")
    plt.show()

feature_importance_chart(clf, "simple tree", list(dataset.columns))

Exercise 1

Change the train and test split ratio and observe the change in accuracy

In [0]:

#### Your code here
def train_test_split(dataset):
    training_data = dataset.iloc[:50].reset_index(drop=True)#We drop the index respectively relabel the index
    #starting form 0, because we do not want to run into errors regarding the row labels / indexes
    testing_data = dataset.iloc[50:].reset_index(drop=True)
    return training_data,testing_data
training_data = train_test_split(dataset)[0]
testing_data = train_test_split(dataset)[1]

In [0]:

training_data = training_data.values
testing_data = testing_data.values

In [26]:

from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(training_data[:,:15],training_data[:,16])
pred = clf.predict(testing_data[:,:15])
from sklearn.metrics import accuracy_score
accuracy_score(testing_data[:,16], pred)

Out[26]:

0.803921568627451

In [27]:

import os
save_dot = "output" + ".dot"
save_png = "output"+ ".png"
graph = graphviz.Source(export_graphviz(clf, out_file=save_dot, filled = True,feature_names=(list(dataset.columns))[:-2]))
os.system("dot -T png -o " + save_png + " " + save_dot)

Out[27]:

In [28]:

import matplotlib.pyplot as plt
plt.figure(figsize=(20,20))
plt.grid(False)
plt.imshow(plt.imread(save_png)),
plt.show()

Search This Blog

Learning Fish

Understand AIML Decision Tree

The objective of this experiment is to understand Decision Tree.

Decision Tree

Keywords

Importing Required Packages

Loading Dataset

Splitting the datasets into train and test

Predict the class of test data

Exercise 1

Comments

Post a Comment

Popular posts from this blog

How To Run JMeter GUI Mode In Ubuntu Servers

SELENIUM

Sql Basics With Examples

	hair	eggs	milk	aquatic	predator	toothed	backbone	breathes	fins	legs	tail	catsize	class
0	1	0	1	0	1	1	1	1	0	4	0	1	1
1	1	0	1	0	0	1	1	1	0	4	1	1	1
2	0	1	0	1	1	1	1	0	1	0	1	0	4
3	1	0	1	0	1	1	1	1	0	4	0	1	1
4	1	0	1	0	1	1	1	1	0	4	1	1	1

	hair	eggs	milk	aquatic	predator	toothed	backbone	breathes	fins	legs	tail	catsize	class
0	1	0	1	0	1	1	1	1	0	4	0	1	1
1	1	0	1	0	0	1	1	1	0	4	1	1	1
2	0	1	0	1	1	1	1	0	1	0	1	0	4
3	1	0	1	0	1	1	1	1	0	4	0	1	1
4	1	0	1	0	1	1	1	1	0	4	1	1	1

	hair	eggs	milk	aquatic	predator	toothed	backbone	breathes	fins	legs	tail	catsize	class
0	1	0	1	0	1	1	1	1	0	4	0	1	1
1	1	0	1	0	0	1	1	1	0	4	1	1	1
2	0	1	0	1	1	1	1	0	1	0	1	0	4
3	1	0	1	0	1	1	1	1	0	4	0	1	1
4	1	0	1	0	1	1	1	1	0	4	1	1	1