ML/AI: Feature selection in decision trees (information entropy)

How are features selected or prioritized in decision tree learning models?
In the examples that we studied in two previous posts, we examined decision
tree and random forest models of survivability of Titanic passengers, a
fairly canonical ML exercise. We have a data set consisting of a line for each
passenger category of Class (first, second, third, crew), Sex (M,F),
Age (Adult, Child), and whether or not they survived (Y,N). A fifth column
gave the frequency of occurrance of each category.

Which of these features is the most significant decider of the final
classification, survivability?

We recast all of the features into a Bernoulli random variable with
outcomes 1,0 to facilitate the calculations, for example Class
\rightarrow 3^{rd} class =1 for yes, in third class, 0 for no,
Sex \rightarrow \mbox{Female}=1 for yes, 0 for Male, and Age
\rightarrow \mbox{Child}=1 for yes, 0 for Adult.
The original data file had 32 categories (4\times 2\times 2\times 2) and a
fifth column giving the number of passengers in each, which we unfolded
into a line per passenger. We can shuffle the lines to facilitate a random
drawing of a passenger, and to divide the dataset into a training and testing
batch.

using RDatasets,DataFrames
titanic=dataset("datasets","Titanic")
classmap2=function(str)
if str=="3rd"
1
else
0
end
end

sexmap=function(str)
if str=="Female" 
1
else
0
end
end

agemap=function(str)
if str=="Child" 
1
else
0
end
end

survivemap=function(str)
if str=="Yes" 
1
else
0
end
end

titanic2=titanic
titanic2=filter(row -> !(row.Freq == 0),  titanic2)
titanic2=transform(titanic2, :Class => ByRow(x -> classmap2(x)) => :Class)
titanic2=transform(titanic2, :Sex => ByRow(x -> sexmap(x)) => :Sex)
titanic2=transform(titanic2, :Age => ByRow(x -> agemap(x)) => :Age)
titanic2=transform(titanic2, :Survived => ByRow(x -> survivemap(x)) => :Survived)
titanic3=DataFrame()
for j in 1:nrow(titanic2)
for i in 1:titanic2.Freq[j]
push!(titanic3,titanic2[j,:])
end
end

Pick a Bernoulli random variable with success probability p.
The entropy and Gini impurity are
maximal values of these two information content metrics (sum over
outcomes), if p=1/2

    \[H(p)=-p\log_2(p)-(1-p)\log_2(1-p)=1\]

    \[G(p)=p(1-p)+(1-p)p=1/2\]

which indicate that we have no real predictive ability of survivability
based on this feature. On the other hand if p=0.9

    \[H=0.4689955, \qquad G=0.18\]

which is very much lower than the extrema in H,G. Extreme values of H,G
mean that this feature is associated with no information gain of predictability,
lower values of H,G mean that this feature makes prediction closer to a
sure thing. Low entropy or low Gini impurity translates into less
randomness
of the variable.

Since entropy or Gini impurity are measures of randomness of a variable, we
can construct their values for the whole dataset’s classification variable
which is survivability. Out of 2201 passengers, 711 survived.

sum(titanic2.Freq .* titanic2.Survived)
# 711

    \[p(S)={711\over 2201}=0.323035\]

    \[H=-p(S)\log_2(p(S))-(1-p(S))\log_2(1-p(S))=0.9076514\]

    \[G=p(S|F)(1-p(S|F))+(1-p(S|F))p(S|F)=0.437366766\]

This is very close to even odds: a randomly selected passenger has just a bit
better than 50:50 chances. If you wish to predict the survival probability
of a randomly selected passenger, taking no features into account, you would
say 0.323, but in fact some categories of passengers had very different
survival rates. How can you do better in your prediction making?
By splitting the data and using the features. Which features are the best
deciders of survival rate?

How many Female passengers were there?

# number of women
sum(titanic2.Freq .* titanic2.Sex)
470
# number of men
sum(titanic2.Freq .* (1 .- titanic2.Sex))
1731

# how many survived?
# number of women surviving
sum(titanic2.Freq .* titanic2.Sex .* titanic2.Survived)
344
# number of men surviving
sum(titanic2.Freq .* (1 .- titanic2.Sex) .* titanic2.Survived)
367

so the probability that a passenger is Female is p(F)=0.2135393 and
the conditional probability that a passenger is Female and survives is
p(S|F)=0.7319148936.

    \[H(p(S|F))=-p(S|F)\log_2(p(S|F))-(1-p(S|F))\log_2(1-p(S|F))=0.838703444\]

Compute as well the probability that a passenger is Male p(F^c)=0.7864607 and
p(S|F^c)=0.2120161756 (probability of being Male and surviving)

    \[H(p(S|F^c))=-p(S|F^c)\log_2(p(S|F^c))-(1-p(S|F^c))\log_2(1-p(S|F^c))=0.7453189\]

(lower, predictable) and finally the conditional entropy for
splitting the data by sex

    \[H(S|F)=p(F) H(p(S|F))+p(F^c) H(p(S|F^c))=0.76526\]

which is a vast improvement over the entropy with no features considered.

Define the information gain for splitting on this feature

    \[IG(S|F)=H(S)-H(S|F)=0.9076514-0.76526=0.14239\]

Incorporating this feature in the decision process results in an improved
survivability probability prediction.

Let’s write a code to compute the entropy and information gain of a data split.
R has a library “FSelector” for feature selection, but I think that it is
always a good idea to write your own so that you understand underlying
principles.

""" Compute entropy for a vector of attribute numbers, ie include attributes
    1,2 (3rdclass yes, Sex=female yes) entropy([1,2])"""
entropy2=function(attrs:: Array,dataf:: DataFrame)
# how many have all attributes =1
Nattr1=sum(reduce(.* ,[dataf[:,j] for j in attrs]))
# prob that random item has all attribs
Pattr1=Nattr1/nrow(dataf)
# how many have all attributes and are in class=1
Nattr1C=sum(reduce(.* ,[dataf[:,j] for j in attrs]).*dataf[:,4])
# entropy for those with all attribs being in class 1
SCandAttr=-xlogx(Nattr1C/Nattr1)-xlogx(1.0-Nattr1C/Nattr1)
# Number that do not have all attributes
Nattr0=nrow(dataf)-Nattr1
# prob that random does not have all attribs
Pattr0=Nattr0/nrow(dataf)
# how many have none of the attributes and are in class=1
Nattr0C=sum(reduce(.* ,[1 .- dataf[:,j] for j in attrs]).*dataf[:,4])
# entropy for those without all atribs being in class 1
SCandNoAttr=-xlogx(Nattr0C/Nattr0)-xlogx(1.0-Nattr0C/Nattr0)
return Pattr1*SCandAttr+Pattr0*SCandNoAttr
end

""" julia does not give x*log2(x)=0 if x=0"""
xlogx=function(x)
if x==0.0
0.0
else
x*log2(x)
end
end

"""Probability of survival if have atributes attr"""
prob=function(attrs:: Array,dataf:: DataFrame)
Nattr=sum(reduce(.* ,[dataf[:,j] for j in attrs]))
NattrC=sum(reduce(.* ,[dataf[:,j] for j in attrs]) .* dataf[:,4])
return NattrC/Nattr
end


infogain=function(attrs:: Array,dataf:: DataFrame)
Nclass=sum(dataf[:,4])
Pclass=Nclass/nrow(dataf)
entrpy=-xlogx(Pclass)-xlogx(1.0-Pclass)
entrpy-entropy2(attrs,dataf)
end

Let’s use it to see how we should build a decision tree, by ordering the branching
according to information gain, the higher the gain of a split, the closer to
the base node of the tree

# third class is first feature
infogain([1],titanic3)
0.008030400990276632
# sex is second feature
infogain([2],titanic3)
0.14239119454923332
# age is the third feature
infogain([3],titanic3)
0.006410718332584997

so we will get the best/most useful decision tree by first splitting on
sex, the third class, then age. This is actually how we did our tree
in the earlier post, we lucked out.

Home 2.0
error: Content is protected !!