**Attribute Selection Methods for Decision Tree Construction**

Attribute Selection Methods is the important topic of the Data Mining & Business Intelligence. DMBI is the Important subject of the Computer.

An attribute selection methods is a heuristic for selecting the splitting criterion that “best” separates a given data partition, *D*, of class-labeled training tuples into individual classes.

## Information Gain: Attribute Selection Methods

ID3 uses information gain as its **attribute selection methods**.

This measure is based on pioneering work by Claude Shannon on information theory, which studied the value or “information content” of messages.

Let node *N *represents or hold the tuples of partition *D*. The attribute with the highest information gain is chosen as the splitting attribute for node *N*.

Moreover, This attribute minimizes the information needed to classify the tuples in the resulting partitions. And reflects the least randomness or “impurity” in these partitions.

Such an approach minimizes the expected number of tests needed to classify a given tuple. And guarantees that a simple (but not necessarily the simplest) tree is found.

The expected information needed to classify a tuple in *D *is given by

Where *p _{i} *the probability that an arbitrary tuple in

*D*belongs to class

*Ci*and estimated by |

*C*

_{i}_{,D}|/|

*D*|.

A log function to the base 2 used because of the information encoded in bits.

*Info(D) *just the average amount of information needed to identify the class label of a tuple in *D*.

Note that, at this point, the information we have based solely on the proportions of tuples of each class. *Info(D)* also known as the entropy of *D*. Now, suppose we to partition the tuples in *D *on some attribute *A *having *v *distinct values, {*a*1, *a*2, …….., *av*}, as observed from the training data.

If *A *is discrete-valued, these values correspond directly to the *v *outcomes of a test on *A*. Attribute *A *can be used to split*D*into *v *partitions or subsets, {*D*_{1}, *D*_{2},………, *D _{v}*}, where

*D*contains those tuples in

_{j}*D*that have outcome

*a*of

_{j}*A*.

Moreover, These partitions would correspond to the branches grown from node *N*.

Ideally, we would like this partitioning to produce an exact classification of the tuples. That is, we would like for each partition to be pure. However, it is quite likely that the partitions will be impure (e.g., where a partition may contain a collection of tuples from different classes rather than from a single class).

**How much would more information we still in order to arrive at an exact classification?**

This amount measured by

The term |*Dj*| / |*D*| acts as the weight of the *j*th partition. *Info _{A}*(

*D*) the expected information required to classify a tuple from

*D*based on the partitioning by

*A*.

The smaller the expected information (still) required, the greater the purity of the partitions.

Information gain defined as the difference between the original information requirement (i.e., based on just the proportion of classes) and the new requirements (i.e., obtained after partitioning on *A*). That is,

*Gain*(*A*) = *Info*(*D*) – *Info _{A}*(

*D*)

- In other words,
*Gain(A)*tells us how much would gain by branching on*A*. It the expected reduction in the information requirement caused by knowing the value of*A*. - The attribute
*A*with the highest information gain, (*Gain*(*A*)), chosen as the splitting attribute at node*N*. - Moreover, This is equivalent to saying that we want to partition on the attribute
*A*that would do the “best classification,” so that the amount of information still required to finish classifying the tuples is minimal (i.e., minimum*InfoA*(*D*)).

### Gain Ratio: Attribute Selection Methods

The information gain measure biased toward tests with many outcomes.

That is, it prefers to select attributes having a large number of values. For example, consider an attribute that acts as a unique identifier, such as *product ID*.

A split on *product ID *would result in a large number of partitions (as many as there are values). Each one containing just one tuple.

Because each partition pure, the information required to classify data set *D *based on this partitioning would *Info _{product_ID}*(

*D*) = 0.

Therefore, the information gained by partitioning on this attribute is maximal. Clearly, such a partitioning is useless for classification.

5, a successor of ID3, uses an extension to information gain known as *gain ratio.* Which attempts to overcome this bias.

It applies a kind of normalization to information gain using a **“split information”** value defined analogously with *Info*(*D*) as

This value represents the potential information generated by splitting the training data set, *D*, into *v *partitions, corresponding to the *v *outcomes of a test on attribute *A*.

Note that, for each outcome, it considers the number of tuples having that outcome with respect to the total number of tuples in *D*.

It differs from information gain. Which measures the information with respect to the classification that acquired based on the same partitioning.

The **gain ratio** defined as

The attribute with the maximum gain ratio selected as the splitting attribute. Note, however, that as the split information approaches 0, the ratio becomes unstable.

A constraint added to avoid this, whereby the information gain of the test selected must large at least as great as the average gain over all tests examined.

#### Gini Index: Attribute Selection Methods

The Gini index used in CART. Using the notation described above. The Gini index measures the impurity of *D*, a data partition or set of training tuples, as

where *pi *is the probability that a tuple in *D *belongs to class *C _{i} *and estimated by |

*C*|/|

_{i, D}*D*|. The sum computed over ms

Moreover, The Gini index considers a binary split for each attribute. Let’s first consider the case where. *An *a discrete-valued attribute having *v *distinct values, {*a*_{1}, *a*_{2}, ………….., *a _{v}}*, occurring in

*D*.

To determine the best binary split on *A. W*e examine all of the possible subsets that can formed using known values of *A*.

Moreover, Each subset, *S _{A}*, can consider as a binary test for attribute

*A*of the form “

*A*є

*S*?”.

_{A}Given a tuple, this test satisfied if the value of *A *for the tuple among the values listed in *S _{A}*.

###### If *A *has *v *possible values, then there are 2^{v} possible subsets.

^{v}

###### if *income *has three possible values, namely {*low,* *medium, high*} then. The possible subsets are {*low, medium, high*}, {*low, medium*}, {*low,* *High*},{*medium, high*}, {*low*}, {*medium*}, {*high*}, and {}.

Moreover, We exclude the power set, {*low, medium, high*}, and the empty set from consideration since, conceptually. There are 2* ^{v}*-2 possible ways to form two partitions of the data,

*D*, based on a binary split on

*A*.

When considering a binary split. We compute a weighted sum of the impurity of each resulting partition. For example, if a binary split on *A *partitions *D *into *D*1 and *D*2, the Gini index of *D *given that partitioning is

For each attribute, each of the possible binary splits considered.

Also, For a discrete-valued attribute. The subset that gives the minimum Gini index for that attribute selected as its splitting subset.

For continuous-valued attributes, each possible split-point must consider. The strategy similar to that described above for information gain. Where the midpoint between each pair of (sorted) adjacent values taken as a possible split-point.

The point giving the minimum Gini index for a given (continuous-valued) attribute taken as the split point of that attribute.

Moreover, Recall that for a possible split-point of *A*, *D*_{1} the set of tuples in *D *satisfying *A *≤ *split_point*, and *D*_{2} is the set of tuples in *D *satisfying *A *> *split_point*.

The reduction in impurity that would incur by a binary split on a discrete- or continuous-valued attribute *A *is

The attribute that maximizes the reduction in impurity (or, equivalently, has the minimum Gini index) selected as the splitting attribute.

Moreover, This attribute and either its splitting subset (for a discrete-valued splitting attribute) or split-point (for a continuously valued splitting attribute) together form the splitting criterion.

**Related Terms**

Data Mining and Business Analysis, Naïve Bayesian Classification, CART Classification Method, Logistic Regression

## Leave a Reply