In eduladder you can Ask,Answer,Listen,Earn and Download Questions and Question papers.

Watch related videos of your favorite subject.

Connect with students from different parts of the world.

Apply or Post Jobs, Courses ,Internships and Volunteering opportunity. For FREE

See Our team

Wondering how we keep quality?

Got unsolved questions? Ask Questions

GATE
GMAT
CBSE
NCERT
Career
Interview
Railway
UPSC
NID
NIFT-UG
NIFT-PG
PHP
AJAX
JavaScript
Node Js
Shell Script
Research

You are here:Open notes-->VTU-->Data-Warehousing--DataMinig-10CS755-unit-56

**Data Warehousing & DataMinig [10CS755] unit-5&6**

# UNIT V & VI CLASSIFICATION

5.1 Basics

The input data for a classification task is a collection of records. Each record, also known as

an instance or example, is characterized by a tuple (x, y), where x is the attribute set and y is a

special attribute, designated as the class label. sample data set used for classifying vertebrates into

one of the following categories: mammal, bird, fish, reptile, or amphibian. The attribute set

includes properties of a vertebrate such as its body temperature, skin cover, method of

reproduction ability to fly, and ability to live in water. the attribute set can also contain continuous

features. The class label, on the other hand, must be a discrete attribute. This is a key

characteristic that distinguishes classification from regression, a predictive modelling task in which

y is a continuous attribute. .

Definition 3.1 (Classification). Classification is the task of learning a target function f that maps

each attribute set x to one of the predefined class labels y. The target function is also known

informally as a classification model.A classification model is useful for the following purposes.

Descriptive Modeling

A classification model can serve as an explanatory tool to distinguish between objects of

different classes. For example, it would be useful for both biologists and others to have a

descriptive model.

Predictive Modeling

A classification model can also be used to predict the class label of unknown records.

As a classification model can be treated as a black box that automatically assigns a class label

when presented with the attribute set of an unknown record. Suppose we are given the following

characteristics of a creature known as a gila monster: Classification techniques are most suited for

predicting or describing data sets with binary or nominal categories. They are less effective

for ordinal categories (e.g., to classify a person as a member of high-, medium-, or low-income

group) because they do not consider the implicit order among the categories. Other forms of

relationships, such as the subclass?super class relationships among categories

5.2 General Approach to Solving a Classification Problem

A classification technique (or classifier) is a systematic approach to building classification

models from an input data set. Examples include decision tree classifiers, rule-based classifiers,

neural networks, support vector machines and na?ve Bayes classifiers. Each technique employs a

learning algorithm to identify a model that best fits the relationship between the attribute set and

class label of the input data. The model generated by a learning algorithm should both fit the

input data well and correctly predict the class labels of records it has never seen before.

Therefore, a key objective of the learning algorithm is to build models with good generalization

capability; i.e., models that accurately predict the class labels of previously unknown records.

Figure 3.3. General approach for building a classification model.

Figure 3.3 shows a general approach for solving classification problems. First, a training set

consisting of records whose class labels are known must be provided. The training set is used to

build a classification model, which is subsequently applied to the test set, which consists of records

with unknown class labels.

Table 3.2. Confusion matrix for a 2-class problem

Evaluation of the performance of a classification model is based on the counts of test

records correctly and incorrectly predicted by the model. These counts are tabulated in a table

known as a confusion matrix. Table 4.2 depicts the confusion matrix for a binary classification

problem. Each entry fij in this table denotes the number of records from class i predicted to

be of class j. For instance, f01is the number of records from class 0 incorrectly predicted as class 1.

Based on the entries in the confusion matrix, the total number of correct predictions made by the

model is (f11 + f00) and the total number of incorrect predictions is (f10+ f01). Although a confusion

matrix provides the information needed to determine how well a classification model performs,

summarizing this information with a single number would make it more convenient to compare

the performance of different models. This can be done using a performance metric such as

accuracy, which is defined as follows:

(3.1)

Equivalently, the performance of a model can be expressed in terms of its error rate, which is

given by the following equation:

Most classification algorithms seek models that attain the highest accuracy, or equivalently, the

lowest error rate when applied to the test set.

5.3 Decision Tree Induction

This section introduces a decision tree classifier, which is a simple yet widely used classification

technique.

5.3.1 How a Decision Tree Works

To illustrate how classification with a decision tree works, consider a simpler version of the

vertebrate classification problem described in the previous section. Instead of classifying the

vertebrates into five distinct groups of species, we assign them to two categories: mammals and

non-mammals. Suppose a new species is discovered by scientists. How can we tell whether it is a

mammal or a non-mammal? One approach is to pose a series of questions about the characteristics

of the species. The first question we may ask is whether the species is cold- or warm-blooded.

If it is cold-blooded, then it is definitely not a mammal. Otherwise, it is either a bird or a mammal.

In the latter case, we need to ask a follow-up question: Do the females of the species give birth to

their young? Those that do give birth are definitely mammals, while those that do not are likely to

be non-mammals (with the exception of egg-laying mammals such as the platypus and spiny

anteater) The previous example illustrates how we can solve a classification problem by asking a

series of carefully crafted questions about the attributes of the test record. Each time we

receive an answer, a follow-up question is asked until we reach a conclusion about the class

label of the record. The series of questions and their possible answers can be organized in the form

of a decision tree, which is a hierarchical structure consisting of nodes and directed edges. Figure

4.4 shows the decision tree for the mammal classification problem. The tree has three types of

nodes:

A root node that has no incoming edges and zero or more outgoing edges.

Internal nodes, each of which has exactly one incoming edge and two or more outgoing edges.

Leaf or terminal nodes, each of which has exactly one incoming edge and no outgoing edges.

In a decision tree, each leaf node is assigned a class label. The non terminal nodes,

which include the root and other internal nodes, contain attribute test conditions to separate

records that have different characteristics. For example, the root node shown in Figure 4.4 uses

the attribute Body.

Figure 5.4. A decision tree for the mammal classification problem.

Temperature to separate warm-blooded from cold-blooded vertebrates. Since all coldblooded

vertebrates are non-mammals, a leaf node labeled Non-mammals is created as the right

child of the root node. If the vertebrate is warm-blooded, a subsequent attribute, Gives Birth, is

used to distinguish mammals from other warm-blooded creatures, which are mostly birds.

Classifying a test record is straightforward once a decision tree has been constructed. Starting from

the root node, we apply the test condition to the record and follow the appropriate branch based on

the outcome of the test.

This will lead us either to another internal node, for which a new test condition is applied, or

to a leaf node. The class label associated with the leaf node is then assigned to the record. As an

illustration, Figure 4.5 traces the path in the decision tree that is used to predict the class label of a

flamingo. The path terminates at a leaf node labeled Non-mammals.

5.3.2 How to Build a Decision Tree

There are exponentially many decision trees that can be constructed from a given set of

attributes. While some of the trees are more accurate than others, finding the optimal tree is

computationally infeasible because of the exponential size of the search space. Nevertheless,

efficient algorithms have been developed to induce a reasonably accurate, albeit suboptimal,

decision tree in a reasonable amount of time.

These algorithms usually employ a greedy strategy that grows a decision tree by making a

series of locally optimum decisions about which attribute to use for partitioning the data. One such

algorithm is Hunts algorithm, which is the basis of many existing decision tree induction

algorithms, including ID3, C4.5, and CART.

This section presents a high-level discussion of Hunts algorithm and illustrates some of its

design issues.

Figure 5.5. Classifying an unlabeled vertebrate.

The dashed lines represent the outcomes of applying various attribute test conditions on the

unlabeled vertebrate. The vertebrate is eventually assigned to the Non-mammal class.

Hunts Algorithm

In Hunts algorithm, a decision tree is grown in a recursive fashion by partitioning the

training records into successively purer subsets. Let Dtbe the set of training records that are

associated with node t and y = {y1, y2, . . . , yc} be the class labels. The following is a recursive

definition of Hunts algorithm.

Step 1: If all the records in Data belong to the same class yt, then t is a leaf node labeled as yt.

Step 2: If Data contains records that belong to more than one class, an attribute test

condition is selected to partition the records into smaller subsets. A child node is created for

each outcome of the test condition and the records in Dt are distributed to the children based on

the outcomes. The algorithm is then recursively applied to each child node.

Figure 3.6. Training set for predicting borrowers who will default on loan payments.

To illustrate how the algorithm works, consider the problem of predicting whether a loan

applicant will repay her loan obligations or become delinquent, subsequently defaulting on her loan.

A training set for this problem can be constructed by examining the records of previous borrowers.

In the example shown in Figure 4.6, each record contains the personal information of a borrower

along with a class label indicating whether the borrower has defaulted on loan payments.

The initial tree for the classification problem contains a single node with class label Defaulted = No (see

Figure 3.7(a)), which means that most of the borrowers successfully repaid their loans. The tree, however, needs to

be redefined since the root node contains records from both classes. The records are

Subsequently divided into smaller subsets based on the outcomes of the Home Owner test condition, as shown in Figure

3.7(b). The justification for choosing this attribute test condition will be discussed later. For now, we will assume that

this is the best criterion for splitting the data at this point. Hunts algorithm is then applied recursively to

each child of the root node. From the training set given in Figure 3.6, notice that all borrowers who are home

owners successfully repaid their loans. The left child of the root is therefore a leaf node labelled Defaulted = No (see

Figure 3.7(b)). For the right child, we need to continue applying the recursive step of Hunts algorithm until all the

records belong to the same class. The trees resulting from each recursive step are shown in Figures 3.7(c) and (d).

Figure 5 .7 Hunts algorithm for inducing decision trees.

Hunts algorithm will work if every combination of attribute values is present in the

training data and each combination has a unique class label. These assumptions are too

stringent for use in most practical situations. Additional conditions are needed to handle the

following cases:

1. It is possible for some of the child nodes created in Step 2 to be empty; i.e., there are no

records associated with these nodes. This can happen if none of the training records have the

combination of attribute values associated with such nodes. In this case the node is declared a

leaf node with the same class label as the majority class of training records associated with its

parent node.

2. In Step 2, if all the records associated with Dt have identical attribute values (except for

the class label), then it is not possible to split these records any further. In this case, the node is

declared a leaf node with the same class label as the majority class of training records associated

with this node.

Design Issues of Decision Tree Induction

A learning algorithm for inducing decision trees must address the following two issues.

a) How should the training records be split? Each recursive step of the tree-growing

process must select an attribute test condition to divide the records into smaller

subsets. To implement this step, the algorithm must provide a method for specifying

the test condition for different attribute types as well as an objective measure for

evaluating the goodness of each test condition.

b) How should the splitting procedure stop? A stopping condition is needed to terminate the

tree-growing process. A possible strategy is to continue expanding a node until either all the

records belong to the same class or all the records have identical attribute values. Although

both conditions are sufficient to stop any decision tree induction algorithm, other

criteria can be imposed to allow the tree-growing procedure to terminate earlier.

5.3.3 Methods for Expressing Attribute Test Conditions

Decision tree induction algorithms must provide a method for expressing an attribute test condition and its

corresponding outcomes for different attribute types. Binary Attributes The test condition for a binary attribute

generates two potential outcomes, as shown in Figure 3.8.

Cold blooded warm blooded

Figure 3.8 Test condition for binary attributes.

Body

temperature

Figure 3.9 Test conditions for nominal attributes.

Nominal Attributes:

Since a nominal attribute can have many values, its test condition can be expressed in

two ways, as shown in Figure 3.9. For a multiway split (Figure 3.9(a)), the number of

outcomes depends on the number of distinct values for the corresponding attribute. For

example, if an attribute such as marital status has three distinct values?single, married, or divorced

its test condition will produce a three-way split. On the other hand, some decision tree algorithms,

such as CART, produce only binary splits by considering all 2k?1 ? 1 ways of creating a binary

partition of k attribute values. Figure 3.9(b) illustrates three different ways of grouping the attribute

values for marital status into two subsets.

Ordinal Attributes:

Ordinal attributes can also produce binary or multiway splits. Ordinal attribute values can

be grouped as long as the grouping does not violate the order property of the attribute values.

Figure 3.10 illustrates various ways of splitting training records based on the Shirt Size attribute.

The groupings shown in Figures 3.10(a) and (b) preserve the order among the attribute

values, whereas the grouping shown in Figure 3.10(c) violates this property because it combines

the attribute values Small and Large into the same partition while Medium and Extra Large are

combined into another partition.

Figure 3.10 Different ways of grouping ordinal attribute values.

Continuous Attributes:

For continuous attributes, the test condition can be expressed as a comparison test (A < v) or (A ?

v) with binary outcomes, or a range query with outcomes of the form vi ? A < vi+1, for i = 1. . . k.

The difference between these approaches is shown in Figure 3.11. For the binary case, the decision

tree algorithm must consider all possible split positions v, and it selects the one that produces

the best partition. For the multiway split, the algorithm must consider all possible ranges of

continuous values. One approach is to apply the discretization strategies described. After

discretization, a new ordinal value will be assigned to ach discretized interval. Adjacent intervals

can also be aggregated into wider ranges as long as the order property is preserved.

(a) (b)

Figure 3.11 Test condition for continuous attributes.

Figure 3.12 Multiway versus binary splits.

5.3.4 Measures for Selecting the Best Split

There are many measures that can be used to determine the best way to split the records.

These measures are defined in terms of the class distribution of the records before and after splitting.

Let p(i|t) denote the fraction of records belonging to class i at a given node t. We sometimes

omit the reference to node t and express the fraction as pi. In a two-class problem, the class

distribution at any node can be written as (p0, p1), where p1 = 1 ? p0. The class distribution

before splitting is (0.5, 0.5) because there are an equal number of records from each class.

If we split the data using the Gender attribute, then the class distributions of the child nodes are

(0.6, 0.4) and (0.4, 0.6), respectively. Although the classes are no longer evenly distributed, the

child nodes still contain records from both classes. Splitting on the second attribute, Car Type will

result in purer partitions.

The measures developed for selecting the best split are often based on the degree of impurity

of the child nodes. The smaller the degree of impurity, the more skewed the class distribution. For

example, a node with class distribution (0, 1) has zero impurity, whereas a node with uniform class

distribution (0.5, 0.5) has the highest impurity. Examples of impurity measures include

c?1

Entropy (t) = ?

i=0

p(i|t) log2p(i|t), (3.3)

Gini (t) = 1 ?

c?1

i=0

[p(i|t)]2, (3.4)

Classification error (t) = 1 ? max [p(i|t)],

(3.5)

Where c is the number of classes and 0 log20 = 0 in entropy calculations.

Figure 3.13 Comparison among the impurity measures for binary classification problems.

Figure 3.13 compares the values of the impurity measures for binary classification problems.

p refers to the fraction of records that belong to one of the two classes. Observe that all three

measures attain their maximum value when the class distribution is uniform (i.e., when p = 0.5).

The minimum values for the measures are attained when all the records belong to the same class

(i.e., when p equals 0 or 1). We next provide several examples of computing the different impurity

measures.

Gain Ratio

Impurity measures such as entropy and Gini index tend to favor attributes that have a large

number of distinct values. Comparing the first test condition, Gender, with the second, Car

Type, it is easy to see that Car Type seems to provide a better way of splitting the data since

it produces purer descendent nodes. However, if we compare both conditions with Customer ID,

the latter appears to produce purer partitions.

Yet Customer ID is not a predictive attribute because its value is unique for each record.

Even in a less extreme situation, a test condition that results in a

large number of outcomes may not be desirable because the number of record associated

with each partition is too small to enable us to make any reliable predictions.

There are two strategies for overcoming this problem. The first strategy is to restrict the test

conditions to binary splits only. This strategy is employed by decision tree algorithms such as

CART. Another strategy is to modify the splitting criterion to take into account the number of

outcomes produced by the attribute test condition. For example, in the C4.5 decision tree

algorithm,a splitting criterion known as gain ratio is used to determine the goodness of a split.

This criterion is defined as follows:

GAIN RATIO = ΔINFO

Split info

Here, Split Info = ? ki=1 P (vi) log2P (vi) and k is the total number of splits. For example, if each

attribute value has the same number of records, then ∀i: P (vi) = 1/k and the split

information would be equal to log2k. This example suggests that if an attribute produces a

large number of splits, its split information will also be large, which in turn reduces its gain ratio.

5.3.5 Algorithm for Decision Tree Induction

A skeleton decision tree induction algorithm called Tree Growth is shown in Algorithm 4.1. The input to this

algorithm consists of the training records E and the attribute set F. The algorithm works by recursively selecting the

best attribute to split the data (Step 7) and expanding the leaf nodes of the tree.

(Steps 11 and 12) until the stopping criterion is met (Step 1). The details of this algorithm are

explained below:

1. The createNode() function extends the decision tree by creating a new node. A node in the

decision tree has either a test condition, denoted as node. Test cond, or a class label, denoted as

node. Label.

2. The find best split () function determines which attribute should be selected as the test condition

for splitting the training records. As previously noted, the choice of test condition depends on

which impurity measure is used to determine the goodness of a split. Some widely used measures

include entropy, the Gini index, and the χ2statistic.

3. The Classify() function determines the class label to be assigned to a leaf node. For each leaf

node t, let p(i|t) denote the fraction of training records from class i associated with the node t. In

most cases, the leaf node is assigned to the class that has the majority number of training records:

leaf.label = argmax p(i|t),i

where the argmax operator returns the argument i that maximizes the expression p(i|t). Besides

providing the information needed to determine the class label of a leaf node, the fraction p(i|t) can also be used to

estimate the probability that a record assigned to the leaf node t belongs to class i.

4. The stopping cond() function is used to terminate the tree-growing process by testing whether all

the records have either the same class label or the same attribute values. Another way to terminate the recursive

function is to test whether the number of records has fallen below some

Minimum threshold.

After building the decision tree, a tree-pruning step can be performed to reduce the size of the

decision tree. Decision trees that are too large are susceptible to a phenomenon known as overfitting. Pruning helps

by trimming the branches of the initial tree in a way that improves the generalization capability of the decision tree.

5.3.6 Characteristics of Decision Tree Induction

The following is a summary of the important characteristics of decision tree

induction algorithms.

1. Decision tree induction is a nonparametric approach for building classification models. In other

words, it does not require any prior assumptions regarding the type of probability distributions

satisfied by the class and other attributes.

2. Finding an optimal decision tree is an NP-complete problem. Many decision tree algorithms

employ a heuristic-based approach to guide their search in the vast hypothesis space. For example,

the algorithm presented in Section 3.3.5 uses a greedy, top-down, recursive partitioning strategy

for growing a decision tree.

3. Techniques developed for constructing decision trees are computationally inexpensive, making

it possible to quickly construct models even when the training set size is very large. Furthermore,

once a decision tree has been built, classifying a test record is extremely fast, with a worst-case

complexity of O(w), where w is the maximum depth of the tree.

4. Decision trees, especially smaller-sized trees, are relatively easy to interpret. The accuracies of

the trees are also comparable to other classification techniques for many simple data sets.

5. Decision trees provide an expressive representation for learning discrete valued functions.

However, they do not generalize well to certain types of Boolean problems. One notable example

is the parity function, whose value is 0 (1) when there is an odd (even) number of Boolean

attributes with the value True. Accurate modeling of such a function requires a full decision tree

with 2d nodes, where d is the number of Boolean attributes

6. Decision tree algorithms are quite robust to the presence of noise, especially when methods for

avoiding overfitting, are employed.

7. The presence of redundant attributes does not adversely affect the accuracy of decision trees. An

attribute is redundant if it is strongly correlated with another attribute in the data. One of the two

redundant attributes will not be used for splitting once the other attribute has been chosen.

However, if the data set contains many irrelevant attributes, i.e., attributes that are not useful for

the classification task, then some of the irrelevant attributes may be accidently chosen during the

tree-growing process, which results in a decision tree that is larger than necessary.

8. Since most decision tree algorithms employ a top-down, recursive partitioning approach, the

number of records becomes smaller as we traverse down the tree. At the leaf nodes, the number of

records may be too small to make a statistically significant decision about the class representation

of the nodes. This is known as the data fragmentation problem. One possible solution is to disallow

further splitting when the number of records falls below a certain threshold.

9. A subtree can be replicated multiple times in a decision tree, This makes the decision tree more

complex than necessary and perhaps more difficult to interpret. Such a situation can arise from

decision tree implementations that rely on a single attribute test condition at each internal node.

Since most of the decision tree algorithms use a divide-and-conquer partitioning strategy, the same

test condition can be applied to different parts of the attribute space, thus leading to the subtree

replication problem.

10. The test conditions described so far in this chapter involve using only a single attribute at a

time. As a consequence, the tree-growing procedure can be viewed as the process of partitioning

the attribute space into disjoint regions until each region contains records of the same class. The

border between two neighboring regions of different classes is known as a decision boundary.

Constructive induction provides another way to partition the data into homogeneous,

nonrectangular regions

5.4 Rule-Based Classification

In this section, we look at rule-based classifiers, where the learned model is represented as a set of IFTHEN

rules. We first examine how such rules are used for classification. We then study ways in which they can be

generated, either froma decision tree or directly from the training data using a sequential covering algorithm.

5.4.1 Using IF-THEN Rules for Classification

Rules are a good way of representing information or bits of knowledge. A rule-based classifier uses a set of IFTHEN

rules for classification. An IF-THEN rule is an expression of the form

IF condition THEN conclusion.

An example is rule R1,

R1: IF age = youth AND student = yes THEN buys computer = yes.

The IF-part (or left-hand side) of a rule is known as the rule antecedent or precondition. The THEN-part (or

right-hand side) is the rule consequent. In the rule antecedent, the condition consists of one or more attribute tests

(such as age = youth, and student = yes) that are logically ANDed. The rules consequent contains a class prediction

(in this case, we are predicting whether a customer will buy a computer). R1 can also be written as

R1: (age = youth) ^ (student = yes))(buys computer = yes).

If the condition (that is, all of the attribute tests) in a rule antecedent holds true for a given tuple, we say that the rule

antecedent is satisfied (or simply, that the rule is satisfied) and that the rule covers the tuple.

A rule R can be assessed by its coverage and accuracy. Given a tuple, X, from a class labeled data set,D, let ncovers

be the number of tuples covered by R; ncorrect be the number of tuples correctly classified by R; and jDj be the

number of tuples in D. We can define the coverage and accuracy of R as

That is, a rules coverage is the percentage of tuples that are covered by the rule (i.e., whose attribute values hold

true for the rules antecedent). For a rules accuracy, we look at the tuples that it covers and see what percentage of

them the rule can correctly classify.

5.4.2 Rule Extraction from a Decision Tree

Decision tree classifiers are a popular method of classification?it is easy to understand how decision trees work and

they are known for their accuracy. Decision trees can become large and difficult to interpret. In this subsection, we

look at how to build a rule based classifier by extracting IF-THEN rules from a decision tree. In comparison with a

decision tree, the IF-THEN rules may be easier for humans to understand, particularly if the decision tree is very

large.

To extract rules from a decision tree, one rule is created for each path from the root to a leaf node.

Each splitting criterion along a given path is logically ANDed to form the rule antecedent (IF part). The leaf node

holds the class prediction, forming the rule consequent (THEN part).

Example 3.4 Extracting classification rules from a decision tree. The decision tree of Figure 6.2 can

be converted to classification IF-THEN rules by tracing the path from the root node to

each leaf node in the tree.

A disjunction (logical OR) is implied between each of the extracted rules. Because the rules are extracted directly

from the tree, they are mutually exclusive and exhaustive. By mutually exclusive, this means that we cannot have

rule conflicts here because no two rules will be triggered for the same tuple. (We have one rule per leaf, and any

tuple can map to only one leaf.) By exhaustive, there is one rule for each possible attribute-value combination, so

that this set of rules does not require a default rule. Therefore, the order of the rules does not matter?they are

unordered.

Since we end up with one rule per leaf, the set of extracted rules is not much simpler than the corresponding decision

tree! The extracted rules may be even more difficult to interpret than the original trees in some cases. As an

example, Figure 6.7 showed decision trees that suffer from subtree repetition and replication. The resulting set of

rules extracted can be large and difficult to follow, because some of the attribute tests may be irrelevant or

redundant. So, the plot thickens. Although it is easy to extract rules from a decision tree, we may need to do some

more work by pruning the resulting rule set.

5.4.3 Rule Induction Using a Sequential Covering Algorithm

IF-THEN rules can be extracted directly from the training data (i.e., without having to generate a decision tree first)

using a sequential covering algorithm. The name comes from the notion that the rules are learned sequentially (one

at a time), where each rule for a given class will ideally cover many of the tuples of that class (and hopefully none of

the tuples of other classes). Sequential covering algorithms are the most widely used approach to mining disjunctive

sets of classification rules, and form the topic of this subsection. Note that in a newer alternative approach,

classification rules can be generated using associative classification algorithms, which search for attribute-value

pairs that occur frequently in the data. These pairs may form association rules, which can be analyzed and used in

classification. Since this latter approach is based on association rule mining (Chapter 5), we prefer to defer its

treatment until later, in Section 6.8. There are many sequential covering algorithms. Popular variations include AQ,

CN2, and the more recent, RIPPER. The general strategy is as follows. Rules are learned one at a time. Each time a

rule is learned, the tuples covered by the rule are removed, and the process repeats on the remaining tuples. This

sequential learning of rules is in contrast to decision tree induction. Because the path to each leaf in a decision tree

corresponds to a rule, we can consider decision tree induction as learning a set of rules simultaneously.

A basic sequential covering algorithm is shown in Figure 6.12. Here, rules are learned for one class at a time.

Ideally, when learning a rule for a class, Ci, we would like the rule to cover all (or many) of the training tuples of

class C and none (or few) of the tuples from other classes. In this way, the rules learned should be of high accuracy.

The rules need not necessarily be of high coverage.

Algorithm: Sequential covering. Learn a set of IF-THEN rules for classification.

Input: D, a data set class-labeled tuples;

Att vals, the set of all attributes and their possible values.

Output: A set of IF-THEN rules.

Method:

(1) Rule set = fg; // initial set of rules learned is empty

(2) for each class c do

(3) repeat

(4) Rule = Learn One Rule(D, Att vals, c);

(5) remove tuples covered by Rule from D;

(6) until terminating condition;

(7) Rule set = Rule set +Rule; // add new rule to rule set

(8) endfor

(9) return Rule Set;

This is because we can have more than one rule for a class, so that different rules may cover different tuples within

the same class. The process continues until the terminating condition is met, such as when there are no more training

tuples or the quality of a rule returned is below a user-specified threshold. The Learn One Rule procedure finds the

best rule for the current class, given the current set of training tuples. How are rules learned? Typically, rules are

grown in a general-to-specific manner .We can think of this as a beam search, where we start off with an empty rule

and then gradually keep appending attribute tests to it. We append by adding the attribute test as a logical conjunct

to the existing condition of the rule antecedent. Suppose our training set, D, consists of loan application data.

Attributes regarding each applicant include their age, income, education level, residence, credit rating, and the term

of the loan. The classifying attribute is loan decision, which indicates whether a

loan is accepted (considered safe) or rejected (considered risky). To learn a rule for the class accept, we start off

with the most general rule possible, that is, the condition of the rule antecedent is empty. The rule is:

IF THEN loan decision = accept.

We then consider each possible attribute test that may be added to the rule. These can be derived from the parameter

Att vals, which contains a list of attributes with their associated values. For example, for an attribute-value pair (att,

val), we can consider attribute tests such as att = val, att _ val, att > val, and so on. Typically, the training data will

contain many attributes, each of which may have several possible values. Finding an optimal rule set becomes

computationally explosive. Instead, Learn One Rule adopts a greedy depth-first strategy. Each time it is faced with

adding a new attribute test (conjunct) to the current rule, it picks the one that most improves the rule quality,

based on the training samples. We will say more about rule quality measures in a minute. For the moment, lets say

we use rule accuracy as our quality measure. Getting back to our example with Figure 6.13, suppose Learn One Rule

finds that the attribute test income = high best improves the accuracy of our current (empty) rule. We append it to

the condition, so that the current rule becomes

IF income = high THEN loan decision = accept. Each time we add an attribute test to a rule, the resulting rule should

cover more of the accept tuples. During the next iteration, we again consider the possible attribute tests and end up

selecting credit rating = excellent. Our current rule grows to become

IF income = high AND credit rating = excellent THEN loan decision = accept.

The process repeats, where at each step, we continue to greedily grow rules until the resulting rule meets an

acceptable quality level.

5.4.4 Rule Pruning

Learn One Rule does not employ a test set when evaluating rules. Assessments of rule quality as described above are

made with tuples from the original training data. Such assessment is optimistic because the rules will likely overfit

the data. That is, the rules may perform well on the training data, but less well on subsequent data. To compensate

for this, we can prune the rules. A rule is pruned by removing a conjunct (attribute test). We choose to prune a rule,

R, if the pruned version of R has greater quality, as assessed on an independent set of tuples. As in decision tree

pruning, we

refer to this set as a pruning set. Various pruning strategies can be used, such as the pessimistic pruning approach

described in the previous section. FOIL uses a simple yet effective method. Given a rule, R,

where pos and neg are the number of positive and negative tuples covered by R, respectively. This value will

increase with the accuracy of R on a pruning set. Therefore, if the FOIL Prune value is higher for the pruned version

of R, then we prune R. By convention, RIPPER starts with the most recently added conjunct when considering

pruning. Conjuncts are pruned one at a time as long as this results in an improvement.

Rule based classifier

Classify records by using a collection of ifthen rules

Rule: (Condition) → y

? Where Condition is a conjunction of attributes y is the class label

? LHS: rule antecedent or condition

? RHS: rule consequent

? Examples of classification rules:

(Blood Type=Warm) (Lay Eggs=Yes) → Birds

(Taxable Income < 50K) (Refund=Yes) → Evade=No

R1: (Give Birth = no) (Can Fly = yes) → Birds

R2: (Give Birth = no) (Live in Water = yes) → Fishes

R3: (Give Birth = yes) (Blood Type = warm) → Mammals

R4: (Give Birth = no) (Can Fly = no) → Reptiles

R5: (Live in Water = sometimes) → Amphibians

Application of Rule-Based Classifier

A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule.

R1: (Give Birth = no) (Can Fly = yes) → Birds

R2: (Give Birth = no) (Live in Water = yes) → Fishes

R3: (Give Birth = yes) (Blood Type = warm) → Mammals

R4: (Give Birth = no) (Can Fly = no) → Reptiles

R The rule R1 covers a hawk => Bird

The rule R3 covers the grizzly bear => Mammal

Rule Coverage and Accuracy

Coverage of a rule: Fraction of records that satisfy the antecedent of a rule

Accuracy of a rule: Fraction of records satisfy both the antecedent and consequent of a rule table

Characteristics of Rule-Based Classifier

Mutually exclusive rules: Classifier contains mutually exclusive rules if the rules are independent of each other

every record is covered by at most one rule.

Exhaustive rules: Classifier has exhaustive coverage if accounts for every possible combination of attribute values

each record is covered by at least one rule.

5.5 Nearest-Neighbor Classifiers

Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing a given test tuplewith training

tuples that are similar to it. The training tuples are described by n attributes. Each tuple represents a point in an ndimensional

space. In this way all of the training tuples are stored in an n-dimensional pattern space. When given an

unknown tuple, a k-nearest-neighbor classifier searches the pattern space for the k training tuples that are closest to

the unknown tuple. These k training tuples are the k nearest neighbors of the unknown tuple. Closeness is

defined in terms of a distance metric, such as Euclidean distance. The Euclidean distance between two points or

tuples, say, X1 = (x11, x12, : : : , x1n) and X2 = (x21, x22, : : : , x2n), is

In other words, for each numeric attribute, we take the difference between the corresponding values of that attribute

in tuple X1 and in tuple X2, square this difference, and accumulate it. The square root is taken of the total

accumulated distance count. Typically, we normalize the values of each attribute before using Euclids Equation.

This helps prevent attributes with initially large ranges (such as income) from outweighing attributes with initially

smaller ranges (such as binary attributes). Min-max normalization, for example, can be used to transform a value v

of a numeric attribute A to v0 in the range [0, 1] by computing

where minA and maxA are the minimum and maximum values of attribute A. Chapter 2 describes other methods for

data normalization as a form of data transformation. For k-nearest-neighbor classification, the unknown tuple is

assigned the most common class among its k nearest neighbors. When k = 1, the unknown tuple is assigned the class

of the training tuple that is closest to it in pattern space. Nearest neighbor classifiers can also be used for prediction,

that is, to return a real-valued prediction for a given unknown tuple. In this case, the classifier returns the average

value of the real-valued labels associated with the k nearest neighbors of the unknown tuple. But how can distance

be computed for attributes that not numeric, but categorical, such as color? The above discussion assumes that the

attributes used to describe the tuples are all numeric. For categorical attributes, a simple method is to compare the

corresponding value of the attribute in tuple X1 with that in tuple X2. If the two are identical (e.g., tuples X1 and X2

both have the color blue), then the difference between the two is taken as 0. If the two are different (e.g., tuple X1 is

blue but tuple X2 is red), then the difference is considered to be 1. Other methods may incorporate more

sophisticated schemes for differential grading (e.g., where a larger difference score is assigned, say, for blue and

white than for blue and black).

What about missing values? In general, if the value of a given attribute A is missing in tuple X1 and/or in tuple

X2, we assume the maximum possible difference. Suppose that each of the attributes have been mapped to the range

[0, 1]. For categorical attributes, we take the difference value to be 1 if either one or both of the corresponding

values of A are missing. If A is numeric and missing fromboth tuples X1 and X2, then the difference is also taken to

be 1.

How can I determine a good value for k, the number of neighbors? This can be determined experimentally.

Starting with k = 1, we use a test set to estimate the error rate of the classifier. This process can be repeated each

time by incrementing k to allow for one more neighbor. The k value that gives the minimum error rate may be

selected. In general, the larger the number of training tuples is, the larger the value of k will be (so that classification

and prediction decisions can be based on a larger portion of the stored tuples). As the number of training tuples

approaches infinity and k =1, the error rate can be no worse then twice the Bayes error rate (the latter being the

theoretical minimum).

5.6. Introduction to Bayesian Classification

The Bayesian Classification represents a supervised learning method as well as a statistical method for classification.

Assumes an underlying probabilistic model and it allows us to capture uncertainty about the model in a principled

way by determining probabilities of the outcomes. It can solve diagnostic and predictive problems. This

Classification is named after Thomas Bayes ( 1702-1761), who proposed the ayesTheorem.Bayesian classification

provides practical learning algorithms and prior knowledge and observed data can be combined. Bayesian

Classification provides a useful perspective for understanding and evaluating many learning algorithms. It calculates

explicit probabilities forhypothesis and it is robust to noise in input data.Uses of Naive Bayes classification: 1. Naive

Bayes text classification) The Bayesian classification is used as a probabilistic learning method (Naive Bayes text

classification). Naive Bayes classifiers are among the most successful known algorithms forlearning to classify text

documents.

2. Spam filtering (http://en.wikipedia.org/wiki/Bayesian_spam_filtering)Spam filtering is the best known use of

Naive Bayesian text classification. It makes use of anaive Bayes classifier to identify spam e-mail.Bayesian spam

filtering has become a popular mechanism to distinguish illegitimate spamemail from legitimate email (sometimes

called "ham" or "bacn").[4] Many modern mail clientsimplement Bayesian spam filtering. Users can also install

separate email filtering programs.