We are building EduLadder(ELADR) - Protocol

The Eladr Protocol is a decentralized, security and efficiency enhanced Web3 noSQL database powered by IPFS as the data storage layer https://ipfs.io/, and the Cardano block chain as the rewards token platform, https://cardano.org/. It provides a JSON based, IPFS layer 2 solution for data indexing and retrieval in an 'append only' file system built with open source Node.js API libraries.

Eladr tokens are designed to incentifised community members as a proof of contribution. Using that they can access diffrent infrastructure built on top of eladr


Using this You can,Buy courses,Reward others and exchange for real money.


WHITE PAPER Buy Now

Real Problems! Real Experts!

Join Our Telegram Channel !


The Eduladder is a community of students, teachers, and programmers. We help you to solve your academic and programming questions fast.
In eduladder you can Ask,Answer,Listen,Earn and Download Questions and Question papers.
Watch related videos of your favorite subject.
Connect with students from different parts of the world.
Apply or Post Jobs, Courses ,Internships and Volunteering opportunity. For FREE
See Our team
Wondering how we keep quality?
Got unsolved questions? Ask Questions
Billion Drop

We are starting air drop for the month November So that you can test alpha version of our eladr protocol that will release soon. Who ever has ever posted an answer on eduladder is eligible for this. Please update your profile with cardano address to test.

For any question or query please joinOur Telegram Channel !


Countdown Started

List Of Profiles Enrolled
Youtube Videohttps://www.youtube.com/watch?v=JnuD8IEan4s

Our Github Repo
FrontEnd BackEnd

We are looking for some great and cool people to work with us. Please sent your resume to admin@eduladder.com


You are here:Open notes-->VTU-->Data-Warehousing--DataMinig-10CS755-unit-56

Data Warehousing & DataMinig [10CS755] unit-5&6

UNIT V & VI CLASSIFICATION

5.1 Basics
The input data for a classification task is a collection of records. Each record, also known as
an instance or example, is characterized by a tuple (x, y), where x is the attribute set and y is a
special attribute, designated as the class label. sample data set used for classifying vertebrates into
one of the following categories: mammal, bird, fish, reptile, or amphibian. The attribute set
includes properties of a vertebrate such as its body temperature, skin cover, method of
reproduction ability to fly, and ability to live in water. the attribute set can also contain continuous
features. The class label, on the other hand, must be a discrete attribute. This is a key
characteristic that distinguishes classification from regression, a predictive modelling task in which
y is a continuous attribute. .
Definition 3.1 (Classification). Classification is the task of learning a target function f that maps
each attribute set x to one of the predefined class labels y. The target function is also known
informally as a classification model.A classification model is useful for the following purposes.
Descriptive Modeling
A classification model can serve as an explanatory tool to distinguish between objects of
different classes. For example, it would be useful for both biologists and others to have a
descriptive model.
Predictive Modeling
A classification model can also be used to predict the class label of unknown records.
As a classification model can be treated as a black box that automatically assigns a class label
when presented with the attribute set of an unknown record. Suppose we are given the following
characteristics of a creature known as a gila monster: Classification techniques are most suited for
predicting or describing data sets with binary or nominal categories. They are less effective
for ordinal categories (e.g., to classify a person as a member of high-, medium-, or low-income
group) because they do not consider the implicit order among the categories. Other forms of
relationships, such as the subclass?super class relationships among categories
5.2 General Approach to Solving a Classification Problem
A classification technique (or classifier) is a systematic approach to building classification
models from an input data set. Examples include decision tree classifiers, rule-based classifiers,
neural networks, support vector machines and na?ve Bayes classifiers. Each technique employs a
learning algorithm to identify a model that best fits the relationship between the attribute set and
class label of the input data. The model generated by a learning algorithm should both fit the
input data well and correctly predict the class labels of records it has never seen before.
Therefore, a key objective of the learning algorithm is to build models with good generalization
capability; i.e., models that accurately predict the class labels of previously unknown records.
Figure 3.3. General approach for building a classification model.
Figure 3.3 shows a general approach for solving classification problems. First, a training set
consisting of records whose class labels are known must be provided. The training set is used to
build a classification model, which is subsequently applied to the test set, which consists of records
with unknown class labels.
Table 3.2. Confusion matrix for a 2-class problem
Evaluation of the performance of a classification model is based on the counts of test
records correctly and incorrectly predicted by the model. These counts are tabulated in a table
known as a confusion matrix. Table 4.2 depicts the confusion matrix for a binary classification
problem. Each entry fij in this table denotes the number of records from class i predicted to
be of class j. For instance, f01is the number of records from class 0 incorrectly predicted as class 1.
Based on the entries in the confusion matrix, the total number of correct predictions made by the
model is (f11 + f00) and the total number of incorrect predictions is (f10+ f01). Although a confusion
matrix provides the information needed to determine how well a classification model performs,
summarizing this information with a single number would make it more convenient to compare
the performance of different models. This can be done using a performance metric such as
accuracy, which is defined as follows:
(3.1)
Equivalently, the performance of a model can be expressed in terms of its error rate, which is
given by the following equation:
Most classification algorithms seek models that attain the highest accuracy, or equivalently, the
lowest error rate when applied to the test set.
5.3 Decision Tree Induction
This section introduces a decision tree classifier, which is a simple yet widely used classification
technique.
5.3.1 How a Decision Tree Works
To illustrate how classification with a decision tree works, consider a simpler version of the
vertebrate classification problem described in the previous section. Instead of classifying the
vertebrates into five distinct groups of species, we assign them to two categories: mammals and
non-mammals. Suppose a new species is discovered by scientists. How can we tell whether it is a
mammal or a non-mammal? One approach is to pose a series of questions about the characteristics
of the species. The first question we may ask is whether the species is cold- or warm-blooded.
If it is cold-blooded, then it is definitely not a mammal. Otherwise, it is either a bird or a mammal.
In the latter case, we need to ask a follow-up question: Do the females of the species give birth to
their young? Those that do give birth are definitely mammals, while those that do not are likely to
be non-mammals (with the exception of egg-laying mammals such as the platypus and spiny
anteater) The previous example illustrates how we can solve a classification problem by asking a
series of carefully crafted questions about the attributes of the test record. Each time we
receive an answer, a follow-up question is asked until we reach a conclusion about the class
label of the record. The series of questions and their possible answers can be organized in the form
of a decision tree, which is a hierarchical structure consisting of nodes and directed edges. Figure
4.4 shows the decision tree for the mammal classification problem. The tree has three types of
nodes:
A root node that has no incoming edges and zero or more outgoing edges.
Internal nodes, each of which has exactly one incoming edge and two or more outgoing edges.
Leaf or terminal nodes, each of which has exactly one incoming edge and no outgoing edges.
In a decision tree, each leaf node is assigned a class label. The non terminal nodes,
which include the root and other internal nodes, contain attribute test conditions to separate
records that have different characteristics. For example, the root node shown in Figure 4.4 uses
the attribute Body.
Figure 5.4. A decision tree for the mammal classification problem.
Temperature to separate warm-blooded from cold-blooded vertebrates. Since all coldblooded
vertebrates are non-mammals, a leaf node labeled Non-mammals is created as the right
child of the root node. If the vertebrate is warm-blooded, a subsequent attribute, Gives Birth, is
used to distinguish mammals from other warm-blooded creatures, which are mostly birds.
Classifying a test record is straightforward once a decision tree has been constructed. Starting from
the root node, we apply the test condition to the record and follow the appropriate branch based on
the outcome of the test.
This will lead us either to another internal node, for which a new test condition is applied, or
to a leaf node. The class label associated with the leaf node is then assigned to the record. As an
illustration, Figure 4.5 traces the path in the decision tree that is used to predict the class label of a
flamingo. The path terminates at a leaf node labeled Non-mammals.
5.3.2 How to Build a Decision Tree
There are exponentially many decision trees that can be constructed from a given set of
attributes. While some of the trees are more accurate than others, finding the optimal tree is
computationally infeasible because of the exponential size of the search space. Nevertheless,
efficient algorithms have been developed to induce a reasonably accurate, albeit suboptimal,
decision tree in a reasonable amount of time.
These algorithms usually employ a greedy strategy that grows a decision tree by making a
series of locally optimum decisions about which attribute to use for partitioning the data. One such
algorithm is Hunts algorithm, which is the basis of many existing decision tree induction
algorithms, including ID3, C4.5, and CART.
This section presents a high-level discussion of Hunts algorithm and illustrates some of its
design issues.
Figure 5.5. Classifying an unlabeled vertebrate.
The dashed lines represent the outcomes of applying various attribute test conditions on the
unlabeled vertebrate. The vertebrate is eventually assigned to the Non-mammal class.
Hunts Algorithm
In Hunts algorithm, a decision tree is grown in a recursive fashion by partitioning the
training records into successively purer subsets. Let Dtbe the set of training records that are
associated with node t and y = {y1, y2, . . . , yc} be the class labels. The following is a recursive
definition of Hunts algorithm.
Step 1: If all the records in Data belong to the same class yt, then t is a leaf node labeled as yt.
Step 2: If Data contains records that belong to more than one class, an attribute test
condition is selected to partition the records into smaller subsets. A child node is created for
each outcome of the test condition and the records in Dt are distributed to the children based on
the outcomes. The algorithm is then recursively applied to each child node.
Figure 3.6. Training set for predicting borrowers who will default on loan payments.
To illustrate how the algorithm works, consider the problem of predicting whether a loan
applicant will repay her loan obligations or become delinquent, subsequently defaulting on her loan.
A training set for this problem can be constructed by examining the records of previous borrowers.
In the example shown in Figure 4.6, each record contains the personal information of a borrower
along with a class label indicating whether the borrower has defaulted on loan payments.
The initial tree for the classification problem contains a single node with class label Defaulted = No (see
Figure 3.7(a)), which means that most of the borrowers successfully repaid their loans. The tree, however, needs to
be redefined since the root node contains records from both classes. The records are
Subsequently divided into smaller subsets based on the outcomes of the Home Owner test condition, as shown in Figure
3.7(b). The justification for choosing this attribute test condition will be discussed later. For now, we will assume that
this is the best criterion for splitting the data at this point. Hunts algorithm is then applied recursively to
each child of the root node. From the training set given in Figure 3.6, notice that all borrowers who are home
owners successfully repaid their loans. The left child of the root is therefore a leaf node labelled Defaulted = No (see
Figure 3.7(b)). For the right child, we need to continue applying the recursive step of Hunts algorithm until all the
records belong to the same class. The trees resulting from each recursive step are shown in Figures 3.7(c) and (d).
Figure 5 .7 Hunts algorithm for inducing decision trees.
Hunts algorithm will work if every combination of attribute values is present in the
training data and each combination has a unique class label. These assumptions are too
stringent for use in most practical situations. Additional conditions are needed to handle the
following cases:
1. It is possible for some of the child nodes created in Step 2 to be empty; i.e., there are no
records associated with these nodes. This can happen if none of the training records have the
combination of attribute values associated with such nodes. In this case the node is declared a
leaf node with the same class label as the majority class of training records associated with its
parent node.
2. In Step 2, if all the records associated with Dt have identical attribute values (except for
the class label), then it is not possible to split these records any further. In this case, the node is
declared a leaf node with the same class label as the majority class of training records associated
with this node.
Design Issues of Decision Tree Induction
A learning algorithm for inducing decision trees must address the following two issues.
a) How should the training records be split? Each recursive step of the tree-growing
process must select an attribute test condition to divide the records into smaller
subsets. To implement this step, the algorithm must provide a method for specifying
the test condition for different attribute types as well as an objective measure for
evaluating the goodness of each test condition.
b) How should the splitting procedure stop? A stopping condition is needed to terminate the
tree-growing process. A possible strategy is to continue expanding a node until either all the
records belong to the same class or all the records have identical attribute values. Although
both conditions are sufficient to stop any decision tree induction algorithm, other
criteria can be imposed to allow the tree-growing procedure to terminate earlier.
5.3.3 Methods for Expressing Attribute Test Conditions
Decision tree induction algorithms must provide a method for expressing an attribute test condition and its
corresponding outcomes for different attribute types. Binary Attributes The test condition for a binary attribute
generates two potential outcomes, as shown in Figure 3.8.
Cold blooded warm blooded
Figure 3.8 Test condition for binary attributes.
Body
temperature
Figure 3.9 Test conditions for nominal attributes.
Nominal Attributes:
Since a nominal attribute can have many values, its test condition can be expressed in
two ways, as shown in Figure 3.9. For a multiway split (Figure 3.9(a)), the number of
outcomes depends on the number of distinct values for the corresponding attribute. For
example, if an attribute such as marital status has three distinct values?single, married, or divorced
its test condition will produce a three-way split. On the other hand, some decision tree algorithms,
such as CART, produce only binary splits by considering all 2k?1 ? 1 ways of creating a binary
partition of k attribute values. Figure 3.9(b) illustrates three different ways of grouping the attribute
values for marital status into two subsets.
Ordinal Attributes:
Ordinal attributes can also produce binary or multiway splits. Ordinal attribute values can
be grouped as long as the grouping does not violate the order property of the attribute values.
Figure 3.10 illustrates various ways of splitting training records based on the Shirt Size attribute.
The groupings shown in Figures 3.10(a) and (b) preserve the order among the attribute
values, whereas the grouping shown in Figure 3.10(c) violates this property because it combines
the attribute values Small and Large into the same partition while Medium and Extra Large are
combined into another partition.
Figure 3.10 Different ways of grouping ordinal attribute values.
Continuous Attributes:
For continuous attributes, the test condition can be expressed as a comparison test (A < v) or (A ?
v) with binary outcomes, or a range query with outcomes of the form vi ? A < vi+1, for i = 1. . . k.
The difference between these approaches is shown in Figure 3.11. For the binary case, the decision
tree algorithm must consider all possible split positions v, and it selects the one that produces
the best partition. For the multiway split, the algorithm must consider all possible ranges of
continuous values. One approach is to apply the discretization strategies described. After
discretization, a new ordinal value will be assigned to ach discretized interval. Adjacent intervals
can also be aggregated into wider ranges as long as the order property is preserved.
(a) (b)
Figure 3.11 Test condition for continuous attributes.
Figure 3.12 Multiway versus binary splits.
5.3.4 Measures for Selecting the Best Split
There are many measures that can be used to determine the best way to split the records.
These measures are defined in terms of the class distribution of the records before and after splitting.
Let p(i|t) denote the fraction of records belonging to class i at a given node t. We sometimes
omit the reference to node t and express the fraction as pi. In a two-class problem, the class
distribution at any node can be written as (p0, p1), where p1 = 1 ? p0. The class distribution
before splitting is (0.5, 0.5) because there are an equal number of records from each class.
If we split the data using the Gender attribute, then the class distributions of the child nodes are
(0.6, 0.4) and (0.4, 0.6), respectively. Although the classes are no longer evenly distributed, the
child nodes still contain records from both classes. Splitting on the second attribute, Car Type will
result in purer partitions.
The measures developed for selecting the best split are often based on the degree of impurity
of the child nodes. The smaller the degree of impurity, the more skewed the class distribution. For
example, a node with class distribution (0, 1) has zero impurity, whereas a node with uniform class
distribution (0.5, 0.5) has the highest impurity. Examples of impurity measures include
c?1
Entropy (t) = ?
i=0
p(i|t) log2p(i|t), (3.3)
Gini (t) = 1 ?
c?1
i=0
[p(i|t)]2, (3.4)
Classification error (t) = 1 ? max [p(i|t)],
(3.5)
Where c is the number of classes and 0 log20 = 0 in entropy calculations.
Figure 3.13 Comparison among the impurity measures for binary classification problems.
Figure 3.13 compares the values of the impurity measures for binary classification problems.
p refers to the fraction of records that belong to one of the two classes. Observe that all three
measures attain their maximum value when the class distribution is uniform (i.e., when p = 0.5).
The minimum values for the measures are attained when all the records belong to the same class
(i.e., when p equals 0 or 1). We next provide several examples of computing the different impurity
measures.
Gain Ratio
Impurity measures such as entropy and Gini index tend to favor attributes that have a large
number of distinct values. Comparing the first test condition, Gender, with the second, Car
Type, it is easy to see that Car Type seems to provide a better way of splitting the data since
it produces purer descendent nodes. However, if we compare both conditions with Customer ID,
the latter appears to produce purer partitions.
Yet Customer ID is not a predictive attribute because its value is unique for each record.
Even in a less extreme situation, a test condition that results in a
large number of outcomes may not be desirable because the number of record associated
with each partition is too small to enable us to make any reliable predictions.
There are two strategies for overcoming this problem. The first strategy is to restrict the test
conditions to binary splits only. This strategy is employed by decision tree algorithms such as
CART. Another strategy is to modify the splitting criterion to take into account the number of
outcomes produced by the attribute test condition. For example, in the C4.5 decision tree
algorithm,a splitting criterion known as gain ratio is used to determine the goodness of a split.
This criterion is defined as follows:
GAIN RATIO = ΔINFO
Split info
Here, Split Info = ? ki=1 P (vi) log2P (vi) and k is the total number of splits. For example, if each
attribute value has the same number of records, then ∀i: P (vi) = 1/k and the split
information would be equal to log2k. This example suggests that if an attribute produces a
large number of splits, its split information will also be large, which in turn reduces its gain ratio.
5.3.5 Algorithm for Decision Tree Induction
A skeleton decision tree induction algorithm called Tree Growth is shown in Algorithm 4.1. The input to this
algorithm consists of the training records E and the attribute set F. The algorithm works by recursively selecting the
best attribute to split the data (Step 7) and expanding the leaf nodes of the tree.
(Steps 11 and 12) until the stopping criterion is met (Step 1). The details of this algorithm are
explained below:
1. The createNode() function extends the decision tree by creating a new node. A node in the
decision tree has either a test condition, denoted as node. Test cond, or a class label, denoted as
node. Label.
2. The find best split () function determines which attribute should be selected as the test condition
for splitting the training records. As previously noted, the choice of test condition depends on
which impurity measure is used to determine the goodness of a split. Some widely used measures
include entropy, the Gini index, and the χ2statistic.
3. The Classify() function determines the class label to be assigned to a leaf node. For each leaf
node t, let p(i|t) denote the fraction of training records from class i associated with the node t. In
most cases, the leaf node is assigned to the class that has the majority number of training records:
leaf.label = argmax p(i|t),i
where the argmax operator returns the argument i that maximizes the expression p(i|t). Besides
providing the information needed to determine the class label of a leaf node, the fraction p(i|t) can also be used to
estimate the probability that a record assigned to the leaf node t belongs to class i.
4. The stopping cond() function is used to terminate the tree-growing process by testing whether all
the records have either the same class label or the same attribute values. Another way to terminate the recursive
function is to test whether the number of records has fallen below some
Minimum threshold.
After building the decision tree, a tree-pruning step can be performed to reduce the size of the
decision tree. Decision trees that are too large are susceptible to a phenomenon known as overfitting. Pruning helps
by trimming the branches of the initial tree in a way that improves the generalization capability of the decision tree.
5.3.6 Characteristics of Decision Tree Induction
The following is a summary of the important characteristics of decision tree
induction algorithms.
1. Decision tree induction is a nonparametric approach for building classification models. In other
words, it does not require any prior assumptions regarding the type of probability distributions
satisfied by the class and other attributes.
2. Finding an optimal decision tree is an NP-complete problem. Many decision tree algorithms
employ a heuristic-based approach to guide their search in the vast hypothesis space. For example,
the algorithm presented in Section 3.3.5 uses a greedy, top-down, recursive partitioning strategy
for growing a decision tree.
3. Techniques developed for constructing decision trees are computationally inexpensive, making
it possible to quickly construct models even when the training set size is very large. Furthermore,
once a decision tree has been built, classifying a test record is extremely fast, with a worst-case
complexity of O(w), where w is the maximum depth of the tree.
4. Decision trees, especially smaller-sized trees, are relatively easy to interpret. The accuracies of
the trees are also comparable to other classification techniques for many simple data sets.
5. Decision trees provide an expressive representation for learning discrete valued functions.
However, they do not generalize well to certain types of Boolean problems. One notable example
is the parity function, whose value is 0 (1) when there is an odd (even) number of Boolean
attributes with the value True. Accurate modeling of such a function requires a full decision tree
with 2d nodes, where d is the number of Boolean attributes
6. Decision tree algorithms are quite robust to the presence of noise, especially when methods for
avoiding overfitting, are employed.
7. The presence of redundant attributes does not adversely affect the accuracy of decision trees. An
attribute is redundant if it is strongly correlated with another attribute in the data. One of the two
redundant attributes will not be used for splitting once the other attribute has been chosen.
However, if the data set contains many irrelevant attributes, i.e., attributes that are not useful for
the classification task, then some of the irrelevant attributes may be accidently chosen during the
tree-growing process, which results in a decision tree that is larger than necessary.
8. Since most decision tree algorithms employ a top-down, recursive partitioning approach, the
number of records becomes smaller as we traverse down the tree. At the leaf nodes, the number of
records may be too small to make a statistically significant decision about the class representation
of the nodes. This is known as the data fragmentation problem. One possible solution is to disallow
further splitting when the number of records falls below a certain threshold.
9. A subtree can be replicated multiple times in a decision tree, This makes the decision tree more
complex than necessary and perhaps more difficult to interpret. Such a situation can arise from
decision tree implementations that rely on a single attribute test condition at each internal node.
Since most of the decision tree algorithms use a divide-and-conquer partitioning strategy, the same
test condition can be applied to different parts of the attribute space, thus leading to the subtree
replication problem.
10. The test conditions described so far in this chapter involve using only a single attribute at a
time. As a consequence, the tree-growing procedure can be viewed as the process of partitioning
the attribute space into disjoint regions until each region contains records of the same class. The
border between two neighboring regions of different classes is known as a decision boundary.
Constructive induction provides another way to partition the data into homogeneous,
nonrectangular regions
5.4 Rule-Based Classification
In this section, we look at rule-based classifiers, where the learned model is represented as a set of IFTHEN
rules. We first examine how such rules are used for classification. We then study ways in which they can be
generated, either froma decision tree or directly from the training data using a sequential covering algorithm.
5.4.1 Using IF-THEN Rules for Classification
Rules are a good way of representing information or bits of knowledge. A rule-based classifier uses a set of IFTHEN
rules for classification. An IF-THEN rule is an expression of the form
IF condition THEN conclusion.
An example is rule R1,
R1: IF age = youth AND student = yes THEN buys computer = yes.
The IF-part (or left-hand side) of a rule is known as the rule antecedent or precondition. The THEN-part (or
right-hand side) is the rule consequent. In the rule antecedent, the condition consists of one or more attribute tests
(such as age = youth, and student = yes) that are logically ANDed. The rules consequent contains a class prediction
(in this case, we are predicting whether a customer will buy a computer). R1 can also be written as
R1: (age = youth) ^ (student = yes))(buys computer = yes).
If the condition (that is, all of the attribute tests) in a rule antecedent holds true for a given tuple, we say that the rule
antecedent is satisfied (or simply, that the rule is satisfied) and that the rule covers the tuple.
A rule R can be assessed by its coverage and accuracy. Given a tuple, X, from a class labeled data set,D, let ncovers
be the number of tuples covered by R; ncorrect be the number of tuples correctly classified by R; and jDj be the
number of tuples in D. We can define the coverage and accuracy of R as
That is, a rules coverage is the percentage of tuples that are covered by the rule (i.e., whose attribute values hold
true for the rules antecedent). For a rules accuracy, we look at the tuples that it covers and see what percentage of
them the rule can correctly classify.
5.4.2 Rule Extraction from a Decision Tree
Decision tree classifiers are a popular method of classification?it is easy to understand how decision trees work and
they are known for their accuracy. Decision trees can become large and difficult to interpret. In this subsection, we
look at how to build a rule based classifier by extracting IF-THEN rules from a decision tree. In comparison with a
decision tree, the IF-THEN rules may be easier for humans to understand, particularly if the decision tree is very
large.
To extract rules from a decision tree, one rule is created for each path from the root to a leaf node.
Each splitting criterion along a given path is logically ANDed to form the rule antecedent (IF part). The leaf node
holds the class prediction, forming the rule consequent (THEN part).
Example 3.4 Extracting classification rules from a decision tree. The decision tree of Figure 6.2 can
be converted to classification IF-THEN rules by tracing the path from the root node to
each leaf node in the tree.
A disjunction (logical OR) is implied between each of the extracted rules. Because the rules are extracted directly
from the tree, they are mutually exclusive and exhaustive. By mutually exclusive, this means that we cannot have
rule conflicts here because no two rules will be triggered for the same tuple. (We have one rule per leaf, and any
tuple can map to only one leaf.) By exhaustive, there is one rule for each possible attribute-value combination, so
that this set of rules does not require a default rule. Therefore, the order of the rules does not matter?they are
unordered.
Since we end up with one rule per leaf, the set of extracted rules is not much simpler than the corresponding decision
tree! The extracted rules may be even more difficult to interpret than the original trees in some cases. As an
example, Figure 6.7 showed decision trees that suffer from subtree repetition and replication. The resulting set of
rules extracted can be large and difficult to follow, because some of the attribute tests may be irrelevant or
redundant. So, the plot thickens. Although it is easy to extract rules from a decision tree, we may need to do some
more work by pruning the resulting rule set.
5.4.3 Rule Induction Using a Sequential Covering Algorithm
IF-THEN rules can be extracted directly from the training data (i.e., without having to generate a decision tree first)
using a sequential covering algorithm. The name comes from the notion that the rules are learned sequentially (one
at a time), where each rule for a given class will ideally cover many of the tuples of that class (and hopefully none of
the tuples of other classes). Sequential covering algorithms are the most widely used approach to mining disjunctive
sets of classification rules, and form the topic of this subsection. Note that in a newer alternative approach,
classification rules can be generated using associative classification algorithms, which search for attribute-value
pairs that occur frequently in the data. These pairs may form association rules, which can be analyzed and used in
classification. Since this latter approach is based on association rule mining (Chapter 5), we prefer to defer its
treatment until later, in Section 6.8. There are many sequential covering algorithms. Popular variations include AQ,
CN2, and the more recent, RIPPER. The general strategy is as follows. Rules are learned one at a time. Each time a
rule is learned, the tuples covered by the rule are removed, and the process repeats on the remaining tuples. This
sequential learning of rules is in contrast to decision tree induction. Because the path to each leaf in a decision tree
corresponds to a rule, we can consider decision tree induction as learning a set of rules simultaneously.
A basic sequential covering algorithm is shown in Figure 6.12. Here, rules are learned for one class at a time.
Ideally, when learning a rule for a class, Ci, we would like the rule to cover all (or many) of the training tuples of
class C and none (or few) of the tuples from other classes. In this way, the rules learned should be of high accuracy.
The rules need not necessarily be of high coverage.
Algorithm: Sequential covering. Learn a set of IF-THEN rules for classification.
Input: D, a data set class-labeled tuples;
Att vals, the set of all attributes and their possible values.
Output: A set of IF-THEN rules.
Method:
(1) Rule set = fg; // initial set of rules learned is empty
(2) for each class c do
(3) repeat
(4) Rule = Learn One Rule(D, Att vals, c);
(5) remove tuples covered by Rule from D;
(6) until terminating condition;
(7) Rule set = Rule set +Rule; // add new rule to rule set
(8) endfor
(9) return Rule Set;
This is because we can have more than one rule for a class, so that different rules may cover different tuples within
the same class. The process continues until the terminating condition is met, such as when there are no more training
tuples or the quality of a rule returned is below a user-specified threshold. The Learn One Rule procedure finds the
best rule for the current class, given the current set of training tuples. How are rules learned? Typically, rules are
grown in a general-to-specific manner .We can think of this as a beam search, where we start off with an empty rule
and then gradually keep appending attribute tests to it. We append by adding the attribute test as a logical conjunct
to the existing condition of the rule antecedent. Suppose our training set, D, consists of loan application data.
Attributes regarding each applicant include their age, income, education level, residence, credit rating, and the term
of the loan. The classifying attribute is loan decision, which indicates whether a
loan is accepted (considered safe) or rejected (considered risky). To learn a rule for the class accept, we start off
with the most general rule possible, that is, the condition of the rule antecedent is empty. The rule is:
IF THEN loan decision = accept.
We then consider each possible attribute test that may be added to the rule. These can be derived from the parameter
Att vals, which contains a list of attributes with their associated values. For example, for an attribute-value pair (att,
val), we can consider attribute tests such as att = val, att _ val, att > val, and so on. Typically, the training data will
contain many attributes, each of which may have several possible values. Finding an optimal rule set becomes
computationally explosive. Instead, Learn One Rule adopts a greedy depth-first strategy. Each time it is faced with
adding a new attribute test (conjunct) to the current rule, it picks the one that most improves the rule quality,
based on the training samples. We will say more about rule quality measures in a minute. For the moment, lets say
we use rule accuracy as our quality measure. Getting back to our example with Figure 6.13, suppose Learn One Rule
finds that the attribute test income = high best improves the accuracy of our current (empty) rule. We append it to
the condition, so that the current rule becomes
IF income = high THEN loan decision = accept. Each time we add an attribute test to a rule, the resulting rule should
cover more of the accept tuples. During the next iteration, we again consider the possible attribute tests and end up
selecting credit rating = excellent. Our current rule grows to become
IF income = high AND credit rating = excellent THEN loan decision = accept.
The process repeats, where at each step, we continue to greedily grow rules until the resulting rule meets an
acceptable quality level.
5.4.4 Rule Pruning
Learn One Rule does not employ a test set when evaluating rules. Assessments of rule quality as described above are
made with tuples from the original training data. Such assessment is optimistic because the rules will likely overfit
the data. That is, the rules may perform well on the training data, but less well on subsequent data. To compensate
for this, we can prune the rules. A rule is pruned by removing a conjunct (attribute test). We choose to prune a rule,
R, if the pruned version of R has greater quality, as assessed on an independent set of tuples. As in decision tree
pruning, we
refer to this set as a pruning set. Various pruning strategies can be used, such as the pessimistic pruning approach
described in the previous section. FOIL uses a simple yet effective method. Given a rule, R,
where pos and neg are the number of positive and negative tuples covered by R, respectively. This value will
increase with the accuracy of R on a pruning set. Therefore, if the FOIL Prune value is higher for the pruned version
of R, then we prune R. By convention, RIPPER starts with the most recently added conjunct when considering
pruning. Conjuncts are pruned one at a time as long as this results in an improvement.
Rule based classifier
Classify records by using a collection of ifthen rules
Rule: (Condition) → y
? Where Condition is a conjunction of attributes y is the class label
? LHS: rule antecedent or condition
? RHS: rule consequent
? Examples of classification rules:
(Blood Type=Warm) (Lay Eggs=Yes) → Birds
(Taxable Income < 50K) (Refund=Yes) → Evade=No
R1: (Give Birth = no) (Can Fly = yes) → Birds
R2: (Give Birth = no) (Live in Water = yes) → Fishes
R3: (Give Birth = yes) (Blood Type = warm) → Mammals
R4: (Give Birth = no) (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
Application of Rule-Based Classifier
A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule.
R1: (Give Birth = no) (Can Fly = yes) → Birds
R2: (Give Birth = no) (Live in Water = yes) → Fishes
R3: (Give Birth = yes) (Blood Type = warm) → Mammals
R4: (Give Birth = no) (Can Fly = no) → Reptiles
R The rule R1 covers a hawk => Bird
The rule R3 covers the grizzly bear => Mammal
Rule Coverage and Accuracy
Coverage of a rule: Fraction of records that satisfy the antecedent of a rule
Accuracy of a rule: Fraction of records satisfy both the antecedent and consequent of a rule table
Characteristics of Rule-Based Classifier
Mutually exclusive rules: Classifier contains mutually exclusive rules if the rules are independent of each other
every record is covered by at most one rule.
Exhaustive rules: Classifier has exhaustive coverage if accounts for every possible combination of attribute values
each record is covered by at least one rule.
5.5 Nearest-Neighbor Classifiers
Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing a given test tuplewith training
tuples that are similar to it. The training tuples are described by n attributes. Each tuple represents a point in an ndimensional
space. In this way all of the training tuples are stored in an n-dimensional pattern space. When given an
unknown tuple, a k-nearest-neighbor classifier searches the pattern space for the k training tuples that are closest to
the unknown tuple. These k training tuples are the k nearest neighbors of the unknown tuple. Closeness is
defined in terms of a distance metric, such as Euclidean distance. The Euclidean distance between two points or
tuples, say, X1 = (x11, x12, : : : , x1n) and X2 = (x21, x22, : : : , x2n), is
In other words, for each numeric attribute, we take the difference between the corresponding values of that attribute
in tuple X1 and in tuple X2, square this difference, and accumulate it. The square root is taken of the total
accumulated distance count. Typically, we normalize the values of each attribute before using Euclids Equation.
This helps prevent attributes with initially large ranges (such as income) from outweighing attributes with initially
smaller ranges (such as binary attributes). Min-max normalization, for example, can be used to transform a value v
of a numeric attribute A to v0 in the range [0, 1] by computing
where minA and maxA are the minimum and maximum values of attribute A. Chapter 2 describes other methods for
data normalization as a form of data transformation. For k-nearest-neighbor classification, the unknown tuple is
assigned the most common class among its k nearest neighbors. When k = 1, the unknown tuple is assigned the class
of the training tuple that is closest to it in pattern space. Nearest neighbor classifiers can also be used for prediction,
that is, to return a real-valued prediction for a given unknown tuple. In this case, the classifier returns the average
value of the real-valued labels associated with the k nearest neighbors of the unknown tuple. But how can distance
be computed for attributes that not numeric, but categorical, such as color? The above discussion assumes that the
attributes used to describe the tuples are all numeric. For categorical attributes, a simple method is to compare the
corresponding value of the attribute in tuple X1 with that in tuple X2. If the two are identical (e.g., tuples X1 and X2
both have the color blue), then the difference between the two is taken as 0. If the two are different (e.g., tuple X1 is
blue but tuple X2 is red), then the difference is considered to be 1. Other methods may incorporate more
sophisticated schemes for differential grading (e.g., where a larger difference score is assigned, say, for blue and
white than for blue and black).
What about missing values? In general, if the value of a given attribute A is missing in tuple X1 and/or in tuple
X2, we assume the maximum possible difference. Suppose that each of the attributes have been mapped to the range
[0, 1]. For categorical attributes, we take the difference value to be 1 if either one or both of the corresponding
values of A are missing. If A is numeric and missing fromboth tuples X1 and X2, then the difference is also taken to
be 1.
How can I determine a good value for k, the number of neighbors? This can be determined experimentally.
Starting with k = 1, we use a test set to estimate the error rate of the classifier. This process can be repeated each
time by incrementing k to allow for one more neighbor. The k value that gives the minimum error rate may be
selected. In general, the larger the number of training tuples is, the larger the value of k will be (so that classification
and prediction decisions can be based on a larger portion of the stored tuples). As the number of training tuples
approaches infinity and k =1, the error rate can be no worse then twice the Bayes error rate (the latter being the
theoretical minimum).
5.6. Introduction to Bayesian Classification
The Bayesian Classification represents a supervised learning method as well as a statistical method for classification.
Assumes an underlying probabilistic model and it allows us to capture uncertainty about the model in a principled
way by determining probabilities of the outcomes. It can solve diagnostic and predictive problems. This
Classification is named after Thomas Bayes ( 1702-1761), who proposed the ayesTheorem.Bayesian classification
provides practical learning algorithms and prior knowledge and observed data can be combined. Bayesian
Classification provides a useful perspective for understanding and evaluating many learning algorithms. It calculates
explicit probabilities forhypothesis and it is robust to noise in input data.Uses of Naive Bayes classification: 1. Naive
Bayes text classification) The Bayesian classification is used as a probabilistic learning method (Naive Bayes text
classification). Naive Bayes classifiers are among the most successful known algorithms forlearning to classify text
documents.
2. Spam filtering (http://en.wikipedia.org/wiki/Bayesian_spam_filtering)Spam filtering is the best known use of
Naive Bayesian text classification. It makes use of anaive Bayes classifier to identify spam e-mail.Bayesian spam
filtering has become a popular mechanism to distinguish illegitimate spamemail from legitimate email (sometimes
called "ham" or "bacn").[4] Many modern mail clientsimplement Bayesian spam filtering. Users can also install
separate email filtering programs.

Editors




You might like this video:Watch more here

Watch more videos from this user Here

Learn how to upload a video over here