Data Warehousing DataMinig 10CS755 unit 4
The Eduladder is a community of students, teachers, and programmers just interested to make you pass any exams. So we solve previous year question papers for you.
See Our team
Wondering how we keep quality?
Got unsolved questions?

Ask Questions

Use Me  ?

New searches

You are here:Open notes-->VTU-->Data-Warehousing--DataMinig-10CS755-unit-4

Data Warehousing & DataMinig [10CS755] unit-4

UNIT IV ASSOCIATION ANALYSIS

This chapter presents a methodology known as association analysis, which is useful for
discovering interesting relationships hidden in large data sets. The uncovered relationships can
be represented in the form of association rules or sets of frequent items. For example, the
following rule can be extracted from the data set shown in Table 4.1:
{Diapers} → {Beer}.
Table 4.1. An example of market basket transactions.
TI
D
ITEMS
1 {Bread, Milk}
2 { Bread, Diapers, Beer, Eggs}
3 {Milk, Diapers, Beer, Cola}
4 {Bread, Milk, Diapers, Beer}
5 {Bread, Milk, Diapers, Cola}
The rule suggests that a strong relationship exists between the sale of diapers and beer
because many customers who buy diapers also buy beer. Retailers can use this type of rules to
help them identify new opportunities for cross- selling their products to the customers.
.
4.1 Basic Concepts and Algorithms
This section reviews the basic terminology used in association analysis and presents a formal
description of the task.
Binary Representation Market basket data can be represented in a binary format as shown in
Table 4.2, where each row corresponds to a transaction and each column corresponds to an item.
An item can be treated as a binary variable whose value is one if the item is present in a
transaction and zero otherwise. Because the presence of an item in a transaction is often
considered more important than its absence, an item is an asymmetric binary variable.
Table 4.2 A binary 0/1 representation of market basket data.
This representation is perhaps a very simplistic view of real market basket data because it
ignores certain important aspects of the data such as the quantity of items sold or the price paid
to purchase them. Itemset and Support Count Let I = {i1,i2,. . .,id} be the set of all items in a
market basket data and T = {t1, t2, . . . , tN } be the set of all transactions. Each transaction ti
contains a subset of items chosen from I. In association analysis, a collection of zero or more
items is termed an itemset. If an itemset contains k items, it is called a k-itemset. For instance,
{Beer, Diapers, Milk} is an example of a 3-itemset. The null (or empty) set is an itemset that
does not contain any items.
The transaction width is defined as the number of items present in a transaction. A
transaction tj is said to contain an itemset X if X is a subset of tj. For example, the second
transaction shown in Table 6.2 contains the item-set {Bread, Diapers} but not {Bread, Milk}. An
important property of an itemset is its support count, which refers to the number of transactions
that contain a particular itemset. Mathematically, the support count, σ(X), for an itemset X can
be stated as follows:
Where the symbol | ・ | denote the number of elements in a set. In the data set shown in Table
4.2, the support count for {Beer, Diapers, Milk} is equal to two because there are only two
transactions that contain all three items.
Association Rule An association rule is an implication expression of the form X → Y , where
X and Y are disjoint itemsets, i.e., X ∩ Y = 0. The strength of an association rule can be
measured in terms of its support and confidence. Support determines how often a rule is
applicable to a given data set, while confidence determines how frequently items in Y
appear in transactions that contain X. The formal definitions of these metrics are
Support s(X------->Y) = ∂(XUY) 4.1
Confidence C(X------>Y) = ∂(XUY) 4.2
Formulation of Association Rule Mining Problem The association rule mining problem
can be formally stated as follows:
Definition 4.1 (Association Rule Discovery). Given a set of transactions T , find all the rules
having support ? minsup and confidence ? minconf, where minsup and minconf are the
corresponding support and confidence thresholds.
From Equation 4.2, notice that the support of a rule X -→ Y depends only on the support of
its corresponding itemset, X ∩ Y . For example, the following rules have identical support
because they involve items from the same itemset,
{Beer, Diapers, Milk}:
{Beer, Diapers} -→{Milk}, {Beer, Milk} -→{Diapers},
{Diapers, Milk} -→{Beer}, {Beer} -→{Diapers, Milk},
{Milk} -→{Beer,Diapers}, {Diapers} -→{Beer,Milk}.
If the itemset is infrequent, then all six candidate rules can be pruned immediately without
our having to compute their confidence values. Therefore, a common strategy adopted by many
association rule mining algorithms is to decompose the problem into two major subtasks:
1. Frequent Itemset Generation, whose objective is to find all the item-sets that satisfy the
minsup threshold. These itemsets are called frequent itemsets.
2. Rule Generation, whose objective is to extract all the high-confidence rules from the
frequent itemsets found in the previous step. These rules are called strong rules.
The computational requirements for frequent itemset generation are generally more
expensive than those of rule generation.
Figure 4.1. An itemset lattice.
4.2 Frequent Itemset Generation
A lattice structure can be used to enumerate the list of all possible itemsets. Figure 4.1 shows
an itemset lattice for I = {a, b, c, d, e}. In general, a data set that contains k items can potentially
generate up to 2k - 1 frequent itemsets, excluding the null set. Because k can be very large in
many practical applications, the search space of itemsets that need to be explored is
exponentially large.
A brute-force approach for finding frequent itemsets is to determine the support count for
every candidate itemset in the lattice structure. To do this, we need to compare each candidate
against every transaction, an operation that is shown in Figure 4.2. If the candidate is contained
in a transaction, its support count will be incremented. For example, the support for
{Bread,Milk} is incremented three times because the itemset is contained in transactions 1, 4,
and 5. Such an approach can be very expensive because it requires O(NMw) comparisons, where
N is the number of transactions, M =2k - 1 is the number of candidate itemsets, and w is the
maximum transaction width.
Candidates
Transactions
Items
Bread, Milk
Bread, Diapers, Beer, Eggs
Milk, Diapers, Beer, Coke
Bread, Milk, Diapers, Beer
Bread, Milk, Diapers, Coke
Figure 6.2. Counting the support of candidate itemsets.
There are several ways to reduce the computational complexity of frequent itemset
generation.
1. Reduce the number of candidate itemsets (M). The Apriori principle, described in the next
section, is an effective way to eliminate some of the candidate itemsets without counting their
support values.
2. Reduce the number of comparisons. Instead of matching each candidate itemset against
N
TID
1
2
3
4
5
M
every transaction, we can reduce the number of comparisons by using more advanced data
structures, either to store the candidate itemsets or to compress the data set.
4.2.1 The Apriori Principle
This section describes how the support measure helps to reduce the number of candidate
itemsets explored during frequent itemset generation. The use of support for pruning candidate
itemsets is guided by the following principle.
Theorem 4.1 (Apriori Principle). If an itemset is frequent, then all of its subsets must also be
frequent. To illustrate the idea behind the Apriori principle, consider the itemset lattice shown in
Figure 4.3. Suppose {c, d, e} is a frequent itemset. Clearly, any transaction that contains {c, d, e}
must also contain its subsets, {c, d},{c, e}, {d, e}, {c}, {d}, and {e}. As a result, if {c, d, e} is
frequent, then all subsets of {c, d, e} (i.e., the shaded itemsets in this figure) must also be
frequent.
Figure 4.3. An illustration of the Apriori principle.
If {c, d, e} is frequent, then all subsets of this itemset are frequent.
Conversely, if an itemset such as {a, b} is infrequent, then all of its supersets must be
infrequent too. As illustrated in Figure 6.4, the entire subgraph containing the supersets of {a, b}
can be pruned immediately once {a, b} is found to be infrequent. This strategy of trimming the
exponential search space based on the support measure is known as support-based pruning. Such
a pruning strategy is made possible by a key property of the support measure, namely, that the
support for an itemset never exceeds the support for its subsets. This property is also known as
the anti-monotone property of the support measure.
Definition 4.2 (Monotonicity Property). Let I be a set of items, and J =2I be the power set
of I. A measure f is monotone (or upward closed) if
Figure 4.4. An illustration of support-based pruning.
If {a, b} is infrequent, then all supersets of {a, b} are infrequent, which means that if X is a
subset of Y , then f(X) must not exceed f(Y ). On the other hand, f is anti-monotone (or
downward closed) if which means that if X is a subset of Y , then f(Y ) must not exceed f(X).
Any measure that possesses an anti-monotone property can be incorporated directly into the
mining algorithm to effectively prune the exponential search space of candidate itemsets, as will
be shown in the next section.
4.2.2 Frequent Itemset Generation in the Apriori Algorithm
Apriori is the first association rule mining algorithm that pioneered the use of support-based
pruning to systematically control the exponential growth of candidate itemsets. Figure 4.5
provides a high-level illustration of the frequent itemset generation part of the Apriori algorithm
for the transactions shown in
Figure 4.5. Illustration of frequent itemset generation using the Apriori algorithm.
Table 4.1. We assume that the support threshold is 60%, which is equivalent to a minimum
support count equal to 3.
Apriori principle ensures that all supersets of the infrequent 1-itemsets must be infrequent.
Because there are only four frequent 1-itemsets, the number of candidate 2-itemsets generated by
the algorithm is = 6. Two of these six candidates, {Beer, Bread} and {Beer, Milk}, are
subsequently found to be infrequent after computing their support values. The remaining four
candidates are frequent, and thus will be used to generate candidate 3-itemsets. Without supportbased
pruning, there are = 20 candidate 3-itemsets that can be formed using the six items given
in this example. With the Apriori principle, we only need to keep candidate 3-itemsets whose
subsets are frequent. The only candidate that has this property is
{Bread, Diapers,Milk}.
The effectiveness of the Apriori pruning strategy can be shown by counting the number of
candidate itemsets generated. A brute-force strategy of enumerating all itemsets (up to size 3) as
candidates will produce
candidates. With the Apriori principle, this number decreases to
candidates, which represents a 68% reduction in the number of candidate itemsets even in
this simple example.
The pseudocode for the frequent itemset generation part of the Apriori algorithm is shown in
Algorithm 4.1. Let Ck denote the set of candidate k-itemsets and Fk denote the set of frequent kitemsets:
? The algorithm initially makes a single pass over the data set to determine the support of
each item. Upon completion of this step, the set of all frequent 1-itemsets, F1, will be known
(steps 1 and 2).
? Next, the algorithm will iteratively generate new candidate k-itemsets using the frequent (k
- 1)-itemsets found in the previous iteration (step 5). Candidate generation is implemented using
a function called apriori-gen, which is described in Section 4.2.3.
? To count the support of the candidates, the algorithm needs to make an additional pass over
the data set (steps 6?10). The subset function is used to determine all the candidate itemsets in
Ck that are contained in each transaction t.
? After counting their supports, the algorithm eliminates all candidate itemsets whose support
counts are less than minsup (step 12).
? The algorithm terminates when there are no new frequent itemsets generated.
4.3 Rule Generation
This section describes how to extract association rules efficiently from a given frequent itemset.
Each frequent k-itemset, Y , can produce up to 2k-2 association rules, ignoring rules that have
empty antecedents or consequents( 0→Yor Y → 0). An association rule can be extracted by
partitioning the itemset Y into two non-empty subsets, X and Y -X, such that X → Y - X satisfies
the confidence threshold. Note that all such rules must have already met the support threshold
because they are generated from a frequent itemset.
Example 4 .2. Let X = {1, 2, 3} be a frequent itemset. There are six candidate association rules
that can be generated from X: {1, 2} →{3}, {1, 3} →{2}, {2, 3}→{1}, {1}→{2, 3}, {2}→{1, 3},
and {3}→{1, 2}. As each of their support is identical to the support for X, the rules must satisfy the
support threshold.
Computing the confidence of an association rule does not require additional scans of the
transaction data set. Consider the rule {1, 2} →{3}, which is generated from the frequent itemset X
= {1, 2, 3}. The confidence for this rule is σ({1, 2, 3})/σ({1, 2}). Because {1, 2, 3} is frequent, the
anti-monotone property of support ensures that {1, 2} must be frequent, too. Since the support
counts for both itemsets were already found during frequent itemset generation, there is no need to
read the entire data set again.
4.3.1 Confidence-Based Pruning
4.3.2 Rule Generation in Apriori Algorithm
The Apriori algorithm uses a level-wise approach for generating association rules, where each
level corresponds to the number of items that belong to the rule consequent. Initially, all the highconfidence
rules that have only one item in the rule consequent are extracted. These rules are then
used to generate new candidate rules. For example, if {acd}→{b} and {abd}→{c} are highconfidence
rules, then the candidate rule {ad} →{bc} is generated by merging the consequents of
both rules. Figure 4.15 shows a lattice structure for the association rules generated from the frequent
itemset {a, b, c, d}.
Figure 4.15. Pruning of association rules using the confidence measure.
Suppose the confidence for {bcd} →{a} is low. All the rules containing item a in its
consequent, including {cd} →{ab}, {bd}→{ac}, {bc} →{ad}, and {d} →{abc} can be discarded.
The only difference is that, in rule generation, we do not have to make additional passes over
the data set to compute the confidence of the candidate rules. Instead, we determine the confidence
of each rule by using the support counts computed during frequent itemset generation.
Algorithm 4.2 Rule generation of the Apriori algorithm.
1: for each frequent k-itemset fk, k ? 2 do
2: H1 = {i | i ∈ fk} {1-item consequents of the
rule.}
3: call ap-genrules(fk, H1.)
4: end for
Algorithm 4.3 Procedure ap-genrules(fk , Hm).
1: k = |fk| {size of frequent itemset.}
2: m = |Hm| {size of rule consequent.}
3: if k > m + 1 then
4: Hm+1 = apriori-gen(Hm).
5: for each hm+1 ∈ Hm+1 do
6: conf = σ(fk)/σ(fk - hm+1).
7: if conf ? minconf then
8: output the rule (fk - hm+1) -→ hm+1.
9: else
10: delete hm+1 from Hm+1.
11: end if
12: end for
13: call ap-genrules(fk, Hm+1.)
14: end if
4.4 Compact Representation of frequent Itemsets
In practice, the number of frequent itemsets produced from a transaction data set can be very
large. It is useful to identify a small representative set of itemsets from which all other frequent
itemsets can be derived. Two such representations are presented in this section in the form of
maximal and closed frequent itemsets.
Figure 4.16. Maximal frequent itemset.
Definition 4.3 (Maximal Frequent Itemset). A maximal frequent itemset is defined as a
frequent itemset for which none of its immediate supersets are frequent.
To illustrate this concept, consider the itemset lattice shown in Figure 4.16. The itemsets in the
lattice are divided into two groups: those that are frequent and those that are infrequent. A frequent
itemset border, which is represented by a dashed line, is also illustrated in the diagram. Every
itemset located above the border is frequent, while those located below the border (the shaded
nodes) are infrequent. Among the itemsets residing near the border, {a, d}, {a, c, e}, and {b, c, d, e}
are considered to be maximal frequent itemsets because their immediate supersets are infrequent.
An itemset such as {a, d} is maximal frequent because all of its immediate supersets, {a, b, d}, {a,
c, d}, and {a, d, e}, are infrequent. In contrast, {a, c} is non-maximal because one of its immediate
supersets, {a, c, e}, is frequent.
For example, the frequent itemsets shown in Figure 4.16 can be divided into two groups:
? Frequent itemsets that begin with item a and that may contain items c, d, or e. This group
includes itemsets such as {a}, {a, c}, {a, d}, {a, e}, and {a, c, e}.
? Frequent itemsets that begin with items b, c, d, or e. This group includes itemsets such as
{b}, {b, c}, {c, d},{b, c, d, e}, etc.
4.4.2 Closed Frequent Itemsets
Closed itemsets provide a minimal representation of itemsets without losing their support
information. A formal definition of a closed itemset is presented below.
Definition 4.4 (Closed Itemset). An itemset X is closed if none of its immediate supersets has
exactly the same support count as X. Put another way, X is not closed if at least one of its
immediate supersets has the same support count as X. Examples of closed itemsets are shown in
Figure 4.17 illustrate the support count of each itemset, we have associated each node (itemset)
in the lattice with a list of its corresponding transaction IDs.
Figure 4.17. An example of the closed frequent itemsets
Definition 4.5 (Closed Frequent Itemset). An itemset is a closed frequent itemset if it is closed
and its support is greater than or equal to minsup. Algorithms are available to explicitly extract
closed frequent itemsets from a given data set. Interested readers may refer to the bibliographic
notes at the end of this chapter for further discussions of these algorithms. We can use the closed
frequent itemsets to determine the support counts for the non-closed Representation of Frequent
Itemsets
Algorithm 4 .4 Support counting using closed frequent itemsets.
1: Let C denote the set of closed frequent itemsets
2: Let kmax denote the maximum size of closed frequent itemsets
max
4: for k = kmax - 1 downto 1 do
6: for each f ∈ Fk do
7: if f ∈ C then
8: f.support = max{f .support|f ∈ Fk+1, f ⊂ f }
9: end if
10: end for
11: end for
The algorithm proceeds in a specific-to-general fashion, i.e., from the largest to the smallest
frequent itemsets. This is because, in order to find the support for a non-closed frequent itemset, the
support for all of its supersets must be known.
Figure 4.18. Relationships among frequent, maximal frequent, and closed
frequent itemsets.
3: Fk = {f|f C, |f| = kmax} {Find all frequent itemsets of size kmax.}
5: Fk = {f|f Fk+1, |f| = k} {Find all frequent itemsets of size k.}
/
Figure 4.17 relationship among frequent, maximum frequent, and closed frequent itemset.
Closed frequent itemsets are useful for removing some of the redundant association rules. An
association rule X → Y is redundant if there exists another rule X → Y , where X is a subset of X
and Y is a subset of Y , such that the support and confidence for both rules are identical. In the
example shown in Figure 4.17, {b} is not a closed frequent itemset while {b, c} is closed.
The association rule {b} →{d, e} is therefore redundant because it has the same support and
confidence as {b, c} →{d, e}. Such redundant rules are not generated if closed frequent itemsets are
used for rule generation.
4.5 Alternative Methods for Generating Frequent Itemsets
Apriori is one of the earliest algorithms to have successfully addressed the combinatorial
explosion of frequent itemset generation. It achieves this by applying the Apriori principle to prune
the exponential search space. Despite its significant performance improvement, the algorithm still
incurs considerable I/O overhead since it requires making several passes over the transaction data
set.
? General-to-Specific versus Specific-to-General: The Apriori algorithm uses a general-tospecific
search strategy, where pairs of frequent (k-1)-itemsets are merged to obtain candidate kitemsets.
This general-to-specific search strategy is effective, provided the maximum length of a
frequent itemset is not too long. The configuration of frequent item-sets that works best with this
strategy is shown in Figure 4.19(a), where the darker nodes represent infrequent itemsets.
Alternatively, a specific-to-general search strategy looks for more specific frequent itemsets first,
before finding the more general frequent itemsets. This strategy is use-ful to discover maximal
frequent itemsets in dense transactions, where the frequent itemset border is located near the bottom
of the lattice, as shown in Figure 4.19(b). The Apriori principle can be applied to prune all subsets
of maximal frequent itemsets. Specifically, if a candidate k-itemset is maximal frequent, we do not
have to examine any of its subsets of size k - 1. However, if the candidate k-itemset is infrequent,
we need to check all of its k - 1 subsets in the next iteration. Another approach is to combine both
general-to-specific and specific-to-general search strategies. This bidirectional approach requires
more space to
Figure 6.19. General-to-specific, specific-to-general, and bidirectional search.
? Equivalence Classes: Another way to envision the traversal is to first partition the lattice into
disjoint groups of nodes (or equivalence classes). A frequent itemset generation algorithm searches
for frequent itemsets within a particular equivalence class first before moving to another
equivalence class. As an example, the level-wise strategy used in the Apriori algorithm can be
considered to be partitioning the lattice on the basis of itemset sizes;
? Breadth-First versus Depth-First: The Apriori algorithm traverses the lattice in a breadthfirst
manner, as shown in Figure 6.21(a). It first discovers all the frequent 1-itemsets, followed by
the frequent 2-itemsets, and so on, until no new frequent itemsets are generated.
Figure 6.20. Equivalence classes based on prefix and suffix labels of item sets
Figure 6.21. Breadth first and depth first traversal
Representation of Transaction Data Set There are many ways to represent a transaction
data set. The choice of representation can affect the I/O costs incurred when computing the support
of candidate itemsets. Figure 6.23 shows two different ways of representing market basket
transactions. The representation on the left is called a horizontal data layout, which is adopted by
many association rule mining algorithms, including Apriori. Another possibility is to store the list of
transaction identifiers (TID-list) associated with each item. Such a representation is known as the
vertical data layout. The support for each candidate itemset is obtained by intersecting the TID-lists
of its subset items. The length of the TID-lists shrinks as we progress to larger sized itemsets.
Vertical Data Layout
a b c d e
1 1 2 2 1
4 2 3 4 3
5 5 4 5 6
6 7 8 9
7 8 9
8 10
9
Figure 6.23. Horizontal and vertical data format.
However, one problem with this approach is that the initial set of TID-lists may be too large to
fit into main memory, thus requiring more sophisticated techniques to compress the TID-lists. We
describe another effective approach to represent the data in the next section.

Editors




Join eduladder!