See Our team

Wondering how we keep quality?

Got unsolved questions?

Ask Questions

Engineering
GATE
CBSE
NCERT
Psychology
English
Computer
Constitution
Astrology
Yoga
Economics
Physics
Biology
Electronics
Microprocessor
Career
Interview
Anatomy
Botany

You are here:Open notes-->VTU-->Data-Warehousing--DataMinig-10CS755-unit-4

**Data Warehousing & DataMinig [10CS755] unit-4**

# UNIT IV ASSOCIATION ANALYSIS

This chapter presents a methodology known as association analysis, which is useful for

discovering interesting relationships hidden in large data sets. The uncovered relationships can

be represented in the form of association rules or sets of frequent items. For example, the

following rule can be extracted from the data set shown in Table 4.1:

{Diapers} → {Beer}.

Table 4.1. An example of market basket transactions.

TI

D

ITEMS

1 {Bread, Milk}

2 { Bread, Diapers, Beer, Eggs}

3 {Milk, Diapers, Beer, Cola}

4 {Bread, Milk, Diapers, Beer}

5 {Bread, Milk, Diapers, Cola}

The rule suggests that a strong relationship exists between the sale of diapers and beer

because many customers who buy diapers also buy beer. Retailers can use this type of rules to

help them identify new opportunities for cross- selling their products to the customers.

.

4.1 Basic Concepts and Algorithms

This section reviews the basic terminology used in association analysis and presents a formal

description of the task.

Binary Representation Market basket data can be represented in a binary format as shown in

Table 4.2, where each row corresponds to a transaction and each column corresponds to an item.

An item can be treated as a binary variable whose value is one if the item is present in a

transaction and zero otherwise. Because the presence of an item in a transaction is often

considered more important than its absence, an item is an asymmetric binary variable.

Table 4.2 A binary 0/1 representation of market basket data.

This representation is perhaps a very simplistic view of real market basket data because it

ignores certain important aspects of the data such as the quantity of items sold or the price paid

to purchase them. Itemset and Support Count Let I = {i1,i2,. . .,id} be the set of all items in a

market basket data and T = {t1, t2, . . . , tN } be the set of all transactions. Each transaction ti

contains a subset of items chosen from I. In association analysis, a collection of zero or more

items is termed an itemset. If an itemset contains k items, it is called a k-itemset. For instance,

{Beer, Diapers, Milk} is an example of a 3-itemset. The null (or empty) set is an itemset that

does not contain any items.

The transaction width is defined as the number of items present in a transaction. A

transaction tj is said to contain an itemset X if X is a subset of tj. For example, the second

transaction shown in Table 6.2 contains the item-set {Bread, Diapers} but not {Bread, Milk}. An

important property of an itemset is its support count, which refers to the number of transactions

that contain a particular itemset. Mathematically, the support count, σ(X), for an itemset X can

be stated as follows:

Where the symbol | ・ | denote the number of elements in a set. In the data set shown in Table

4.2, the support count for {Beer, Diapers, Milk} is equal to two because there are only two

transactions that contain all three items.

Association Rule An association rule is an implication expression of the form X → Y , where

X and Y are disjoint itemsets, i.e., X ∩ Y = 0. The strength of an association rule can be

measured in terms of its support and confidence. Support determines how often a rule is

applicable to a given data set, while confidence determines how frequently items in Y

appear in transactions that contain X. The formal definitions of these metrics are

Support s(X------->Y) = ∂(XUY) 4.1

Confidence C(X------>Y) = ∂(XUY) 4.2

Formulation of Association Rule Mining Problem The association rule mining problem

can be formally stated as follows:

Definition 4.1 (Association Rule Discovery). Given a set of transactions T , find all the rules

having support ? minsup and confidence ? minconf, where minsup and minconf are the

corresponding support and confidence thresholds.

From Equation 4.2, notice that the support of a rule X -→ Y depends only on the support of

its corresponding itemset, X ∩ Y . For example, the following rules have identical support

because they involve items from the same itemset,

{Beer, Diapers, Milk}:

{Beer, Diapers} -→{Milk}, {Beer, Milk} -→{Diapers},

{Diapers, Milk} -→{Beer}, {Beer} -→{Diapers, Milk},

{Milk} -→{Beer,Diapers}, {Diapers} -→{Beer,Milk}.

If the itemset is infrequent, then all six candidate rules can be pruned immediately without

our having to compute their confidence values. Therefore, a common strategy adopted by many

association rule mining algorithms is to decompose the problem into two major subtasks:

1. Frequent Itemset Generation, whose objective is to find all the item-sets that satisfy the

minsup threshold. These itemsets are called frequent itemsets.

2. Rule Generation, whose objective is to extract all the high-confidence rules from the

frequent itemsets found in the previous step. These rules are called strong rules.

The computational requirements for frequent itemset generation are generally more

expensive than those of rule generation.

Figure 4.1. An itemset lattice.

4.2 Frequent Itemset Generation

A lattice structure can be used to enumerate the list of all possible itemsets. Figure 4.1 shows

an itemset lattice for I = {a, b, c, d, e}. In general, a data set that contains k items can potentially

generate up to 2k - 1 frequent itemsets, excluding the null set. Because k can be very large in

many practical applications, the search space of itemsets that need to be explored is

exponentially large.

A brute-force approach for finding frequent itemsets is to determine the support count for

every candidate itemset in the lattice structure. To do this, we need to compare each candidate

against every transaction, an operation that is shown in Figure 4.2. If the candidate is contained

in a transaction, its support count will be incremented. For example, the support for

{Bread,Milk} is incremented three times because the itemset is contained in transactions 1, 4,

and 5. Such an approach can be very expensive because it requires O(NMw) comparisons, where

N is the number of transactions, M =2k - 1 is the number of candidate itemsets, and w is the

maximum transaction width.

Candidates

Transactions

Items

Bread, Milk

Bread, Diapers, Beer, Eggs

Milk, Diapers, Beer, Coke

Bread, Milk, Diapers, Beer

Bread, Milk, Diapers, Coke

Figure 6.2. Counting the support of candidate itemsets.

There are several ways to reduce the computational complexity of frequent itemset

generation.

1. Reduce the number of candidate itemsets (M). The Apriori principle, described in the next

section, is an effective way to eliminate some of the candidate itemsets without counting their

support values.

2. Reduce the number of comparisons. Instead of matching each candidate itemset against

N

TID

1

2

3

4

5

M

every transaction, we can reduce the number of comparisons by using more advanced data

structures, either to store the candidate itemsets or to compress the data set.

4.2.1 The Apriori Principle

This section describes how the support measure helps to reduce the number of candidate

itemsets explored during frequent itemset generation. The use of support for pruning candidate

itemsets is guided by the following principle.

Theorem 4.1 (Apriori Principle). If an itemset is frequent, then all of its subsets must also be

frequent. To illustrate the idea behind the Apriori principle, consider the itemset lattice shown in

Figure 4.3. Suppose {c, d, e} is a frequent itemset. Clearly, any transaction that contains {c, d, e}

must also contain its subsets, {c, d},{c, e}, {d, e}, {c}, {d}, and {e}. As a result, if {c, d, e} is

frequent, then all subsets of {c, d, e} (i.e., the shaded itemsets in this figure) must also be

frequent.

Figure 4.3. An illustration of the Apriori principle.

If {c, d, e} is frequent, then all subsets of this itemset are frequent.

Conversely, if an itemset such as {a, b} is infrequent, then all of its supersets must be

infrequent too. As illustrated in Figure 6.4, the entire subgraph containing the supersets of {a, b}

can be pruned immediately once {a, b} is found to be infrequent. This strategy of trimming the

exponential search space based on the support measure is known as support-based pruning. Such

a pruning strategy is made possible by a key property of the support measure, namely, that the

support for an itemset never exceeds the support for its subsets. This property is also known as

the anti-monotone property of the support measure.

Definition 4.2 (Monotonicity Property). Let I be a set of items, and J =2I be the power set

of I. A measure f is monotone (or upward closed) if

Figure 4.4. An illustration of support-based pruning.

If {a, b} is infrequent, then all supersets of {a, b} are infrequent, which means that if X is a

subset of Y , then f(X) must not exceed f(Y ). On the other hand, f is anti-monotone (or

downward closed) if which means that if X is a subset of Y , then f(Y ) must not exceed f(X).

Any measure that possesses an anti-monotone property can be incorporated directly into the

mining algorithm to effectively prune the exponential search space of candidate itemsets, as will

be shown in the next section.

4.2.2 Frequent Itemset Generation in the Apriori Algorithm

Apriori is the first association rule mining algorithm that pioneered the use of support-based

pruning to systematically control the exponential growth of candidate itemsets. Figure 4.5

provides a high-level illustration of the frequent itemset generation part of the Apriori algorithm

for the transactions shown in

Figure 4.5. Illustration of frequent itemset generation using the Apriori algorithm.

Table 4.1. We assume that the support threshold is 60%, which is equivalent to a minimum

support count equal to 3.

Apriori principle ensures that all supersets of the infrequent 1-itemsets must be infrequent.

Because there are only four frequent 1-itemsets, the number of candidate 2-itemsets generated by

the algorithm is = 6. Two of these six candidates, {Beer, Bread} and {Beer, Milk}, are

subsequently found to be infrequent after computing their support values. The remaining four

candidates are frequent, and thus will be used to generate candidate 3-itemsets. Without supportbased

pruning, there are = 20 candidate 3-itemsets that can be formed using the six items given

in this example. With the Apriori principle, we only need to keep candidate 3-itemsets whose

subsets are frequent. The only candidate that has this property is

{Bread, Diapers,Milk}.

The effectiveness of the Apriori pruning strategy can be shown by counting the number of

candidate itemsets generated. A brute-force strategy of enumerating all itemsets (up to size 3) as

candidates will produce

candidates. With the Apriori principle, this number decreases to

candidates, which represents a 68% reduction in the number of candidate itemsets even in

this simple example.

The pseudocode for the frequent itemset generation part of the Apriori algorithm is shown in

Algorithm 4.1. Let Ck denote the set of candidate k-itemsets and Fk denote the set of frequent kitemsets:

? The algorithm initially makes a single pass over the data set to determine the support of

each item. Upon completion of this step, the set of all frequent 1-itemsets, F1, will be known

(steps 1 and 2).

? Next, the algorithm will iteratively generate new candidate k-itemsets using the frequent (k

- 1)-itemsets found in the previous iteration (step 5). Candidate generation is implemented using

a function called apriori-gen, which is described in Section 4.2.3.

? To count the support of the candidates, the algorithm needs to make an additional pass over

the data set (steps 6?10). The subset function is used to determine all the candidate itemsets in

Ck that are contained in each transaction t.

? After counting their supports, the algorithm eliminates all candidate itemsets whose support

counts are less than minsup (step 12).

? The algorithm terminates when there are no new frequent itemsets generated.

4.3 Rule Generation

This section describes how to extract association rules efficiently from a given frequent itemset.

Each frequent k-itemset, Y , can produce up to 2k-2 association rules, ignoring rules that have

empty antecedents or consequents( 0→Yor Y → 0). An association rule can be extracted by

partitioning the itemset Y into two non-empty subsets, X and Y -X, such that X → Y - X satisfies

the confidence threshold. Note that all such rules must have already met the support threshold

because they are generated from a frequent itemset.

Example 4 .2. Let X = {1, 2, 3} be a frequent itemset. There are six candidate association rules

that can be generated from X: {1, 2} →{3}, {1, 3} →{2}, {2, 3}→{1}, {1}→{2, 3}, {2}→{1, 3},

and {3}→{1, 2}. As each of their support is identical to the support for X, the rules must satisfy the

support threshold.

Computing the confidence of an association rule does not require additional scans of the

transaction data set. Consider the rule {1, 2} →{3}, which is generated from the frequent itemset X

= {1, 2, 3}. The confidence for this rule is σ({1, 2, 3})/σ({1, 2}). Because {1, 2, 3} is frequent, the

anti-monotone property of support ensures that {1, 2} must be frequent, too. Since the support

counts for both itemsets were already found during frequent itemset generation, there is no need to

read the entire data set again.

4.3.1 Confidence-Based Pruning

4.3.2 Rule Generation in Apriori Algorithm

The Apriori algorithm uses a level-wise approach for generating association rules, where each

level corresponds to the number of items that belong to the rule consequent. Initially, all the highconfidence

rules that have only one item in the rule consequent are extracted. These rules are then

used to generate new candidate rules. For example, if {acd}→{b} and {abd}→{c} are highconfidence

rules, then the candidate rule {ad} →{bc} is generated by merging the consequents of

both rules. Figure 4.15 shows a lattice structure for the association rules generated from the frequent

itemset {a, b, c, d}.

Figure 4.15. Pruning of association rules using the confidence measure.

Suppose the confidence for {bcd} →{a} is low. All the rules containing item a in its

consequent, including {cd} →{ab}, {bd}→{ac}, {bc} →{ad}, and {d} →{abc} can be discarded.

The only difference is that, in rule generation, we do not have to make additional passes over

the data set to compute the confidence of the candidate rules. Instead, we determine the confidence

of each rule by using the support counts computed during frequent itemset generation.

Algorithm 4.2 Rule generation of the Apriori algorithm.

1: for each frequent k-itemset fk, k ? 2 do

2: H1 = {i | i ∈ fk} {1-item consequents of the

rule.}

3: call ap-genrules(fk, H1.)

4: end for

Algorithm 4.3 Procedure ap-genrules(fk , Hm).

1: k = |fk| {size of frequent itemset.}

2: m = |Hm| {size of rule consequent.}

3: if k > m + 1 then

4: Hm+1 = apriori-gen(Hm).

5: for each hm+1 ∈ Hm+1 do

6: conf = σ(fk)/σ(fk - hm+1).

7: if conf ? minconf then

8: output the rule (fk - hm+1) -→ hm+1.

9: else

10: delete hm+1 from Hm+1.

11: end if

12: end for

13: call ap-genrules(fk, Hm+1.)

14: end if

4.4 Compact Representation of frequent Itemsets

In practice, the number of frequent itemsets produced from a transaction data set can be very

large. It is useful to identify a small representative set of itemsets from which all other frequent

itemsets can be derived. Two such representations are presented in this section in the form of

maximal and closed frequent itemsets.

Figure 4.16. Maximal frequent itemset.

Definition 4.3 (Maximal Frequent Itemset). A maximal frequent itemset is defined as a

frequent itemset for which none of its immediate supersets are frequent.

To illustrate this concept, consider the itemset lattice shown in Figure 4.16. The itemsets in the

lattice are divided into two groups: those that are frequent and those that are infrequent. A frequent

itemset border, which is represented by a dashed line, is also illustrated in the diagram. Every

itemset located above the border is frequent, while those located below the border (the shaded

nodes) are infrequent. Among the itemsets residing near the border, {a, d}, {a, c, e}, and {b, c, d, e}

are considered to be maximal frequent itemsets because their immediate supersets are infrequent.

An itemset such as {a, d} is maximal frequent because all of its immediate supersets, {a, b, d}, {a,

c, d}, and {a, d, e}, are infrequent. In contrast, {a, c} is non-maximal because one of its immediate

supersets, {a, c, e}, is frequent.

For example, the frequent itemsets shown in Figure 4.16 can be divided into two groups:

? Frequent itemsets that begin with item a and that may contain items c, d, or e. This group

includes itemsets such as {a}, {a, c}, {a, d}, {a, e}, and {a, c, e}.

? Frequent itemsets that begin with items b, c, d, or e. This group includes itemsets such as

{b}, {b, c}, {c, d},{b, c, d, e}, etc.

4.4.2 Closed Frequent Itemsets

Closed itemsets provide a minimal representation of itemsets without losing their support

information. A formal definition of a closed itemset is presented below.

Definition 4.4 (Closed Itemset). An itemset X is closed if none of its immediate supersets has

exactly the same support count as X. Put another way, X is not closed if at least one of its

immediate supersets has the same support count as X. Examples of closed itemsets are shown in

Figure 4.17 illustrate the support count of each itemset, we have associated each node (itemset)

in the lattice with a list of its corresponding transaction IDs.

Figure 4.17. An example of the closed frequent itemsets

Definition 4.5 (Closed Frequent Itemset). An itemset is a closed frequent itemset if it is closed

and its support is greater than or equal to minsup. Algorithms are available to explicitly extract

closed frequent itemsets from a given data set. Interested readers may refer to the bibliographic

notes at the end of this chapter for further discussions of these algorithms. We can use the closed

frequent itemsets to determine the support counts for the non-closed Representation of Frequent

Itemsets

Algorithm 4 .4 Support counting using closed frequent itemsets.

1: Let C denote the set of closed frequent itemsets

2: Let kmax denote the maximum size of closed frequent itemsets

max

4: for k = kmax - 1 downto 1 do

6: for each f ∈ Fk do

7: if f ∈ C then

8: f.support = max{f .support|f ∈ Fk+1, f ⊂ f }

9: end if

10: end for

11: end for

The algorithm proceeds in a specific-to-general fashion, i.e., from the largest to the smallest

frequent itemsets. This is because, in order to find the support for a non-closed frequent itemset, the

support for all of its supersets must be known.

Figure 4.18. Relationships among frequent, maximal frequent, and closed

frequent itemsets.

3: Fk = {f|f C, |f| = kmax} {Find all frequent itemsets of size kmax.}

5: Fk = {f|f Fk+1, |f| = k} {Find all frequent itemsets of size k.}

/

Figure 4.17 relationship among frequent, maximum frequent, and closed frequent itemset.

Closed frequent itemsets are useful for removing some of the redundant association rules. An

association rule X → Y is redundant if there exists another rule X → Y , where X is a subset of X

and Y is a subset of Y , such that the support and confidence for both rules are identical. In the

example shown in Figure 4.17, {b} is not a closed frequent itemset while {b, c} is closed.

The association rule {b} →{d, e} is therefore redundant because it has the same support and

confidence as {b, c} →{d, e}. Such redundant rules are not generated if closed frequent itemsets are

used for rule generation.

4.5 Alternative Methods for Generating Frequent Itemsets

Apriori is one of the earliest algorithms to have successfully addressed the combinatorial

explosion of frequent itemset generation. It achieves this by applying the Apriori principle to prune

the exponential search space. Despite its significant performance improvement, the algorithm still

incurs considerable I/O overhead since it requires making several passes over the transaction data

set.

? General-to-Specific versus Specific-to-General: The Apriori algorithm uses a general-tospecific

search strategy, where pairs of frequent (k-1)-itemsets are merged to obtain candidate kitemsets.

This general-to-specific search strategy is effective, provided the maximum length of a

frequent itemset is not too long. The configuration of frequent item-sets that works best with this

strategy is shown in Figure 4.19(a), where the darker nodes represent infrequent itemsets.

Alternatively, a specific-to-general search strategy looks for more specific frequent itemsets first,

before finding the more general frequent itemsets. This strategy is use-ful to discover maximal

frequent itemsets in dense transactions, where the frequent itemset border is located near the bottom

of the lattice, as shown in Figure 4.19(b). The Apriori principle can be applied to prune all subsets

of maximal frequent itemsets. Specifically, if a candidate k-itemset is maximal frequent, we do not

have to examine any of its subsets of size k - 1. However, if the candidate k-itemset is infrequent,

we need to check all of its k - 1 subsets in the next iteration. Another approach is to combine both

general-to-specific and specific-to-general search strategies. This bidirectional approach requires

more space to

Figure 6.19. General-to-specific, specific-to-general, and bidirectional search.

? Equivalence Classes: Another way to envision the traversal is to first partition the lattice into

disjoint groups of nodes (or equivalence classes). A frequent itemset generation algorithm searches

for frequent itemsets within a particular equivalence class first before moving to another

equivalence class. As an example, the level-wise strategy used in the Apriori algorithm can be

considered to be partitioning the lattice on the basis of itemset sizes;

? Breadth-First versus Depth-First: The Apriori algorithm traverses the lattice in a breadthfirst

manner, as shown in Figure 6.21(a). It first discovers all the frequent 1-itemsets, followed by

the frequent 2-itemsets, and so on, until no new frequent itemsets are generated.

Figure 6.20. Equivalence classes based on prefix and suffix labels of item sets

Figure 6.21. Breadth first and depth first traversal

Representation of Transaction Data Set There are many ways to represent a transaction

data set. The choice of representation can affect the I/O costs incurred when computing the support

of candidate itemsets. Figure 6.23 shows two different ways of representing market basket

transactions. The representation on the left is called a horizontal data layout, which is adopted by

many association rule mining algorithms, including Apriori. Another possibility is to store the list of

transaction identifiers (TID-list) associated with each item. Such a representation is known as the

vertical data layout. The support for each candidate itemset is obtained by intersecting the TID-lists

of its subset items. The length of the TID-lists shrinks as we progress to larger sized itemsets.

Vertical Data Layout

a b c d e

1 1 2 2 1

4 2 3 4 3

5 5 4 5 6

6 7 8 9

7 8 9

8 10

9

Figure 6.23. Horizontal and vertical data format.

However, one problem with this approach is that the initial set of TID-lists may be too large to

fit into main memory, thus requiring more sophisticated techniques to compress the TID-lists. We

describe another effective approach to represent the data in the next section.

## Use Me ?