See Our team

Wondering how we keep quality?

Got unsolved questions?

Ask Questions

Engineering
GATE
CBSE
NCERT
Psychology
English
Computer
Constitution
Astrology
Yoga
Economics
Physics
Biology
Electronics
Microprocessor
Career
Interview
Anatomy
Botany

You are here:Open notes-->VTU-->Data-Warehousing--DataMinig-10CS755-unit-7

**Data Warehousing & DataMinig [10CS755] unit-7**

# UNIT- 7 Clustering Techniques

7.1Overview

Clustering and classification are both fundamental tasks in Data Mining. Classification is used mostly as a

supervised learning method, clustering for unsupervised learning (some clustering models are for both). The

goal of clus- tering is descriptive, that of classification is predictive (Veyssieres and Plant, 1998). Since the

goal of clustering is to discover a new set of categories, the new groups are of interest in themselves, and their

assessment is intrinsic. In classification tasks, however, an important part of the assessment is extrinsic, since

the groups must reflect some reference set of classes. “Understanding our world requires conceptualizing the

similarities and differences between the entities that compose it” (Tyron and Bailey, 1970).

Clustering groups data instances into subsets in such a manner that similar instances are grouped together,

while different instances belong to differ-ent groups. The instances are thereby organized into an efficient

representation that characterizes the population being sampled. Formally, the clustering structure is represented

as a set of subsets C = C1 , . . . , Ck of S, such that: i=1 belongs to exactly one and only one subset.

Clustering of objects is as ancient as the human need for describing the salient characteristics of men and

objects and identifying them with a type. Therefore, it embraces various scientific disciplines: from

mathematics and statistics to biology and genetics, each of which uses different terms to describethe topologies

formed using this analysis. From biological “taxonomies”, to medical “syndromes” and genetic “genotypes”

to manufacturing ”group tech- nology” ? the problem is identical: forming categories of entities and assigning

individuals to the proper groups within it.

Distance Measures

Since clustering is the grouping of similar instances/objects, some sort of measure that can determine whether

two objects are similar or dissimilar is required. There are two main type of measures used to estimate this

relation: distance measures and similarity measures.

Many clustering methods use distance measures to determine the similarity or dissimilarity between any pair of

objects. It is useful to denote the distance between two instances xi and xj as: d(xi,xj ). A valid distance measure

should be symmetric and obtains its minimum value (usually zero) in case of identical vectors. The distance

measure is called a metric distance measure if it also satisfies the following properties:

1. Triangle inequality d(xi,xk ) ? d(xi,xj ) + d(xj ,xk ) xi,xj ,xk S.

k S = Ci and Ci ∩ Cj = for i = j. Consequently, any instance in S

2. d(xi ,xj )= 0 xi = xj

7.2 Features of cluster analysis

In this section we describe the most well-known clustering algorithms. The main reason for having many

clustering methods is the fact that the notion of “cluster” is not precisely defined (Estivill-Castro, 2000).

Consequently many

clustering methods have been developed, each of which uses a different in- diction principle. Farley and

Raftery (1998) suggest dividing the clustering methods into two main groups: hierarchical and partitioning

methods. Han

and Kamber (2001) suggest categorizing the methods into additional three main categories: density-based

methods, model-based clustering and grid-based methods. An alternative categorization based on the induction

principle of the various clustering methods is presented in (Estivill-Castro, 2000).

7.4 Types of Cluster Analysis Methods, Partitional Methods, Hierarchical Methods, Density

Based Methods

7.4.1 Hierarchical Methods

These methods construct the clusters by recursively partitioning the instances in either a top-down or bottomup

fashion. These methods can be sub- divided as following:

Agglomerative hierarchical clustering ? Each object initially represents a cluster of its own. Then clusters are

successively merged until the desired cluster structure is obtained.

Divisive hierarchical clustering ? All objects initially belong to one cluster. Then the cluster is divided into

sub-clusters, which are successively divided into their own sub-clusters. This process continues until

the desired cluster structure is obtained.

The result of the hierarchical methods is a dendrogram, representing the nested n grouping of objects and

similarity levels at which groupings change. A clustering of the data objects is obtained by cutting the

dendrogram at the desired similarity level.

The merging or division of clusters is performed according to some similar-ity measure, chosen so as to optimize

some criterion (such as a sum of squares).The hierarchical clustering methods could be further divided according

to the manner that the similarity measure is calculated (Jain et al., 1999):

xi ,xj S.

Single-link clustering (also called the connectedness, the minimum method or the nearest neighbor method)

? methods that consider the distance between two clusters to be equal to the shortest distance from

any member of one cluster to any member of the other cluster. If the data consist of similarities, the

similarity between a pair of clusters is considered to be equal to the greatest similarity from any member of one

cluster to any member of the other cluster (Sneath and Sokal, 1973).

Complete-link clustering (also called the diameter, the maximum method or the furthest neighbor

method) - methods that consider the distance between two clusters to be equal to the longest distance from

any member of one cluster to any member of the other cluster (King,1967).

Average-link clustering (also called minimum variance method) - meth-ods that consider the distance between

two clusters to be equal to the average distance from any member of one cluster to any member of the other

cluster. Such clustering algorithms may be found in (Ward, 1963)and (Murtagh, 1984).The disadvantages of the

single-link clustering and the average-link clustering can be summarized as follows (Guha et al., 1998):Singlelink

clustering has a drawback known as the “chaining effect“: Afew points that form a bridge between two

clusters cause the single-link clustering to unify these two clusters into one. Average-link clustering may cause

elongated clusters to split and for portions of neighboring elongated clusters to merge.

The complete-link clustering methods usually produce more compact clusters and more useful hierarchies than

the single-link clustering methods, yet the single-link methods are more versatile. Generally, hierarchical

methods are

characterized with the following strengths: Versatility ? The single-link methods, for example, maintain good

performance on data sets containing non-isotropic clusters, including well-separated, chain-like and concentric

clusters.

Multiple partitions ? hierarchical methods produce not one partition, but multiple nested partitions, which

allow different users to choose different partitions, according to the desired similarity level. The hierarchical

partition is presented using the dendrogram.

The main disadvantages of the hierarchical methods are:

Inability to scale well ? The time complexity of hierarchical algorithms is at least O(m2 ) (where m is the total

number of instances), which is non-linear with the number of objects. Clustering a large number of objects

using a hierarchical algorithm is also characterized by huge I/O costs. Hierarchical methods can never undo

what was done previously. Namely there is no back-tracking capability.

7.4.2 Partitioning Methods

Partitioning methods relocate instances by moving them from one cluster to another, starting from an initial

partitioning. Such methods typically require that the number of clusters will be pre-set by the user. To achieve

global optimality in partitioned-based clustering, an exhaustive enumeration process of all possible partitions is

required. Because this is not feasible, certain greedy heuristics are used in the form of iterative optimization.

Namely, a relocation method iteratively relocates points between the k clusters. The following subsections

present various types of partitioning methods.

7.5 Quality and Validity of Cluster Analysis.

## Use Me ?