Data Warehousing DataMinig 10CS755 unit 7

The Eduladder is a community of students, teachers, and programmers just interested to make you pass any exams. So we solve previous year question papers for you.
See Our team
Wondering how we keep quality?
Got unsolved questions?

Ask Questions

You are here:Open notes-->VTU-->Data-Warehousing--DataMinig-10CS755-unit-7

Data Warehousing & DataMinig [10CS755] unit-7

UNIT- 7 Clustering Techniques

Clustering and classification are both fundamental tasks in Data Mining. Classification is used mostly as a
supervised learning method, clustering for unsupervised learning (some clustering models are for both). The
goal of clus- tering is descriptive, that of classification is predictive (Veyssieres and Plant, 1998). Since the
goal of clustering is to discover a new set of categories, the new groups are of interest in themselves, and their
assessment is intrinsic. In classification tasks, however, an important part of the assessment is extrinsic, since
the groups must reflect some reference set of classes. “Understanding our world requires conceptualizing the
similarities and differences between the entities that compose it” (Tyron and Bailey, 1970).
Clustering groups data instances into subsets in such a manner that similar instances are grouped together,
while different instances belong to differ-ent groups. The instances are thereby organized into an efficient
representation that characterizes the population being sampled. Formally, the clustering structure is represented
as a set of subsets C = C1 , . . . , Ck of S, such that: i=1 belongs to exactly one and only one subset.
Clustering of objects is as ancient as the human need for describing the salient characteristics of men and
objects and identifying them with a type. Therefore, it embraces various scientific disciplines: from
mathematics and statistics to biology and genetics, each of which uses different terms to describethe topologies
formed using this analysis. From biological “taxonomies”, to medical “syndromes” and genetic “genotypes”
to manufacturing ”group tech- nology” ? the problem is identical: forming categories of entities and assigning
individuals to the proper groups within it.
Distance Measures
Since clustering is the grouping of similar instances/objects, some sort of measure that can determine whether
two objects are similar or dissimilar is required. There are two main type of measures used to estimate this
relation: distance measures and similarity measures.
Many clustering methods use distance measures to determine the similarity or dissimilarity between any pair of
objects. It is useful to denote the distance between two instances xi and xj as: d(xi,xj ). A valid distance measure
should be symmetric and obtains its minimum value (usually zero) in case of identical vectors. The distance
measure is called a metric distance measure if it also satisfies the following properties:
1. Triangle inequality d(xi,xk ) ? d(xi,xj ) + d(xj ,xk ) xi,xj ,xk S.
k S = Ci and Ci ∩ Cj = for i = j. Consequently, any instance in S
2. d(xi ,xj )= 0 xi = xj
7.2 Features of cluster analysis
In this section we describe the most well-known clustering algorithms. The main reason for having many
clustering methods is the fact that the notion of “cluster” is not precisely defined (Estivill-Castro, 2000).
Consequently many
clustering methods have been developed, each of which uses a different in- diction principle. Farley and
Raftery (1998) suggest dividing the clustering methods into two main groups: hierarchical and partitioning
methods. Han
and Kamber (2001) suggest categorizing the methods into additional three main categories: density-based
methods, model-based clustering and grid-based methods. An alternative categorization based on the induction
principle of the various clustering methods is presented in (Estivill-Castro, 2000).
7.4 Types of Cluster Analysis Methods, Partitional Methods, Hierarchical Methods, Density
Based Methods
7.4.1 Hierarchical Methods
These methods construct the clusters by recursively partitioning the instances in either a top-down or bottomup
fashion. These methods can be sub- divided as following:
Agglomerative hierarchical clustering ? Each object initially represents a cluster of its own. Then clusters are
successively merged until the desired cluster structure is obtained.
Divisive hierarchical clustering ? All objects initially belong to one cluster. Then the cluster is divided into
sub-clusters, which are successively divided into their own sub-clusters. This process continues until
the desired cluster structure is obtained.
The result of the hierarchical methods is a dendrogram, representing the nested n grouping of objects and
similarity levels at which groupings change. A clustering of the data objects is obtained by cutting the
dendrogram at the desired similarity level.
The merging or division of clusters is performed according to some similar-ity measure, chosen so as to optimize
some criterion (such as a sum of squares).The hierarchical clustering methods could be further divided according
to the manner that the similarity measure is calculated (Jain et al., 1999):
xi ,xj S.
Single-link clustering (also called the connectedness, the minimum method or the nearest neighbor method)
? methods that consider the distance between two clusters to be equal to the shortest distance from
any member of one cluster to any member of the other cluster. If the data consist of similarities, the
similarity between a pair of clusters is considered to be equal to the greatest similarity from any member of one
cluster to any member of the other cluster (Sneath and Sokal, 1973).
Complete-link clustering (also called the diameter, the maximum method or the furthest neighbor
method) - methods that consider the distance between two clusters to be equal to the longest distance from
any member of one cluster to any member of the other cluster (King,1967).
Average-link clustering (also called minimum variance method) - meth-ods that consider the distance between
two clusters to be equal to the average distance from any member of one cluster to any member of the other
cluster. Such clustering algorithms may be found in (Ward, 1963)and (Murtagh, 1984).The disadvantages of the
single-link clustering and the average-link clustering can be summarized as follows (Guha et al., 1998):Singlelink
clustering has a drawback known as the “chaining effect“: Afew points that form a bridge between two
clusters cause the single-link clustering to unify these two clusters into one. Average-link clustering may cause
elongated clusters to split and for portions of neighboring elongated clusters to merge.
The complete-link clustering methods usually produce more compact clusters and more useful hierarchies than
the single-link clustering methods, yet the single-link methods are more versatile. Generally, hierarchical
methods are
characterized with the following strengths: Versatility ? The single-link methods, for example, maintain good
performance on data sets containing non-isotropic clusters, including well-separated, chain-like and concentric
Multiple partitions ? hierarchical methods produce not one partition, but multiple nested partitions, which
allow different users to choose different partitions, according to the desired similarity level. The hierarchical
partition is presented using the dendrogram.
The main disadvantages of the hierarchical methods are:
Inability to scale well ? The time complexity of hierarchical algorithms is at least O(m2 ) (where m is the total
number of instances), which is non-linear with the number of objects. Clustering a large number of objects
using a hierarchical algorithm is also characterized by huge I/O costs. Hierarchical methods can never undo
what was done previously. Namely there is no back-tracking capability.
7.4.2 Partitioning Methods
Partitioning methods relocate instances by moving them from one cluster to another, starting from an initial
partitioning. Such methods typically require that the number of clusters will be pre-set by the user. To achieve
global optimality in partitioned-based clustering, an exhaustive enumeration process of all possible partitions is
required. Because this is not feasible, certain greedy heuristics are used in the form of iterative optimization.
Namely, a relocation method iteratively relocates points between the k clusters. The following subsections
present various types of partitioning methods.
7.5 Quality and Validity of Cluster Analysis.


Want to exlplore yourself?