Data Warehousing DataMinig 10CS755 unit 2
The Eduladder is a community of students, teachers, and programmers just interested to make you pass any exams. So we solve previous year question papers for you.
See Our team
Wondering how we keep quality?
Got unsolved questions?

Ask Questions

Use Me  ?

New searches

You are here:Open notes-->VTU-->Data-Warehousing--DataMinig-10CS755-unit-2

Data Warehousing & DataMinig [10CS755] unit-2

UNIT 2 ONLINE ANALYTICAL PROCESSING (OLAP)

2.1 INTRODUCTION
A dimension is an attribute or an ordinate within a multidimensional structure
consisting of a list of values (members). For example, the degree, the country, the
scholarship and the year were the four dimensions used in the student database.
Dimensions are used for selecting and aggregating data at the desired level. A dimension
does not include ordering of values, for example there is no ordering associated with
values of each of the four dimensions, but a dimension may have one or more hierarchies
that show parent /child relationship between the members of a dimension.
For example, the dimension country may have a hierarchy that divides the world into
continents and continents into regions followed by regions into countries if such a
hierarchy is useful for the applications. Multiple hierarchies may be defined on a
dimension. For example, counties may be defined to have a geographical hierarchy and
may have another hierarchy defined on the basis of their wealth or per capita income (e.g.
high, medium, low).
The non-null values of facts are the numerical values stored in each data cube cell. They
are called measures. A measure is a non-key attribute in a fact table and the value of the
measure is dependent on the values of the dimensions. Each unique combination of
members in a Cartesian product dimensions of the cube identifies precisely one data cell
within the cube and that cell stores the values of the measures.
The SQL command GROUP BY is unusual aggregation operator in that a table is divided
into sub-tables based on the attribute values in the GROUP BY clause so that each subtable
has the same values for the attribute and then aggregations over each sub-table are
carried out. SQL has a variety of aggregation functions including max, min, average,
count which are used by employing the GROUP BY facility.
A data cube computes aggregates overall subsets of dimensions specified in the cube. A
cube may be found at the union of (large) number of SQL GROUP-BY operations.
Generally, all or some of the aggregates are pre-computed to improve query response
time. A decision has to be made as to what and how much should be pre-computed since
pre-computed queries require storage and time to compute them.
A data cube is often implemented as a database in which there are dimension tables each
of which provides details of a dimension. The database may be the enterprise data
warehouse.
2.2 OLAP
OLAP systems are data warehouse front-end software tools to make aggregate
data available efficiently, for advanced analysis, to managers of an enterprise. The
analysis often requires resource intensive aggregations processing and therefore it
becomes necessary to implement a special database (e.g. data warehouse) to improve
OLAP response time. It is essential that an OLAP system provides facilities for a
manager to pose ad hoc complex queries to obtain the information that he/she requires.
Another term that is being used increasingly is business intelligence. It is used to
mean both data warehousing and OLAP. It has been defined as a user-centered process of
exploring data, data relationships and trends, thereby helping to improve overall decision
making. Normally this involves a process of accessing data (usually stored within the
data warehouse) and analyzing it to draw conclusions and derive insights with the
purpose of effecting positive change within an enterprise. Business intelligence is closely
related to OLAP.
A data warehouse and OLAP are based on a multidimensional conceptual view of
the enterprise data. Any enterprise data is multidimensional consisting of dimensions
degree, country, scholarship, and year. Data that is arranged by the dimensions is like a
spreadsheet, although a spreadsheet presents only two-dimensional data with each cell
containing an aggregation. As an example, table 8.1 shows one such two-dimensional
spreadsheet with dimensions Degree and Country, where the measure is the number of
students joining a university in a particular year or semester.
Table 8.1 A multidimensional view of data for two dimensions
Degree
B.Sc LLB MBBS BCom BIT ALL
Country
Australia 5 20 15 50 11 101
India 10 0 15 25 17 67
Malaysia 5 1 10 12 23 51
Singapore 2 2 10 10 31 55
Sweden 5 0 5 25 7 42
UK 5 15 20 20 13 73
USA 0 2 20 15 19 56
ALL 32 40 95 157 121 445
Table 8.1 be the information for the year 2001. Similar spreadsheet views would be
available for other years. Three-dimensional data can also be organized in a spreadsheet
using a number of sheets or by using a number of two-dimensional tables in the same
sheet.
Although it is useful to think of OLAP systems as a generalization of
spreadsheets, spreadsheets are not really suitable for OLAP in spite of the nice userfriendly
interface that they provide. Spreadsheets tie data storage too tightly to the
presentation. It is therefore difficult to obtain other desirable views of the information.
Furthermore it is not possible to query spreadsheets. Also, spreadsheets become unwieldy
when more than three dimensions are to be represented. It is difficult to imagine a
spreadsheet with millions of rows or with thousands of formulas. Even with small
spreadsheets, formulas often have errors. An error-free spreadsheet with thousands of
formulas would therefore be very difficult to build. Data cubes essentially generalize
spreadsheets to any number of dimensions.
OLAP is the dynamic enterprise analysis required to create, manipulate, animate
and synthesize information from exegetical, contemplative and formulaic data analysis
models.
Essentially what this definition means is that the information is manipulated from the
point if view of a manager (exegetical), from the point of view of someone who has
thought about it(contemplative) and according to some formula(formulaic).
Another definition of OLAP, which is software technology that enables analysts,
managers and executives to gain insight into data through fast, consistent, interactive
access to a wide variety of possible views of information that, has been transformed from
raw data to reflect that real dimensional of the enterprise as understood by the user.
An even simpler definition is that OLAP is a fast analysis of shared
multidimensional information for advanced analysis. This definition (sometimes called
FASMI) implies that most OLAP queries should be answered within seconds.
Furthermore, it is expected that most OLAP queries can be answered without any
programming.
In summary, a manager would want even the most complex query to be answered
quickly; OLAP is usually a multi-user system that may be run on a separate server using
specialized OLAP software. The major OLAP applications are trend analysis over a
number of time periods, slicing, dicing, drill-down and roll-up to look at different levels
of detail and pivoting or rotating to obtain a new multidimensional view.
2.3 CHARACTERISTICS OF OLAP SYSTEMS
The following are the differences between OLAP and OLTP systems.
1. Users: OLTP systems are designed for office workers while the OLAP systems are
designed for decision makers. Therefore while an OLTP system may be accessed by
hundreds or even thousands of users in a large enterprise, an OLAP system is likely to be
accessed only by a select group of managers and may be used only by dozens of users.
2. Functions: OLTP systems are mission-critical. They support day-to-day operations of
an enterprise and are mostly performance and availability driven. These systems carry out
simple repetitive operations. OLAP systems are management-critical to support decision
of an enterprise support functions using analytical investigations. They are more
functionality driven. These are ad hoc and often much more complex operations.
3. Nature: Although SQL queries often return a set of records, OLTP systems are
designed to process one record at a time, for example a record related to the customer
who might be on the phone or in the store. OLAP systems are not designed to deal with
individual customer records. Instead they involve queries that deal with many records at a
time and provide summary or aggregate data to a manager. OLAP applications involve
data stored in a data warehouse that has been extracted from many tables and perhaps
from more than one enterprise database.
4. Design: OLTP database systems are designed to be application-oriented while OLAP
systems are designed to be subject-oriented. OLTP systems view the enterprise data as a
collection of tables (perhaps based on an entity-relationship model). OLAP systems view
enterprise information as multidimensional).
5. Data: OLTP systems normally deal only with the current status of information. For
example, information about an employee who left three years ago may not be available

on the Human Resources System. The old information may have been achieved on some
type of stable storage media and may not be accessible online. On the other hand, OLAP
systems require historical data over several years since trends are often important in
decision making.
6. Kind of use: OLTP systems are used for reading and writing operations while OLAP
systems normally do not update the data.
The differences between OLTP and OLAP systems are:
Property OLTP OLAP
Nature of users Operations workers Decision makers
Functions Mission-critical Management-critical
Nature of queries Mostly simple Mostly complex
Nature of usage Mostly repetitive Mostly ad hoc
Nature of design Application oriented Subject oriented
Number of users Thousands Dozens
Nature of data Current, detailed, relational Historical, summarized,
multidimensional
Updates All the time Usually not allowed
Table 8.1 Comparison of OLTP and OLAP system
FASMI Characteristics
In the FASMI characteristics of OLAP systems, the name derived from the first letters of
the characteristics are:
Fast: As noted earlier, most OLAP queries should be answered very quickly,
perhaps within seconds. The performance of an OLAP system has to be like that of a
search engine. If the response takes more than say 20 seconds, the user is likely to move
away to something else assuming there is a problem with the query. Achieving such
performance is difficult. The data structures must be efficient. The hardware must be
powerful enough for the amount of data and the number of users. Full pre-computation of
aggregates helps but is often not practical due to the large number of aggregates. One
approach is to pre-compute the most commonly queried aggregates and compute the
remaining on-the-fly.
Analytic: An OLAP system must provide rich analytic functionality and it is
expected that most OLAP queries can be answered without any programming. The
system should be able to cope with any relevant queries for the application and the user.
Often the analysis will be using the vendor’s own tools although OLAP software
capabilities differ widely between products in the market.
Shared: An OLAP system is shared resource although it is unlikely to be shared
by hundreds of users. An OLAP system is likely to be accessed only by a select group of
managers and may be used merely by dozens of users. Being a shared system, an OLAP
system should be provide adequate security for confidentiality as well as integrity.
Multidimensional: This is the basic requirement. Whatever OLAP software is
being used, it must provide a multidimensional conceptual view of the data. It is because
of the multidimensional view of data that we often refer to the data as a cube. A
dimension often has hierarchies that show parent / child relationships between the
members of a dimension. The multidimensional structure should allow such hierarchies.
Information: OLAP systems usually obtain information from a data warehouse.
The system should be able to handle a large amount of input data. The capacity of an
OLAP system to handle information and its integration with the data warehouse may be
critical.
Codd’s OLAP Characteristics
Codd et al’s 1993 paper listed 12 characteristics (or rules) OLAP systems. Another six in
1995 followed these. Codd restructured the 18 rules into four groups. These rules provide
another point of view on what constitutes an OLAP system.
All the 18 rules are available at http://www.olapreport.com/fasmi.htm. Here we
discuss 10 characteristics, that are most important.
1. Multidimensional conceptual view: As noted above, this is central characteristic of an
OLAP system. By requiring a multidimensional view, it is possible to carry out
operations like slice and dice.
2. Accessibility (OLAP as a mediator): The OLAP software should be sitting between
data sources (e.g data warehouse) and an OLAP front-end.
3. Batch extraction vs interpretive: An OLAP system should provide multidimensional
data staging plus precalculation of aggregates in large multidimensional databases.
4. Multi-user support: Since the OLAP system is shared, the OLAP software should
provide many normal database operations including retrieval, update, concurrency
control, integrity and security.
5. Storing OLAP results: OLAP results data should be kept separate from source data.
Read-write OLAP applications should not be implemented directly on live transaction
data if OLAP source systems are supplying information to the OLAP system directly.
6. Extraction of missing values: The OLAP system should distinguish missing values
from zero values. A large data cube may have a large number of zeros as well as some
missing values. If a distinction is not made between zero values and missing values, the
aggregates are likely to be computed incorrectly.
7. Treatment of missing values: An OLAP system should ignore all missing values
regardless of their source. Correct aggregate values will be computed once the missing
values are ignored.
8. Uniform reporting performance: Increasing the number of dimensions or database
size should not significantly degrade the reporting performance of the OLAP system.
This is a good objective although it may be difficult to achieve in practice.
9. Generic dimensionality: An OLAP system should treat each dimension as equivalent
in both is structure and operational capabilities. Additional operational capabilities may
be granted to selected dimensions but such additional functions should be grantable to
any dimension.
10. Unlimited dimensions and aggregation levels: An OLAP system should allow
unlimited dimensions and aggregation levels. In practice, the number of dimensions is
rarely more than 10 and the number of hierarchies rarely more than six.
MOTIVATIONS FOR USING OLAP
1. Understanding and improving sales: For an enterprise that has many products
and uses a number of channels for selling the products, OLAP can assist in
finding the most popular products and the most popular channels. In some cases it
may be possible to find the most profitable customers. For example, considering
the telecommunications industry and only considering one product,
communication minutes, there is a large amount of data if a company wanted to
analyze the sales of product for every hour of the day (24 hours), differentiate
between weekdays and weekends (2 values) and divide regions to which calls are
made into 50 regions.
2. Understanding and reducing costs of doing business: Improving sales is one
aspect of improving a business, the other aspect is to analyze costs and to control
them as much as possible without affecting sales. OLAP can assist in analyzing
the costs associated with sales. In some cases, it may also be possible to identify
expenditures that produce a high return on investment (ROI). For example,
recruiting a top salesperson may involve significant costs, but the revenue
generated by the salesperson may justify the investment.
2.3 MULTIDIMENSIONAL VIEW AND DATA CUBE
Senior
Executive
V-C, Deans
Department & Faculty
Management, Heads
Daily operations Registrar,
HR, Finance
Figure 2.1 A typical University management hierarchy
The multidimensional view of data is in some ways natural view of any enterprise of
managers. The triangle diagram in Figure 8.1 shows that as we go higher in the triangle
hierarchy the managers need for detailed information declines.
The multidimensional view of data by using an example of a simple OLTP
database consists of the three tables. Much of the literature on OLAP uses examples of a
shoe store selling different colour shoes of different styles.
It should be noted that the relation enrolment would normally not be required
since the degree a student is enrolled in could be included in the relation student but some
students are enrolled in double degrees and so the relation between the student and the
degree is multifold and hence the need for the relation enrolment.
student(Student_id, Student_name, Country, DOB, Address)
enrolment(Student_id, Degree_id, SSemester)
degree(Degree_id, Degree_name, Degree_length, Fee, Department)
An example of the first relation, i.e. student, is given in Table 2.2
Student_id Student_name Country DOB Address
8656789 Peta Williams Australia 1/1/1980 Davis Hall
8700020 John Smith Canada 2/2/1981 9 Davis Hall
8900020 Arun Krishna USA 3/3/1983 90 Second Hall
8801234 Peter Chew UK 4/4/1983 88 Long Hall
8654321 Reena Rani Australia 5/5/1984 88 Long Hall
8712374 Kathy Garcia Malaysia 6/6/1980 88 Long Hall
8612345 Chris Watanabe Singapore 7/7/1981 11 Main street
8744223 Lars Anderssen Sweden 8/8/1982 Null
8977665 Sachin Singh UAE 9/9/1983 Null
9234567 Rahul Kumar India 10/10/1984 Null
9176543 Saurav Gupta UK 11/11/1985 1, Captain Drive
Table 8.3 presents an example of the relation enrolment. In this table, the attribute
SSemester in the semester in which the student started the current degree. We code it by
using the year followed by 01 for the first semester and 02 for the second. We assume
that new students are admitted in each semester. Table 8.4 is an example of the relation
degree. In this table, the degree length is given in terms of the number of semester it
normally takes to finish it. The fee is assumed to be in thousands of dollars per year.
Table 2.3 The relation enrolment
Student_id Degree_id SSemester
8900020 1256 2002-01
8700074 3271 2002-01
8700074 3321 2002-02
8900020 4444 2000-01
8801234 1256 2000-01
8801234 3321 1999-02
8801234 3333 1999-02
8977665 3333 2000-02
Table 2.4 The relation degree
Degree_id Degree_name Degree_length Fee Department
1256 BIT 6 18 Computer Sci.
2345 BSc 6 20 Computer Sci
4325 BSc 6 20 Chemistry
3271 BSc 6 20 Physics
3321 BCom 6 16 Business
4444 MBBS 12 30 Medicine
3333 LLB 8 22 Law
It is clear that the information given in Tables 8.2, 8.3 and 8.4, although suitable for a
student enrolment OLTP system, is not suitable for efficient management decision
making. The managers do not need information about the individual students, the degree
they are enrolled in, and the semester they joined the university. What the managers need
is the trends in student numbers in different degree programs and from different
countries.
We first consider only two dimensions. Let us say we are primarily interested in finding
out how many students from each country came to do a particular degree. Therefore we
may visualize the data as two-dimensional, i.e.,
Country x Degree
A table that summarizes this type of information may be represented by a twodimensional
spreadsheet given in Table 8.5 (the numbers in Table 8.5 do not need relate
to the numbers in Table 8.3). We may call that this table gives the number of students
admitted (in say, 2000-01) a two-dimensional “cube”.
Table 2.5 A two-dimensional table of aggregates for semester 2000-01
Country \ Degree BSc LLB MBBS BCom BIT ALL
Australia 5 20 15 50 11 101
India 10 0 15 25 17 67
Malaysia 5 1 10 12 23 51
Singapore 2 2 10 10 31 55
Sweden 5 0 5 25 7 42
UK 5 15 20 20 13 73
USA 0 2 20 15 19 56
ALL 32 40 95 157 121 445
Using this two-dimensional view we are able to find the number of students joining any
degree from any country (only for semester 2000-01). Other queries that we are quickly
able to answer are:
・ How many students started BIT in 2000-01?
・ How many students joined from Singapore in 2000-01?
The data given in Table 8.6 is for a particular semester, 2000-01. A similar table would
be available for other semesters. Let us assume that the data for 2001-01 is given in Table
8.7.
Table 2.6 A two-dimensional table of aggregates for semester 2001-01
Country \ Degree BSc LLB MBBS BCom BIT ALL
Australia 7 10 16 53 10 96
India 9 0 17 22 13 61
Malaysia 5 1 19 19 20 64
Singapore 2 2 10 12 23 49
Sweden 8 0 5 16 7 36
UK 4 13 20 26 11 74
USA 4 2 10 10 12 38
ALL 39 28 158 158 96 418
Let us now imagine that Table 8.6 is put on top of Table 8.5. We now have a threedimensional
cube with SSemester as the vertical dimension. We now put on top of these
two tables another table that gives the vertical sums, as shown in Table 8.7.
Table 2.7 Two-dimensional table of aggregates for both semesters
Country \ Degree BSc LLB MBBS BCom BIT ALL
Australia 12 30 31 103 21 197
India 19 0 32 47 30 128
Malaysia 10 2 29 31 43 115
Singapore 4 4 20 22 54 104
Sweden 13 0 10 41 14 78
UK 9 28 40 46 24 147
USA 4 4 30 25 31 94
ALL 71 68 192 315 217 863
Tables 8.5, 8.6 and 8.7 together now form a three-dimensional cube. The table 8.7
provides totals for the two semesters and we are able to “drill-down” to find numbers in
individual semesters. Note that a cube does not need to have an equal number of
members in each dimension. Putting the three tables together gives a cube of 8 x 6 x 3 ( =
144) cells including the totals along every dimension.
A cube could be represented by:
Country x Degree x Semester
Figure 2.2 The cube formed by Tables 8.6, 8.7 and 8.8
In the three-dimensional cube, the following eight types of aggregations or queries are
possible:
1. null (e.g. how many students are there? Only 1 possible query)
2. degrees (e.g. how many students are doing BSc? 5 possible queries if we assume
5 different degrees)
3. semester (e.g. how many students entered in semester 2000-01? 2 possible queries
if we only have data about 2 semesters)
4. country (e.g. how many students are from the USA? 7 possible queries if there are
7 countries)
5. degrees, semester (e.g. how many students entered in 2000-01 to enroll in BCom?
With 5 degrees and 2 different semesters 10 queries)
6. (ALL, b, c) semester, country (e.g. how many students from the UK entered in
2000-01? 14 queries)
7. (a, b, ALL) degrees, country (e.g. how many students from Singapore are enrolled
in BCom? 35 queries)
8. (a, b, c) all (e.g. how many students from Malaysia entered in 2000-01 to enroll in
BCom? 70 queries)
2.4 DATA CUBE IMPLEMENTATIONS
1. Pre-compute and store all: This means that millions of aggregates will need to be
computed and stored. Although this is the best solution as far as query response
time is concerned, the solution is impractical since resources required to compute
the aggregates and to store them will be prohibitively large for a large data cube.
Indexing large amounts of data is also expensive.
2. Pre-compute (and store) none: This means that the aggregates are computed onthe-
fly using the raw data whenever a query is posed. This approach does not
require additional space for storing the cube but the query response time is likely
to be very poor for large data cubes.
3. Pre-compute and store some: This means that we pre-compute and store the
most frequently queried aggregates and compute others as the need arises. We
may also be able to derive some of the remaining aggregates using the aggregates
that have already been computed. It may therefore be worthwhile also to preData
Warehousing & DataMinig 
compute some aggregates that are not most frequently queried but help in deriving
many other aggregates. It will of course not be possible to derive all the
aggregates from the pre-computed aggregates and it will be necessary to access
the database (e.g the data warehouse) to compute the remaining aggregates. The
more aggregates we are able to pre-compute the better the query performance.
It can be shown that large numbers of cells do have an “ALL” value and may therefore be
derived from other aggregates. Let us reproduce the list of queries we had and define
them as (a, b, c) where a stands for a value of the degree dimension, b for country and c
for starting semester:
1. (ALL, ALL, ALL) null (e.g. how many students are there? Only 1 query)
2. (a, ALL, ALL) degrees (e.g. how many students are doing BSc? 5 queries)
3. (ALL, ALL, c) semester (e.g. how many students entered in semester 2000-01? 2
queries)
4. (ALL, b, ALL) country (e.g. how many students are from the USA? 7 queries)
5. (a, ALL, c) degrees, semester (e.g. how many students entered in 2000-01 to
enroll in BCom? 10 queries)
6. (ALL, b, c) semester, country (e.g. how many students from the UK entered in
2000-01? 14 queries)
7. (a, b, ALL) degrees, country (e.g. how many students from Singapore are enrolled
in BCom? 35 queries)
8. (a, b, c) all (e.g. how many students from Malaysia entered in 2000-01 to enroll in
BCom? 70 queries)
It is therefore possible to derive the other 74 of the 144 queries from the last 70
queries of type (a, b, c). Of course in a very large data cube, it may not be practical
even to pre-compute all the (a, b, c) queries and decision will need to be made which
ones should be pre-computed given that storage availability may be limited and it
may be required to minimize the average query cost.
In Figure 8.3 we show how the aggregated above are related and how an aggregate at the
higher level may be computed from the aggregates below. For example, aggregates
(ALL, ALL, c) may be derived from either (a, ALL, c) by summing over all a values
from (ALL, b, c) by summing over all b values.
ALL, ALL, ALL
ALL, b, ALL ALL, ALL, c
ALL, b, c a, b, ALL
a, b, c
Figure 2.3 Relationships between aggregations of a three-dimensional cube
Another related issue is where the data used by OLAP will reside. We assume that the
data is stored in a data warehouse or in one or more data marts.
Data cube products use different techniques for pre-computing aggregates and
storing them. They are generally based on one of two implementation models. The first
model, supported by vendors of traditional relational model databases, is called the
ROLAP model or the Relational OLAP model. The second model is called the MOLAP
model for multidimensional OLAP. The MLOAP model provides a direct
a, ALL, ALL
a, ALL, c
multidimensional view of the data whereas the RLOAP model provides a relational view
of the multidimensional data in the form of a fact table.
ROLAP
ROLAP uses a relational DBMS to implement an OLAP environment. It may be
considered a bottom-up approach which is typically based on using a data warehouse that
has been designed using a star schema. The data therefore is likely to be in a
denormalized structure. A normalized database avoids redundancy but is usually not
appropriate for high performance. The summary data will be held in aggregate tables.
The data warehouse provides the multidimensional capabilities by representing data in
fact table(s) and dimension tables. The fact table contains one column for each dimension
and one column for each measure and every row of the table [rovides one fact. A fact
then is represented as (BSc, India, 2001-01) with the last column as 30. An OLAP tool is
then provided to manipulate the data in these data warehouse tables. This tool essentially
groups the fact table to find aggregates and uses some of the aggregates already
computed to find new aggregates.
The advantage of using ROLAP is that it is more easily used with existing relational
DBMS and the data can be stored efficiently using tables since no zero facts need to be
stored. The disadvantage of the ROLAP model is its poor query performance. Proponents
of the MLOAP model have called the ROLAP model SLOWLAP. Some products in this
category are Oracle OLAP mode, OLAP Discoverer, MicroStrategy and Microsoft
Analysis Services.
MOLAP
MOLAP is based on using a multidimensional DBMS rather than a data warehouse to
store and access data. It may be considered as a top-down approach to OLAP. The
multidimensional database systems do not have a standard approach to storing and
maintaining their data. They often use special-purpose file systems or indexes that store
pre-computation of all aggregations in the cube. For example, in ROLAP a cell was
represented as (BSc, India, 2001-01) with a value 30 stored in the last column. In
MOLAP, the same information is stored as 30 and the storage location implicitly gives
the values of the dimensions. The dimension values do not need to be stored since all the
values of the cube could be stored in an array in a predefined way. For example, the cube
in Figure 8.2 may be represented as an array like the following:
12 30 31 10 21 19 19 0 32 47 30 12 10 2 29 31 43 …
3 7 8
If the values of the dimensions are known, we can infer the cell location in the array. If
the cell location is known, the values of the dimension may be inferred. This is obviously
a very compact notation for storing a multidimensional data cube although a couple of
problems remain. Firstly the array is likely to be too large to be stored in the main
memory. Secondly, this representation does not solve the problem of efficiently
representing sparse cubes. To overcome the problem of the array being too large for main
memory, the array may be split into pieces called ‘chunks’, each of which is small
enough to fit in the main memory. To overcome the problem of sparseness, the “chunks”
may be compressed.
MOLAP systems have to deal with sparsity since a very percentage of the cells can be
empty in some applications. The MOLAP implementation is usually exceptionally
efficient. The disadvantage of using MOLAP is that it is likely to be more expensive than
OLAP, the data is not always current, and it may be more difficult to scale a MOLAP
system for very large OLAP problems. Some MOLAP products are Hyperion Essbase
and Applix iTM1. Oracle and Microsoft are also competing in this segment of the OLAP
market.
The differences between ROLAP and MOLAP are summarized in Table 8.8
Table 2.8 Comparison of MOLAP and ROLAP
Property MOLAP ROLAP
Data structure Multidimensional database Relational tables (each cell is a row)
using sparse arrays
Disk space Separate database for data May not require any space other than
cube; large for large data that available in the data warehouse
cubes
Retrieval Fast(pre-computed) Slow(computes on-the-fly)
Scalability Limited (cubes can be very Excellent
large)
Best suited for Inexperienced users, limited Experienced users, queries change
set of queries frequently
DBMS Usually weak Usually very strong
facilities
2.5 DATA CUBE OPERATIONS
A number of operations may be applied to data cubes. The common ones are:
・ Roll-p
・ Drill-down
・ Slice and dice
・ Pivot
Roll-up
Roll-up is like zooming out on the data cube. It is required when the user needs further
abstraction or less detail. This operation performs further aggregations on the data, for
example, from single degree programs to all programs offered by a School or department,
from single countries to a collection of countries, and from individual semesters to
academic years. Often a hierarchy defined on a dimension is useful in the roll-up
operation as suggested by the example of countries and regions.
We provide an example of roll-up based on Table =s 8.6, 8.7 and 8.8. We first
define hierarchies on two dimensions. Amongst countries, let us define:
1. Asia (India, Malaysia, Singapore)
2. Europe (Sweden, UK)
3. Rest (Australia, USA)
Another hierarchy is defined on the dimension degree:
1. Science (BSc, BIT)
2. Medicine (MBBS)
3. Business and Law (BCom, LLB)
The result of a roll-up for both semesters together from Table 8.8 then is given in Table
8.9.
Table 2.9 Result of a roll-up operation using Table 8.7
Country \ Degree Science Medicine Business and Law
Asia 160 81 106
Europe 60 50 115
Rest 68 61 162
Drill-down
Drill-down is like zooming in on the data and is therefore the reverse of roll-up. It is an
appropriate operation when the user needs further details or when the user wants to
partition more finely or wants to focus on some particular values of certain dimensions.
Drill-down adds more details to the data. Hierarchy defined on a dimension may be
involved in drill-down. For example, a higher level views of student data, for example in
Table 8.9, gives student numbers for the two semesters for groups of countries and
groups of degrees. If one is interested in more detail then it is possible to drill-down to
tables 8.6 and 8.7 for student numbers in each of the semesters for each country and for
each degree.
Slice and dice
Slice and dice are operations for browsing the data in the cube. The terms refer to the
ability to look at information from different viewpoints.
A slice is a subset of the cube corresponding to a single value for one or more
members of the dimensions. For example, a slice operation is performed when the user
wants a selection on one dimension of a three-dimensional cube resulting in a twodimensional
site. Let the degree dimension be fixed as degree = BIT. The slice will not
include any information about other degrees. The information retrieved therefore is more
like a two-dimensional cube for degree = BIT as shown in Table 8.10.
Table 2.10 Result of a slice when degree value is ‘BIT’
Country \ Semester 2000-01 2000-02 2001-01 2001-02
Australia 11 5 10 2
India 17 0 13 5
Malaysia 23 2 20 1
Singapore 31 4 23 2
Sweden 7 0 7 4
UK 13 8 11 6
USA 19 4 12 5
It should be noted that Table 8.7 also is a slice (with SSemester = ‘2000-01’) from the
cube built by piling several tables like Tables 8.7 and 8.8 about different semesters on top
of each other. It is shown in Figure 8.4.
The dice operation is similar to slice but dicing does not involve reducing the number of
dimensions. A dice is obtained by performing a selection on two or more dimensions. For
example, one may only be interested in degrees BIT and BCom and countries Australia,
India, and Malaysia for semesters 2000-01 and 2000-01. The result is a three-dimensional
cube and we show it by Table 8.11 and 8.12 placed on top of each other. For example one
may only be interested in the degrees BIT and BCom and the countries, Australia, India
and Malaysia for semesters 2000-01 and 2001-01. The result is a three-dimensional cube
as shown in Figure 8.4 and we show it by Tables 8.11 and 8.12 for the two semesters
placed on top of each other. We have left out the table that shows the totals from these
tables.
Figure 2.4 A slice from the cube in Figure 8.2
Table 2.11 A three-dimensional dice from a three-dimensional cube (SSemester 2000-
01)
Country \ Degree BCom BIT
Australia 50 11
India 25 17
Malaysia 12 23
Table 2.12 A three-dimensional dice from a three-dimensional cube (SSemester 2001-
01)
Country \ Degree BCom BIT
Australia 53 10
India 22 13
Malaysia 19 20
A dice may be shown as in Figure 8.5
Figure 2.5 A dice from the cube in Figure 8.2
Slice and dice from often figure in interactive use of an OLAP system in which the user
can navigate through the cube by specifying either one or two dimensions or values of all
three dimensions that are of interest.
Pivot or Rotate
The pivot operation is used when the user wishes to re-orient the view of the data cube. It
may involve swapping the rows and columns, or moving one of the row dimensions into
the column dimension. For example, the cube consisting of Tables 8.6, 8.7 and 8.8 gives
the dimension degree along the x-axis, country along the y-axis and starting semester
along the z-axis (or the vertical axis). One may want to swap the dimensions country and
starting semester. The cube will then consist of several tables like those given in Tables
8.13 and 8.14 on top of each other.
Table 2.13 One table on which other similar tables are piled up in a rotated cube
(Country dimension value = Australia)
Semester \ Degree BSc LLB MBBS BCom BIT ALL
2000-01 5 20 15 50 11 101
2001-01 7 10 16 53 10 96
ALL 12 30 31 103 21 197
Table 2.14 Another table that is part of a rotated cube (Country dimension value = India)
Semester \ Degree BSc LLB MBBS BCom BIT ALL
2000-01 10 0 15 25 17 67
2001-01 9 0 17 22 13 61
ALL 19 0 32 47 30 128
Clearly this rotated cube gives a different view of the same data. Such views can be
particularly useful if the number of dimensions is greater than three.
2.6 GUIDELINES FOR OLAP IMPLEMENTATION
Following are a number of guidelines for successful implementation of OLAP. The
guidelines are, somewhat similar to those presented for data warehouse implementation.
1. Vision: The OLAP team must, in consultation with the users, develop a clear vision for
the OLAP system. This vision including the business objectives should be clearly
defined, understood, and shared by the stakeholders.
2. Senior management support: The OLAP project should be fully supported by the
senior managers. Since a data warehouse may have been developed already, this should
not be difficult.
3. Selecting an OLAP tool: The OLAP team should familiarize themselves with the
ROLAP and MOLAP tools available in the market. Since tools are quite different, careful
planning may be required in selecting a tool that is appropriate for the enterprise. In some
situations, a combination of ROLAP and MOLAP may be most effective.
4. Corporate strategy: The OLAP strategy should fit in with the enterprise strategy and
business objectives. A good fit will result in the OLAP tools being used more widely.
5. Focus on the users: The OLAP project should be focused on the users. Users should,
in consultation with the technical professional, decide what tasks will be done first and
what will be done later. Attempts should be made to provide each user with a tool
suitable for that person’s skill level and information needs. A good GUI user interface
should be provided to non-technical users. The project can only be successful with the
full support of the users.
6. Joint management: The OLAP project must be managed by both the IT and business
professionals. Many other people should be involved in supplying ideas. An appropriate
committee structure may be necessary to channel these ideas.
7. Review and adapt: As noted in last chapter, organizations evolve and so must the
OLAP systems. Regular reviews of the project may be required to ensure that the project
is meeting the current needs of the enterprise.
2.7 OLAP SOFTWARE
There is much OLAP software available in the market.
A list is available at http://www.kdnuggets.com/software/dbolap.html.
Another is available at http://www.olapreport.com/market.htm.
The list below provides some major OLAP software.
・ BI2M (Business Intelligence to Marketing and Management) from B&M Services
has three modules one of which is for OLAP. The OLAP module allows database
exploring including slice and dice, roll-up, drill-down and displays results as 2D
charts, 3D charts and tables.
・ Business Objects OLAP Intelligence from BusinessObjects allows access to
OLAP servers from Microsoft, Hyperion, IBM and SAP. Usual operations like
slice and dice, and drill directly on multidimensional sources are possible.
BusineeOjects also has widely used Crystal Analysis and Reports.
・ ContourCube from Contour Components is an OLAP product that enables users
to slice and dice, roll-up, drill-down and pivot efficiently.
・ DB2 Cube Views from IBM includes features and functions for managing and
deploying multidimensional data. It is claimed that OLAP solutions can be
deployed quickly.
・ Essbase Integration Services from Hyperion Solutions is a widely used suite of
tools. The company’s Web sites make it difficult to understand what the software
does. It 2005 market ranking was 2.
・ Everest from OutlookSoft is a Microsoft-based single application and database
that provides operational reporting and analysis including OLAP and
multidimensional slice and dice and other analysis operations.
・ Executive Suite from CIP-Global is an integrated corporate planning, forecasting,
consolidation and reporting solution based on Microsoft’s SQL server 2000 and
analysis Services Platform.
・ Executive Viewer from Temtec provides users real-time Web access to OLAP
databases such as Microsoft Analysis Services and Hyperion Essbase for
advanced and ad hoc analysis as well as reporting.
・ Express and the Oracle OLAP Option ? Express is a multidimensional database
and application development environment for building OLAP applications. It is
MOLAP. OLAP Analytic workspaces is a porting of the Oracle Express analytic
engine to the Oracle RDBMS kernel which now runs as an OLAP virtual
machine.
・ MicroStrategy 8 from MicroStrategy provides facilities for query, reporting and
advanced analytical needs. It 2005 market ranking was 5.
・ NovaView from Panorama extends the Microsoft platform that integrates
analysis, reporting and performance measurement information into a single
solution.
・ PowerPlay from Cognos is widely used OLAP software that allows users to
analyze large volumes of data with fast response times. Its 2005 market ranking
was 3.
・ SQL Server 2000 Analysis Services from Microsoft. SQL Server 2000 Analysis
Services is the OLAP Services component in SQL Server 7.0

Editors




Join eduladder!