We are building EduLadder(ELADR) - Protocol

The Eladr Protocol is a decentralized, security and efficiency enhanced Web3 noSQL database powered by IPFS as the data storage layer https://ipfs.io/, and the Cardano block chain as the rewards token platform, https://cardano.org/. It provides a JSON based, IPFS layer 2 solution for data indexing and retrieval in an 'append only' file system built with open source Node.js API libraries.

Eladr tokens are designed to incentifised community members as a proof of contribution. Using that they can access diffrent infrastructure built on top of eladr


Using this You can,Buy courses,Reward others and exchange for real money.


WHITE PAPER Buy Now

Real Problems! Real Experts!

Join Our Telegram Channel !


The Eduladder is a community of students, teachers, and programmers. We help you to solve your academic and programming questions fast.
In eduladder you can Ask,Answer,Listen,Earn and Download Questions and Question papers.
Watch related videos of your favorite subject.
Connect with students from different parts of the world.
Apply or Post Jobs, Courses ,Internships and Volunteering opportunity. For FREE
See Our team
Wondering how we keep quality?
Got unsolved questions? Ask Questions
ELADR beta version launched

We launched Anonymous immutable internet on eladr protocol

For any question or query please joinOur Telegram Channel !


Try BETA
Youtube Videohttps://www.youtube.com/watch?v=ySLPZu3Jxro

Our Github Repo
FrontEnd BackEnd

We are looking for some great and cool people to work with us. Please sent your resume to admin@eduladder.com


You are here:Open notes-->VTU-->Data-Warehousing--DataMinig-10CS755-unit-3

Data Warehousing & DataMinig [10CS755] unit-3

Unit 3 DATA MINING 3.1 INTRODUCTION

The complexity of modern society coupled with growing competition due to trade
globalization has fuelled the demand for data mining. Most enterprises have collected
information over at least the last 30 years and they are keen to discover business
intelligence that might be buried in it. Business intelligence may be in the form of
customer profiles which may result in better targeted marketing and other business
actions.
During the 1970s and the 1980s, most Western societies developed a similar set of
privacy principles and most of them enacted legislation to ensure that governments and
the private sector were following good privacy principles. Data mining is a relatively new
technology and privacy principles developed some 20-30 years ago are not particularly
effective in dealing with privacy concerns that are being raised about data mining. These
concerns have been heightened by the dramatically increased use of data mining by
governments as a result of the 9/11 terrorist attacks. A number of groups all over the
world are trying to wrestle with the issues raised by widespread use of data mining
techniques.
3.2 Challenges
The daily use of the word privacy about information sharing and analysis is often vague
and may be misleading. We will therefore provide a definition (or two). Discussions
about the concept of information privacy started in the 1960s when a number of
researchers recognized the dangers of privacy violations by large collections of personal
information in computer systems. Over the years a number of definitions of information
privacy have emerged. One of them defines information privacy as the individual’s
ability to control the circulation of information relating to him/her. Another widely used
definition is the claim of individuals, groups, or institutions to determine for themselves
when, how, and to what extent information about them is communicated to others.
Sometimes privacy is confused with confidentially and at other times with
security. Privacy does involve confidentiality and security but it involves more than the
two.
BASIC PRINCIPLES TO PROTECT INFORMATION PRIVACY
During the 1970s and 1980s many countries and organizations (e.g. OECD, 1980)
developed similar basic information privacy principles which were then enshrined in
legislation by many nations. These principles are interrelated and party overlapping and
should therefore be treated together. The OECD principles are:
1. Collection limitation: There should be limits to the collection of personal data
and any such data should be obtained by lawful and fair means and, where
appropriate, with the knowledge or consent of the data subject.
2. Data quality: Personal data should be relevant to the purposes for which they are
to be used, and, to the extent necessary for those purposes, should be accurate,
complete and kept up-to-data.
3. Purpose specification: The purpose for which personal data are collected should
be specified not later than at the time of data collection and the subsequent use
limited to the fulfilment of those purposes or such others as are not incompatible
with those purposes and as are specified on each occasion of change of purpose.
4. Use limitation: Personal data should not be disclosed, made available or
otherwise used for purposes other than those specified in accordance with
Principle 3 except with the consent of the data subject or by the authority of law.
5. Security safeguards: Personal data should be protected by reasonable security
safeguards against such risks as loss of unauthorized access, destruction, use,
modification or disclosure of data.
6. Openness: There should be general policy of openness about developments,
practices and policies with respect to personal data. Means should be readily
available for establishing the existence and nature of personal data, and the main
purposes of their use, as well as the identity and usual residence of the data
controller.
7. Individual participation: An individual should have the right:
(a) to obtain from a data controller, or otherwise, confirmation of whether or
not the data controller has data relating to him;
(b) to have communicated to him, data relating to him
within a reasonable time;
at a charge, if any, that is not excessive;
in a reasonable manner; and
in a form that is readily intelligible to him;
(c) to be given reasons if a request made under subparagraphs (a) and (b) is
denied, and to be able to challenge such denial; and
(d) to challenge data related to him and, if the challenge is successful, to have
the data erased, rectified, completed or amended.
8. Accountability: A data controller should be accountable for complying with
measures which give effect to the principles stated above.
These privacy protection principles were developed for online transaction processing
(OLTP) systems before technologies like data mining became available. In OLTP
systems, the purpose of the system is quite clearly defined since the system is used for a
particular operational purpose of an enterprise (e.g. student enrolment). Given a clear
purpose of the system, it is then possible to adhere to the above principles.
USES AND MISUSES OF DATA MINING
Data mining involves the extraction of implicit, previously unknown and potentially
useful knowledge from large databases. Data mining is a very challenging task since it
involves building and using software that will manage, explore, summarize, model,
analyse and interpret large datasets in order to identify patterns and abnormalities.
Data mining techniques are being used increasingly in a wide variety of
applications. The applications include fraud prevention, detecting tax avoidance, catching
drug smugglers, reducing customer churn and learning more about customers’ behaviour.
There are also some (mis)uses of data mining that have little to do with any of these
applications. For example, a number of newspapers in early 2005 have reported results of
analyzing associations between the political party that a person votes for and the car the
person drives. A number of car models have been listed in the USA for each of the two
major political parties.
In the wake of the 9/11 terrorism attacks, considerable use of personal
information, provided by individuals for other purposes as well as information collected
by governments including intercepted emails and telephone conversations, is being made
in the belief that such information processing (including data mining) can assist in
identifying persons who are likely to be involved in terrorist networks or individuals who
might be in contact with such persons or other individuals involved in illegal activities
(e.g. drug smuggling). Under legislation enacted since 9/11, many governments are able
to demand access to most private sector data. This data can include records on travel,
shopping, utilities, credit, telecommunications and so on. Such data can them be mined in
the belief that patterns can be found that will help in identifying terrorists or drug
smugglers.
Consider a very simple artificial example of data in Table 9.1 being analysed
using a data mining technique like the decision tree:
Table 3.1 A simple data mining example
Birth Country Age Religion Visited X Studied in West
Risk Class
A <30 P Yes Yes
B
B >60 Q Yes Yes
A
A <30 R Yes No
C
X 30-45 R No No
B
Y 46-60 S Yes No
C
X >60 P Yes Yes
A
Z <25 P No Yes
B
A <25 Q Yes No
A
B <25 Q Yes No
C
B 30-45 S Yes No
C
Using the decision tree to analyse this data may result in rules like the following:
If Age = 30-45 and Birth Country = A and Visited X = Yes and Studied in West =
Yes and
Religion = R then Risk Class = A.
User profiles are built based on relevant user characteristics. The number of
characteristics may be large and may include all kinds of information including telephone
zones phoned, travelling on the same flight as a person on a watch list and much more.
User profiling is used in a variety of other areas, for example authorship analysis or
plagiarism detection.
Once a user profile is formed, the basic action of the detection system is to
compare incoming personal data to the profile and make a decision as to whether the data
fit any of the profiles. The comparison can in fact be quite complex because not all of the
large numbers of characteristics in the profile are likely to match but a majority might.
Such profile matching can lead to faulty inferences. As an example, it was
reported that a person was wrongly arrested just because the person had an Arab name
and obtained a driver license at the same motor vehicle office soon after one of the 9/11
hijackers did. Although this incident was not a result of data mining, it does show that an
innocent person can be mistaken for a terrorist or a drug smuggler as a result of some
matching characteristics.
PRIMARY AIMS OF DATA MINING
Essentially most data mining techniques that we are concerned about are designed to
discover and match profiles. The aims of the majority of such data mining activities are
laudable but the techniques are not always perfect. What happens if a person matches the
profile but does not belong to the category?
Perhaps it is not a matter of great concern if a telecommunications company
labels a person as one that is likely to switch and then decides to target that person with a
special campaign designed to encourage the person to stay. On the other hand, if the
Customs department identifies a person as fitting the profile of a drug smuggler then that
person is likely to undergo a special search whenever he/she returns home from overseas
and perhaps at other airports if the customs department of one country shares information
with other countries. This would be a matter of much more concern to governments.
Knowledge about the classification or profile of an individual who has been so
classified or profiled may lead to disclosure of personal information with some given
probability. The characteristics that someone may be able to deduce about a person with
some possibility may include sensitive information, for example, race, religion, travel
history, and level of credit card expenditure.
Data mining is used for many purposes that are beneficial to society, as the list of
some of the common aims of data mining below shows.
・ The primary aim of many data mining applications is to understand the customer
better and improve customer services.
・ Some applications aim to discover anomalous patterns in order to help identify,
for example, fraud, abuse, waste, terrorist suspects, or drug smugglers.
・ In many applications in private enterprises, the primary aim is to improve the
profitability of an enterprise
・ The primary purpose of data mining is to improve judgement, for example, in
making diagnoses, in resolving crime, in sorting out manufacturing problems, in
predicting share prices or currency movements or commodity prices.
・ In some government applications, one of the aims of data mining is to identify
criminal and fraud activities.
・ In some situations, data mining is used to find patterns that are simply not
possible without the help of data mining, given the huge amount of data that must
be processed.
3.3 Data Mining Tasks
Data mining functionalities are used to specify the kind of patterns to be found in data mining
tasks. In general, data mining tasks can be classified into two categories:
? Descriptive
? predictive
? Predictive tasks. The objective of these tasks is to predict the value of a particular
attribute based on the values of other attribute.
? Use some variables (independent/explanatory variable) to predict unknown or
future values of other variables (dependent/target variable).
? Description Methods: Here the objective is to derive patterns that summarize the
underlying relationships in data.
? Find human-interpretable patterns that describe the data.
There are four core tasks in Data Mining:
i. Predictive modeling ii. Association analysis
iii. Clustering analysis, iv. Anomaly detection
Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make predictions.
Describe data mining functionalities, and the kinds of patterns they can discover (or) define each
of the following data mining functionalities: characterization, discrimination, association and
correlation analysis, classification, prediction, clustering, and evolution analysis. Give examples
of each data mining functionality, using a real-life database that you are familiar with.
1). predictive method
Find some missing or unavailable data values rather than class labels referred to as prediction. Although
prediction may refer to both data value prediction and class label prediction, it is usually confined to data value
prediction and thus is distinct from classification. Prediction also encompasses the identification of distribution
trends based on the available data.
Example:
Predicting flooding is difficult problem. One approach is uses monitors placed at various points in the river. These
monitors collect data relevant to flood prediction: water level, rain amount, time, humidity etc. These water levels at
a potential flooding point in the river can be predicted based on the data collected by the sensors upriver from this
point. The prediction must be made with respect to the time the data were collected
Classification:
? It predicts categorical class labels
? It classifies data (constructs a model) based on the training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
? Typical Applications
o credit approval
o target marketing
o medical diagnosis
o treatment effectiveness analysis
Classification can be defined as the process of finding a model (or function) that describes and
distinguishes data classes or concepts, for the purpose of being able to use the model to predict the class
of objects whose class label is unknown. The derived model is based on the analysis of a set of training
data (i.e., data objects whose class label is known).
Example:
An airport security screening station is used to deter mine if passengers are potential terrorist or criminals. To do
this, the face of each passenger is scanned and its basic pattern(distance between eyes, size, and shape of mouth,
head etc) is identified. This pattern is compared to entries in a database to see if it matches any patterns that are
associated with known offenders
A classification model can be represented in various forms, such as
1) IF-THEN rules,
student ( class , "undergraduate") AND concentration ( level, "high") ==> class A
student (class ,"undergraduate") AND concentrtion (level,"low") ==> class B
student (class , "post graduate") ==> class C
2) Decision tree
3) Neural network.
.
Classification vs. Prediction
Classification differs from prediction in that the former is to construct a set of models (or functions) that describe
and distinguish data class or concepts, whereas the latter is to predict some missing or unavailable, and often
numerical, data values. Their similarity is that they are both tools for prediction: Classification is used for predicting
the class label of data objects and prediction is typically used for predicting missing numerical data values.
2). Association Analysis
It is the discovery of association rules showing attribute-value conditions that occur frequently together in a given
set of data. For example, a data mining system may find association rules like
major(X, “computing science””) owns(X, “personal computer”)
[support = 12%, confidence = 98%]
where X is a variable representing a student. The rule indicates that of the students under study, 12% (support) major
in computing science and own a personal computer. There is a 98% probability (confidence, or certainty) that a
student in this group owns a personal computer.
Example:
A grocery store retailer to decide whether to but bread on sale. To help determine the impact of this decision, the
retailer generates association rules that show what other products are frequently purchased with bread. He finds 60%
of the times that bread is sold so are pretzels and that 70% of the time jelly is also sold. Based on these facts, he tries
to capitalize on the association between bread, pretzels, and jelly by placing some pretzels and jelly at the end of the
aisle where the bread is placed. In addition, he decides not to place either of these items on sale at the same time.
3). Clustering analysis
Clustering analyzes data objects without consulting a known class label. The objects are clustered or
grouped based on the principle of maximizing the intra-class similarity and minimizing the
interclass similarity. Each cluster that is formed can be viewed as a class of objects.
Example:A certain national department store chain creates special catalogs targeted to various
demographic groups based on attributes such as income, location and physical characteristics of potential
customers (age, height, weight, etc). To determine the target mailings of the various catalogs and to assist
in the creation of new, more specific catalogs, the company performs a clustering of potential customers
based on the determined attribute values. The results of the clustering exercise are the used by
management to create special catalogs and distribute them to the correct target population based on the
cluster for that catalog.
Clustering can also facilitate taxonomy formation, that is, the organization of observations into a hierarchy of classes
that group similar events together as shown below:
Classification vs. Clustering
? In general, in classification you have a set of predefined classes and want to know which class a
new object belongs to.
? Clustering tries to group a set of objects and find whether there is some relationship between the
objects.
? In the context of machine learning, classification is supervised learning and clustering is
unsupervised learning.
4). Anomaly Detection
It is the task of identifying observations whose characteristics are significantly different from the
rest of the data. Such observations are called anomalies or outliers. This is useful in fraud
detection and network intrusions.
3.4 Types of Data
A Data set is a Collection of data objects and their attributes. An data object is also known as
record, point, case, sample, entity, or instance. An attribute is a property or characteristic of an
object. Attribute is also known as variable, field, characteristic, or feature.
3.4.1 Attributes and Measurements
An attribute is a property or characteristic of an object. Attribute is also known as variable,
field, characteristic, or feature. Examples: eye color of a person, temperature, etc. A collection of
attributes describe an object.
Attribute Values: Attribute values are numbers or symbols assigned to an attribute. Distinction
between attributes and attribute values? Same attribute can be mapped to different attribute
values. Example: height can be measured in feet or meters.
The way you measure an attribute is somewhat may not match the attributes properties.
? Different attributes can be mapped to the same set of values. Example: Attribute values for ID
and age are integers. But properties of attribute values can be different, ID has no limit but age
has a maximum and minimum value.
Income Cheat
The types of an attribute
A simple way to specify the type of an attribute is to identify the properties of numbers that
correspond to underlying properties of the attribute.
? Properties of Attribute Values
The type of an attribute depends on which of the following properties it possesses:
? Distinctness: = ≠
? Order: < >
? Addition: + -
? Multiplication: * /
There are different types of attributes
? Nominal
Examples: ID numbers, eye color, zip codes
? Ordinal
Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall,
medium, short}
? Interval
Examples: calendar dates, temperatures in Celsius or Fahrenheit.
? Ratio
Examples: temperature in Kelvin, length, time, counts
3.4.2 Describing attributes by the number of values
Discrete Attribute? Has only a finite or countably infinite set of values, examples: zip codes,
counts, or the set of words in a collection of documents, often represented as integer variables.
Binary attributes are a special case of discrete attributes
Continuous Attribute? Has real numbers as attribute values, examples: temperature, height,
or weight. Practically, real values can only be measured and represented using a finite number of
digits. Continuous attributes are typically represented as floating-point variables.
Asymmetric Attribute-only a non-zero attributes value which is different from other values.
Preliminary investigation of the data to better understand its specific characteristics, it can help
to answer some of the data mining questions
? To help in selecting pre-processing tools
? To help in selecting appropriate data mining algorithms
Things to look at: Class balance, Dispersion of data attribute values, Skewness, outliers,
missing values, attributes that vary together, Visualization tools are important, Histograms, box
plots, scatter plots Many datasets have a discrete (binary) attribute class
Data mining algorithms may give poor results due to class imbalance problem, Identify the
problem in an initial phase.
General characteristics of data sets:
? Dimensionality: of a data set is the number of attributes that the objects in the data set
possess. Curse of dimensionality refers to analyzing high dimensional data.
? Sparsity: data sets with asymmetric features like most attributes of an object with value 0;
in some cases it may be with value non-zero.
? Resolution: it is possible to obtain different levels of resolution of the data.
Now there are varieties of data sets are there, let us discuss some of the following.
1. Record
? Data Matrix
? Document Data
? Transaction Data
2. Graph
? World Wide Web
? Molecular Structures
3. Ordered
? Spatial Data
? Temporal Data
? Sequential Data
-? Genetic Sequence Data
Record Data
Data that consists of a collection of records, each of which consists of a fixed set of attributes
Transaction or market basket Data
A special type of record data, where each transaction (record) involves a set of items. For
example, consider a grocery store. The set of products purchased by a customer during one
shopping trip constitute a transaction, while the individual products that were purchased are the
items.
Transaction data is a collection of sets of items, but it can be viewed as a set of records whose
fields are asymmetric attributes.
Transaction data can be represented as sparse data matrix: market basket representation
? Each record (line) represents a transaction
? Attributes are binary and asymmetric
Data Matrix
An M*N matrix, where there are M rows, one for each object, and N columns, one for each
attribute. This matrix is called a data matrix, which holds only numeric values to its cells.
If data objects have the same fixed set of numeric attributes, then the data objects can be
thought of as points in a multi-dimensional space, where each dimension represents a distinct
attribute
Such data set can be represented by an m by n matrix, where there are m rows, one for each
object, and n columns, one for each attribute
The Sparse Data Matrix
It is a special case of a data matrix in which the attributes are of the same type and are
asymmetric; i.e. , only non-zero values are important.
Document Data
Each document becomes a `term' vector, each term is a component (attribute) of the vector, and
the value of each component is the number of times the corresponding term occurs in the
document.
Graph-based data
In general, the data can take many forms from a single, time-varying real number to a complex
interconnection of entities and relationships. While graphs can represent this entire spectrum of
data, they are typically used when relationships are crucial to the domain. Graph-based data
mining is the extraction of novel and useful knowledge from a graph representation of data.
Graph mining uses the natural structure of the application domain and mines directly over that
structure. The most natural form of knowledge that can be extracted from graphs is also a graph.
Therefore, the knowledge, sometimes referred to as patterns, mined from the data are typically
expressed as graphs, which may be sub-graphs of the graphical data, or more abstract
expressions of the trends reflected in the data. The need of mining structural data to uncover
objects or concepts that relates objects (i.e., sub-graphs that represent associations of features)
has increased in the past ten years, involves the automatic extraction of novel and useful
knowledge from a graph representation of data. a graph-based knowledge discovery system that
finds structural, relational patterns in data representing entities and relationships. This algorithm
was the first proposal in the topic and has been largely extended through the years. It is able to
develop graph shrinking as well as frequent substructure extraction and hierarchical conceptual
clustering.
A graph is a pair G = (V, E) where V is a set of vertices and E is a set of edges. Edges connect
one vertices to another and can be represented as a pair of vertices. Typically each edge in a
graph is given a label. Edges can also be associated with a weight.
We denote the vertex set of a graph g by V (g) and the edge set by E(g). A label function, L,
maps a vertex or an edge to a label. A graph g is a sub-graph of another graph g' if there exists a
sub-graph isomorphism from g to g'. (Frequent Graph) Given a labeled graph dataset, D = {G1,
G2, . . . , Gn}, support (g) [or frequency(g)] is the percentage (or number) of graphs in D where g
is a sub-graph. A frequent (sub) graph is a graph whose support is no less than a minimum
support threshold, min support.
Spatial data
Also known as geospatial data or geographic information it is the data or information that
identifies the geographic location of features and boundaries on Earth, such as natural or
constructed features, oceans, and more. Spatial data is usually stored as coordinates and
topology, and is data that can be mapped. Spatial data is often accessed, manipulated or analyzed
through Geographic Information Systems (GIS).
Measurements in spatial data types: In the planar, or flat-earth, system, measurements of
distances and areas are given in the same unit of measurement as coordinates. Using the
geometry data type, the distance between (2, 2) and (5, 6) is 5 units, regardless of the units used.
In the ellipsoidal or round-earth system, coordinates are given in degrees of latitude and
longitude. However, lengths and areas are usually measured in meters and square meters, though
the measurement may depend on the spatial reference identifier (SRID) of the geography
instance. The most common unit of measurement for the geography data type is meters.
Orientation of spatial data: In the planar system, the ring orientation of a polygon is not an
important factor. For example, a polygon described by ((0, 0), (10, 0), (0, 20), (0, 0)) is the same
as a polygon described by ((0, 0), (0, 20), (10, 0), (0, 0)). The OGC Simple Features for SQL
Specification does not dictate a ring ordering, and SQL Server does not enforce ring ordering.
Time Series Data
A time series is a sequence of observations which are ordered in time (or space). If observations
are made on some phenomenon throughout time, it is most sensible to display the data in the
order in which they arose, particularly since successive observations will probably be dependent.
Time series are best displayed in a scatter plot. The series value X is plotted on the vertical axis
and time t on the horizontal axis. Time is called the independent variable (in this case however,
something over which you have little control). There are two kinds of time series data:
1. Continuous, where we have an observation at every instant of time, e.g. lie detectors,
electrocardiograms. We denote this using observation X at time t, X(t).
2. Discrete, where we have an observation at (usually regularly) spaced intervals. We
denote this as Xt.
Examples
Economics - weekly share prices, monthly profits
Meteorology - daily rainfall, wind speed, temperature
Sociology - crime figures (number of arrests, etc), employment figures
Sequence Data
Sequences are fundamental to modeling the three primary medium of human communication:
speech, handwriting and language. They are the primary data types in several sensor and
monitoring applications. Mining models for network intrusion detection view data as sequences
of TCP/IP packets. Text information extraction systems model the input text as a sequence of
words and delimiters. Customer data mining applications profile buying habits of customers as a
sequence of items purchased. In computational biology, DNA, RNA and protein data are all best
modeled as sequences.
A sequence is an ordered set of pairs (t1 x1) . . . (tn xn) where ti denotes an ordered attribute like
time (ti?1 _ ti) and xi is an element value. The length n of sequences in a database is typically
variable. Often the first attribute is not explicitly specified and the order of the elements is
implicit in the position of the element. Thus, a sequence x can be written as x1 . . . xn. The
elements of a sequence are allowed to be of many different types. When xi is a real number, we
get a time series. Examples of such sequences abound ? stock prices along time, temperature
measurements obtained from a monitoring instrument in a plant or day to day carbon monoxide
levels in the atmosphere. When si is of discrete or symbolic type we have a categorical sequence.
3.6 Measures of Similarity and Dissimilarity, Data Mining Applications
Data mining focuses on (1) the detection and correction of data quality problems (2) the use of
algorithms that can tolerate poor data quality. Data are of high quality "if they are fit for their
intended uses in operations, decision making and planning" (J. M. Juran). Alternatively, the data
are deemed of high quality if they correctly represent the real-world construct to which they
refer. Furthermore, apart from these definitions, as data volume increases, the question of
internal consistency within data becomes paramount, regardless of fitness for use for any
external purpose, e.g. a person's age and birth date may conflict within different parts of a
database. The first views can often be in disagreement, even about the same set of data used for
the same purpose.
Definitions are:
? Data quality: The processes and technologies involved in ensuring the conformance of
data values to business requirements and acceptance criteria.
? Data exhibited by the data in relation to the portrayal of the actual scenario.
? The state of completeness, validity, consistency, timeliness and accuracy that makes data
appropriate for a specific use.
Data quality aspects: Data size, complexity, sources, types and formats Data processing issues,
techniques and measures We are drowning in data, but starving of knowledge (Jiawei Han).
Dirty data
What does dirty data mean?
Incomplete data(missing attributes, missing attribute values, only aggregated data, etc.)
Inconsistent data (different coding schemes and formats, impossible values or out-of-range
values), Noisy data (containing errors and typographical variations, outliers, not accurate values)
Data quality is a perception or an assessment of data's fitness to serve its purpose in a given
context.
Aspects of data quality include:
? Accuracy
? Completeness
? Update status
? Relevance
? Consistency across data sources
? Reliability
? Appropriate presentation
? Accessibility
3.7.1 Measurement and data collection issues
Just think about the statement below” a person has a height of 2 meters, but weighs only 2kg`s “.
This data is inconsistence. So it is unrealistic to expect that data will be perfect.
Measurement error refers to any problem resulting from the measurement process. The
numerical difference between measured value to the actual value is called as an error. Both of
these errors can be random or systematic.
Noise and artifacts
Noise is the random component of a measurement error. It may involve the distortion of a value
or the addition of spurious objects. Data Mining uses some robust algorithms to produce
acceptable results even when noise is present.
Data errors may be the result of a more deterministic phenomenon called as artifacts.
Precision, Bias, and Accuracy
The quality of measurement process and the resulting data are measured by Precision and Bias.
Accuracy refers to the degree of measurement error in data.
Outliers
Missing Values
It is not unusual for an object to be missed its attributes. In some cases information is not
collected properly. Example application forms , web page forms.
Strategies for dealing with missing data are as follows:
? Eliminate data objects or attributes with missing values.
? Estimate missing values
? Ignore the missing values during analysis
Inconsistent values
Suppose consider a city like kengeri which is having zipcode 560060, if the user will give some
other value for this locality then we can say that inconsistent value is present.
Duplicate data
Sometimes Data set contain same object more than once then it is called duplicate data. To detect
and eliminate such a duplicate data two main issues are addressed here; first, if there are two
objects that actually represent a single object, second the values of corresponding attributes may
differ.
Issues related to applications are timelines of the data, knowledge about the data and relevance of
the data.

Editors




You might like this video:Watch more here

Watch more videos from this user Here

Learn how to upload a video over here