Dataset Generation Overview
This document describes all the constraints available to DataGen 3.0 and
it also documents how the data sets are generated.
-
Introduction
-
Domain Representation
-
Dataset Generation
-
Reports
-
Examples
-
References
1) Introduction
DataGen creates data that conforms to all the constraints you define.
In tabular data you can for example state that the first column contain
integers from -10 to 10, the second column to contain numbers from 0.0
to 1.0, and the third column to contain five symbols {A,B,C,D,E}. This
example shows constraints of datatype, domain, and range. The three datatypes
in the example are ordinal, continuous and nominal. The domains presented
are 21, infinity, and 5. Finally, the domains are constrained by maximum
and minimum values for numeric datatypes and by the letters/words in the
nominal column. Several other constraints can be specified for individual
columns and constraints can also be defined between columns. For the example
above DataGen produces one row at a time by selecting a value from the
domain of each column. The value that is selected can be constrained (biased)
in several ways. For non-continuous datatypes the selection can be sequential:
A,B,C,A,B,C,A..., etc. For all datatypes the selection can be random. And
finally, the randomness can be squewed by a guassian function. On numerical
datatypes this has the effect of creating a standard normal distribution.
2) Domain Representation
-
Constraints on individual columns
-
Constraints with the use of Classification Rules.
DataGen will first generate a synthetic set of rules and proceed to
generate a relation which abides by this rule base. Several characteristics
of the process can be customized. Rule bases may be in Conjunctive Normal
Form, with easy customization of the quantity and complexity of the rules.
Data sets can also be customized, for example, to include some interesting
real-world characteristics such as irrelevant attributes, missing attributes,
noisy data and missing values.
Real world data sets are likely to have some implicit structure. This
structure is unknown to the agent trying to discover the portion of this
implicit knowledge which applies to the particular object the agent would
like to have classified.
Initial machine learning techniques like ID3[Qu86]
and derivatives like CN2[CN89] and INFERULE[UR91]
discover decision structures. Decision Trees can, for example, be used to classify fully defined objects.
Fig 1. Decision List
-
IF pulse_rate=high THEN
-
IF pressure=high THEN
-
ELSE IF pressure=low THEN
-
ELSE IF pulse_rate=low THEN
-
IF pressure=high THEN
-
ELSE IF pressure=low THEN
Another popular knowledge base representation are classification rules. These rules correlates two terms were the left-hand is a sufficient
condition for the right-hand side concept. Below is an sample set of classification
rules.
Fig 2. Strictly Conjunctive Classification Rules
-
IF pulse=(high OR mid_high) AND pressure=high THEN risk=high
-
IF pulse=high AND pressure=low THEN risk=mid_high
-
IF pulse=medium AND pressure=medium THEN risk=low
-
IF pulse=(low OR very_log) AND pressure=low THEN risk=medium
DatGen uses the latter representation form to generate its data sets restricts
the right-hand side to a single attribute-value. The representation language
could be extended to allow class descriptions of several attribute-values.
Real-world domains have many different characteristics. Some domains
have can be represented with a simple
domain theory but may require hundreds of rules. Other domains on the other
hand may be represented by a few complex rules.
The DatGen computer program integrates several key parameters to generate
strictly conjunctive classification rules which are then used to generate
synthetic data sets.
Several key parameters have been identified to model many different
types of domains. A more detailed document
discusses each parameter. Parameters exist to modify features about the
rule base, the data set and disturbances to the data set. Some clear parameters
include the number and complexity of the rules. Other include the number
of tuples to generate and finally what percentage of the values are erroneously
entered.
3) Reports
DataGen can present its results in two formats: plain and verbose.
When a plain report is requested DataGen will return the requested data
in tab separated format. The verbose format on the other hand summarizes
the settings it operated under to help in debugging the request.
Plain Report: Figure 3a shows a synthetic
relation composed of 10 tuples. The first three columns are predicting
attributes while the last column is the predicted attribute. The distribution
of the predicting attribute-values appear to be random while the predicted
attribute is composed of two members (1,2) which are almost evenly split.
Fig 3.a DatGen Dataset Report
A B C Class
10 7 3 2
2 13 9 1
10 7 5 2
10 7 8 2
10 7 5 2
14 13 9 1
9 13 9 1
10 7 12 2
10 7 9 2
6 13 9 1
Notice that tuples which belong to class 2 (ie. tuples with the
last column = 2) have the first column=10 and the second column=7.
If the predicting attributes are represented with the alphabetical characters
A..Z, we may infer the rule if A=10 and B=7 then 2. Finally, for
class 1 we can infer the rule if B=13 and C=9 then 1. If
a new event arrived with values (5, 13, 9) we would be confident
in classifying the event as 1. Figure 3.b shows the corresponding
rule set.
Fig 3.b DatGen Rule Report
A=(10) & B=(7) -> 2 (60%)
B=(13) & C=(9) -> 1 (40%)
Verbose Report:A verbose report adds information
which can be best used to visually understand data generation. Figure 4
is a verbose report. The first section deals with the settings used the
generate the rule base and data.
Fig 4.1 VARIABLES
full: Randomness
rand: Rule distribution
10: Events
3: Rules
3: Relevant Attributes
0,1 : Avg. Conjunctions per rule
0,1 : Avg. Disjunctions per rule term
Randomness states whether the function should be performed in full
or pseudo mode.
Rule distribution defines whether rules from the generated rule-base
should be inkoked Uniformly,
Randomly, or randomly with a
Standard
Normal distribution bias. Figure 4. states that this example has a
random distribution.
Events determines the number of generated events. Ten events
will again be reported. The number of rules to be created in the knowledge
base is specified by the
Rules variable. In this example three synthetic rules will be
created and will then be used to generate synthetic tuples.
Relevant Attributes refers to the number of attributes which
should be generated (not including the Class attribute). In this example
there are 3 relevant rules (A, B, C). Rule conditions will be composed
from these three attributes.
The final two variables in the figure determine the shape of the condition
part of the Conjunctinve Normal Form (CNF) rules created.
The 0,1 conjunctions variable means that a random selection between
rules with 0 conjunctions (ie. a single term) and 1 conjunction will be
made.
Finally for each term a selection of either 0 or 1 disjunction will
be made.
The verbose TUPLES report has several column. The first column
is always an incrementing counter while the final column is always the
class membership.
Fig 4.2 TUPLES
Tuple A B C Class
1: 6 1 5 3
2: 3 1 5 3
3: 7 4 9 2
4: 7 9 1 2
5: 5 7 6 1
6: 7 4 3 2
7: 1 6 1 1
8: 5 1 5 3
9: 4 1 5 3
10: 10 1 5 3
The three generated rules are each composed of three terms. The percentage
by each rule indicates the number of tuples represented by each rule. Rule
2 for example relates to 30% of the tuples in the synthetic relation. This
report however was generated from three CNF rules.
Fig 4.3 RULES
B=(1) & C=(5) -> 3 (50.0%)
A=(7) & B=(4,9) -> 2 (30.0%)
B=(6,7) -> 1 (20.0%)
Each of the ten events was created by randomly selecting a rule from the
rule base. Attributes which are not mentioned in a rule are filled with
a random value.
Program Detail
As described in the examples, the DatGen program first creates a number
of classification rules and then creates some tuples which abide by these
rules. The program is implemented as a
Posix
ANSI C program. An interactive
WWW interface to the program is also available.
4) Examples
-
Create 101 rows of random months (1 to 12) and years (1966 to 1999)
DatGen -O101 -X1,12;1966,1999
-
-O101 requests 101 rows
-
-X1,12;1966,1999 requests two columns the first ranging from 1 to 12, the
second ranging from 1966 to 1999.
-
missing values
-
Create 1,001 rows of random months (1 to 12) and years (1966 to 1999)
DatGen -O1001 -X1,12;1966,1999
-
-O1001 requests 1,001 rows
-
-X1,12;1966,1999 requests two columns the first ranging from 1 to 12, the
second ranging from 1966 to 1999.
-
Create 99,999 rows with a binary column ('a','b') and four ordinal attributes.
Create two rules that predict each of 'a' and 'b' based on the values of
three columns. Hide one of the predictive columns.
DatGen -O99,999 -R2 -A3 -I1 -D10 -M1
-
-O1001 requests 1,001 rows
-
-X1,12;1966,1999 requests two columns the first ranging from 1 to 12, the
second ranging from 1966 to 1999.
5) References
- [CN89]
Clark, P. and Niblett, T. 1989.
The CN2 Induction Algorithm.
Machine Learning
3: 261-283.
- [CF88] Cheng, J., Fayyad, U. M., Irani, K.B, and Quian, Z. 1988.
Improved Decision Trees: A Generalized Version of ID3.
In Proceedings for the Fifth International Conference on Machine Learning.
Morgan Kaufmann. San Mateo, California. pages 100-107.
- [JG93]Gray, Jim.
The Benchmark Handbook For Database and Transaction Processing
Systems, Second Edition
- [HS94] Holsheimer, M. and Siebes, A. 1994.
Data Mining: The Search for Knowledge in Databases.
CWI Report CS-R9406. Amsterdam, The Netherlands.
- [KZ94] Kloesgen, W. and Zytkow, J. 1994.
Machine Discovery Terminology.
http://orgwis.gmd.de/explora/terms.html
- [Mi83] Michalski, R. S. 1983.
A Theory and Methodology of Inductive Learning.
In Michalski et al. [MC83], pages 83-134.
- [MC83] Michalski, R. S., Carbonell, J. G., and Mitchell, T. M. 1983. editors,
Machine Learning, an Artificial Intelligence approach, volume 1.
Morgan Kaufmann, San Mateo, California.
- [PF91] Piatetsky-Shapiro, G and Frawley, W. J. 1991. editors.
Knowledge Discovery in Databases.
AAAI Press, Menlo Park, California.
- [PB90] Porter, B. W., Bareiss, R. and Holte, R. C. 1990. Concept Learning and Heuristic Classification in Weak-Theory Domains. 1990. Jude W. Shavlik and Thomas G. Dietterich, editors,
Readings in Machine Learning.
Morgan Kaufmann, San Mateo, California. pages 710-746.
- [Qu86] Quinlan, J.R. 1986. Induction of decision trees.
Machine Learning,
1:81 - 106.
- [SG91] Smyth, P. and Goodman, R. M. 1991.
Rule Induction Using Information Theory.
In Piatetsky-Shapiro and Frawley [PF91], pages 159 - 176.
- [SS95] Murthy, Sreerama and Salzberg, Steven. 995.
Decision Tree Induction: How Effective is the Greedy Heuristic?.
First International Conference on Knowledge Discovery and Data Mining, pages ### - ###. Montreal.
- [UR91] Uthurusamy R., Fayyad, U. M., and Spangler, S. 1991.
Learning Useful Rules from Inconclusive Data.
In Piatetsky-Shapiro and Frawley [PF91], pages 141 - 158.
Home Page: http://www.datasetgenerator.com
Comments: Gabor Melli
last updated 05.11.04