MonetDB has played a pivotal role in the development of datamining
applications at DataDistilleries (now SPSS).
During its course of development we used a small mock-up database
to study performance issues.
This database is included in the MonetDB source distribution directory scripts/gold
and provides an easy way to explore some of the MonetDB's functionality
when it comes to processing MIL queries.
The steps below should help you understand some of the basics in
using MonetDB.
Please note that you need a source distribution to follow this
tutorial. Binary installations are currently not (yet) supported.
We suppose your MonetDB sources are in $MONETDB_SRC and
you have configured and successfully installed MonetDB
in a directory $MONETDB_PREFIX.
Make sure $MONETDB_PREFIX/bin is in your PATH and type:
$ mkdir $MONETDB_PREFIX/var/MonetDB/dbfarm/gold
So, creating a database is nothing more than creating a directory in the
'dbfarm' directory (the default dbfarm is $MONETDB_PREFIX/var/MonetDB/dbfarm)
This manual step is only meant as an illustration, because when the
environment variables are set, typing the following would produce
the same effects and produces a message that the database has been initialized:
$ Mserver --dbname=gold
# Monet Database Server V4.6.0
# Copyright (c) 1993-2005, CWI. All rights reserved.
# Compiled for i686-pc-linux-gnu/32bit; dynamically linked.
# Visit http://monetdb.cwi.nl/ for further information.
MonetDB>
Note that the actual output may differ slightly, depending on
the version downloaded and the system it runs on.
A different dbfarm location can be chosen and passed to
Mserver with the --dbfarm=<dir> command line switch.
A small MIL script is provided to quickly populate the database. It uses a compressed file with tuples to be inserted and a format file for the database loader.
$ cd $MONETDB_SRC/scripts/gold $ export TSTTRGDIR=`pwd` $ Mserver --dbname=gold < load.mil
The database is now loaded. A few more tables are constructed for subsequent processing. Don't disturb the output layout, because this script is also used every night as part of the intensive testing machinery. It suffices to note that we now have a skewed insurance database with 100K entries, half of which with claimed damage.
$ Mserver --dbname=gold < init.mil
With all pieces in place we can run an experiment. The queries illustrate what would be produced by an end-user mining for gold at a single button click. The file tst100.mil provides an outlook on the proprietary query language of MonetDB. The timing for each datamining step is summarized. This query sequence was originally generated by the Data Distilleries application front-end.
$ Mserver --dbname=gold tst100.mil >output $ sh summary
A key feature to gain tremendous speed in this application domain is the ability to deal with virtual OIDs. They do not require storage space, but can be used to speed up processing significantly. In order to use the virtual oid features, first create a vectorized database with the void.mil script. Make sure that you have the 'str' and 'enum' modules installed!
$ Mserver --dbname=gold < void.mil $ Mserver --dbname=gold tstvoid.mil > output $ sh summaryIt will run somewhat faster...