Abstract
Data mining (DM) is the process of finding patterns and relationships in databases.
The breakthrough in computer technologies triggered a massive growth in data
collected and maintained by organisations. In many applications, these data arrive
continuously in large volumes as a sequence of instances known as a data stream.
Mining these data is known as stream data mining. Due to the large amount of data
arriving in a data stream, each record is normally expected to be processed only
once. Moreover, this process can be carried out on different sites in the organisation
simultaneously making the problem distributed in nature. Distributed stream data
mining poses many challenges to the data mining community including scalability
and coping with changes in the underlying concept over time.
In this thesis, the author hypothesizes that learning classifier systems (LCSs) - a
class of classification algorithms - have the potential to work efficiently in distributed
stream data mining. LCSs are an incremental learner, and being evolutionary
based they are inherently adaptive. However, they suffer from two main drawbacks
that hinder their use as fast data mining algorithms. First, they require a large
population size, which slows down the processing of arriving instances. Second,
they require a large number of parameter settings, some of them are very sensitive
to the nature of the learning problem. As a result, it becomes difficult to choose a
right setup for totally unknown problems.
The aim of this thesis is to attack these two problems in LCS, with a specific focus
on UCS - a supervised evolutionary learning classifier system. UCS is chosen as it
has been tested extensively on classification tasks and it is the supervised version
of XCS, a state of the art LCS.
In this thesis, the architectural design for a distributed stream data mining system
will be first introduced. The problems that UCS should face in a distributed data
stream task are confirmed through a large number of experiments with UCS and
the proposed architectural design.
To overcome the problem of large population sizes, the idea of using a Neural
Network to represent the action in UCS is proposed. This new system - called NLCS
{ was validated experimentally using a small fixed population size and has shown
a large reduction in the population size needed to learn the underlying concept in
the data.
An adaptive version of NLCS called ANCS is then introduced. The adaptive version
dynamically controls the population size of NLCS. A comprehensive analysis of the
behaviour of ANCS revealed interesting patterns in the behaviour of the parameters,
which motivated an ensemble version of the algorithm with 9 nodes, each using a
different parameter setting. In total they cover all patterns of behaviour noticed in
the system. A voting gate is used for the ensemble. The resultant ensemble does
not require any parameter setting, and showed better performance on all datasets
tested.
The thesis concludes with testing the ANCS system in the architectural design for
distributed environments proposed earlier.
The contributions of the thesis are: (1) reducing the UCS population size by an
order of magnitude using a neural representation; (2) introducing a mechanism
for adapting the population size; (3) proposing an ensemble method that does not
require parameter setting; and primarily (4) showing that the proposed LCS can
work efficiently for distributed stream data mining tasks.