The importance of classification in science has already been remarked upon in
Chapter 6, where techniques were described for examining multivariate data for
the presence of relatively distinct groups or clusters of observations. In this
chapter a ·further aspect of the classification problem will be discussed, in
which the groups know a priori, and the aim is to devise rules which can
allocate previously unknown objects or individuals into these groups in an
optimal fashion. In this situation the investigator has one set of multivariate
observations, the training sample, for which group membership is known with
certainty a priori, and a second set, the test sample, consisting of observations
· for which group membership is unknown and which have to be assigned to
one of the known groups as accurately as possible. The information used in
deriving a suitable allocation rule is the variable values of the training
sample. Areas, where this type of classification problem is of importance, are
numerous and include the following.
• Medical diagnosis. Here the variables describing each might be the
results of various clinical tests, and the groups could be collections of patients
known to have different diseases.
• Archaeology. Here the aim might be to decide from which ancient civilisation
a, pottery fragment originates, with the variables being particular measurements
on the artefacts.
• Speech recognition. Here the objects to be classified are usually waveforms,
and the variables a set of acoustical parameters extracted from the utterance
of a specific word by an individual.
An initial question that might be asked is, 'since the members of the training
sample can be classified with certainty, why not apply the procedure used for
their classification to the test sample?'. Reasons are not difficult to identify. In
A simple example 249
Medicine, for example, it might be possible to diagnose a particular condition
with certainty only as a result of a post-mortem examination. Clearly, for
patients still alive and in need of treatment, a different diagnostic rule is
required.
In statistics the type of classification question described in this preamble is
usually referred to as a discrimination or assignment problem.