Case-based reasoning depends heavily on assessing the similarity between cases. Similarity assessment is the task of determining which cases are similar to each other and which are dissimilar. In general, defining distance functions is a difficult and tedious task.
There are several uses of distance functions in case-based reasoning all of which can benefit from distance function learning. In case-base classification, cases belonging to the same class should have a lower distance than cases in different classes. A good distance function is one that leads to a highly accurate classifier. In case base maintenance, the case base is reorganized into clusters so that search time is reduced by consulting the cluster's representative first. Cases far from the representative of the cluster can be removed to improve search time. A good distance function is one that leads to tight and cohesive clusters, effectively organizing the case base. In case-based retrieval, several cases relevant to a user's query are returned and users send feedback to improve retrieval. A good distance function is one that retrieves a variety of relevant cases for a user's query. Each of these applications places different requirements on the distance function, which makes manually tuning the distance function difficult.
Due to the fact that distance functions are difficult to design for different applications, it is worthwhile to look for approaches that learn distance functions from the case base. In particular this chapter will discuss methods in the context of classification, where the goal of supervised similarity assessment is to obtain distance functions that separate cases belonging to different classes well. Consequently, the scope of the methods discussed in this chapter is limited to case bases that have class labels. Class labels either represent natural classes or capture information about the relevancy of a particular case for a particular use. Different methods are needed for unsupervised learning tasks, such as clustering. The class labels associated with different cases can be viewed as feedback that, as we will see, is instrumental for tailoring distance functions for a particular classification task.
Figure 1.1 illustrates what distance function learning for supervised
similarity assessment is trying to accomplish; it depicts the distances of 13 cases,
5 of which belong to a class that is identified by a square and 8 belong to a different
class that is identified by a circle. When using the initial distance function
we do not observe much clustering with respect to the two classes. Starting from
this distance function, we would like to obtain a better distance function
so that the cases belonging to the same class are clustered together. In Figure 1.1
we can identify 3 clusters with respect to
, 2 containing only circles
and one containing only squares. Why is it beneficial to find such a distance function
? Most importantly, using the learned distance function in conjunction
with a
-nearest neighbor classifier [10] allows us to obtain a classifier
with high predictive accuracy. For example, if we use a 3-nearest neighbor classifier
with
it will have 100% accuracy with respect to leave-one-out cross-validation,
whereas several cases are misclassified if
is used. The second advantage
is that looking at
itself will tell us which features are important for
the particular classification problem.
It is also important to understand that there are no universal distance functions for a given case base. The usefulness of a distance function is determined in the context of the task it is used for. Consequently, a distance function that does a good job in separating cancer patients from healthy patients is unlikely to be useful in separating diabetes patients from healthy patients.
This chapter surveys how distance function learning is performed in three separate areas of case-based reasoning: classification, maintenance, and retrieval. Section 1.2 reviews basic properties of distance functions and introduces a supervised distance function learning framework. Section 1.3 discusses the Relief algorithm used in case-based classification [30]. Section 1.4 discusses the inside-outside weight adjustment algorithm used as a supervised method for case-base maintenance [17]. Section 1.5 discusses a relevance feedback algorithm used in case-based retrieval [37]. In Section 1.6 we show how, on two example case bases, to visualize a distance function before and after learning. Section 1.7 surveys recent work in distance function learning. Section 1.8 gives a brief summary.
A case
consists of a vector of features and a class label (target value). We
refer to
as one case in a space of possible cases
, as
and
as the class to which
belongs.
In general, the case need not be a vector. We define a distance function over a set
of cases
. A distance function
should
satisfy the following two conditions for
:
Condition 1.2 implies that two cases with distance
are equivalent.
Duplicate cases occur in practice and, either because of rounding or preprocessing,
they may have a distance of
even if the cases are not the same. A simple application
of the Condition 1.3 is the saying that the shortest distance between
two points is a straight line. This is because the Euclidean distance is a metric
in
.
A metric depends only on the definition of the distance function and not the composition
of the set. In practice, however, the set
is the vector space
where each
is a vector of length
:
The first condition is known as the triangle inequality. Given a norm
over
, the following distance function is a metric:
We define a family of norms and distance functions.
For each
norm, there is a corresponding
distance function metric.
This family of distance functions is also known as the Minkowski distance. There
are three common
values, which correspond to the following three distance functions:

In distance function learning, the objective is to find parameter values for a parameterized distance function. The most common approach to obtain a parameterized distance function is feature weighting. Feature weights are the output of the distance function learning algorithms discussed here. Weighted distance functions are a subclass of the more general linear distance functions.
Feature weights, one for each attribute, emphasize one subset of features over the
others. For the
distance, the weighted version is:

Linear distance functions generalize the
distance by projecting the original
distance vector onto a linear basis
:

For small to medium sized case bases, it is often more efficient to compute and store
the distances between all pairs of cases. This forms a distance matrix for the case
base
and distance function
.
For distance function learning, we often decompose the distance matrix into the set
of matrices
for
, storing the distance with respect to the
th feature. A weighted distance matrix for the
distance can be re-created
as:
for feature weights
. For a metric distance
function,
is a symmetric matrix with zeros along the diagonal.
In case-based classification, each case has a class label. The objective of a classification algorithm is to determine the class for an incoming case. In classification, an ideal distance function should make cases close to other cases in the same class, while making them far from cases belonging to a different class. Most distance function learning algorithms for classification find parameters for a parameterized distance function and then apply an existing distance-based classifier.
A classifier that uses distance functions is the
-nearest neighbor (
-NN)
algorithm [10]. In pattern classification, we consider a case base
on which learning is performed. The subset
is further divided into a training
set
and a testing set
such that:
and
. Given a distance function
and a training set of cases
, the 1-NN algorithm assigns the class label to a new case in the testing set
as follows based on its nearest neighbor in the training set:

Two feature evaluation methods are common in the literature. Feature selection algorithms finds a subset of features that more compactly represent the case base. Feature evaluation algorithms assigns a score to each feature that indicates the degree to which it is useful for classification. The higher the score, the more useful a feature is. Thus, feature selection algorithms are a specialization of feature evaluation methods, in which the score is either 0 or 1. In high-dimensional case bases, such as images or signals, feature evaluation methods are used to remove features that have a low score. This sometimes improves classification accuracy and reduces computation time.
We use feature evaluation algorithms to compute weights in a distance function. We
make the weights proportional to the score. Let
be the scores
for each of the
features. We compute the weight vector
as:
The Relief algorithm was originally designed as a feature evaluation algorithm, but it is commonly used to compute weights for distance functions [28]. The weight of a feature corresponds to how well it separates cases from different classes while not separating cases from the same class. Weights are updated for each case in the case base.
Algorithm 1 shows the steps in a single
iteration of the Relief-F algorithm, which is an extension of the original Relief
algorithm designed to handle multiple classes [30]. From
the training set
, a random sample (
) of size
is selected. The weight of each feature,
, is updated for each case
in
. The update considers the nearest neighbor of
from each
class using the last weight vector,
. The neighbor in the same class as
,
, is called the hit case. Conversely, the nearest neighbor in each class
,
, is called a miss case. The update decreases the
feature weight to bring
closer to the feature value of the hit and increases
the weight to move
away from the feature value of the misses. This weight
update procedure is repeated for the new weight vector
until the weights
converge.
Relief does not have an explicit objective function for the weights. Weights are
updated based on their prediction error, which makes this an iterative hill-climbing
approach. By updating weights for individual cases, Relief utilizes local information.
Many other feature evaluation methods consider only global information such as the
information gain (entropy with respect to class) and the
statistic (variance
in the feature values) [16]. Local information is particularly
useful in distance function learning.
In case-based reasoning, the case base often contains irrelevant or near-duplicate cases, which degrades performance. Case base maintenance methods organize the case base by removing such cases. This is typically accomplished using clustering, where the case base is partitioned into a set of clusters. Next, a prototype of each cluster is selected and becomes the representative of the cluster. Cases that have high distance from the prototype can be removed or at least assigned a low weight. Distance function learning is often used for case base maintenance. This creates a customized case base for the distance function.
If the case base has class labels, we seek a distance function such that the resulting clusters are more pure. Purity is defined as the average fraction of cases in a cluster that belong to a majority class. When the purity approaches 1.0, all cases in all clusters belong to a single class. The algorithm we discuss in this section, inside-outside weight updating, relies on two ideas, as depicted in Figure 1.2. First, it uses clustering to evaluate the weighted distance function shown in Equation 1.5. Second, it uses the local information from clusters to update the weights. We refer to this as a co-evolving approach because the clustering and the distance function are learned together.
After the cases have been clustered, the purity of the obtained clusters is computed and is used to assess the quality of the current distance function. Any clustering algorithm can be used for this purpose; however, supervised clustering algorithms [18] that maximize cluster purity while keeping the number of clusters low are particularly useful for supervised distance function evaluation. More details on how clustering is exactly used for distance function evaluation are given in [17].
Inside-outside weight updating uses the average distance between the majority class
members
of a cluster and the average distance between all members belonging to a cluster
for the purpose of weight adjustment. More formally, let:
After a clustering
has been obtained with respect to a distance function the weights of the distance
function are adjusted using Formula 1.8 iterating over the obtained clusters
and the given set of attributes. It should also be noted that no weight adjustment
is performed for clusters that are pure or for clusters that only contain a single
case.
The weight adjustment formula 1.8 gives each cluster the same degree of importance
when modifying the weights. Suppose we had two clusters, one with 10 majority cases
and 5 minority cases, and the other with 20 majority and 10 minority cases. With
both clusters having identical average distances and average majority class distances
with respect to a feature, the weight would have identical increases (decreases)
for the two clusters. This somehow violates common sense; more efforts should be
allocated to remove 10 minority cases from a cluster of size 30, than to removing
5 members of a cluster that only contains 15 cases. Therefore, we add a factor
to the weight adjustment heuristic that makes weight adjustment somewhat proportional
to the number of minority cases in a cluster. Our weight adjustment formula therefore
becomes:
This update rule is similar to that of the Relief algorithm as discussed in Section
1.3. Like this method, Relief updates the weight to
bring same-class cases closer together and different-class cases further apart. Unlike
this method, Relief uses simple nearest-neighbor queries instead of whole-cluster
information. It does not take advantage of information from both the class labels
or the result of the clustering. For example, Equation 1.7 puts more
emphasis on the cases in the same class compared to different classes. In the context
of clusters, this expression is similar to the
term from Equation 1.9.
The case distance matrix
is next computed using:

![\begin{eqnarray*}
D & = & \left[\begin{array}{cccccc}
0 & 1.57 & 1.89 & 1.92 & 3...
....58 & 2.21\\
& & & & 0 & 2.58\\
& & & & & 0\end{array}\right]\end{eqnarray*}](img116.png)
The average inter-case distances have changed to:
,
. As
we can see, the cases belonging to the majority class have moved closer to each other
(the average majority class case distance dropped by 0.14 from 1.78), whereas the
average distances of all cases belonging to the cluster increased very slightly,
which implies that the distances involving majority cases (involving cases 1, 2,
and 3 in this case) have decreased, as intended.
In case-based retrieval, cases are retrieved and ranked according to their distance to a query. Unlike classification and maintenance methods, the cases are assumed not to have a class label. Offline training of a distance function is therefore not possible. A user's query, at runtime, partitions the set of cases into two classes: relevant and non-relevant. Distance function learning uses this feedback to improve the recall and precision of subsequent retrievals. Since a new distance function is learned for each query, the learning process must be very fast, even if not very accurate.
Learning a distance function in response to a user's feedback is known as relevance
feedback. One of the earliest relevance feedback algorithms was designed for text
retrieval [37]. As in other fields, a case
is represented
as a vector of word features:

Although cases are represented by feature vectors much like the other fields we have discussed, the values for the features have a special meaning. This permits several assumptions in these case bases that are otherwise not justified for the other domains we have considered. The most common form is the term-frequency inverse case frequency (TFIDF) where each term has the following form:

The most common distance function learning algorithm for case-based retrieval using relevance feedback relies on some assumptions about relevant cases and the words they contain. An analysis of the distribution of words in text cases has shown that [39]:
Since we assume that relevant cases are characterized by the consistent presence of relevant words, distance function learning methods weight words by their consistency within the set of relevant cases. A common measure for inconsistency is the standard deviation of the word frequencies in the relevant cases. The weights are inversely proportional to the standard deviation:
Despite the apparent simplicity of the weight update strategy, it has been shown to be effective for text retrieval [8,37,39]. The main reason for its good performance is that it has access to the relevant cases, rather than having to predict class labels. In pattern classification and clustering, the relevance of a new case (class label) is not known at runtime, the classification process is not interactive. Information retrieval is an interactive process in which the task is slightly different than classification. Instead of asking ``What is the best class of this case?'' the question is ``Given the relevant cases, what other cases are similar to these?'' By providing cases of relevant cases interactively, the user provides significantly more information to the retriever than is available to the classifier. Since distance functions are individualized and different from each other, each distance function determines what is relevant for a particular user but not what is relevant to all users.
Evaluating the performance of these algorithms is difficult in practice. The retrieval process is evaluated in terms of precision (percent of the returned cases that were relevant) and recall (percent of relevant cases that were returned). Ideally, both precision and recall should be high. For relevance feedback algorithms, users are typically not available during experiments. A common practice is to simulate a user's feedback by assigning a class label to each case. Cases in the same class as the query are then considered relevant.
![\begin{eqnarray*}
t & = & \left[\begin{array}{ccccccc}
\mbox{bush} & \mbox{clint...
...0 & 0.8 & 0\\
0.6 & 0 & 0.01 & 0 & 0 & 0.9 & 0\end{array}\right]\end{eqnarray*}](img132.png)
This assigns the highest weights to ``President'' and ``Bush''. The weight for ``Bush'' is less than ``President'' because ``Bush'' appears inconsistently across the relevant cases, in case 5 but not 4. The word ``President'', however, appears in both cases. The same query with the learned weights returns the cases: 5, 4, 2, 1, 3 in that order. After relevance feedback, the top two cases were those marked as relevant.
Because distance functions are embedded in different algorithms (classification, clustering, and retrieval), it can be difficult to evaluate the performance of a distance function learning algorithm. Both to provide an intuitive understanding of distance functions and as a qualitative evaluation, we discuss several visualization methods for distance functions. They show how learning a distance function changes the spatial relationship among the cases, and try to visualize how good distance functions can be distinguished from bad ones.
A Voronoi diagram partitions cases in a case base into mutually exclusive regions
called Voronoi cells (See [33] for more details). Each cell
contains exactly one case
from the case base
. The cell is defined
such that the case
is the nearest case in the case base to all cases in the
cell. Cases along the border are equally distance to the cases in the cells.
The weights change the shape of the cells in the Voronoi diagram. Figure 1.3
shows two Voronoi diagrams of a case base with two classes. On the left the cells
appear to be wider but compressed vertically. This is because our original case base
is skewed along one dimension. We change the weights to place more ``emphasis''
on the
-axis such that its weight is larger than that of the
-axis. As a
result, the cells on the right diagram are stretched along the vertical dimension
and compressed along the horizontal axis. As the weight for
increases towards
1, the cells become vertical line segments.
In the 2-D case, we saw that a slight change of the weights significantly changed the neighborhood of the cases. In the Voronoi diagram, this is easily seen as changing the shape of the Voronoi cells. In higher dimensions, however, computing the Voronoi diagram is computationally expensive.
Multi-dimensional scaling (MDS) (See [6] for more details) finds
a
-dimension representation for
-dimension cases, where
. For
visualization, we assume that
. A key benefit of MDS for visualization is that
it preserves the distance between cases such that
Most MDS algorithm operate only on the distance matrix of the cases. In classical multi-dimensional scaling, the projection is a decomposition of the distance matrix. Since the purpose of distance function learning is to change the distance function and thus the distance matrix, cases change their spatial relationship as the weights change. For many MDS algorithms, the change in position is so drastic that it is hard to compare the two figures before and after changing the weights.
Although MDS can give us a nice visualization of the cases in 2-D, much visual information is lost when moving from one set of weights to another. This is because the cases tend to move in seemingly unpredictable ways. The distance matrix image does not suffer from these problems. The main disadvantage, however, is that it can often be hard to see small intensity changes.
To form the distance matrix image, each entry becomes a block of pixels in an image.
The color of each block is proportional to the distance. A common color map is a
linear gradient from black to white. In this color map, the brighter the color, the
larger the distance. Each row
in the image corresponds to
for all
columns
. As the weights change in the distance function, the color of the pixels
will change.
To facilitate the qualitative meaning of the image, adjacent rows should correspond to cases in the same class. Grouping by class allows us to observe changes in the structural properties of the distance function. Blocks along the diagonal represent the same classes. Off-diagonal blocks represent distance between pairs of classes, and classes that are well separated will have bright colors for these blocks. Cohesive classes in which the cases are all close to each other will have dark colors in the blocks along the diagonal. We typically assume that classes are cohesive, but this may not be true in general.
We show two example case bases before and after changing the weights. For the 2-D case base, we demonstrate all three visualization methods. For the 9-D case base, we demonstrate only the last two methods. In both cases, the objective of learning the objective function is to improve classification performance. The objective function should make cases in the same class closer and cases in different classes further apart.
Figure 1.3 shows about 20 cases each from two 2-D Gaussian distributions
with means
,
and covariance
. In Figures 1.3, 1.4, and 1.5, the weights
are

Figure 1.3 shows Voronoi diagrams for the cases with equal and unequal weights. The diagram with unequal weights has been stretched vertically but compressed along the horizontal axis. It may be unclear why the cells are taller when the vertical weight is smaller. This is because a smaller weight decreases distance along that dimension. As a result, more cases are closer to the cases in the cells along that dimension. In contrast, horizontal distance has increased leaving fewer cases close in the cells.
The Voronoi diagram indicates that the distance function meets our objectives. The
separation between
and
is greatest along the horizontal component
because
. The weights increase the horizontal distance
causing the means to be further apart. The result is that the distance function can
potentially improve classification performance.
![]() ![]()
|
Figure 1.4 shows the result of multi-dimensional scaling with equal weights and unequal weights. Although the original case base was 2-D, with unequal weights, the position of the cases and thus their distance has changed significantly compared to the distance with equal weights.
The figures show that with unequal weights, the grouping of the cases improves. As a whole, the classes appear to be further separated from each other in Figure 1.4 (right). Within each class, the cases appear to be closer together.
![]()
![]()
|
Figure 1.5 shows the distance matrix image for the cases with equal and unequal weights. Since the cases are grouped by class, the lower-left and upper-right quadrants of the figure denote cases in the same class. With unequal weights, these regions are generally darker. This means that the distance between cases in the same class has decreased overall. The lower-right and upper-left quadrants of the figure denote the distance between cases in different classes. With unequal weights, these quadrants are generally brighter. This means that the distance between cases in different classes has increased overall.
![]() ![]()
|
With a 2-D case base, the Voronoi diagrams were illustrative. Unfortunately, for more than 3 dimensions the Voronoi diagram is expensive to compute and difficult to display. For higher-dimensional case bases we use only the multi-dimensional scaling and the distance matrix images.
The case base has 9 features with 214 cases from 7 classes [4]. We demonstrate the effect of weights with two weight vectors:

Figure 1.6 shows the multi-dimensional scaling of the case base with equal weights and with Relief weights. We see that the Relief algorithm works well on this case base. Unlike the 2-D case, the cases have shifted significantly with the different weights. In general, the cases are further apart with the Relief weights because the scale of the figures is different. Cases in the same class are grouped together, which is desirable in classification problems. The clusters are better separated from clusters of cases belonging to another class. In general, the figure shows that cases are more tightly grouped within the same class and these groups are better separated from each other.
![]()
![]()
|
Figure 1.7 shows images of the distance matrix with equal weights and Relief weights. These figures show that the Relief weights increase the distance between cases of different classes but decrease the distance between cases of the same class. The enlarged portions of the figure highlight the difference between the distance functions. With equal weights, the distance between cases in the same class decreases, causing the image to appear darker. With Relief weights, the distance between these cases and those of different classes increases, causing the image to appear brighter. Among the rest of the cases, the blocks along the diagonal appear darker with Relief weights. This means that cases in the same class are closer to each other with Relief weights. The off-diagonal blocks, particularly on the top and right, are brighter with Relief weights.
The distance matrix images also reveal some distance structure across the classes. If the classes were separated well, most of the distance matrix would be brightly colored. Only the regions along the diagonal would be dark. Since the majority of the distance values are small, this means that the three classes in the upper-right are, as a whole, very far from the rest of the cases. The Relief weights clarify this fact.
![]()
![]()
|
In this section, we survey other distance function learning work in case-based reasoning and machine learning. The work is grouped by the method and objective. Our survey is broad in scope and touches several different fields of research, each placing different requirements on distance functions. Although we do not claim that our survey is complete, we hope it serves as a starting point for newcomers to the field.
Statistical methods make use of either simple statistical models or models of the probability distribution. The model is used to derive weights and usually depends on distributional assumptions about the case base.
The correlation
between two random variables
and
is defined
as:
Correlation is used for both selecting features and assigning weights. Yu and Liu
select features that are highly correlated with the class labels but not correlated
with each other [51]. This reduces redundant features because features
that are highly correlated may be redundant. Doulamis and Doulamis set feature weights
proportional to the correlation between the features values and a user-defined relevance
measures, as in information retrieval [15]. Pan et al.
find a highly correlated subset of features of a high-dimensional projection of the
case base [34]. Projecting features into a high-dimensional space could
introduce many irrelevant or redundant features. To eliminate redundant features,
we compute a kernel matrix and perform an eigenvalue decomposition as in kernel PCA.
The
largest eigenvalues correspond to the most important features. Weights for
this subset of features are proportional to the correlation coefficient with respect
to the target.
As we discussed in Section 1.5, many relevance feedback methods use the variance of features in the case base because it is easy to compute. It is especially useful in interactive case-based systems, because users can specify which cases are relevant. Kohlmaier et al. assign weights to a feature based on the degree to which it increases variance in the computed similarity values [29]. The variance in the similarity function is a good indicator of different cases. In addition to providing the single best response, a good case result should return a variety of different, but related, cases. We consider adding a single feature to the set of features. If this new feature increases the variance of the similarity function, it is believed to be a good starting point for obtaining good feedback from the users. If the variance is low with this feature, most of the cases are similar based on this feature, so it may not help the user find a good solution. The variance of similarity values can be used either alone or combined with weights. Results show that the method performs better than weights alone on several case-based retrieval tasks [40].
Expectation maximization (EM) approaches iterative solve an optimization problem where the objective is to maximize the likelihood of the case base given the new model. In distance function learning, the model is the set of weights. The likelihood is the probability of finding the existing case base given the current weights. The EM algorithm has two basic steps. In the maximization step, we find those weights that best explain the case base. In the expectation step, we recompute the likelihood function, which changes when the weights change. The EM algorithm maximizes weights, computes the expected case base given the weights and then repeats until the weights converge.
These methods are common in classification and clustering. In the maximum likelihood approach, we seek a set of weights that best explains the case base. If we have a set of pair-wise constraints, such as knowing that that two cases should not be considered similar, we can find weights that maximize the likelihood of separating cases [50]. Given the weights, we determine the degree to which the pair-wise constraints are violated and update our likelihood function. We can again find weights that maximize the likelihood. Huang et al. apply another EM-type algorithm to find weights that induce a good clustering, a maximum-likelihood clustering. Each cluster is modeled by a multivariate Gaussian distribution. Each case has a probability of being a member of each cluster. The weights are used to compute the membership probabilities for each case. The EM approach is to obtain a new clustering given the weights and then find the weights that improve the membership probabilities [24].
Often the original set of features is not adequate for the distance function. We would like to evaluate a feature, indicating its usefulness with respect to the class labels. When there are many features, some are irrelevant or redundant. Finding a suitable subset or subspace of the features these features can improve the computational efficiency. When the features do not adequately express the target value, we can construct new features that are more expressive.
In Section 1.3, we discussed the popular Relief method. Many other feature evaluation algorithms are commonly used to find weights for distance functions. Most of these were designed, like Relief, for classification problems and compare the class distribution considering the different values of nominal features. Information gain measures the difference in entropy with respect to the class labels [16]. The case base is split into partitions with respect to a particular feature such that all cases in the partition have the same value for the feature. The entropy with respect to the class labels is calculated in each partition. The information gain is the difference between the initial entropy, without the partitions, and the average entropy after the partitioning.
As with the Relief algorithm, the output of a feature evaluation method can be used for weights in a distance function. In classification, the information gain increases weights for features that are similar to the class label. Features that are distributed like the class label have high weights as do those features with many values [14]. Text features, particularly in information retrieval, lend themselves to this method. A relevant word would only be in the set of relevant cases, so we would expect it to have high information gain. If the word is not relevant the class labels would tend to be distributed similarly in each split, so the information gain would be small [49].
Extending the concept of feature weights to individual feature values is also common in case-based reasoning literature. Such a local weight could, for example, place more emphasis on the difference between the values ``red'' and ``blue'' than on the difference between ``red'' and ``pink''. Nunez et al. compared several entropy-based weighting methods for local weighting [32]. They showed that the local weight approach can be useful. However, having more weights to control makes the learning process more difficult. As a result, more complex search methods such as genetic algorithms have to be used [25].
As we have seen, feature selection seeks an optimal subset of the initial set of features. In dimension reduction, we also seek a reduced set of features, but we consider combinations of features. As shown in our visualization cases in Section 1.6, multi-dimensional scaling can reduce a high-dimensional case base to a 2-D case base that preserves clusters. There are three common methods for dimension reduction: principal component analysis, linear discriminant analysis, and multi-dimensional scaling.
The principal components of a case base are the directions (linear combination of features) that indicate the variance in the cases. They are the eigenvectors of the covariance matrix that correspond to the largest eigenvalues. The principal components become the new features. The covariance matrix allows to calculate the Mahalanobis distance, which projects the cases onto the inverse of the covariance matrix. The projected cases is ``corrected'' for the covariance such that an un-weighted distance function can be used. The objective of these distance function learning algorithms is to find the best projection. For example, the projection should map the cases into well-separated classes [5,19].
Rather than requiring a distance function to consist of good features, one can require that it lead to good classification results. Since predicting accuracy is computationally expensive, an alternative is to maximize the margin of separation between cases. Since the separation depends on the learned distance function, the problem is to find a distance that leads to the greatest separation between cases in different classes. An early algorithm for this purpose is Linear Discriminant Analysis (LDA). A local approach to LDA finds the discriminant among a small neighborhood of cases. The linear projection matrix that leads to the best discriminant is chosen for each case [22].
From a geometric perspective, these covariance approaches reshape the cases to improve separation. The dimensionality of the projected cases is the same as the original cases. As we have seen in Figures 1.4 and 1.6, projecting cases onto fewer dimensions can yield insight into the distances. Much work has shown that they are quantitatively effective as well. A desirable property of multi-dimensional scaling is that it should preserves distances in the projected space. Rather than preserving all the original distances, the projection can increase the distance between cases in different classes, leading to a better separation of classes in the lower-dimensional space [52].
Other projection methods utilize different aspects of relatedness such as knowledge of unwanted variance between cases of different classes [23] and the structure of local neighborhoods [20].
Often the original set of features for a case base are unable to represent the concept.
For example, if the true target value is a nonlinear function of the features such
as
, then it would be difficult for any weight vector to approach the
function. A common solution is to construct a new, larger set of features. The similarity
function and weights are then computed in this high-dimensional feature space. Examples
include the set of polynomials of degree 2, yielding features such as
.
This feature construction method can introduce significant redundancy among the features.
As a result, dimension reduction methods, like those discussed earlier can be used.
A specialized version of PCA, called kernel PCA, is ideally suited for this problem
[34]. As the original number of features increases, feature construction
like this tends to be very expensive, as
features have to be constructed.
Another technique is to add new, derived features. An example is to find a set of
weights for a subset of features and then use the weighted combination of feature
values as a new feature [36]. The new feature can either replace or
augment the existing set of features. It is particularly useful with image cases,
when the individual low-level features are are, by themselves, not good predictors
of the class.
In kernel-based machines, cases are represented as a vector of similarity values.
Each case consists of its similarity to all other cases in the case base. A kernel
function
defines the similarity between two cases
and
. The kernel matrix constructed from the original case base as follows. For
a case base of size
, the kernel matrix has dimension
. Each entry
is the value of the kernel for two cases
. Using
the so-called kernel trick, any algorithm that operates on case bases with a dot
product can be modified to use a kernel function. This kernelization allows many
existing algorithms to be extended with kernels. Techniques that can be kernelized
are the support vector machines, principal component analysis, and distance based
algorithms [3,41]. In kernel PCA, we compute a
reduced-dimension version of the kernel matrix. The weights for this new feature
vector indicate which cases contribute most to the overall covariance of the kernel
matrix. As with traditional PCA, the kernel version improves the performance of algorithms
when using the reduced set of features [34].
Recent work has established a relationship between kernels and distance functions [3,41]. In particular, the work shows that good distance functions make for good kernels [3]. As a result, many researchers have applied distance function learning methods, like those discussed in this chapter, to learn kernel functions. As in distance function learning, feature evaluation methods like Relief can be used to find weights for kernel-based machines [9]. Weights can be learned for a parameterized kernel function in much the same way they are learned for distance functions [26]. Recent work shows that good distance function make for good kernels [3].
A common approach to assess the importance of features is to apply an algorithm that generates weights as part of the model. The weights are then extracted from the model and used as weights for the distance function or to select features.
A popular classifier the computes weights is the Logistic regression algorithm. This
algorithm assumes that the probability of a case
belonging to a class
is:
Arshadi and Jurisica have applied logistic regression to case-based classification of micro-array cases [1]. Their objective was to select relevant features from a very high dimensional case base. The combine several classifiers in an ensemble. The classification approach is to retrieve several cases from the case base with the learned weights and then compute the majority class label. Their results show a significant improvement in accuracy. Wilson and Leake use this method to maintain the case base through clustering [48]. By clustering and then learning a set of weights, cases that are very distant from the prototype of the cluster are removed as irrelevant. Features with low weights are also removed as irrelevant. Their results showed that this lead to an improvement of classification accuracy.
Support vector machines also learn weights, classifying cases based on a separating hyperplane as follows:
The relevance feedback methods like those we have previously discussed use relevant cases as the basis for weights. Their objective is to find weights that best separate relevant from non-relevant cases. Applied both in text retrieval [8,37,39] and image retrieval [38], these methods are quite popular. We can also cluster the cases based on relevance and then find a distance function that separates the clusters [27]
Local search methods are popular for distance function learning. The most common of these are iterative methods such as hill-climbing. Here, an initial weight vector is updated until the objective function converges, eg. to the peak of a hill in the objective function. Updates are computed either by the gradient of the objective function or with heuristics. Relief, as we discussed in Section 1.3, is a local search method.
In the Relief algorithm, a weight is the degree to which a single feature can be used to predict the class label [28]. In practice this technique works well and has inspired several extensions. The most common extension, Relief-F, extends Relief from two classes to several classes, which allows for greater applicability and widespread use [30]. Rather than learning a single weight, we can find a pair of weights for each class: one for the nearest hit and miss cases [7]. Although widely used as a batch learning algorithm, a recent iterative version of Relief achieves comparable accuracy as an online learner and supports removing outliers [45].
Gradient-based hill climbing methods, known as gradient descent methods, are the most common form of hill climbing algorithm. For distance function learning, these methods define an objective function in terms of weights and then compute its gradient. A common approach used in case-based retrieval is to find weights that compute an optimal ranking of the cases. For example, given a user's feedback of known ranks for a set of cases, weights can be learned that match the ranks. The error function is simply the squared difference between the known ranks and predicted ranks. This is a continuous, differentiable function, which lends itself to gradient-based methods [31,43]. Coyle and Cunningham minimize the difference between a user's ranking of a set of retrieved cases versus those computed with uniform weights. The idea is to make the similarity with weights as different from uniform weights as possible. By saving these weights for individual users, a customized case base is created [11]. With a similarly objective function, Shiu et al. incorporate learned weights as a first step in their case base maintenance process [42]. Tsang et al. find weights that induce a good clustering to edit the case base. The objective is to improve cluster metrics such as intra-cluster distance (tightness of the cluster) [46]. The objective function for optimization minimizes the difference in the objective with learned weights versus the set of uniform weights.
In Section 1.4, we examined a non-gradient approach that optimizes class purity [17]. Here the weights are changed along the single attribute that improves the purity the most. A subspace projection method, like hill climbing, updates weights along a pre-defined direction as long as the objective function improves [21]. This direction typically has components of several features.
Local search methods tend to converge to a point, known as a local optima. This local solution may not be the global, best solution. Global search methods are intended to find this global solution. Optimization methods are used when we know that there is only one optimal solution, typically because the objective function is convex. Most other search methods expand their search area with randomization.
We can pose the distance function learning problem as one of optimization. In general, we would like to find feature weights that minimize a global error function. Peleg and Meir find a subset of features that minimize the expected generalization error [35]. The objective is to minimize the margin cost for a feature. Features that can adequately separate the cases have low margin cost. Features that are poor separators are not useful for classification. Weinberger et al. extend the optimization problem in the Relief algorithm. For Relief we considered the distance between the nearest hit and miss case for each case in the training set. Keeping the hit and misses fixed during optimization, the objective is to minimize the distance between cases in the same class and maximize the distance between cases in different classes [47].
Genetic algorithms can directly search for weights and can evolve expressions for the distance function. Stahl and Gabel find weights for specific feature values. Each individual is a similarity table containing weights for pairs of feature values. The fitness function is the accuracy of the ranks generated with the weights [44]. Jarmulak et al. select features by evolving a feature bit mask. The fitness function is cross-validation accuracy with classification cases [25]. Using genetic programming, feature indexes and arithmetic operators are added to a syntax tree forming an arithmetic expression. The fitness of the expression is its estimated prediction accuracy [12].
Because objective functions over a clustering can be expensive to compute, we seek search methods that do not evaluate many (very similar) solutions. We can view the search for weights as a transition from one clustering, induced by the original weights, to another, induced by the changed weights. If weights lead to a good cluster, these weights, and those that led to them, form a path of weights that lead to a good clustering. By remembering this path, consisting of which choices were good and bad, the search can focus on good paths while avoiding bad ones. If the path leads to a clustering that has already been seen, the search can quickly switch to a different path rather than repeating work already done. Conceptually, this is an application of reinforcement learning to search, in which the best choices are remembered and reused [2].
The objective of distance function learning for supervised similarity assessment is to find a distance function that groups together cases belonging to the same class, while separating cases of different classes. We introduced a framework for parameterized distance functions, which depends on a vector of weights. We then provided detailed descriptions of algorithms used in three different fields within case-based reasoning: case-based classification, case base maintenance, and case-based retrieval. We showed how to visualize the difference between good and bad distance functions for high-dimensional case bases. Finally, we surveyed recent work in the literature.
Distance function learning is particularly suited to applications with a large number of dimensions, when it is difficult for us to determine which features are the most important. In classification, the Relief algorithm finds features that separate cases from different classes. In case base maintenance, the inside-outside weight update concentrates cases of the same class into cohesive clusters. In case-based retrieval, relevance feedback adapts the distance function to a user's preference at run time.
Distance function learning is a very active research field, and it can benefit from a cross-fertilization of ideas from different fields. The algorithms discussed in this chapter originated in the fields of machine learning, data mining, and information retrieval. In the field of machine learning, recent work has established a relationship between kernels and distance functions. Distance function learning can be applied to obtain better kernels, and kernel methods can be used to derive good distance functions. This has been used in pattern classification. In data mining, distance function learning has found widespread use when users can specify an objective function to organize information. This suggests an interactive approach to unsupervised clustering in which users can explore a clustering by changing objective functions. In information retrieval, distance functions are tailored to individual users and their queries. As different modalities of information become available in addition to text (image, video, signals), distance function learning can be used to emphasize the relevant features with respect to a user's query. In all of these fields, distance function learning is the common thread helping us better assess the similarity between cases.