Several neural network researchers have investigated the problem of improving prediction by removing undesirable features or vectors. For example, by using a genetic algorithm to search the space of subsets of the universe of inputs. The fitness criterion used for evaluation is the actual prediction error made. This is an expensive operation since, to evaluate each subset of indicators, training has to be done from scratch.
A different approach to this problem is to remove some training samples (as opposed to features) from the training set. The candidate training samples that are to be removed belong to the ‘malicious’ category of data that will harm out of sample performance.
Another approach is to extract features by making linear combinations of the features in the full feature set. We propose a decision boundary method for doing this. Such a method favors features that discriminate between the classes of interest rather than the fidelity of representation as principal component analysis does. This technique works well if none of the inputs in our universe confuse the predictor.
Several related techniques exist in the field of pattern recognition, all dealing with feature subset selection. All of the existing methods handle the case when the features in the full feature set are all useful to the classification task. The motivation there is to select features until any further increase in prediction is not justified by the added complexity of having an extra feature. However, in our application the problem is quite different. We are typically given a large set of features constituting the full feature set where not all of the features help and some features may surely confuse the predictor.
A further related technique is the well known analysis of variance (ANOVA). This technique computes the sum of squares between classes SS(between) and the sum of squares within each class SS(within). This technique is described for situations where each class is characterized by only a single quantity or feature. The ratio of SS(between) to SS(within) can be used as an alternative criterion for both of the proposed feature selection algorithms. The advantage of using the number of mismatches as the criterion is the ability to handle cases where each class is characterized by more than one feature, as is typical in many prediction tasks.