Virtual Computational Chemistry Laboratory

Overview Input Files Output Files Example


Unsupervised Forward Selection (UFS) is a data reduction algorithm that selects from a data matrix a maximal linearly independent set of columns with a minimal amount of multiple correlation.

UFS was designed for use in the development of Quantitative Structure-Activity Relationship (QSAR) models, where the m by n data matrix contains the values of n variables (typically molecular properties) for m objects (typically compounds). QSAR data sets often contain redundancy (exact linear dependencies between subsets of the variables), and multicollinearity (high multiple correlations between subsets of the variables). Both of these features inhibit the development of QSAR models with the ability to generalise successfully to new objects. UFS produces a reduced data set that contains no redundancy and a minimal amount of multicollinearity.

UFS is a forward stepping algorithm that proceeds as follows. First, the two columns with the smallest squared pairwise correlation coefficient are chosen. Then, for each i > 2 , column i is chosen from the remaining columns to have the smallest squared multiple correlation coefficient with columns 1 to i-1. The process halts when the number of columns selected reaches the rank of the data matrix; that is, when the squared multiple correlation coefficient of each remaining column with those already selected equals one. Thus the algorithm builds a basis for the column space of the data matrix, minimising the multicollinearity in the reduced data set at each stage.

In practice, the ufs program excludes variables with standard deviation less than sdevmin and halts when the squared multiple correlation coefficient of each remaining column with those already selected exceeds rsqmax. Both sdevmin and rsqmax are user adjustable parameters. The columns of the data matrix are usually mean-centred before any calculations are performed, but this behaviour may be over-ridden by the user.

For further details, including applications of UFS to QSAR model building, see D. C. Whitley, M. G. Ford and D. J. Livingstone, Unsupervised Forward Selection: A Method for Eliminating Redundant Variables, Journal of Chemical Information and Computer Sciences, 2000, 40, 1160-1168.


Copyright 2001 -- 2016 All rights reserved.