|
| Partial Least Squares Regression (PLSR) Partial Least Squares (PLS) regression technique is especially useful in quite common case where the number of descriptors (independent variables) is comparable to or greater than the number of compounds (data points) and/or there exist other factors leading to correlations between variables. In this case the solution of classical least squares problem does not exist or is unstable and unreliable. On the other hand, PLS approach leads to stable, correct and highly predictive models even for correlated descriptors [1-3]. The same code base is successfully employed in software implementing the Molecular Field Topology Analysis (MFTA) technique proposed by us [4] for QSAR studies of organic compounds. Partial Least Squares regression is based on linear transition from a large number of original descriptors to a new variable space based on small number of orthogonal factors (latent variables). In other words, factors are mutually independent (orthogonal) linear combinations of original descriptors. Unlike some similar approaches (e.g. principal component regression PCR), latent variables are chosen in such a way as to provide maximum correlation with dependent variable; thus, PLS model contains the smallest necessary number of factors [2]. With increasing number of factors, PLS model converges to ordinary multiple linear regression model (if one exists). In addition, PLS approach allows one to detect relationship between activity and descriptors even if key descriptors have little contribution to the first few principal components. This concept is illustrated by Fig. 1 representing a hypothetical data set with two independent variables x1 and x2 and one dependent variable y. It can be easily seen that original variables x1 and x2 here are strongly correlated. From them, we change to two orthogonal factors (latent variables) t1 and t2 that are linear combinations of original descriptors. As a result, a single-factor model can be obtained that relates activity y to the first latent variable t1.
a b Fig. 1. Transformation of original descriptors to latent variables (a) and construction of activity model containing one PLS factor (b). Basic algorithm of PLS method [1] for the step of building k-th factor can be represented as follows: (1) where N - number of compounds, M - number of descriptors, X[N,M] - descriptor matrix, y[N] - activity vector, w[M] - auxiliary weight vector, t[N] - factor coefficient vector, p[M] - loading vector, q - scalar coefficient of relationship between factor and activity; all vectors are columns, entities without index "(k+1)" are for the current (k-th) factor. Because latent variables are the linear combinations of original descriptors (with coefficients represented by loading vector p), factor model indirectly describes the effect of each descriptor on activity. The approach to factor construction provides the description of available data using minimum number of adjustable parameters and, consequently, maximum precision and stability of regression model. However, inclusion of excessive factors in the model increases the accuracy of description but may decrease the predictivity as model starts to represent not just the true pattern of relation between descriptors and activity but also random noise and individual features of the training set. Because of this, during construction of the model its predictivity is monitored after including each successive factor by means of cross-validation procedure. In cross-validation approach, computation is run several times in such a way that certain subset of the training set is not used in the model construction. Then the activity is predicted for excluded compounds using such partial model. Each compound is excluded exactly once, and normalized total error of prediction for them serves as a measure of predictivity for the full model - cross-validation parameter Q2 that is used in PLS regression to select optimal number of PLS factors. This procedure is illustrated by Fig. 2 that represents partitioning of 9-compound training set (i=1-9) into 3 cross-validation groups (g=1-3).
Fig. 2. Scheme of the cross-validation computation. By summing up the squared errors of prediction for excluded compounds (shaded in Fig. 2) we obtain Mean Squared Error of Cross-Validation: (2) Cross-validation parameter is defined as (3) where sy - root mean square deviation of y from average value for the training set. In other words, Q2 - parameter shows to what extent the factor model constructed is better than random selection. It should be noted that commonly used leave-one-out cross-validation (where compounds are excluded one by one) may strongly overestimate the predictivity. It is well known that Partial Least Squares (PLS) regression is quite sensitive to the noise created by the excessive irrelevant descriptors. To achieve the best model quality, two-step descriptor selection procedure [5] is applied. The first step consists in the elimination of the low-variable (almost constant) descriptors that are different from a constant only for a few (2-3) compounds in the training set. At best, such descriptors cannot provide useful statistical information. In most cases, they simply help to artificially fit these particular compounds, thus decreasing the model predictivity. At the second step, we find optimal descriptor subset providing maximum value of Q2 parameter (Q2-guided descriptor selection) by means of a genetic algorithm. This approach models the Darwinian evolution in a big population of so-called 'chromosomes', i.e. compact uniform representations of possible solutions for the optimization problem. In recent years, it has been widely employed for descriptor selection in multiple linear regression [6] and PLS regression [7-8], providing substantial advantage over enumerative and semi-enumerative techniques (e.g. [9-10]) in terms of speed and quality of optimization. General scheme of optimization procedure is shown in Fig. 3. Here, bit mask vectors serve as chromosomes. Vector element is 1 if the corresponding descriptor is included in the model, and 0 otherwise. Initial population containing preset number of chromosomes is constructed randomly. For each chromosome, the fitness measure is computed as Q2 value for PLS model with specified descriptor set and optimal number of factors. Then, the modification type (crossover or mutation) is selected randomly observing preset probabilities, child chromosomes are constructed and their fitness is computed. Resulting chromosomes are introduced into the population (replacing worst members) if they have better fitness and population contains no chromosomes with higher fitness and lower descriptor count (this latter condition prevents selection of excessive descriptors). The process is subsequently repeated until convergence is achieved. Fig. 3. General algorithm of genetic optimization. Let us consider in more detail the process of selection and modification of parent chromosomes. As noted above, algorithm uses two types of modification operators: mutation and crossover. The former operates on one parent chromosome and the latter on two chromosomes. Parent chromosomes are selected using fitness ranking approach [11] that considers a sequence of population chromosomes ordered by fitness value. The probability of selection grows linearly with position of chromosome in the sequence. This approach ensures preferred selection of most fit chromosomes but prevents degenerate selection when chromosomes in population are similar or very different in fitness. During mutation, a position i in parent chromosome is randomly chosen where chromosome value is changed (see Fig. 4). Crossover of two chromosomes also involves random choice of position but child chromosomes are formed by swapping parent chromosome fragments after that position (Fig. 5). This facilitates combining the useful features of different solutions. Fig. 4. Mutation operator in genetic algorithm. Fig. 5. Crossover operator in genetic algorithm. This procedure leads to substantial (5-10 fold) reduction of number of descriptors included in model, slight change in correlation quality and substantial increase of predictivity. Also, despite the stochastic nature of this technique, computational experiments demonstrate reasonable stability of the results in terms of both statistical parameters of a model and descriptor subset [5]. Data available in PLS model allows one to compute contribution of each (original) descriptor to predicted activity values (4) where bj - coefficient at descriptor xj in PLS model back-rotated to original descriptor space, - xj variation range, - activity range in the training set. This parameter assists in identifying most important descriptors (structural features) and analyzing effect of descriptors on activity.
References
|
Copyright 2001 -- 2023 https://vcclab.org. All rights reserved. |