Input data format

Virtual Computational Chemistry Laboratory

Input data Output results Example List of key words

Input data format

Input data files should be only txt file (not Excel, html or anything elshe!)

The program process data that are in a tabular form. Each data entry (one case) corresponds to one row and each variable corresponds to one column. For example, the Kubinyi's set contains 5 independent (input, X1-X5) and one dependent (output, Y) variable and 8 data entries.

You can copy- paste this set into the program as:

(A): (B):

8
0 0 1 1 0 1
1 0 0 1 1 1
0 1 1 2 1 2
2 -2 2 3 1 2
0 0 1 1 1 1
1 0 1 2 1 2
1 0 0 1 1 1
0 3 -1 3 .99 2.1 8
1 0 0 1 1 0
1 1 0 0 1 1
2 0 1 1 2 1
2 2 -2 2 3 1
1 0 0 1 1 1
2 1 0 1 2 1
1 1 0 0 1 1
2.1 0 3 -1 3 .99

The position of Y-values in both tables is:

X1 X2 X3 X4 X5 Y
Y is the last Y X1 X2 X3 X4 X5
Y is the first

The default way is (A), with dependent variable (Y) following independent variables (X1-X5). In the case (B) you should use REVERSED=1. The first line of data should indicate number of rows (data entries) that are available in the data for the training data set.

Suppose, you want to use 2 last rows as a test set. This can be done by :

6 2
0 0 1 1 0   1
1 0 0 1 1   1
0 1 1 2 1   2
2 -2 2 3 1   2
0 0 1 1 1   1
1 0 1 2 1   2
1 0 0 1 1   1
0 3 -1 3 .99 2.1

The program will know that there are two data set. The first one will be used for training (and in general, always the first) and the second one to test the algorithm performance. Up to 10 sets can be added in the same way and only the first set will be used to train the program.

If you do not know the target values of the test set, the first line should be changed to:

6 -2
0 0 1 1 0   1
1 0 0 1 1   1
0 1 1 2 1   2
2 -2 2 3 1   2
0 0 1 1 1   1
1 0 1 2 1   2
1 0 0 1 1
0 3 -1 3 .99

If data sets can contains names of data entries, this should be indicated by NAMES=1. An example of the same data set with names is:

8
1 0 0 1 1 0   1
999999 1 0 0 1 1   1
This_is_a_long_name 0 1 1 2 1   2
The_name_can_be 2 -2 2 3 1   2
any_character 0 1 1 1   1
@3$$091 0 1 2 1   2
but 1 0 0 1 1   1
no_space_and_tabs! 0 3 -1 3 .99 2.1

You can also see that there is no requirement for alignment of data in columns. The data can be separated with any number of tabs and spaces.

See FAQ if you have questions. How to cite this applet? Are you looking for a new job in chemoinformatics?

Input data Output results Example List of key words