Input data format

Virtual Computational Chemistry Laboratory

Overview Input Files Output Files Example

Input data format

The program process data that are in a tabular form. Each data entry (one case) corresponds to one row and each variable corresponds to one column. For example, the Kubinyi's set contains 5 independent (input, X1-X5 INPUTS=5) and one dependent (Y, OUTPUTS=1) variable and 8 data entries. The dependent (OUTPUTS) variables should be always the last columns. These variables will not be processed by the UFS.

You can copy- paste this set into the program as:

8
0 0 1 1 0   1
1 0 0 1 1   1
0 1 1 2 1   2
2 -2 2 3 1   2
0 0 1 1 1   1
1 0 1 2 1   2
1 0 0 1 1   1
0 3 -1 3 .99 2.1

The dependent variable (Y) should follow independent variables (X1-X5). The first line of data should indicate number of rows (data entries) that are available in the data for the training data set.

You can also provide names of columns (VARIABLE_NAMES=1) as the first line of data entries:

8
column_1 column_2 column_3 column_4 column_5   output_1
0 0 1 1 0   1
1 0 0 1 1   1
0 1 1 2 1   2
2 -2 2 3 1   2
0 0 1 1 1   1
1 0 1 2 1   2
1 0 0 1 1   1
0 3 -1 3 .99 2.1

Spaces or tabs are not allowed in the names of variables (columns). If data sets can contains names of data entries, this should be indicated by NAMES=1. An example of the same data set with names is:

8
1 0 0 1 1 0   1
999999 1 0 0 1 1   1
This_is_a_long_name 0 1 1 2 1   2
The_name_can_be 2 -2 2 3 1   2
any_character 0 1 1 1   1
@3$$091 0 1 2 1   2
but 1 0 0 1 1   1
no_space_and_tabs! 0 3 -1 3 .99 2.1

You can also see that there is no requirement for alignment of data in columns. The data can be separated with any number of tabs and spaces but names of data entries and variables should not contain them.

See FAQ if you have questions. How to cite this applet? Are you looking for a new job in chemoinformatics?

Overview Input Files Output Files Example