Download data

Provided Data

The training and known test set data were used in our previous studies [1-2]. The new "blind" data set was courtesy provided by Prof. Schultz. Thus, three datasets are available:

  • Training dataset of 644 molecules -- to develop models
  • Known test dataset of 449 molecules -- to preliminary rank methods during the challenge
  • Blind test dataset of 120 molecules -- to identify a winner*

    The experimental data for the Blind set has not been yet previously published and will be available by Prof. Schultz only after September 1st. These data cover the structural domain defined by the training dataset. The environmental toxicity is measured as log(IGC50-1) (see also Introduction).

    Structural information of molecules (SMILES) and several sets of descriptors of molecules are available:

  • E-state indices - calculated at Virtual Computational Chemistry Laboratory site
  • Quantum chemistry - calculated using AM1 MOPAC 7.1 using optimized structures and also include logP, logS values
  • DRAGON descriptors - calculated at Virtual Computational Chemistry Laboratory site using optimized structures
  • SimulationsPlus descriptors - see description
  • MOE descriptors [3] - see description, were calculated using optimized structures

    The descriptors can be used to develop models. The participants can also extend the descriptors. The structures of molecules are provided as:
    SMILES format -- Excel file also includes values for the BLIND set
    MOL2 format -- 3d.tar file with 3D optimized structures

    N.B.! Please, notice that DRAGON descriptors were updated on Tuesday July, 28th.

    E-state indices

    DRAGON

    SimulationsPlus

    QuantumChemistry

    MOE

    Excel -- download not provided [4] Excel -- download Excel -- download Excel -- download
    ARFF -- Training
    ARFF -- Known Test
    ARFF -- Blind Test

    TEXT -- Training
    TEXT -- Known Test
    TEXT -- Blind Test

    ARFF -- Training
    ARFF -- Known Test
    ARFF -- Blind Test

    TEXT -- Training
    TEXT -- Known Test
    TEXT -- Blind Test

    ARFF -- Training
    ARFF -- Known Test
    ARFF -- Blind Test

    TEXT -- Training
    TEXT -- Known Test
    TEXT -- Blind Test

    ARFF -- Training
    ARFF -- Known Test
    ARFF -- Blind Test

    TEXT -- Training
    TEXT -- Known Test
    TEXT -- Blind Test

    ARFF -- Training
    ARFF -- Known Test
    ARFF -- Blind Test

    TEXT -- Training
    TEXT -- Known Test
    TEXT -- Blind Test

    [1] Zhu, H.; Tropsha, A.; Fourches, D.; Varnek, A.; Papa, E.; Gramatica, P.; Oberg, T.; Dao, P.; Cherkasov, A.; Tetko, I. V. Combinatorial QSAR Modeling of Chemical Toxicants Tested against Tetrahymena pyriformis J. Chem. Inf. Model. 2008, 48 (4), 766-784.

    [2] Tetko, I. V.; Sushko, I.; Pandey, A. K.; Zhu, H.; Tropsha, A.; Papa, E.; Oberg, T.; Todeschini, R.; Fourches, D.; Varnek, A., Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: focusing on applicability domain and overfitting by variable selection J. Chem. Inf. Model. 2008, 48 (9), 1733-46.

    [3] MOE (The Molecular Operating Environment) Version 2008.10, software available from Chemical Computing Group Inc., 1010 Sherbrooke Street West, Suite 910, Montreal, Canada H3A 2R7. http://www.chemcomp.com

    [4] Dragon descriptors are not provided in this format due to the limitations on a maximum number of columns in Excel.