Top datasets
Mutagenesis
The dataset comprises of 230 molecules trialed for mutagenicity on Salmonella typhimurium. A subset of 188 molecules is learnable using linear regression. This subset was later termed the ”regression friendly” dataset. The remaining subset of 42 molecules is named the …
Financial
PKDD'99 Financial dataset contains 606 successful and 76 not successful loans along with their information and transactions.
Trains
East-West challenge (1980) database describes east-bound and west-bound trains.
IMDb
The IMDb database: moderately large, real database of movies.
Genes
KDD Cup 2001 prediction of gene/protein function and localization.
Hepatitis
PKDD'02 Hepatitis dataset describes 206 instances of Hepatitis B (contrasting them against 484 cases of Hepatitis C).
PTE
A database from The Predictive Toxicology Evaluation Challenge (1997). The task is to predict whether the compound is carcinogenic, or not.
UW-CSE
This dataset lists facts about the Department of Computer Science and Engineering at the University of Washington (UW-CSE), such as entities (e.g., Student, Professor) and their relationships (i.e. AdvisedBy, Publication).
PTC
Predictive Toxicology Challenge (2000) consists of more than three hundreds of organic molecules marked according to their carcinogenicity on male and female mice and rats.
Biodegradability
This is an older data set of chemical structures containing 328 compounds labeled by their half-life for aerobic aqueous biodegradation (a regression task).
Musk
The Musk database describes molecules occurring in different conformations. Each molecule is either musk or non-musk and one of the conformations determines this property. Such a problem is known as a multiple-instance problem, and is modeled by two tables molecule and…
Carcinogenesis
For prediction of whether a given molecule is carcinogenic or not. The dataset contains 182 positive carcinogenecity tests and 148 negative tests.