Relational Dataset Repository

Mutagenesis

The dataset comprises of 230 molecules trialed for mutagenicity on Salmonella typhimurium. A subset of 188 molecules is learnable using linear regression. This subset was later termed the ”regression friendly” dataset. The remaining subset of 42 molecules is named the …

Financial

PKDD'99 Financial dataset contains 606 successful and 76 not successful loans along with their information and transactions.

Trains

East-West challenge (1980) database describes east-bound and west-bound trains.

IMDb

The IMDb database: moderately large, real database of movies.

Genes

KDD Cup 2001 prediction of gene/protein function and localization.

Hepatitis

PKDD'02 Hepatitis dataset describes 206 instances of Hepatitis B (contrasting them against 484 cases of Hepatitis C).

PTE

A database from The Predictive Toxicology Evaluation Challenge (1997). The task is to predict whether the compound is carcinogenic, or not.

UW-CSE

This dataset lists facts about the Department of Computer Science and Engineering at the University of Washington (UW-CSE), such as entities (e.g., Student, Professor) and their relationships (i.e. AdvisedBy, Publication).

PTC

Predictive Toxicology Challenge (2000) consists of more than three hundreds of organic molecules marked according to their carcinogenicity on male and female mice and rats.

Biodegradability

This is an older data set of chemical structures containing 328 compounds labeled by their half-life for aerobic aqueous biodegradation (a regression task).

Musk

The Musk database describes molecules occurring in different conformations. Each molecule is either musk or non-musk and one of the conformations determines this property. Such a problem is known as a multiple-instance problem, and is modeled by two tables molecule and…

Carcinogenesis

For prediction of whether a given molecule is carcinogenic or not. The dataset contains 182 positive carcinogenecity tests and 148 negative tests.

Top datasets