/tools
tools tagged “dataset”
Mol-Instructions
zjunlp/Mol-Instructions
Mol-Instructions is a dataset that contains a large collection of instructions for biomolecular tasks, including molecule-oriented and protein-oriented tasks. It aims to facilitate the development of large language models for generating and understanding molecular and protein-related information.
ProteinFlow
adaptyvbio/ProteinFlow
ProteinFlow is an open-source Python library that streamlines the pre-processing of protein structure data for deep learning applications. It enables users to filter, cluster, and generate datasets from protein structure databases, facilitating various protein design tasks.
plinder
plinder-org/plinder
PLINDER is a dataset and evaluation resource focused on protein-ligand interactions, containing over 400k systems and numerous annotations for training and benchmarking docking algorithms. It aims to standardize the evaluation of protein-ligand interactions in the field of computational chemistry.
ProteinWorkshop
a-r-j/ProteinWorkshop
ProteinWorkshop is a benchmarking framework designed for protein representation learning. It includes a variety of pre-training and downstream task datasets, models, and utilities, making it a valuable resource for researchers in molecular biology and computational chemistry.
equidock_public
octavian-ganea/equidock_public
EquiDock is a tool designed for fast rigid protein-protein docking using independent SE(3)-equivariant models. It includes preprocessing steps for datasets and allows for training and inference of docking models, making it relevant for molecular simulations and drug discovery.
resources_2025
PatWalters/resources_2025
This repository serves as a comprehensive resource for machine learning in drug discovery, offering curated datasets, benchmarks, and educational materials. It focuses on enhancing the understanding and application of cheminformatics in predicting molecular properties and interactions.
MoleculeSTM
chao1224/MoleculeSTM
MoleculeSTM is a multi-modal model designed for text-based editing and retrieval of molecular structures. It provides tools for molecular property prediction and includes datasets for training and evaluation, making it a valuable resource in drug discovery and molecular design.
Awesome-Biomolecule-Language-Cross-Modeling
QizhiPei/Awesome-Biomolecule-Language-Cross-Modeling
Awesome-Biomolecule-Language-Cross-Modeling is a curated list of resources that focuses on leveraging biomolecule data and natural language processing through multi-modal learning. It includes various models and datasets that facilitate tasks related to molecular properties and interactions.
geom
learningmatter-mit/geom
GEOM is a dataset containing 37 million molecular conformations annotated by energy and statistical weight for over 450,000 molecules. It is designed for use in property prediction and molecular generation, providing essential data for researchers in computational chemistry.
Ankh
agemagician/Ankh
Ankh is an optimized protein language model that enhances general-purpose modeling for protein engineering. It offers pre-trained models and datasets for various protein-related tasks, including secondary structure prediction and solubility assessment.
misato-dataset
t7morgen/misato-dataset
The MISATO repository offers a machine learning dataset of protein-ligand complexes designed for structure-based drug discovery. It includes molecular dynamics simulations and quantum mechanics data, facilitating the training of AI models for predicting binding affinities and other molecular properties.
ASE_ANI
isayev/ASE_ANI
ASE_ANI is a prototype interface for the ANI-1x and ANI-1ccx neural network potentials, enabling predictions of molecular properties and facilitating molecular dynamics simulations. It is designed for use within the Atomic Simulation Environment (ASE) and supports various applications in computational chemistry.
nablaDFT
AIRI-Institute/nablaDFT
nablaDFT is a comprehensive dataset and benchmark designed for evaluating neural network potentials in molecular property prediction and Hamiltonian prediction. It includes a large collection of drug-like molecules with calculated electronic properties, making it a valuable resource for computational chemistry and machine learning applications in drug discovery.
MolTrans
kexinhuang12345/MolTrans
MolTrans is a tool designed for predicting drug target interactions using a transformer-based model. It addresses challenges in molecular representation learning and provides datasets for training and evaluation.
protein-ligand-benchmark
openforcefield/protein-ligand-benchmark
The 'protein-ligand-benchmark' repository offers a comprehensive dataset designed for testing parameters and methods of free energy calculations in protein-ligand interactions. It includes detailed metadata for various protein targets and ligands, facilitating research in molecular property prediction and computational chemistry.
PocketGen
zaixizhang/PocketGen
PocketGen is a tool that generates full-atom ligand-binding protein pockets using generative models. It benchmarks its performance against established datasets like CrossDocked and Binding MOAD, providing processed datasets for training and evaluation of pocket generation methods.
ATOMICA
mims-harvard/ATOMICA
ATOMICA is a geometric AI model that learns universal representations of intermolecular interactions at an atomic scale. It is pretrained on a large dataset of molecular interaction interfaces and can be used for various downstream tasks, including binding site prediction and embedding biomolecular complexes.
ProteinInvBench
A4Bio/ProteinInvBench
ProteinInvBench is an open-source project that benchmarks structure-based protein design methods. It integrates various models, datasets, and evaluation metrics into a unified framework, facilitating the analysis and development of protein design algorithms.
MolT5
blender-nlp/MolT5
MolT5 is a tool that facilitates the translation between molecular representations (like SMILES) and natural language descriptions. It includes pretrained models for tasks such as molecule captioning and generation, along with datasets for training and evaluation.
spice-dataset
openmm/spice-dataset
The SPICE dataset is a collection of quantum mechanical data aimed at training potential functions for simulating drug-like small molecules interacting with proteins. It includes a wide range of chemical space and conformations, making it a valuable resource for molecular machine learning applications.
ConfGF
DeepGraphLearning/ConfGF
ConfGF is an implementation of Learning Gradient Fields for generating molecular conformations. It provides tools for training models on molecular datasets and generating conformations from SMILES representations, making it useful for molecular design and related applications.
3DInfomax
HannesStark/3DInfomax
3DInfomax enhances graph neural networks for predicting molecular properties by leveraging 3D molecular geometry. It provides tools for pre-training and fine-tuning models on various molecular datasets, enabling better predictions and molecular fingerprint generation.
chemml
hachmannlab/chemml
ChemML is a machine learning and informatics program suite that facilitates the analysis and modeling of chemical and materials data. It provides tools for predicting molecular properties and supports various applications in drug discovery and materials informatics.
DrugOOD
tencent-ailab/DrugOOD
DrugOOD is a dataset curator and benchmark tool designed for AI-aided drug discovery, focusing on generating datasets for ligand and structure-based affinity prediction. It supports various noise levels and domain annotations, making it a valuable resource for researchers in molecular property prediction.