[![Contributors][contributors-shield]][contributors-url] [![Forks][forks-shield]][forks-url] [![Stargazers][stars-shield]][stars-url] [![Issues][issues-shield]][issues-url] [![LinkedIn][linkedin-shield]][linkedin-url]

Protify

A low code solution for computationally predicting the properties of chemicals.
Explore the docs »

View Demo · Report Bug · Request Feature

Table of Contents

About The Project
Getting Started
- Installation
Usage
Hyperparameter Optimization
Contributing
Built With
License
Contact
Cite

## About The Project Protify is an open source platform designed to simplify and democratize workflows for chemical language models. With Protify, deep learning models can be trained to predict chemical properties at the click of a button, without requiring extensive coding knowledge or computational resources. ### Why Protify? - **Benchmark multiple models efficiently**: Need to evaluate 10 different protein language models against 15 diverse datasets with publication-ready figures? Protify makes this possible without writing a single line of code. - **Flexible for all skill levels**: Build custom pipelines with code or use our no-code interface depending on your needs and expertise. - **Accessible computing**: No GPU? No problem. Synthyra offers precomputed embeddings for many popular datasets, which Protify can download for analysis with scikit-learn on your laptop. - **Cost-effective solutions**: The upcoming Synthyra API integration will offer affordable GPU training options, while our Colab notebook provides an accessible entry point for GPU-reliant analysis. Protify is currently in beta. We're actively working to enhance features and documentation to meet our ambitious goals. ## Currently Supported Models

Click to expand model list

pLM - Protein Language Model | Model Name | Description | Size (parameters) | Type | |------------|-------------|------|------| | ESM2-8 | Very small pLM from Meta AI that learns evolutionary information from millions of protein sequences. | 8M | pLM | | ESM2-35 | Small-sized pLM trained on evolutionary data. | 35M | pLM | | ESM2-150 | Medium-sized pLM with improved protein structure prediction capabilities. | 150M | pLM | | ESM2-650 | Large pLM offering state-of-the-art performance on many protein prediction tasks. | 650M | pLM | | ESM2-3B | Largest ESM2 pLM with exceptional capability for protein structure and function prediction. | 3B | pLM | | ESMC-300 | pLM optimized for representation learning. | 300M | pLM | | ESMC-600 | Larger pLM for representations. | 600M | pLM | | ProtBert | BERT-based pLM trained on protein sequences from UniRef. | 420M | pLM | | ProtBert-BFD | BERT-based pLM trained on BFD database with improved performance. | 420M | pLM | | ProtT5 | T5-based pLM capable of both encoding and generation tasks. | 3B | pLM | | ANKH-Base | Base version of the ANKH pLM focused on protein structure understanding. | 400M | pLM | | ANKH-Large | Large version of the ANKH pLM with improved structural predictions. | 1.2B | pLM | | ANKH2-Large | Improved second generation ANKH pLM. | 1.2B | pLM | | GLM2-150 | Medium-sized general language model adapted for protein sequences. | 150M | pLM | | GLM2-650 | Large general language model adapted for protein sequences. | 650M | pLM | | GLM2-GAIA | Specialized GLM pLM fine-tuned with contrastive learning. | 650M | pLM | | DPLM-150 | Diffusion pLM focused on protein structure. | 150M | pLM | | DPLM-650 | Larger diffusion pLM focused on protein structure. | 650M | pLM | | DPLM-3B | Largest deep protein language model in the DPLM family. | 3B | pLM | | DSM-150 | Diffusion sequence model 150 parameter version. | 150M | pLM | | DSM-650 | Diffusion sequence model 650 parameter version. | 650M | pLM | | DSM-PPI | DSM model optimized for protein-protein interactions. | Varies | pLM | | ProtCLM-1b | Causal (auto regressive) pLM. | 1B | pLM | | OneHot-Protein | One-hot encoding baseline for protein sequences. | N/A | Baseline | | OneHot-DNA | One-hot encoding baseline for DNA sequences. | N/A | Baseline | | OneHot-RNA | One-hot encoding baseline for RNA sequences. | N/A | Baseline | | OneHot-Codon | One-hot encoding baseline for codon sequences. | N/A | Baseline | | Random | Baseline model with randomly initialized weights, serving as a negative control. | Varies | Negative control | | Random-Transformer | Randomly initialized transformer model serving as a homology-based control. | Varies | Homology control |

## Currently Supported Datasets

Click to expand dataset list

BC - Binary Classification | SLC - Single-Label Classification | MLC - Multi-Label Classification | R - Regression TC - Tokenwise classification | TR - Tokenwise regression | Dataset Name | Description | Type | Task | Tokenwise | Multiple inputs | |--------------|-------------|------|------|-----------|-------------| | EC | Enzyme Commission numbers dataset for predicting enzyme function classification. | MLC | Protein function prediction | No | No | | GO-CC | Gene Ontology Cellular Component dataset for predicting protein localization in cells. | MLC | Protein localization prediction | No | No | | GO-BP | Gene Ontology Biological Process dataset for predicting protein involvement in biological processes. | MLC | Protein function prediction | No | No | | GO-MF | Gene Ontology Molecular Function dataset for predicting protein molecular functions. | MLC | Protein function prediction | No | No | | MB | Metal ion binding dataset for predicting protein-metal interactions. | BC | Protein-metal binding prediction | No | No | | DeepLoc-2 | Binary classification dataset for predicting protein localization in 2 categories. | BC | Protein localization prediction | No | No | | DeepLoc-10 | Multi-class classification dataset for predicting protein localization in 10 categories. | MCC | Protein localization prediction | No | No | | Subcellular | Dataset for predicting subcellular localization of proteins. | MCC | Protein localization prediction | No | No | | enzyme-kcat | Dataset for predicting enzyme catalytic rate constants (kcat). | R | Enzyme kinetics prediction | No | No | | solubility | Dataset for predicting protein solubility properties. | BC | Protein solubility prediction | No | No | | localization | Dataset for predicting subcellular localization of proteins. | MCC | Protein localization prediction | No | No | | temperature-stability | Dataset for predicting protein stability at different temperatures. | BC | Protein stability prediction | No | No | | optimal-temperature | Dataset for predicting the optimal temperature for protein function. | R | Protein property prediction | No | No | | optimal-ph | Dataset for predicting the optimal pH for protein function. | R | Protein property prediction | No | No | | material-production | Dataset for predicting protein suitability for material production. | BC | Protein application prediction | No | No | | fitness-prediction | Dataset for predicting protein fitness in various environments. | BC | Protein fitness prediction | No | No | | number-of-folds | Dataset for predicting the number of structural folds in proteins. | BC | Protein structure prediction | No | No | | cloning-clf | Dataset for predicting protein suitability for cloning operations. | BC | Protein engineering prediction | No | No | | stability-prediction | Dataset for predicting overall protein stability. | BC | Protein stability prediction | No | No | | SecondaryStructure-3 | Dataset for predicting protein secondary structure in 3 classes. | MCC | Protein structure prediction | Yes | No | | SecondaryStructure-8 | Dataset for predicting protein secondary structure in 8 classes. | MCC | Protein structure prediction | Yes | No | | fluorescence-prediction | Dataset for predicting protein fluorescence properties. | R | Protein property prediction | Yes | No | | plastic | Dataset for predicting protein capability for plastic degradation. | BC | Enzyme function prediction | No | No | | gold-ppi | Gold standard dataset for protein-protein interaction prediction. | SLC | PPI prediction | No | Yes | | human-ppi-saprot | Human protein-protein interaction dataset from SAProt paper. | SLC | PPI prediction | No | Yes | | human-ppi-pinui | Human protein-protein interaction dataset from PiNUI. | SLC | PPI prediction | No | Yes | | yeast-ppi-pinui | Yeast protein-protein interaction dataset from PiNUI. | SLC | PPI prediction | No | Yes | | peptide-HLA-MHC-affinity | Dataset for predicting peptide binding affinity to HLA/MHC complexes. | SLC | Binding affinity prediction | No | Yes | | shs27-ppi-raw | Raw SHS27k with single-label labels. | SLC | PPI type prediction | No | Yes | | shs148-ppi-raw | Raw SHS148k with single-label labels. | SLC | PPI type prediction | No | Yes | | shs27-ppi-random | SHS27k | MLC | PPI prediction | No | Yes | | shs148-ppi-random | SHS148k CD-Hit 40%, multi-label lables, randomized data splits. | MLC | PPI type prediction | No | Yes | | shs27-ppi-dfs | SHS27k CD-Hit 40%, multi-label lables, data splits via depth first search. | MLC | PPI type prediction | No | Yes | | shs148-ppi-dfs | SHS148k CD-Hit 40%, multi-label lables, data splits via depth first search. | MLC | PPI type prediction | No | Yes | | shs27-ppi-bfs | SHS27k CD-Hit 40%, multi-label lables, data splits via breadth first search. | MLC | PPI type prediction | No | Yes | | shs148-ppi-bfs | SHS148k CD-Hit 40%, multi-label lables, data splits via breadth first search. | MLC | PPI type prediction | No | Yes | | string-ppi-random | STRING CD-Hit 40%, multi-label lables, randomized data splits. | MLC | PPI type prediction | No | Yes | | string-ppi-dfs | STRING CD-Hit 40%, multi-label lables, data splits via depth first search. | MLC | PPI type prediction | No | Yes | | string-ppi-bfs | STRING CD-Hit 40%, multi-label lables, data splits via breadth first search. | MLC | PPI type prediction | No | Yes | | ppi-mutation-effect | Compare wild type, mutated, and target sequence to determine if PPI is stronger or not. | SLC | PPI effect prediction | No | Yes | | PPA-ppi | Protein-Protein Affinity dataset from Bindwell. | R | protein-protein affinity prediction | No | Yes | | foldseek-fold | Dataset for protein fold classification using Foldseek. | MCC | Protein structure prediction | No | No | | foldseek-inverse | Inverse protein fold prediction dataset. | MCC | Protein structure prediction | No | No | | ec-active | Dataset for predicting active enzyme classes. | MCC | Enzyme function prediction | No | No | | taxon_domain | Taxonomic classification at domain level. | MCC | Taxonomic prediction | No | No | | taxon_kingdom | Taxonomic classification at kingdom level. | MCC | Taxonomic prediction | No | No | | taxon_phylum | Taxonomic classification at phylum level. | MCC | Taxonomic prediction | No | No | | taxon_class | Taxonomic classification at class level. | MCC | Taxonomic prediction | No | No | | taxon_order | Taxonomic classification at order level. | MCC | Taxonomic prediction | No | No | | taxon_family | Taxonomic classification at family level. | MCC | Taxonomic prediction | No | No | | taxon_genus | Taxonomic classification at genus level. | MCC | Taxonomic prediction | No | No | | taxon_species | Taxonomic classification at species level. | MCC | Taxonomic prediction | No | No | | diff_phylogeny | Differential phylogeny dataset. | Various | Phylogeny prediction | No | No | | plddt | AlphaFold pLDDT confidence score prediction. | TR | Confidence prediction | Yes | No | | realness | Protein realness dataset. | BC | Authenticity prediction | No | No | | million_full | Large-scale enzyme variant dataset, from Millionfull preprint October 2025 | R | Protein fitness prediction | No | No |

For more details about supported models and datasets, including programmatic access and command-line utilities, see the [Resource Listing Documentation](docs/resource_listing.md). ### Current Key Features - **Multiple interfaces**: Run experiments via an intuitive GUI, CLI, or prepared YAML files - **Efficient embeddings**: Leverage fast and efficient embeddings from ESM2 and ESMC via [FastPLMs](https://github.com/Synthyra/FastPLMs) - Coming soon: Additional protein, SMILES, SELFIES, codon, and nucleotide language models - **Flexible model probing**: Use efficient MLPs for sequence-wise tasks or transformer probes for token-wise tasks - Coming soon: Full model fine-tuning, hybrid probing, and LoRA - **Automated model selection**: Find optimal scikit-learn models for your data with LazyPredict, enhanced by automatic hyperparameter optimization - Coming soon: GPU acceleration - **Hyperparameter optimization**: Integrated Weights & Biases sweeps that conducts a hyperparameter se [README truncated...]

/tools

IFM

DEELIG

MGDTA

tox21_dataset

Progen

largeDFTdata

nmrdata

BOOM

Matcha

ECFP-Sort-and-Slice

chemprop_benchmark_v2

spice-models

DrugDataResource

Protify

ProteinF3S

ADKF-IFT

TED-Gen