PCCS-PubChem is a dataset for predicting chemical compound structures from chemical compound names. PCCS-PubChem includes train/dev/test data file and a script for evaluation. Each data file includes chemical compound names with corresponding structures extracted from PubChem. PCCS-PubChem is used in our AACL'2020 paper.
You can download this dataset below:
Databases of chemical substances are necessary for developing new materials and drugs, and for synthesizing products from new materials. However, only a portion of countless chemical compounds are registered in chemical databases. Moreover, the databases for chemical domains are manually maintained, which requires much time and cost. Therefore, automation of database creation for chemical compounds is required.
When automatically creating a chemical database, it is necessary to merge some knowledges referred by aliases of each chemical compound (e.g, 2-acetyloxybenzoic acid is a alias of aspirin). Therefore, we have to indentify the entity (e.g. chemical compound structure) of chemical compound names for automation of chemical database creation.
The task is conversion of each chemical compound name to its SMILES string.
For example, A SMILES of "aspirin" is "CC(=O)OC1=CC=CC=C1C(=O)O".
A chemical compound could be represented by different SMILES. Therefore, we use the canonical SMILES of the structure of each chemical compound name because it uniquely determines the correspondence between chemical structures and SMILES strings.
train.tsv, dev.tsv, test.tsv
dev.filtered.tsv, test.filtered.tsv
conda env create -f environment.yml
python3 evaluate.py -p [PRED-FILE] -r [REF-FILE]
test.smiles
, you run ./split_tsv.sh test.tsv test .
in root directory of this dataset.python3 path/to/script/evaluate.py --help
to see how to use this script.test.smiles
, you run ./split_tsv.sh test.tsv test .
in root directory of this dataset../split_tsv.sh
to see usage.synonym
or IUPAC
)
CID Name SMILES InChI 2244 aspirin CC(=O)OC1=CC=CC=C1C(=O)O InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)
CID Name SMILES InChI Name Type 2244 aspirin CC(=O)OC1=CC=CC=C1C(=O)O InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12) synonym
The results of PCCS-PubChem dataset can be evaluated with some metrics as follows:
Recall is the ratio of predicted correct structures to entire, and represents the ability of predicting correct structures from compound names. Here, CORRECT is correctly predicted ones, the number of matches between the correct SMILES and the generated SMILES. ENTIRE is the total number of chemical compund names in an evaluation data.
Precision is the ratio of predicted correct structures to predicted strings as SMILES (i.e., the string can be converted to some sort of chemical compound structure), and represents the correctness of predicted structure when some sort of chemical compound structures are predicted from chemical compound name. Here, MISS is the number of failures of parsing chemical compound names (only appeared for rule-based methods), ERROR is the number of outputs that are syntactically incorrect, in other words failed to convert to Mol object with RDKit.
NOTE: In general, ERROR (the number of predicted strings that don't follow SMILES grammar) should not be substracted from ENTIRE at denominator of precision. However, in this case, ungramaticall SMILES can be automatically detected with RDKit (or other SMILES string parser). Therefore, we use this precision definition from the point of practical usages.
In our evaluation script, the ERROR value is automatically calculated with RDKit. Please be careful of the libraries version when you evaluate on your environment. (We recommend to evaluate on the environment that build from this file with conda.)
F-measure can be also caluculated from precision and recall shown above.
Validity is a ratio of correctly predicted SMILES for measuring the ability of predicting gramatically correct SMILES from chemical compound names. In other words, the validity value represents range size of chemical compound name space that a prediction system can keep the precision value (Of cource, the range size represented by validity value depend on test set).
Here, MISTAKE is the number of outputs that were grammatically correct SMILESs but do not match the correct SMILESs.
number of records | |
---|---|
dev | 19,610 |
test | 196,660 |
train | 5,000,000 |
dev.filtered | 1,113 |
test.filtered | 11,194 |
@inproceedings{omote-etal-2020-transformer,
title = "Transformer-based Approach for Predicting Chemical Compound Structures",
author = "Omote, Yutaro and
Matsushita, Kyoumoto and
Iwakura, Tomoya and
Tamura, Akihiro and
Ninomiya, Takashi",
booktitle = "Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing",
month = dec,
year = "2020",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.aacl-main.19",
pages = "154--162",
abstract = "By predicting chemical compound structures from their names, we can better comprehend chemical compounds written in text and identify the same chemical compound given different notations for database creation. Previous methods have predicted the chemical compound structures from their names and represented them by Simplified Molecular Input Line Entry System (SMILES) strings. However, these methods mainly apply handcrafted rules, and cannot predict the structures of chemical compound names not covered by the rules. Instead of handcrafted rules, we propose Transformer-based models that predict SMILES strings from chemical compound names. We improve the conventional Transformer-based model by introducing two features: (1) a loss function that constrains the number of atoms of each element in the structure, and (2) a multi-task learning approach that predicts both SMILES strings and InChI strings (another string representation of chemical compound structures). In evaluation experiments, our methods achieved higher F-measures than previous rule-based approaches (Open Parser for Systematic IUPAC Nomenclature and two commercially used products), and the conventional Transformer-based model. We release the dataset used in this paper as a benchmark for the future research.",
}
2021-01-17: updated the dataset file
2020-12-07: site open