A Dataset for Prediction of Chemical Compound Structures from PubChem (PCCS-PubChem)

Overview

PCCS-PubChem is a dataset for predicting chemical compound structures from chemical compound names. PCCS-PubChem includes train/dev/test data file and a script for evaluation. Each data file includes chemical compound names with corresponding structures extracted from PubChem. PCCS-PubChem is used in our AACL'2020 paper.

You can download this dataset below:

Background

Databases of chemical substances are necessary for developing new materials and drugs, and for synthesizing products from new materials. However, only a portion of countless chemical compounds are registered in chemical databases. Moreover, the databases for chemical domains are manually maintained, which requires much time and cost. Therefore, automation of database creation for chemical compounds is required.
When automatically creating a chemical database, it is necessary to merge some knowledges referred by aliases of each chemical compound (e.g, 2-acetyloxybenzoic acid is a alias of aspirin). Therefore, we have to indentify the entity (e.g. chemical compound structure) of chemical compound names for automation of chemical database creation.

Task

The task is conversion of each chemical compound name to its SMILES string.
For example, A SMILES of "aspirin" is "CC(=O)OC1=CC=CC=C1C(=O)O".
A chemical compound could be represented by different SMILES. Therefore, we use the canonical SMILES of the structure of each chemical compound name because it uniquely determines the correspondence between chemical structures and SMILES strings.

Data

Files in the data set

Format of the data sets

Sample of a training data

CID Name  SMILES  InChI
2244  aspirin CC(=O)OC1=CC=CC=C1C(=O)O  InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)

Sample of a dev/test data

CID Name  SMILES  InChI Name Type
2244  aspirin CC(=O)OC1=CC=CC=C1C(=O)O  InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)  synonym

Evaluation Metrics

The results of PCCS-PubChem dataset can be evaluated with some metrics as follows:

r e c a l l = C O R R E C T E N T I R E , \begin{aligned} {\rm recall } &= \frac{{\rm CORRECT}}{{\rm ENTIRE}}, \\ \end{aligned}

Recall is the ratio of predicted correct structures to entire, and represents the ability of predicting correct structures from compound names. Here, CORRECT is correctly predicted ones, the number of matches between the correct SMILES and the generated SMILES. ENTIRE is the total number of chemical compund names in an evaluation data.

p r e c i s i o n = C O R R E C T E N T I R E M I S S E R R O R , \begin{aligned} {\rm precision } &= \frac{{\rm CORRECT}}{{\rm ENTIRE - MISS - ERROR}}, \\ \end{aligned}

Precision is the ratio of predicted correct structures to predicted strings as SMILES (i.e., the string can be converted to some sort of chemical compound structure), and represents the correctness of predicted structure when some sort of chemical compound structures are predicted from chemical compound name. Here, MISS is the number of failures of parsing chemical compound names (only appeared for rule-based methods), ERROR is the number of outputs that are syntactically incorrect, in other words failed to convert to Mol object with RDKit.

NOTE: In general, ERROR (the number of predicted strings that don't follow SMILES grammar) should not be substracted from ENTIRE at denominator of precision. However, in this case, ungramaticall SMILES can be automatically detected with RDKit (or other SMILES string parser). Therefore, we use this precision definition from the point of practical usages.

In our evaluation script, the ERROR value is automatically calculated with RDKit. Please be careful of the libraries version when you evaluate on your environment. (We recommend to evaluate on the environment that build from this file with conda.)

F m e a s u r e = 2 p r e c i s i o n r e c a l l p r e c i s i o n + r e c a l l , \begin{aligned} {\rm F-measure } &= \frac{2 * {\rm precision * recall}}{\rm precision + recall}, \end{aligned}

F-measure can be also caluculated from precision and recall shown above.

v a l i d i t y = C O R R E C T + M I S T A K E E N T I R E , \begin{aligned} {\rm validity } &= \frac{{\rm CORRECT + MISTAKE}}{{\rm ENTIRE}}, \\ \end{aligned}

Validity is a ratio of correctly predicted SMILES for measuring the ability of predicting gramatically correct SMILES from chemical compound names. In other words, the validity value represents range size of chemical compound name space that a prediction system can keep the precision value (Of cource, the range size represented by validity value depend on test set).
Here, MISTAKE is the number of outputs that were grammatically correct SMILESs but do not match the correct SMILESs.

Dataset statistics

number of records
dev 19,610
test 196,660
train 5,000,000
dev.filtered 1,113
test.filtered 11,194

Citation

@inproceedings{omote-etal-2020-transformer,
    title = "Transformer-based Approach for Predicting Chemical Compound Structures",
    author = "Omote, Yutaro  and
      Matsushita, Kyoumoto  and
      Iwakura, Tomoya  and
      Tamura, Akihiro  and
      Ninomiya, Takashi",
    booktitle = "Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing",
    month = dec,
    year = "2020",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.aacl-main.19",
    pages = "154--162",
    abstract = "By predicting chemical compound structures from their names, we can better comprehend chemical compounds written in text and identify the same chemical compound given different notations for database creation. Previous methods have predicted the chemical compound structures from their names and represented them by Simplified Molecular Input Line Entry System (SMILES) strings. However, these methods mainly apply handcrafted rules, and cannot predict the structures of chemical compound names not covered by the rules. Instead of handcrafted rules, we propose Transformer-based models that predict SMILES strings from chemical compound names. We improve the conventional Transformer-based model by introducing two features: (1) a loss function that constrains the number of atoms of each element in the structure, and (2) a multi-task learning approach that predicts both SMILES strings and InChI strings (another string representation of chemical compound structures). In evaluation experiments, our methods achieved higher F-measures than previous rule-based approaches (Open Parser for Systematic IUPAC Nomenclature and two commercially used products), and the conventional Transformer-based model. We release the dataset used in this paper as a benchmark for the future research.",
}

Change Log

2021-01-17: updated the dataset file
2020-12-07: site open