A Dataset for Prediction of Chemical Compound Structures from PubChem (PCCS-PubChem)

Overview

PCCS-PubChem is a dataset for predicting chemical compound structures from chemical compound names. PCCS-PubChem includes train/dev/test data file and a script for evaluation. Each data file includes chemical compound names with corresponding structures extracted from PubChem. PCCS-PubChem is used in our AACL'2020 paper.

You can download this dataset below:

pccs-pubchem.tar.gz

Background

Databases of chemical substances are necessary for developing new materials and drugs, and for synthesizing products from new materials. However, only a portion of countless chemical compounds are registered in chemical databases. Moreover, the databases for chemical domains are manually maintained, which requires much time and cost. Therefore, automation of database creation for chemical compounds is required.
When automatically creating a chemical database, it is necessary to merge some knowledges referred by aliases of each chemical compound (e.g, 2-acetyloxybenzoic acid is a alias of aspirin). Therefore, we have to indentify the entity (e.g. chemical compound structure) of chemical compound names for automation of chemical database creation.

Task

The task is conversion of each chemical compound name to its SMILES string.
For example, A SMILES of "aspirin" is "CC(=O)OC1=CC=CC=C1C(=O)O".
A chemical compound could be represented by different SMILES. Therefore, we use the canonical SMILES of the structure of each chemical compound name because it uniquely determines the correspondence between chemical structures and SMILES strings.

Data

Files in the data set

train.tsv, dev.tsv, test.tsv
- Each file corresponds to the training, developing and test sets
- train.tsv has 4 columns and dev.tsv/test.tsv has 5 columns delimited by tab.
  - Each column corresponds to PubChem Compound ID (CID), Chemical Compound Name (Name), SMILES, InChI, Name Type
    - train.tsv doesn't have Name Type column
  - Please see the below section for more information.
dev.filtered.tsv, test.filtered.tsv
- [dev/test].filtered.tsv file is created by extracting only synonym names from [dev/test].tsv and removing Database ID, Abbreviation and so on.
environment.yml
- The conda environment file to run the evaluation script
- To create virtual environment for evaluation script, please run conda env create -f environment.yml
  - More detail
evaluate.py
- A Python3 script to evaluate outputs of your system
- This script compare SMILES as a string.
- This script requires some third party library.
  - Please create virtual environment for this script from environment.yml with conda.
    - How to create
- Usage:
  - python3 evaluate.py -p [PRED-FILE] -r [REF-FILE]
    - PRED-FILE: a file consisting of SMILES strings that a system predicted for the evaluation dataset, one per line
    - REF-FILE: a file consisting of correct SMILES strings for the evaluation dataset, one per line
      - you can get this REF-FILE with split_tsv.sh contained in this dataset.
        
        To generate REF-FILE as test.smiles, you run ./split_tsv.sh test.tsv test . in root directory of this dataset.
- Please run python3 path/to/script/evaluate.py --help to see how to use this script.
split_tsv.sh
- Bash script to split a tsv format dataset file into columns
- Usage
  - split_tsv.sh [TSV] [PREFIX] [DIR]
    - TSV : a tsv format dataset file (e.g. train.tsv)
    - PREFIX : prefix string of the output files
    - DIR : output directory
- To generate REF-FILE as test.smiles, you run ./split_tsv.sh test.tsv test . in root directory of this dataset.
- Please run ./split_tsv.sh to see usage.

Format of the data sets

CID: PubChem Compound ID (Natural Number)
- The identifier of a chemical compound used in PubChem
Name: (String)
- A chemical compound name extracted from Synonyms and IUPAC Name sections of PubChem
  - IUPAC Name is a name based on IUPAC nomenclature.
  - Synonyms is aliases of the chemical compound .
    - Example: Synonyms of CID:2244 (aspirin)
SMILES (Simplified Molecular Input Line Entry System): (String)
- A SMILES corresponding to a CID extracted from PubChem.
  - Explanation of SMILES
- This SMILES string has been canonicalized with RDKit.
  - Canonicalized SMILES string is unique for each chemical compound.
  - Therefore, it can be judged by comparing canonicalized SMILES strings as a string whether the chemical compounds are same structure.
InChI (International Chemical Identifier): (String)
- A InChI corresponding to a CID extracted from PubChem.
  - Explanation of InChI
Name Type: (String: synonym or IUPAC)
- A label represents the section where a name is extracted; Synonyms or IUPUC section in PubChem.

Sample of a training data

CID Name  SMILES  InChI
2244  aspirin CC(=O)OC1=CC=CC=C1C(=O)O  InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)

Sample of a dev/test data

CID Name  SMILES  InChI Name Type
2244  aspirin CC(=O)OC1=CC=CC=C1C(=O)O  InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)  synonym

Evaluation Metrics

The results of PCCS-PubChem dataset can be evaluated with some metrics as follows:

$\begin{aligned} {\rm recall } &= \frac{{\rm CORRECT}}{{\rm ENTIRE}}, \\ \end{aligned}$

Recall is the ratio of predicted correct structures to entire, and represents the ability of predicting correct structures from compound names. Here, CORRECT is correctly predicted ones, the number of matches between the correct SMILES and the generated SMILES. ENTIRE is the total number of chemical compund names in an evaluation data.

$\begin{aligned} {\rm precision } &= \frac{{\rm CORRECT}}{{\rm ENTIRE - MISS - ERROR}}, \\ \end{aligned}$

Precision is the ratio of predicted correct structures to predicted strings as SMILES (i.e., the string can be converted to some sort of chemical compound structure), and represents the correctness of predicted structure when some sort of chemical compound structures are predicted from chemical compound name. Here, MISS is the number of failures of parsing chemical compound names (only appeared for rule-based methods), ERROR is the number of outputs that are syntactically incorrect, in other words failed to convert to Mol object with RDKit.

NOTE: In general, ERROR (the number of predicted strings that don't follow SMILES grammar) should not be substracted from ENTIRE at denominator of precision. However, in this case, ungramaticall SMILES can be automatically detected with RDKit (or other SMILES string parser). Therefore, we use this precision definition from the point of practical usages.

In our evaluation script, the ERROR value is automatically calculated with RDKit. Please be careful of the libraries version when you evaluate on your environment. (We recommend to evaluate on the environment that build from this file with conda.)

$\begin{aligned} {\rm F-measure } &= \frac{2 * {\rm precision * recall}}{\rm precision + recall}, \end{aligned}$

F-measure can be also caluculated from precision and recall shown above.

$\begin{aligned} {\rm validity } &= \frac{{\rm CORRECT + MISTAKE}}{{\rm ENTIRE}}, \\ \end{aligned}$

Validity is a ratio of correctly predicted SMILES for measuring the ability of predicting gramatically correct SMILES from chemical compound names. In other words, the validity value represents range size of chemical compound name space that a prediction system can keep the precision value (Of cource, the range size represented by validity value depend on test set).
Here, MISTAKE is the number of outputs that were grammatically correct SMILESs but do not match the correct SMILESs.

Dataset statistics

	number of records
dev	19,610
test	196,660
train	5,000,000
dev.filtered	1,113
test.filtered	11,194

Citation

@inproceedings{omote-etal-2020-transformer,
    title = "Transformer-based Approach for Predicting Chemical Compound Structures",
    author = "Omote, Yutaro  and
      Matsushita, Kyoumoto  and
      Iwakura, Tomoya  and
      Tamura, Akihiro  and
      Ninomiya, Takashi",
    booktitle = "Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing",
    month = dec,
    year = "2020",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.aacl-main.19",
    pages = "154--162",
    abstract = "By predicting chemical compound structures from their names, we can better comprehend chemical compounds written in text and identify the same chemical compound given different notations for database creation. Previous methods have predicted the chemical compound structures from their names and represented them by Simplified Molecular Input Line Entry System (SMILES) strings. However, these methods mainly apply handcrafted rules, and cannot predict the structures of chemical compound names not covered by the rules. Instead of handcrafted rules, we propose Transformer-based models that predict SMILES strings from chemical compound names. We improve the conventional Transformer-based model by introducing two features: (1) a loss function that constrains the number of atoms of each element in the structure, and (2) a multi-task learning approach that predicts both SMILES strings and InChI strings (another string representation of chemical compound structures). In evaluation experiments, our methods achieved higher F-measures than previous rule-based approaches (Open Parser for Systematic IUPAC Nomenclature and two commercially used products), and the conventional Transformer-based model. We release the dataset used in this paper as a benchmark for the future research.",
}

Change Log

2021-01-17: updated the dataset file
2020-12-07: site open