Figure - PMC

Skip to main content

An official website of the United States government

Here's how you know

Here's how you know

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

View full-text article in PMC

. 2023 Oct 30;4(12):100865. doi: 10.1016/j.patter.2023.100865

Search in PMC
Search in PubMed
View in NLM Catalog
Add to search

© 2023 The Authors

This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

PMC Copyright notice

Similarity metrics between the three canonicalized representations of each query molecule

String similarity, as measured with gestalt pattern matching, demonstrates that different canonicalizations result in markedly different strings. Shared token ratio and token length ratio indicate that these strings were tokenized into different inputs to the CLM. Feature similarity demonstrates that the differently canonicalized queries’ token vectors were interpreted differently by the model resulting in increased spread across feature space. Feature similarity was determined by cosine similarity between ChemBERTa vector embeddings. Deviations from 1.0 for each metric represent divergence between canonicalized queries.