Protocol for creating representations of molecular structures using a polymer-specific decoder

Yannik Köster; Julian Kimmig; Stefan Zechel; Ulrich S Schubert

doi:10.1016/j.xpro.2024.103055

. 2024 May 2;5(2):103055. doi: 10.1016/j.xpro.2024.103055

Protocol for creating representations of molecular structures using a polymer-specific decoder

Yannik Köster ^1,^2,^3,^∗, Julian Kimmig ^1,², Stefan Zechel ^1,², Ulrich S Schubert ^1,^2,^4,^∗∗

PMCID: PMC11078690 PMID: 38700976

Summary

To supply chemical structures of polymers for machine learning applications, decoding is necessary. Here, we present a protocol for generating polymer fingerprints (PFPs), which are representations of molecular structures, using a polymer-specific decoder. We outline steps for downloading, installing, and basic application of the software. Moreover, we present procedures for processing and analyzing polymer structure data and the preparation for integration into machine learning methods. On this basis, we explain how artificial neural networks can be utilized to predict polymer properties.

For complete details on the use and execution of this protocol, please refer to Köster et al.¹

Subject areas: Chemistry, Material sciences, Computer sciences

Graphical abstract

Highlights

•
Use of analysis and visualization tools for polymer science data
•
Train and use artificial neural networks (optional: on your own data)
•
Understand the Polyfingerprint module to decode polymer structures for machine learning

Publisher’s note: Undertaking any experimental protocol requires adherence to local institutional guidelines for laboratory safety and ethics.

Before you begin

The protocol below describes how to set up and run a polymer property prediction model, more precisely, one for cloud points. However, this platform can also be used to predict any polymer property, based on the chosen input.

Installation of the development environment

Timing: 45 min

Automatically set up a Jupyter-server and its environment via Docker.

1.
Download the program, training data and models from GitHub under the following link: https://github.com/Bizbalt/PFP/archive/refs/heads/main.zip and unpack it to a location of your choice (from now on this folder is referred as <PFPDIR>).

Optional: If you have an Nvidia graphic processing unit, you can uncomment the last code block in the docker-compose.yml file in the newly created PFP folder to make it available to the project.

<PFPDIR>\docker-compose.yml:

># uncomment to use GPU
>#deploy:
># resources:
>#  reservations:
>#   devices:
>#    - driver: nvidia
>#     capabilities: [gpu]

># uncomment to use GPU
>deploy:
> resources:
>  reservations:
>   devices:
>    - driver: nvidia
>     capabilities: [gpu]

Open in a new tab

2.
Download and install Docker for your operating system at docker.com/get-started and compose the PFP container:
- a.
  Mac.
  Note: Make sure your downloaded Docker Version matches the chip-architecture of your computer (standard processor chip was Intel but can be Apple. You can check for your chipset under “About this Mac” in the Apple menu.
  - i.
    Double-click the docker.dmg and drag it into the applications folder.
  - ii.
    Open the app and accept the terms and conditions. Let it configurate.
  - iii.
    Open a terminal (Applications > Utilities > Console) in admin mode, or with the “sudo” prefix and run the following commands:
    >cd <PFPDIR>
    Note: You can check what files your current path contains with.
    
    >sudo pwd
  - iv.
    Install the project environment via docker.
    >docker compose up
    Note: If you are experienced with installation or want to install the software over ssh, you can install the docker services from the release channels https://docs.docker.com/engine/install/.
- b.
  Windows:
  - i.
    Double-click the docker installer exe (no admin rights needed).
  - ii.
    Tick “use Windows Subsystem for Linux 2 (WSL2) instead of Hyper-V” if applicable (recommended).
  - iii.
    Close and restart system like instructed.
  - iv.
    Accept the terms and conditions on the pop-up after start-up and close it.
  - v.
    Open the windows command line.
  - vi.
    (Win + r) Write “cmd” and press enter.
  - vii.
    In the console navigate to the project folder with the change directory command.
    >cd <PFPDIR>
  - viii.
    Install the project environment via docker.
    >docker compose up
    Note: When prompted to install git in the terminal accept and agree to license.
- c.
  Linux:
  - i.
    Follow the recommended installation instructions for your distribution of Linux as mentioned on https://docs.docker.com/engine/install/.
  - ii.
    Navigate to the project folder containing the docker image:
    >cd <PFPDIR>
  - iii.
    Install the project environment via docker.
    >sudo docker compose up
    CRITICAL: If the respective docker compose up command does not give any output or throws an error please consult the troubleshooting section of Docker engine https://docs.docker.com/engine/install /troubleshoot/ and notice the Hardware requirements.
3.
The Jupyter server starts automatically after the installation process is complete. It can be accessed via your browser of choice at “http:localhost:8888”.

Note: If the docker server is running but you cannot see a web interface please refer to troubleshooting 1.

4.
Pass in the password “pfppassword”.

Pause point: Everything is now set to run the software. To re-run it after a shutdown, only two commands in the terminal/command line and opening the page via browser (step 3) must be redone:

>cd <PFPDIR>

>docker compose up

If the respective docker compose up command throws an error, please refer to troubleshooting 2.

Key resources table

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Deposited data

Non-curated LCST dataset	Köster et al.¹	Non-curated Dataset.xlsx
Curated LCST dataset	Köster et al.¹	cloud_points_data.csv

Software and algorithms

Docker	Docker, Inc.	https://www.docker.com/
ChemDraw Professional 21.0.0.28	Revvity Signals	https://revvitysignals.com/products/research/chemdraw
Jupyter-notebooks for data curation, analysis, and machine learning	This paper	https://github.com/Bizbalt/PFP
Polyfingerprint decoder	Köster et al.¹	Polyfingerprints.py

Other

Cheminfo SMILES checker	University of Valle	http://www.cheminfo.org/flavor/malaria/Utilities/SMILES_generator___checker/index.html

Open in a new tab

Materials and equipment

Hardware requirements

A 64-bit processor with CPU support for virtualization.

Further in-detail requirements can be read up the official docker docs under the operating system specific site at https://docs.docker.com/desktop/.

Exemplary hardware requirements

Minimal hardware requirements for using the trained model:

•
Operating system: Windows, Linux, iOS. (tested on Windows 10, Ubuntu 20.04, Mac Sonoma 14.1 with Intel Processor).
•
At least 4 GB of RAM.
•
20 GB of free disk space.
•
Intel Core i5 4 × 2.60 GHz.

Recommended hardware for training a new model:

•
Operating system: Windows Vista SP2, Linux Ubuntu 20.04 or newer.
•
16 GB of RAM.
•
25 GB of free disk space.
•
Ryzen7 8 × 3.7 GHz processor or dedicated graphics card from Nvidia (4 GB).

Alternatives: Instead of ChemDraw Professional, any other structure drawing software that allows the export and import of chemical structures as SMILES can be used.

Alternatives: Instead of the Cheminfo SMILES checker, any program that can validate the correctness of SMILES strings is suitable.

Step-by-step method details

Data curation and analysis Jupyter-notebook

Timing: 15 min

Load the exemplary data, reformat and understand it.

Note: You could start at this point after a restart (see pause point in the install section). Troubleshooting 3 and 4.

1.
Open the first Notebook.
- a.
  In the Jupyter browser-interface, open the “examples/cloud_point” folder in the folder tree on the left.
- b.
  double-click the “data_curation_and analysis.ipynb” file.

Note: This is one of the three Jupyter notebooks, in which code can be executed and examined sequentially. For good measure every section has its own cell. You do not need to type in the code just execute it separately as the whole program is stored in the Jupyter notebook already.

Note: You can pause at any point, but, when you get back to it, make sure the cells above the one you want to execute are all executed in succession before as they are dependent/building up on each other and their output needs to be stored in the RAM again.

Note: You can always change some parameters in a cell and then run/try out this variation to get a better understanding of the program.

Optional: You can use your own dataset as described under the Optional: Prepare own dataset section, but it is recommended to follow the tutorial with the LCST data first.

Note: Besides the explanation here, the code sometimes contains comments which work as headings or further technical elucidation.

># comments like this one are marked with a hashtag and increase code readability

You can add comments at the end of every line as you like.

2.
Execute the first cell to import statements for libraries, you need to load and manipulate the chemical structure data.

Note: After placing the curser somewhere in the cell and pressing Shift + Enter it executes and jumps to the next cell.

3.
The next two cells (2, 3) hold declaration of the working directory and the loading of it.
4.
Execute the fourth cell:
- a.
  Run the data-specific curation of our “lower critical solution temperature” dataset.
- b.
  Use the data reader from the PFP library to bring the tabular data to programmatically better manageable form while declaring which columns to use for training.
  Note: The reader automatically prepares the columns for later training.
- c.
  Lastly, store the information as an info file.
5.
Run cell five to ten to generate diagrams to relate the chemical structures to each other.

Note: These are also specific to the “lower critical solution temperature” dataset and function as exemplary analysis.

Note: You can find other diagram formats, which might suit your data more at Plotly.com.

6.
All further cells (11–17) cover chemical structure representation:
- a.
  How to display the SMILES as graphics.
- b.
  How to get just the fingerprint vectors and an excursus to comparability.
  Note: General fingerprints are well explained by the daylight side. For more in detail information about the PFP read the original article.
- c.
  Creating different PFP set sizes.
- d.
  Displaying them in two cells in two forms.
- e.
  Calculating conventional similarity indexes, coefficients, distances etc.

Train and use prediction model Jupyter-notebook

Timing: 10 min to 2 h

Neural network set-up and training.

Note: You could start at this point after a restart (see pause point in the install section).

7.
Open the “train_and_use_prediction_model.ipynb” Jupyter notebook next.
8.
As in the last Jupyter notebook, run the first cell to import program libraries that you need to load, train with, save and use the previously curated data set and the corresponding info file.
9.
Sets the working directory and loads the curated data set back into the RAM with the second cell.

Note: This needs to be carried out for this notebook again, as Jupyter notebooks do not share variables, even if both were running. This also means that the “data_curation_and analysis.ipynb” notebook can be closed to save resources.

10.
Run the third cell:
- a.
  Set the PFP-hyperparameters.
- b.
  Create and reduce the PFP.
- c.
  At the last line store the reduction data exclusively.

CRITICAL: A model is representation specific. This means that any new data that shall pass through the model must be represented and reduced in exactly the same way as previous training data. For this reason, the reduction data is saved.

11.
Execute the fourth cell to iterate through the training set to check for completeness prior to training.
12.
Examine the contents of cell five:
- a.
  Set individual identification and hyperparameters for the model to be created and trained.
- b.
  Store this information for later lookup, when reusing the same model or resume training.
13.
By running the last cell, cell six, you call the train_model() function, thus, performing training based on the hyperparameters previously defined.

Note: If the model with the specified name already exists, it loads the model and continues the training, where it was stopped in the last run.

CRITICAL: Here it is important that the same hyperparameters are used, if they are changed, the hyperparameters of the loaded model and the defined ones can mismatch which would result in an error. In such cases, we recommend to either delete the model folder completely or simply choose a new model name (in cell 5, DEFAULT_MODELDATA).

Inference Jupyter-notebook

Timing: 5 min

Utilize saved models, build up on prior training.

14.
Open the “inference.ipynb” Jupyter notebook.

Note: This third notebook is built for the utilization of an already trained model.

15.
In the same manner as with the notebooks before, execute the first three cells (1–3) to import the same libraries, set the path and name to the desired model and load a dataset supposedly missing the target value.
16.
In the fourth cell the PFP is created, and additional training information is set and fitted to the chosen model.
17.
Execute subsequent cell five, to reduce the PFP model specific (see Step 10c).
18.
Load the model in cell six.
19.
By running cell seven the target variable is calculated by the model based on the structural PFP data as input.
20.
Lastly, execute cell eight, to insert the calculated data into the initial data set and display it.

Note: You have now loaded and analyzed a dataset and its respective PFP representation and created, trained and utilized a neural network. The next steps are tweaking the cells to your own liking and or inserting your own data and do these steps once more.

Optional: Prepare own dataset

Timing: 1–5 days

Curate your polymer information to be used in machine learning.

Note: The PFP was designed to also work with small datasets. 500 Entries are considered small. Networks trained on smaller datasets may also achieve low error rates, their applicability for chemistry beyond their training set is likely limited. As this is very case-specific, one must analyze the data first before being able to make a reasonable assumption.

21.
The probably easiest way to implement your own data is by making a copy of the template folder and renaming the copy to a descriptive name for your data.
22.
The newly generated folder contains an excel file “excel_template.xlsx” in which you can fill in your own data.
23.
The structure is as follows (an example with data from Etchenausia et al. can be seen in Table 1)²:
- a.
  Pairs of “SMILES_repeating_unit_#” and “molpercent_repeating_unit_#” columns will contain the monomer structures and their mole fraction per polymer/entry. For that reason, you need to create at least as much column pairs as the polymer with the most different monomers has monomers. The # must be unique in the table, we recommend increasing numbers or letters (1,2,3… or A,B,C,…). The “SMILES_repeating_unit_#” contains valid SMILES for the repeating unit, which have to start and end with an atom with incomplete valency (radical).
  CRITICAL: It is crucial that the connection points of the repeating units are written explicitly, like radicals.
  
  [CH2]C[CH2] .
  
  All side chains and/or backbone atoms must be placed in between the two radical groups or appended in brackets, e.g.,
  
  [CH2]CC[CH2] and [CH2]C[CH](C) are valid but [CH2]C[CH]C is not. The reason for that is the internally concatenation of the SMILES strings. In the following example we emphasized this with structures generated from the SMILES strings:
  
  [CH2]CC[CH2]
  
  becomes.
  
  [CH2]CC[CH2][CH2]CC[CH2][CH2]CC[CH2]
  
  with three iterations which is a valid polymer (without end groups) but.
  
  [CH2]C[CH]C
  
  becomes.
  
  …[CH2]C[CH]C[CH2]C[CH]C[CH2]C[CH]C…
  
  which is structurally also valid but is not correlating to the intended repeating unit with a -CH₃-side group in the polymer.
  
  “molpercent_repeating_unit_#” should contain the amount of the specific repeating unit in the polymer and the values should either be absolute or relative. Logically, each polymer must contain at least on SMILES-molpercentage pair, but it is not necessary to fill all available columns, since empty ones will be ignored.
- b.
  Similar to the repeating units, start- and end-groups are defined in “SMILES_start_group” and “SMILES_end_group”. Here the start-group has to end with a radical to be prepended to the repeating unit string and in the same manner the end-group SMILES have to start with a radical to be appended.
  
  Open in a new tab
  
  Open in a new tab
  Note: If you have no information about the end groups you can bridge this gap by inserting [H].
- c.
  Furthermore, filling in the “Mn” column is mandatory, which contains the number average mass of the polymers (other masses like Mw or even Mz could also work, but Mn has the most descriptive value).
  Note: These steps (23a to c) cover the mandatory structural input, all additional input is unrelated to the PFP und case study related.
- d.
  Additives often have a significant influence on the polymer properties, which is why you can implement them as well: Similar to the repeating units, two columns are required for each additive: “additive_#” and “additive_#_weight_percent” or “additive_#” and “additive_#_concentration_molar”, where # is again a unique identifier. If you use “additive_#_weight_percent” the respective concentration will be calculated automatically using the density property from the data sheet in the excel file. “additive_#” again must be a valid SMILES, but in this case, it can be arbitrary SMILES. Even salts in the form of [Na+].[Cl-] are possible where the individual ion/molecules, separated by “.”, are threaded as individual additives. If you have additives where no SMILES structures can be provided, e.g., a certain amount of a complex mixture, we recommend creating a new numerical column for this additive.
  Note: All other columns are treated either as numerical or categorical values. If the column contains numbers, it is treated as a numerical column and will add one extra dimension to the training data. Columns which cannot be parsed as numerical, e.g., strings are interpreted as categorical: Each unique value in this column adds a new dimension to the training data, which is either 1, if it is in the specific formulation, or 0 if it is not (one-hot-encoding). A column with the values A, B and C, would add 3 dimensions to each entry, with the values [1,0,0] for the “A”, [0,1,0] for the “B” and so on.
  
  Note: As the representation of Structures as SMILES codes can be tricky, one can use the SMILES generator/checker from the Universtiy of Valle or use the paste special (Alt+Ctrl+p) and copy as (Alt+Ctrl+c) SMILES functions of ChemDraw to verify the codes. The third Namen tab of the Non-curated Dataset.xlsx file also gives many examples for initiators and repeating units.
24.
Tune the hyperparameters of the “training” and “inference” notebooks.

Note: You can always orient on the lower critical solution/cloud point temperature example notebooks and use similar values as in those.

25.
Set the TARGET_VALUE to your specific column name of your output value in the “inference” notebook.
26.
Run the modified notebooks like the exemplary before.

Optional: The default hyperparameters can also be seen and set manually through the hyperparameter.yml files created.

Table 1.

Example of information of an entry for a reversible-addition-fragmentation chain-transfer copolymerization of vinyl acetate and N-vinylcaprolactam (line 434 in the cloud_points_data.csv file). Transformed to a long format for better readability

SMILES_start_group	[C](C)(C)(C#N)
SMILES_end_group	[S]C(=S)OCC
SMILES_repeating_unitA	[CH2][CH1](OC(=O)C)
molpercent_repeating_unitA	0.53
SMILES_repeating_unitB	[CH2][CH](N1C(=O)CCCCC1)
molpercent_repeating_unitB	0.47
Mn	35370
Đ	1.1
polymer_concentration_wpercent	0.003
cloud_point	21.4
def_type	C

Open in a new tab

Optional: How to use the polyfingerprint library as a module

Timing: 10 min

Brief documentation of the two most important functions.

Note: All functions for the decoder are stored in polyfingerprints.py file. If you just want to use this descriptor in your own system, this section is of interest for you.

27.
The crucial function is create_pfp(). Create the Fingerprint of an example by executing the following code in a new cell or console.

>import polyfingerprints as pfp

>a_polyfingerprint = pfp.create_pfp(end_units={"start": "[C](C)(C)(C#N)", "end": "[S]C(=S)OCC"}, repeating_units={0.53 :"[CH2][CH1](OC(=O)C)", 0.47: "[CH2][CH](N1C(=O)CCCCC1)"}, mol_weight=35370, fp_size=2048)

>print(a_polyfingerprint)

Note: The function takes the structure and composition of a polymer or a list of polymers.

Since the dictionaries for the repeating units can be arbitrary long, copolymers with arbitrary many different repetition units can be set in here.

Note: More options to all functions can be seen opening the contextual help in the Jupyter interface (Ctrl + I).

28.
Shorten training times, by reducing a list of fingerprints with the reduce_fp_set() function.

>b_polyfingerprint = pfp.create_pfp(end_units={"start": "[C](C=C)(C)(C#N)", "end": "[S]C(=S)OCCCC"}, repeating_units={0.53 :"[CH2][CH1](OC(=O)CC)", 0.47: "[CH2][CH](N1C(=O)CCC1)"}, mol_weight=35370, fp_size=2048)

>list_of_pfps = [a_polyfingerprint, b_polyfingerprint]

>print(pfp. reduce_fp_set(∗list_of_pfps))

Note: The function checks if some parts of the fingerprints do not differ and throws the obsolete out.

Note: Notice that the function also gives out two masks. These masks can be used later to reduce a same sized (e.g., fp_size = 2048) fingerprint from polymer outside of the set with the reduce_fp() function in the same way. Like this, they will be still comparable. The reduce_fp() function will also provide a loss, which describes how much of the information that was cut off was not obsolete.

Expected outcomes

When successfully starting the docker image and logging in with the “pfppassword” the user is greeted with an interface like in Figure 1.

Section of the interface of JupyterLab in a browser

The first notebook “data_curation_and analysis.ipynb” was opened though the folder tree on the left.

Though the outcome of the training is random, nearly every case of example training with the default parameters and dataset should be in between 20°C and 40°C² mean squared error. The console will just output that number as “loss” and print a graphic similar to the following Figure 2.

Training overview cross-plot generated after a default run

While choosing a great number of maximal training epochs prohibits to train for too little, the “early-stopping” setting (set on default) stops the training when the error on the validation set worsens, effectively preventing a loss of generalizability/aligning the network specific to the training set. The technical terms for these two phenomena are under- and overfitting and would be visible in the cross-plot as a train error-curve in decline and a higher test loss compared to the train loss, respectively.

These machine learning techniques for quantitative structure-property relationship are of ever-increasing interest and mostly just the starting point of every state-of-the-art lab, whether for efficiency enhancing design-of-experiment approaches or autonomous experimental platforms.³^,⁴ Therefore, the ability to utilize such foundational techniques should be of great interest to any scientist planning to optimize or explore new compound in the chemical space of molecules.

Quantification and statistical analysis

Table 2 below provides an idea of how arbitrarily fast the training process can occur.

Table 2.

Test for time and speed of training

Time [s]	Temperature – mean square error [K²]
29	864
26	895
29	757
30	824
27	507
26	913
29	2061
25	893
25	602
27	977

Open in a new tab

The training was stopped after 10 epochs of training. These early training stages are a good forecast for the overall training runs. Default hyperparameters are used.

Limitations

When adding training data, it must always be kept in mind that single entries of a completely different subject area which is hardly brought in comparison with the other entries is likewise predicted poorly. As all entries are split into the training, validation and test sets (in an 8:1:1 ratio) it is recommended to have at least 10 entries for each subject area. As the program also outputs a csv with all predicted values in the end of training, one can easily see the performance on specific entries there. Additionally, the machine learning approach most likely allows for searching of and interpolation between training data points and should not be seen as a general chemistry learning system. Therefore, newly introduced polymers can only be estimated to the existent of it being comparable to the polymers in the dataset.

Deinstallation

If you want to deinstall the software, deinstall the docker PFP environment (e.g., in docker desktop or via the console), before you deinstall docker. Troubleshooting 5.

Troubleshooting

Problem 1

Jupyter cannot be opened in the browser (related to Step 3).

Potential solution

•
Use a supported browser (such as Google Chrome, Mozilla Firefox, or Microsoft Edge) and ensure it is up to date. Clearing the browser cache and cookies can also help resolve compatibility issues.

Problem 2

The docker compose up command does not execute as expected and the following error message might show (related to Step 2).

>ERROR: error during connect: […]

Potential solution

•
Open “docker desktop” from your programs before executing the command again. Your system might not be able to run the docker engine automatically or autostart is disabled, wherefore you need to manually.

Problem 3

The notebook does not run anymore, or you just want to revert the original state.

Potential solution

•
You can delete the examples folder as it will be reinstated on next start-up/compose up.
•
Alternatively, you can also rebuild the whole environment with

>docker compose up –-build

Please note that this takes some time, as the whole program set-up is installed again.

Problem 4

Jupyter cells cannot be executed or seem frozen.

Potential solution

•
Jupyter cells do run one after another. Is another cell still processing and therefore putting the others on hold?
•
Jupyter cells show that they are currently trying to be executed with an asterisk in front of them. The cell you are currently trying to execute might be still running.
- ○
  If a cell (apart from the model training and usage cells) takes longer than approximately 10 min you can restart the kernel by clicking on "Kernel" in the menu bar and selecting "Restart Kernel". Additionally, ensure your code does not have memory leaks or infinite loops that could cause the kernel to crash.

Problem 5

Deinstallation seems incomplete since hard drive space is still less than before (referring to the deinstallation section).

Potential solution

•
Docker does not delete the environments it created when being deinstalled.
- ○
  Reinstall Docker, delete the environment and deinstall Docker again.

•
On Windows you can open your local folder by opening the windows search for,
>%localappdata%
- ○
  Delete the docker folder there manually.
  
  For further troubleshooting feel free to ask and/or raise an issue in the respective GitHub repository: https://github.com/Bizbalt/PFP.

Resource availability

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Prof. Dr. Ulrich S. Schubert (ulrich.schubert@uni-jena.de).

Technical contact

Questions about the technical specifics of performing the protocol should be directed to and will be answered by the technical contact, Yannik Köster (yannik.koester@uni-jena.de).

Materials availability

This study did not generate unique reagents.

Data and code availability

•
The primary research and the dataset on which this study is based on is available at https://doi.org/10.1016/j.xcrp.2023.101553.
•
The code generated as a part of this study and published simultaneously with this protocol is available at Zenodo: https://doi.org/10.5281/zenodo.10794407.
•
The newest Version will always be available at GitHub https://github.com/Bizbalt/PFP or Zenodo https://zenodo.org/doi/10.5281/zenodo.10794406

Acknowledgments

The authors would like to thank the Deutsche Forschungsgemeinschaft (SFB 1278 “PolyTarget,” project number 316213987, projects Z01 and A06) as well as the Thüringer Aufbaubank (2021 FGI 0005) for financial support.

Author contributions

Conceptualization, methodology, software, validation, and investigation, Y.K. and J.K.; resources, U.S.S.; writing – original draft, Y.K.; writing – review and editing, U.S.S., S.Z., and J.K.; funding acquisition, U.S.S.; supervision, U.S.S. and S.Z.

Declaration of interests

The authors declare no competing interests.

Contributor Information

Yannik Köster, Email: yannik.koester@uni-jena.de.

Ulrich S. Schubert, Email: ulrich.schubert@uni-jena.de.

References

1.Köster Y., Kimmig J., Zechel S., Schubert U.S. Fingerprint applicable for machine learning tested on LCST behavior of polymers. Cell Rep. Phys. Sci. 2023;4 doi: 10.1016/j.xcrp.2023.101553. [DOI] [Google Scholar]
2.Etchenausia L., Rodrigues A.M., Harrisson S., Deniau Lejeune E., Save M. RAFT copolymerization of vinyl acetate and N-vinylcaprolactam: Kinetics, control, copolymer composition, and thermoresponsive self-assembly. Macromolecules. 2016;49:6799–6809. doi: 10.1021/acs.macromol.6b01451. [DOI] [Google Scholar]
3.McGibbon M., Shave S., Dong J., Gao Y., Houston D.R., Xie J., Yang Y., Schwaller P., Blay V. From intuition to AI: Evolution of small molecule representations in drug discovery. Brief. Bioinform. 2023;25 doi: 10.1093/bib/bbad422. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Xie Y., Sattari K., Zhang C., Lin J. Toward autonomous laboratories: Convergence of artificial intelligence and experimental automation. Prog. Mater. Sci. 2023;132 doi: 10.1016/j.pmatsci.2022.101043. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

•
The primary research and the dataset on which this study is based on is available at https://doi.org/10.1016/j.xcrp.2023.101553.
•
The code generated as a part of this study and published simultaneously with this protocol is available at Zenodo: https://doi.org/10.5281/zenodo.10794407.
•
The newest Version will always be available at GitHub https://github.com/Bizbalt/PFP or Zenodo https://zenodo.org/doi/10.5281/zenodo.10794406

[bib1] 1.Köster Y., Kimmig J., Zechel S., Schubert U.S. Fingerprint applicable for machine learning tested on LCST behavior of polymers. Cell Rep. Phys. Sci. 2023;4 doi: 10.1016/j.xcrp.2023.101553. [DOI] [Google Scholar]

[bib2] 2.Etchenausia L., Rodrigues A.M., Harrisson S., Deniau Lejeune E., Save M. RAFT copolymerization of vinyl acetate and N-vinylcaprolactam: Kinetics, control, copolymer composition, and thermoresponsive self-assembly. Macromolecules. 2016;49:6799–6809. doi: 10.1021/acs.macromol.6b01451. [DOI] [Google Scholar]

[bib3] 3.McGibbon M., Shave S., Dong J., Gao Y., Houston D.R., Xie J., Yang Y., Schwaller P., Blay V. From intuition to AI: Evolution of small molecule representations in drug discovery. Brief. Bioinform. 2023;25 doi: 10.1093/bib/bbad422. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.Xie Y., Sattari K., Zhang C., Lin J. Toward autonomous laboratories: Convergence of artificial intelligence and experimental automation. Prog. Mater. Sci. 2023;132 doi: 10.1016/j.pmatsci.2022.101043. [DOI] [Google Scholar]

PERMALINK

Protocol for creating representations of molecular structures using a polymer-specific decoder

Yannik Köster

Julian Kimmig

Stefan Zechel

Ulrich S Schubert

Summary

Graphical abstract

Highlights

Before you begin

Installation of the development environment

Key resources table

Materials and equipment

Hardware requirements

Exemplary hardware requirements

Minimal hardware requirements for using the trained model:

Step-by-step method details

Data curation and analysis Jupyter-notebook

Train and use prediction model Jupyter-notebook

Inference Jupyter-notebook

Optional: Prepare own dataset

Table 1.

Optional: How to use the polyfingerprint library as a module

Expected outcomes

Figure 1.

Figure 2.

Quantification and statistical analysis

Table 2.

Limitations

Deinstallation

Troubleshooting

Problem 1

Potential solution

Problem 2

Potential solution

Problem 3

Potential solution

Problem 4

Potential solution

Problem 5

Potential solution

Resource availability

Lead contact

Technical contact

Materials availability

Data and code availability

Acknowledgments

Author contributions

Declaration of interests

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases