Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Valid CXSMILES rejected #2716

Open
johnmay opened this issue Dec 29, 2024 · 3 comments
Open

Valid CXSMILES rejected #2716

johnmay opened this issue Dec 29, 2024 · 3 comments

Comments

@johnmay
Copy link
Contributor

johnmay commented Dec 29, 2024

Summary

Indigo requires all pseudo atoms are specified in a CXSMILES pseudo block $...$. This is overly restrictive and you very in MarvinJS (https://marvinjs-demo.chemaxon.com/latest/demo.html) that you should be able to truncate the block.

*CC |$R1$| (rejected but should be accepted)
*CC |$R1;;$| (accepted)

Here is the error message:

Convert error! Given string could not be loaded as (query or plain) molecule or reaction, see the error messages: 'molecule auto loader: SMILES loader: only 1 atoms found in pseudo-atoms $...$ block', 'scanner: BufferScanner::read() error', 'scanner: BufferScanner::read() error', 'molecule auto loader: SMILES loader: only 1 atoms found in pseudo-atoms $...$ block'

@AlexanderSavelyev
Copy link
Collaborator

AlexanderSavelyev commented Jan 6, 2025

We use CXSMILES available specification from Chemaxon. Inside the aliases/pseudo atom description, it provides ";" as atom separator (to map the atom number) and there are no statements or examples where $..$ can be less restrictive (unless we missed something).
https://docs.chemaxon.com/display/docs/formats_chemaxon-extended-smiles-and-smarts-cxsmiles-and-cxsmarts.md#pseudo-atoms
Unfortunately, we can't follow MarvinJS (current version demo) behavior as a standard (even if it is their format ) because it can lead to misinterpretation (e.g. what if there are several aliases- C*CC - should it match the first atom or first "any" atom? Alias can be assigned to any atom including carbons)

@johnmay
Copy link
Contributor Author

johnmay commented Jan 6, 2025

Unfortunately, we can't follow MarvinJS (current version demo) behavior as a standard (even if it is their format ) because it can lead to misinterpretation (e.g. what if there are several aliases- C*CC - should it match the first atom or first "any" atom? Alias can be assigned to any atom including carbons)

There is no issue, the position in the $...$ is absolute it is only the trailing ones can be omitted.

There are all valid an mean different things:

C*CC |$R$|
C*CC |$;R$|
C*CC |$;;R$|
C*CC |$;;;R$|

This is just Indigo being too fussy and requiring the equivalent:

C*CC |$R;;;$|
C*CC |$;R;;$|
C*CC |$;;R;$|
C*CC |$;;;R$|

Since Indigo has already extended CXSMILES with ha:/hb:

Since * often will appear early in a SMILES string (i.e. due to sorting by atomic number and using 0 for *) it is convenient to omit the trailing ones, particularly on reactions. Although Indigo's CXSMILES is even more broken on reactions with _AV (atom value) getting convoluted with the pseudo labels and generating wrong/invalid results:

Indigo indigo = new Indigo();
IndigoObject mol1 = indigo.loadMolecule("*Cl |$_AV:;3$,$R1;$|");
IndigoObject mol2 = indigo.loadMolecule("*Br |$_AV:;4$,$R2;$|");
IndigoObject rxn = indigo.createReaction();
rxn.addReactant(mol1);
rxn.addProduct(mol2);
System.err.println(rxn.smiles()); // *Cl>>*Br |$R1;3;R2;4$| WRONG!
Indigo indigo = new Indigo();
IndigoObject mol1 = indigo.loadMolecule("*Cl |$_AV:;3$$|");
IndigoObject mol2 = indigo.loadMolecule("*Br |$_AV:;4$$|");
IndigoObject rxn = indigo.createReaction();
rxn.addReactant(mol1);
rxn.addProduct(mol2);
System.err.println(rxn.smiles()); // *Cl>>*Br |$_AV:;3;_AV:;4$| WRONG! Invalid!

It's fine if you don't want to fixing these as I want to migrate away from Indigo anyways so any incorrectness is just more motivation :-).

@AlexanderSavelyev
Copy link
Collaborator

Thank you for the reported bug for the reaction SMILES, we already found that the CXSMILES does not work correctly for reactions (unfortunately, the specification is not provided for reactions at all, but from the Marvin output and generic sense it is clear that absolute atom numbering should be through all the reactants/agents/products)
As for the omitting trailing positions (only for read) - we can actually try to fix it, as it should not be lot difficult

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants