implementations of various molecular autoencoders
the repo is organized along the following logic:
- molecules can be represented as sequences, where the sequence is can be characters (i.e., SMILES) or otherwise (i.e., a character or graph grammar)
- Given (1), then all autoencoders share the same base
Encoder
- What differentiates VAEs (e.g., character-, grammar-, or graph-VAE) is then actually the decoder. As such, there are submodules corresponding to each type of VAE implemented (currently only character and grammar)
- molecules are represented as SMILES strings
- SMILES strings are (i) tokenized then (ii) each token is encoded to an integer based on a supplied vocabulary. There are fairly standard SMILES vocabularies and tokenization schemes taken from Schwaller et al. that we use here, but you can define your own. A vocabulary consists of all the tokens we expected to see in a SMILES string (e.g., "C", "Br", "O", etc.) as well as some "special tokens":
SOS
,EOS
,PAD
, andUNK
("start of sequence", "end of sequence", "pad character", and "unknown", respectively). ThePAD
token helps square off a jagged list of sequences (i.e., make a list of sequences with unequal lengths all the same length). TheUNK
token covers cases where an input sequence has a token that you didn't anticipate (e.g., "Pd") without breaking things. - Given a sequence of encoded tokens, we use an
Encoder
to embed this sequence into a hidden vector:h
. - We feed this hidden vector to two independent linear layers that map this hidden vector to both the mean and logvar ("log of the variance") of a distribution in latent space:
z_mean
andz_logvar
z_mean
andz_logvar
define a distribution in latent space, so we then sample from this distibution to get a single latent vectorz
. This is where the term "variational" comes in. If we had either (i) defined only a single linear layer in the step above or equivalently tossed outz_logvar
and only usedz_mean
, we would be left with a plain autoencoder (not technically true)- At generation time, we start with some latent vector
z
and theSOS
("start of sequence") token which we feed into ourCharacterDecoder
. This outputs (1) an unnormalized probability distribution over all tokens (including our special tokens), from which we sample to get the next token in the sequence, and (2) an updated hidden state. - We feed in the next token and updated hidden state iteratively until we either hit some maximum number of iterations or sample the
EOS
token. - This output sequence of encoded tokens can then be decoded back to a sequence of tokens from which we again have a SMILES string!
- While it guaranteed that our sequence starts with the
SOS
token and likely that it ends with theEOS
token (subject to sampling that before we run out of iterations), it is technically possible that we have sampled other special tokens in the middle of our sequence (i.e., our sequence could very well haveSOS
,PAD
, andUNK
in the middle). However, the VAE quickly learns to avoid this "naughty" behavior - Even fixing the above problem, the CVAE has no "innate" concept of syntactic validity for SMILES strings. We know that if we open an aromatic ring in a SMILES string (e.g., "c1") we must eventually close it, but the CVAE has no guarantee that it will. Formally, this is called a "grammar". However (again), it turns out that the grammar of SMILES strings isn't that hard to learn, and most CVAEs with a large enough corpus will learn to avoid this behavior.