sequence_unet.models

Load, download and initialise trained and untrained Sequence UNET models.

BIOSTUDIES_FTP = 'biostudies/nfs/S-BSST/732/S-BSST732': FTP path to the BioStudies directory containing the model data

CUSTOM_OBJECTS = {'GraphCNN': <class 'sequence_unet.graph_cnn.GraphCNN'>, 'WeightedMaskedBinaryCrossEntropy': <class 'sequence_unet.metrics.WeightedMaskedBinaryCrossEntropy'>, 'masked_accuracy': <function masked_accuracy>, 'masked_binary_crossentropy': <function masked_binary_crossentropy>}: Dictionary containing the object mappings required to load Sequence UNET Keras models using tf.keras.models.load_model.

MODELS = ['freq_classifier', 'pregraph_freq_classifier', 'pssm_predictor', 'pregraph_pssm_predictor', 'patho_top', 'pregraph_patho_top', 'patho_finetune', 'pregraph_patho_finetune']: List of model IDs for each trained Sequence UNET model. HDF5 (.h5) and SavedModel (.tf.tar.gz) files are available for each model at {BIOSTUDIES_FTP}/Files/{name}.{ext}.

cnn_top_model(bottom_model, features=True, output_size=20, tune_layers=3, kernel_size=3, activation='sigmoid', kernel_regulariser=None, activity_regulariser=None, dropout=0)

Add a 1D CNN top model to a trained Tensorflow Keras model

Add a fresh prediction head onto a trained model to finetune it for a new application. This was used to create the Sequence UNET ClinVar pathogenicity prediction top model. The new layer is a 1D CNN designed to interface with Sequence UNET architechture. A similar approach can easily be used to add a different final classification head.

bottom_model: PSSM model to train on top of. Bottom_model=None creates a simple CNN model. features: Start from model features rather than predicted output. tune_layers: Number of top layers to set as trainable.

Parameters:

bottom_model (tf.keras.Model) – Model to train on top of.
features (bool) – Use the final feature layer as input to the top model, replacing the current prediction layer, instead of starting from the current prediction layer.
output_size (int) – Number of filters in the new output layer. This is the number of predictions made for each sequence position.
tune_layers (int) – Number of layers in the bottom model to unfreeze and train. Use 0 to only train the new layer and -1 to keep all layers trainable.
kernel_size (int) – CNN kernel width
activation (str) – Activation function for the top layer, determining the type of prediction made.
kernel_regulariser (str or None) – Kernel regularisation to apply (e.g. l1, l2, l1/l2). See Keras Conv1D kernel_regularizer options for details.
activity_regulariser (str or None) – Activity regularisation to apply (e.g. l1, l2, l1/l2). See Keras Conv1D kernel_regularizer options for details.
dropout (float) – Dropout to apply before new CNN layer.

Returns:

The Sequence UNET model

Return type:

tf.keras.Model

download_all_models(root='.', model_format='tf', use_ftp=True)

Download all trained Sequence UNET models.

Download all trained Sequence UNET models from BioStudies. See download_trained_model for a description of the available models. FTP download is recommended to give the best transfer speeds and lowest load for BioStudies, but an HTTP fallback is also provided if FTP isn’t possible.

Parameters:

root (str) – Root directory to download to.
model_format ({'tf', 'h5'}) – Format to download the models in.
use_ftp (bool) – Download over FTP rather than HTTP.

Returns:

Dictionary of paths to the downloaded models, keyed by the model

Return type:

Dict

download_trained_model(model, root='.', model_format='tf', use_ftp=True)

Download a trained Sequence UNET model.

Download the specified trained Sequence UNET model from BioStudies. Models are specified using the IDs indicated in models.MODELS, which also maps them to BioStudies files. Each model comes in a sequence only version (X) and version accepting additional structural input (pregraph_X). FTP download is recommended to give the best transfer speeds and lowest load for BioStudies, but an HTTP fallback is also provided if FTP isn’t possible.

Available models:

Freq_classifier:

Classifier predicting where variants occur above or below 0.01 observation frequency in a cross species multiple sequence alignment, as a proxy for deleteriousness.

Pssm_predictor:

Model predicting multiple alignment frequencies, which can be converted into a PSSM.

Patho_top:

Classifier predicting variant pathogenicity, trained as a new classifier head for freq_classifier on ClinVar data.

Patho_finetune:

Classifier predicting variant pathogenicity, trained by finetuning freq_classifier on ClinVar data.

Pregraph_freq_classifier:

Equivalent to freq_classifier taking structural input.

Pregraph_pssm_predictor:

Equivalent to pssm_predictor taking structural input.

Pregraph_patho_top:

Equivalent to patho_top taking structural input.

Pregraph_patho_finetune:

Equivalent to patho_finetune taking structural input.

Parameters:

model (str) – Sequence UNET model to download (see options in description and MODELS).
root (str) – Root directory to download to.
model_format ({'tf', 'h5'}) – Format to download the models in.
use_ftp (bool) – Download over FTP rather than HTTP.

Returns:

Path to downloaded model.

Return type:

str

load_trained_model(model, root='.', download=False, model_format='tf', use_ftp=True)

Load Sequence UNET models.

Load a trained Sequence UNET model downloaded with download_trained_model or directly from BioStudies. This function provides a convenient wrapper around the Keras load_model function, allowing path or model name input and passing the appropriate custom_objects dictionary. If using load_model directly the CUSTOM_OBJECTS dictionary is required so TensorFlow can locate the custom Sequence UNET layers and metrics.

Parameters:

model (str) – Model to load. This can be a direct path or a model name (see MODELS), with paths taking precedence. If a name is passed it will be searched for in root.
root (str) – Root directory to locate models passed by name. It is ignored if a full path is passed.
download (bool) – Download the requested model if it is not located.
model_format (str) – Format for the model if model is an ID rather than a full path.
use_ftp (bool) – Download over FTP rather than HTTP.

Returns:

The trained Sequence UNET model

Return type:

tf.keras.Model

sequence_unet(filters=8, kernel_size=5, num_layers=4, output_size=20, dropout=0, graph_layers=None, graph_activation='relu', conv_activation='relu', pred_activation='sigmoid', kernel_regulariser=None, batch_normalisation=False)

Initialise a Sequence UNET TensorFlow Keras model.

Generate a functional style Keras Sequence UNET model. This model takes one-hot encoded sequence and optionally inverted residue disance matrix structural information (see graph_cnn.contact_graph) and generates a matrix of predictions for each position in the sequence. It is a 1D CNN based model that uses a U-shaped compression and decompression architecture to spread information through the sequence.

Parameters:

filters (int) – Number of convolutional filters on the top layers. Lower layers have F x 2^N filters, so large numbers of filters in deep networks quickly scales to very many weights.
kernel_size (int) – Width of 1D convolutional kernals.
num_layers (int) – Number of down/up sampling layers.
output_size (int) – Number of output predictions for each position in the sequence. I.e. the number of columns in the output matrix.
dropout (float) – Proportion of neurons dropped out between layers.
graph_layers (int or None) – Number of GraphCNN layers preprocessing structural features to feed into the main network.
graph_activation (str) – Activation function for GraphCNN layers.
conv_activation (str) – Activation function for CNN layers.
pred_activation (str) – Activation function for final layer. Sigmoid was used for frequency classification and softmax for PSSM frequency prediction.
kernel_regulariser (str or None) – Type of kernel regularisation to apply (l1, l2, l1/l2). See Keras Conv1D kernel_regularizer options for details.
batch_normalisation (bool) – Apply batch normalisation between contraction and upsampling layers.

Returns:

The Sequence UNET model

Return type:

tf.keras.Model

Notes

Expected input:: (B, N, 20) arrays with B batches of Nx20 matrices, representing sequences with one hot encoding on each row. An additional (B, N, N) array with NxN contact graphs is required when graph_layers is not none. The UNET architecture means N must be divisble by 2^(num_layers - 1). This can be achieved with padding and masking.
Output:: (B, N, 20) arrays of B batches of Nx20 arrays, with the predicted features of each AA at each position as rows.