sequence_unet.predict

Functions for predicting with Sequence UNET models.

class SequenceUNETMapFunction(num_layers=4, threshold=None, contact_graph=False, weights=False, freq_input=False)

Bases: LabeledFunction

ProteinNetPy LabeledFunction returning the required data for the sequence UNET model.

LabeledFunction taking a ProteinNetPy record and returning the data required to run Sequence UNET on that record. This can be used with a ProteinNetPy map to generate input data for training or predictions. The function outputs a tuple containing model input, N x 20 labels matrix and optionally N long position weights vector . Model input is itself a tuple contaning the N x 20 one-hot encoded sequence and optionally the protein’s amino acid contact graph.

num_layers

Expected number of layers in the target Sequence UNET model.

Type:

int

threshold

If set, the function returns categorical output, with variants below this threshold classed as deleterious.

Type:

float or None

contact_graph

The function outputs a contact graph in addition to the one-hot encoded sequence in the model input tuple (the first element of the returned tuple).

Type:

bool

weights

The function outputs per position sample weights as the third element of the output tuple, with true positions weighted 1 and padded positions weighted 0.

Type:

bool

freq_input

The function outputs true PSSM frequencies instead of one-hot encoded sequences.

Type:

bool

output_shapes

Potentially nested tuple describing the shape of the arrays output by the function.

Type:

tuple

output_types

Potentially nested tuple describing the type of the arrays output by the function.

Type:

tuple

one_hot_sequence(seq)

Convert a Biopython or string AA sequnece to one hot representation.

Parameters:

seq (BioPython Sequence or str) – Sequence to convert to one-hot matrix

Returns:

One-hot encoded sequence array

Return type:

numpy.array

padding_rows(length, layers=6)

Calculate the number of padding rows for an input sequence matrix.

Calculate the number of padding rows required by Sequence UNET for an input sequence matrix. This number of 0 rows should be added to the end of the array so it can be evenly halved a sufficient number of times.

Parameters:
  • length (int) – Sequence length

  • layers (int) – Number of layers in the Sequence UNET model. All pre-trained models have 6 layers.

Returns:

Number of padding rows required.

Return type:

int

predict_proteinnet(model, data, layers=6, contacts=False, wide=False, make_pssm=False)

Generator yielding predictions from a ProteinNet Dataset

Parameters:
  • model (keras.Model) – Sequence UNET model to predict with (or another with the same input/output signature).

  • data (ProteinNetPy Dataset) – ProteinNetPy Dataset to predict from. Use filter functions on this dataset to control what predictions are made.

  • layers (int) – Number of layers in the Sequence UNET model. All pre-trained models have 6 layers.

  • contacts (bool) – Model requires structural contact graph input.

  • wide (bool) – Return wide format data frames, with one column for each mutant amino acid prediction instead of a long format table with mut and pred columns.

  • make_pssm (bool) – Convert predicted frequencies to PSSM scale scores. If used with predictions other than amino acid frequencies in a N x 20 matrix with alphabetical amino acid columns this will create non-sensical data.

Yields:

Pandas DataFrame – Data frame of predictions, including columns for pdb_id, chain, position, wt, mut and prediction. If wide=True a column is included for each mutant amino acid prediction instead of mut and pred columns.

predict_sequence(model, sequences, layers=6, variants=None, wide=False, make_pssm=False)

Predict values from a iterable of BioPython SeqRecords, strings or objects coerceable to strings.

Parameters:
  • model (keras.Model) – Sequence UNET model to predict with (or another with the same input/output signature).

  • sequences (Iterable of str or BioPython SeqRecords) – Sequences to predict from.

  • layers (int) – Number of layers in the Sequence UNET model. All pre-trained models have 6 layers.

  • variants (Pandas DataFrame or None) – Filter variants to those contained in this data frame. It must have the same identifying columns as the output table: gene, position & wt if wide=true and gene, position, wt & mut if not.

  • wide (bool) – Return wide format data frames, with one column for each mutant amino acid prediction instead of a long format table with mut and pred columns.

  • make_pssm (bool) – Convert predicted frequencies to PSSM scale scores. If used with predictions other than amino acid frequencies in a N x 20 matrix with alphabetical amino acid columns this will create non-sensical data.

Yields:

Pandas DataFrame – Data frame of predictions, including columns for gene, position, wt, mut and prediction. If wide=True a column is included for each mutant amino acid prediction instead of mut and pred columns. With SeqRecord input gene contains the ID, otherwise it’s a numerical ID.