sequence_unet.predict
Functions for predicting with Sequence UNET models.
- class SequenceUNETMapFunction(num_layers=4, threshold=None, contact_graph=False, weights=False, freq_input=False)
Bases:
LabeledFunctionProteinNetPy LabeledFunction returning the required data for the sequence UNET model.
LabeledFunction taking a ProteinNetPy record and returning the data required to run Sequence UNET on that record. This can be used with a ProteinNetPy map to generate input data for training or predictions. The function outputs a tuple containing model input, N x 20 labels matrix and optionally N long position weights vector . Model input is itself a tuple contaning the N x 20 one-hot encoded sequence and optionally the protein’s amino acid contact graph.
- num_layers
Expected number of layers in the target Sequence UNET model.
- Type:
int
- threshold
If set, the function returns categorical output, with variants below this threshold classed as deleterious.
- Type:
float or None
- contact_graph
The function outputs a contact graph in addition to the one-hot encoded sequence in the model input tuple (the first element of the returned tuple).
- Type:
bool
- weights
The function outputs per position sample weights as the third element of the output tuple, with true positions weighted 1 and padded positions weighted 0.
- Type:
bool
- freq_input
The function outputs true PSSM frequencies instead of one-hot encoded sequences.
- Type:
bool
- output_shapes
Potentially nested tuple describing the shape of the arrays output by the function.
- Type:
tuple
- output_types
Potentially nested tuple describing the type of the arrays output by the function.
- Type:
tuple
- one_hot_sequence(seq)
Convert a Biopython or string AA sequnece to one hot representation.
- Parameters:
seq (BioPython Sequence or str) – Sequence to convert to one-hot matrix
- Returns:
One-hot encoded sequence array
- Return type:
numpy.array
- padding_rows(length, layers=6)
Calculate the number of padding rows for an input sequence matrix.
Calculate the number of padding rows required by Sequence UNET for an input sequence matrix. This number of 0 rows should be added to the end of the array so it can be evenly halved a sufficient number of times.
- Parameters:
length (int) – Sequence length
layers (int) – Number of layers in the Sequence UNET model. All pre-trained models have 6 layers.
- Returns:
Number of padding rows required.
- Return type:
int
- predict_proteinnet(model, data, layers=6, contacts=False, wide=False, make_pssm=False)
Generator yielding predictions from a ProteinNet Dataset
- Parameters:
model (keras.Model) – Sequence UNET model to predict with (or another with the same input/output signature).
data (ProteinNetPy Dataset) – ProteinNetPy Dataset to predict from. Use filter functions on this dataset to control what predictions are made.
layers (int) – Number of layers in the Sequence UNET model. All pre-trained models have 6 layers.
contacts (bool) – Model requires structural contact graph input.
wide (bool) – Return wide format data frames, with one column for each mutant amino acid prediction instead of a long format table with mut and pred columns.
make_pssm (bool) – Convert predicted frequencies to PSSM scale scores. If used with predictions other than amino acid frequencies in a N x 20 matrix with alphabetical amino acid columns this will create non-sensical data.
- Yields:
Pandas DataFrame – Data frame of predictions, including columns for pdb_id, chain, position, wt, mut and prediction. If wide=True a column is included for each mutant amino acid prediction instead of mut and pred columns.
- predict_sequence(model, sequences, layers=6, variants=None, wide=False, make_pssm=False)
Predict values from a iterable of BioPython SeqRecords, strings or objects coerceable to strings.
- Parameters:
model (keras.Model) – Sequence UNET model to predict with (or another with the same input/output signature).
sequences (Iterable of str or BioPython SeqRecords) – Sequences to predict from.
layers (int) – Number of layers in the Sequence UNET model. All pre-trained models have 6 layers.
variants (Pandas DataFrame or None) – Filter variants to those contained in this data frame. It must have the same identifying columns as the output table: gene, position & wt if wide=true and gene, position, wt & mut if not.
wide (bool) – Return wide format data frames, with one column for each mutant amino acid prediction instead of a long format table with mut and pred columns.
make_pssm (bool) – Convert predicted frequencies to PSSM scale scores. If used with predictions other than amino acid frequencies in a N x 20 matrix with alphabetical amino acid columns this will create non-sensical data.
- Yields:
Pandas DataFrame – Data frame of predictions, including columns for gene, position, wt, mut and prediction. If wide=True a column is included for each mutant amino acid prediction instead of mut and pred columns. With SeqRecord input gene contains the ID, otherwise it’s a numerical ID.