Programming Interface¶
Data module¶
-
class
scvae.data.
DataSet
(input_file_or_name, data_format=None, title=None, specifications=None, values=None, labels=None, example_names=None, feature_names=None, batch_indices=None, feature_selection=None, example_filter=None, preprocessing_methods=None, directory=None, **kwargs)¶ Data set class for working with scVAE.
To easily handle values, labels, metadata, and so on for data sets, scVAE uses this class. Other data formats will have to be converted to it.
- Parameters
input_file_or_name (str) – Path to a data set file or a title for a supported data set (see Data sets).
data_format (str, optional) – Format used to store data set (see Custom data sets).
title (str, optional) – Title of data set for use in, e.g., plots.
specifications (dict, optional) – Metadata for data set.
values (2-d NumPy array, optional) – Matrix for (count) values with rows representing examples/cells and columns features/genes.
labels (1-d NumPy array, optional) – List of labels for examples/cells in the same order as for
values
.example_names (1-d NumPy array, optional) – List of names for examples/cells in the same order as for
values
.feature_names (1-d NumPy array, optional) – List of names for features/genes in the same order as for
values
.batch_indices (1-d NumPy array, optional) – List of batch indices for examples/cells in the same order as for
values
.feature_selection (list, optional) – Method and parameters for feature selection in a list.
example_filter (list, optional) – Method and parameters for example filtering in a list.
preprocessing_methods (list, optional) – Ordered list of preprocessing methods applied to (count) values:
"normalise"
(each feature/gene),"log"
, and"exp"
.directory (str, optional) – Directory where data set is saved.
-
name
¶ Short name for data set used in filenames.
-
title
¶ Title of data set for use in, e.g., plots.
-
specifications
¶ Metadata for data set. If a JSON file was provided, this would contain the contents.
-
data_format
¶ Format used to store data set.
-
terms
¶ Dictionary of terms to use for, e.g.,
"example"
(cell),"feature"
(gene), and"class"
(cell type).
-
values
¶ Matrix for (count) values with rows representing examples/cells and columns features/genes.
-
number_of_examples
¶ The number of examples/cells.
-
number_of_features
¶ The number of features/genes.
-
number_of_classes
¶ The number of classes/cell types.
-
feature_selection_method
¶ The method used for selecting features.
-
feature_selection_parameters
¶ List of parameters for the feature selection method.
-
example_filter_method
¶ The method used for filtering examples.
-
example_filter_parameters
¶ List of parameters for the example filtering method.
-
kind
¶ The kind of data set:
"full"
,"training"
,"validation"
, or"test"
.
-
version
¶ The version of the data set:
"original"
,"reconstructed"
, or latent ("z"
or"y"
).
-
property
number_of_values
¶ Total number of (count) values in matrix.
-
load
()¶ Load data set.
-
split
(method=None, fraction=None)¶ Split data set into subsets.
The data set is split into a training set to train a model, a validation set to validate the model during training, and a test set to evaluate the model after training.
- Parameters
method (str, optional) – The method to use:
"random"
or"sequential"
.fraction (float, optional) – The fraction to use for training and, optionally, validation.
- Returns
Training, validation, and test sets.
-
clear
()¶ Clear data set.
Models module¶
-
class
scvae.models.
VariationalAutoencoder
(feature_size, latent_size=None, hidden_sizes=None, reconstruction_distribution=None, number_of_reconstruction_classes=None, latent_distribution=None, minibatch_normalisation=None, batch_correction=None, number_of_batches=None, number_of_warm_up_epochs=None, log_directory=None, **kwargs)¶ Variational auto-encoder class.
- Parameters
feature_size (int) – The number of features/genes in the data set to model.
latent_size (int) – The number of dimensions to use for the latent space.
hidden_sizes (list(int)) – A list of the number of units in each hidden layer of both the inference (encoder) and the generative (decoder) networks. The number of layers in each network is thus the length of this list. For the inference network, the order of the hidden layers is the same as for the list, while for the generative network, it is the reverse.
reconstruction_distribution (str, optional) – The name of the reconstruction distribution (or likelihood function; see Training a model)
number_of_reconstruction_classes (int, optional) – The number of counts to model directly, starting from zero (see Training a model).
latent_distribution (str, optional) – The name of the latent prior distribution:
"gaussian"
or"unit_variance_gaussian"
(see Training a model).minibatch_normalisation (bool, optional) – If
True
, normalise each random minibatch of data when training or evaluating the model.batch_correction (bool, optional) – If
True
, and if batches are present in data set to model, perform batch correction.number_of_batches (int, optional) – The number of batches in the data set to model. Required, if
batch_correction
isTrue
.number_of_warm_up_epochs (int, optional) – The number of epochs during the start of training with a linear weight on the KL divergence. This weight is gradually increased linearly from 0 to 1 for this number of epochs.
log_directory (str, optional) – Directory where model is saved.
-
feature_size
¶ The number of features/genes which can be modelled.
-
latent_size
¶ The number of dimensions of the latent space.
A list of the number of units in each hidden layer of both the inference (encoder) and the generative (decoder) networks. The number of layers in each network is thus the length of this list. For the inference network, the order of the hidden layers is the same as for the list, while for the generative network, it is the reverse.
-
reconstruction_distribution
¶ An instance of the reconstruction distribution (or likelihood function) class used by the model.
-
number_of_reconstruction_classes
¶ The number of counts modelled directly, starting from zero.
-
latent_distribution
¶ An instance of the latent prior distribution class used by the model.
-
minibatch_normalisation
¶ If
True
, normalise each random minibatch of data when training or evaluating the model.
-
batch_correction
¶ If
True
, and if batches are present in data set to model, perform batch correction.
-
number_of_batches
¶ The number of batches in the data set to model, when
batch_correction
isTrue
.
-
number_of_warm_up_epochs
¶ The number of epochs during the start of training with a linear weight on the KL divergence. This weight is gradually increased linearly from 0 to 1 for this number of epochs.
-
property
name
¶ Short name for model used in filenames.
-
property
description
¶ Description of model.
-
property
parameters
¶ Trainable parameters in the model.
-
train
(training_set, validation_set=None, number_of_epochs=None, minibatch_size=None, learning_rate=None, run_id=None, new_run=None, reset_training=None, **kwargs)¶ Train model.
- Parameters
training_set (DataSet) – Data set used to train model.
validation_set (DataSet, optional) – Data set used to validate model during training, if given.
number_of_epochs (int, optional) – The number of epochs to train the model.
minibatch_size (int, optional) – The size of the random minibatches used at each step of training.
learning_rate (float, optional) – The learning rate used at each step of training.
run_id (str, optional) – ID used to identify a certain run of the model.
new_run (bool, optional) – If
True
, train a model anew as a separate run with an automatically generated ID.reset_training (bool, optional) – If
True
, reset model by removing saved parameters for the model.
-
sample
(sample_size=None, minibatch_size=None, run_id=None, use_early_stopping_model=False, use_best_model=False)¶ Sample from trained model.
- Parameters
sample_size (int, optional) – The number of samples to draw from the model.
minibatch_size (int, optional) – The size of the random minibatches used at each step of training.
run_id (str, optional) – ID used to identify a certain run of the model.
use_early_stopping_model (bool, optional) – If
True
, use model parameters, when early stopping triggered during training. Defaults toFalse
.use_best_model (bool, optional) – If
True
, use model parameters, which resulted in the best performance on validation set during training. Defaults toFalse
.
- Returns
A data set of generated examples/cells as well as a dictionary of data sets of samples for the two latent variables.
-
evaluate
(evaluation_set, minibatch_size=None, run_id=None, use_early_stopping_model=False, use_best_model=False, **kwargs)¶ Evaluate trained model
- Parameters
evaluation_set (DataSet) – Data set used to evaluate model.
minibatch_size (int, optional) – The size of the random minibatches used at each step of training.
run_id (str, optional) – ID used to identify a certain run of the model.
use_early_stopping_model (bool, optional) – If
True
, use model parameters, when early stopping triggered during training. Defaults toFalse
.use_best_model (bool, optional) – If
True
, use model parameters, which resulted in the best performance on validation set during training. Defaults toFalse
.
- Returns
A data set of reconstructed examples/cells as well as a data set of the latent variable (wrapped in a dictionary for compatibility with
GaussianMixtureVariationalAutoencoder
).
-
class
scvae.models.
GaussianMixtureVariationalAutoencoder
(feature_size, latent_size=None, hidden_sizes=None, reconstruction_distribution=None, number_of_reconstruction_classes=None, latent_distribution=None, prior_probabilities_method=None, prior_probabilities=None, number_of_latent_clusters=None, minibatch_normalisation=None, batch_correction=None, number_of_batches=None, number_of_warm_up_epochs=None, log_directory=None, **kwargs)¶ Gaussian-mixture variational auto-encoder class.
- Parameters
feature_size (int) – The number of features/genes in the data set to model.
latent_size (int) – The number of dimensions to use for the latent space.
hidden_sizes (list(int)) – A list of the number of units in each hidden layer of both the inference (encoder) and the generative (decoder) networks. The number of layers in each network is thus the length of this list. For the inference network, the order of the hidden layers is the same as for the list, while for the generative network, it is the reverse.
reconstruction_distribution (str, optional) – The name of the reconstruction distribution (or likelihood function; see Training a model)
number_of_reconstruction_classes (int, optional) – The number of counts to model directly, starting from zero (see Training a model).
latent_distribution (str, optional) – The name of the latent prior distribution:
"gaussian_mixture"
or"full_covariance_gaussian_mixture"
(see Training a model).prior_probabilities_method (str, optional) – Method for how to set the mixture coefficients for the latent prior distribution:
"uniform"
distribution,"custom"
(provide probabilities toprior_probabilities
), or"learn"
during training.prior_probabilities (1-d array-like, optional) – Prior probabilities required when
prior_probabilities_method
is"custom"
.number_of_latent_clusters (int, optional) – The number of latent clusters, which is also the number of components in the Gaussian-mixture model.
minibatch_normalisation (bool, optional) – If
True
, normalise each random minibatch of data when training or evaluating the model.batch_correction (bool, optional) – If
True
, and if batches are present in data set to model, perform batch correction.number_of_batches (int, optional) – The number of batches in the data set to model. Required, if
batch_correction
isTrue
.number_of_warm_up_epochs (int, optional) – The number of epochs during the start of training with a linear weight on the KL divergence. This weight is gradually increased linearly from 0 to 1 for this number of epochs.
log_directory (str, optional) – Directory where model is saved.
-
feature_size
¶ The number of features/genes which can be modelled.
-
latent_size
¶ The number of dimensions of the latent space.
A list of the number of units in each hidden layer of both the inference (encoder) and the generative (decoder) networks. The number of layers in each network is thus the length of this list. For the inference network, the order of the hidden layers is the same as for the list, while for the generative network, it is the reverse.
-
reconstruction_distribution
¶ An instance of the reconstruction distribution (or likelihood function) class used by the model.
-
number_of_reconstruction_classes
¶ The number of counts modelled directly, starting from zero.
-
latent_distribution
¶ An instance of the latent prior distribution class used by the model.
-
prior_probabilities_method
¶ Method for how the mixture coefficients for the latent prior distribution are set:
"uniform"
distribution,"custom"
(given byprior_probabilities
), or"learn"
during training.
-
prior_probabilities
¶ Prior probabilities when
prior_probabilities_method
is"custom"
.
-
minibatch_normalisation
¶ If
True
, normalise each random minibatch of data when training or evaluating the model.
-
batch_correction
¶ If
True
, and if batches are present in data set to model, perform batch correction.
-
number_of_batches
¶ The number of batches in the data set to model, when
batch_correction
isTrue
.
-
number_of_warm_up_epochs
¶ The number of epochs during the start of training with a linear weight on the KL divergence. This weight is gradually increased linearly from 0 to 1 for this number of epochs.
-
property
name
¶ Short name for model used in filenames.
-
property
description
¶ Description of model.
-
property
parameters
¶ Trainable parameters in the model.
-
property
number_of_latent_clusters
¶ The number of latent clusters used in the model.
-
train
(training_set, validation_set=None, number_of_epochs=None, minibatch_size=None, learning_rate=None, run_id=None, new_run=False, reset_training=False, **kwargs)¶ Train model.
- Parameters
training_set (DataSet) – Data set used to train model.
validation_set (DataSet, optional) – Data set used to validate model during training, if given.
number_of_epochs (int, optional) – The number of epochs to train the model.
minibatch_size (int, optional) – The size of the random minibatches used at each step of training.
learning_rate (float, optional) – The learning rate used at each step of training.
run_id (str, optional) – ID used to identify a certain run of the model.
new_run (bool, optional) – If
True
, train a model anew as a separate run with an automatically generated ID.reset_training (bool, optional) – If
True
, reset model by removing saved parameters for the model.
-
sample
(sample_size=None, minibatch_size=None, run_id=None, use_early_stopping_model=False, use_best_model=False)¶ Sample from trained model.
- Parameters
sample_size (int, optional) – The number of samples to draw from the model.
minibatch_size (int, optional) – The size of the random minibatches used at each step of training.
run_id (str, optional) – ID used to identify a certain run of the model.
use_early_stopping_model (bool, optional) – If
True
, use model parameters, when early stopping triggered during training. Defaults toFalse
.use_best_model (bool, optional) – If
True
, use model parameters, which resulted in the best performance on validation set during training. Defaults toFalse
.
- Returns
A data set of generated examples/cells as well as a dictionary of data sets of samples for the two latent variables.
-
evaluate
(evaluation_set, minibatch_size=None, run_id=None, use_early_stopping_model=False, use_best_model=False, **kwargs)¶ Evaluate trained model
- Parameters
evaluation_set (DataSet) – Data set used to evaluate model.
minibatch_size (int, optional) – The size of the random minibatches used at each step of training.
run_id (str, optional) – ID used to identify a certain run of the model.
use_early_stopping_model (bool, optional) – If
True
, use model parameters, when early stopping triggered during training. Defaults toFalse
.use_best_model (bool, optional) – If
True
, use model parameters, which resulted in the best performance on validation set during training. Defaults toFalse
.
- Returns
A data set of reconstructed examples/cells as well as a dictionary of data sets of the two latent variables.
Analyses module¶
-
scvae.analyses.
analyse_data
(data_sets, decomposition_methods=None, highlight_feature_indices=None, analyses_directory=None, **kwargs)¶ Analyse data set and save results and plots.
- Parameters
data_sets (list(DataSet)) – List of data sets to analyse.
decomposition_methods (str or list(str)) – Method(s) used to decompose data set values:
"PCA"
,"SVD"
,"ICA"
, and/or"t-SNE"
.highlight_feature_indices (int or list(int)) – Index or indices to highlight in decompositions.
analyses_directory (str, optional) – Directory where to save analyses.
-
scvae.analyses.
analyse_model
(model, run_id=None, analyses_directory=None, **kwargs)¶ Analyse trained model and save results and plots.
- Parameters
model ((GaussianMixture)VariationalAutoencoder) – Model to analyse.
run_id (str, optional) – ID used to identify a certain run of
model
.analyses_directory (str, optional) – Directory where to save analyses.
-
scvae.analyses.
analyse_intermediate_results
(epoch, learning_curves=None, epoch_start=None, model_type=None, latent_values=None, data_set=None, centroids=None, model_name=None, run_id=None, analyses_directory=None)¶ Analyse reconstructions and latent values.
Reconstructions and latent values from evaluating a model on a data set are analysed, and results and plots are saved.
- Parameters
evaluation_set (DataSet) – Data set used to evaluate
model
.reconstructed_evaluation_set (DataSet) – Reconstructed data set from evaluating
model
onevaluation_set
.latent_evaluation_sets (dict(str, DataSet)) – Dictionary of data sets of the two latent variables.
model ((GaussianMixture)VariationalAutoencoder) – Model evaluated on
evaluation_set
.run_id (str, optional) – ID used to identify a certain run of
model
.sample_reconstruction_set (DataSet) – Reconstruction data set from sampling
model
.decomposition_methods (str or list(str)) – Method(s) used to decompose data set values:
"PCA"
,"SVD"
,"ICA"
, and/or"t-SNE"
.highlight_feature_indices (int or list(int)) – Index or indices to highlight in decompositions.
early_stopping (bool, optional) – If
True
, use parameters formodel
, when early stopping triggered during training. Defaults toFalse
.best_model (bool, optional) – If
True
, use parameters formodel
, which resulted in the best performance on validation set during training. Defaults toFalse
.analyses_directory (str, optional) – Directory where to save analyses.
Argument defaults¶
Below are listed the defaults for some optional arguments:
{
"data": {
"format": "infer",
"directory": "data",
"map_features": false,
"feature_selection": [],
"example_filter": [],
"preprocessing_methods": [],
"noisy_preprocessing_methods": [],
"split_data_set": false,
"splitting_method": "default",
"splitting_fraction": 0.9
},
"analyses": {
"directory": "analyses",
"decomposition_method": "PCA",
"decomposition_dimensionality": 2,
"highlight_feature_indices": [],
"included_analyses": "standard",
"analysis_level": "normal",
"export_options": []
},
"models": {
"directory": "models",
"type": "VAE",
"latent_size": 2,
"hidden_sizes": [100],
"number_of_samples": {
"training": 1,
"evaluation": 1
},
"latent_distribution": {
"VAE": "gaussian",
"GMVAE": "gaussian mixture"
},
"number_of_classes": 1,
"parameterise_latent_posterior": false,
"inference_architecture": "MLP",
"generative_architecture": "MLP",
"reconstruction_distribution": "poisson",
"number_of_reconstruction_classes": 0,
"prior_probabilities_method": "uniform",
"number_of_warm_up_epochs": 0,
"kl_weight": 1,
"proportion_of_free_nats_for_y_kl_divergence": 0.0,
"minibatch_normalisation": true,
"batch_correction": false,
"dropout_keep_probabilities": [],
"count_sum": false,
"number_of_epochs": 200,
"minibatch_size": 100,
"learning_rate": 1e-4,
"sample_size": 0,
"run_id": "",
"new_run": false,
"reset_training": false
},
"evaluation": {
"data_set_kind": "test",
"prediction_training_set_kind": "training",
"prediction_method": "",
"model_versions": "all"
},
"cross_analysis": {
"log_summary": false
}
}