Programming Interface¶

Data module¶

class scvae.data.DataSet(input_file_or_name, data_format=None, title=None, specifications=None, values=None, labels=None, example_names=None, feature_names=None, batch_indices=None, feature_selection=None, example_filter=None, preprocessing_methods=None, directory=None, **kwargs)¶

Data set class for working with scVAE.

To easily handle values, labels, metadata, and so on for data sets, scVAE uses this class. Other data formats will have to be converted to it.

Parameters

input_file_or_name (str) – Path to a data set file or a title for a supported data set (see Data sets).
data_format (str, optional) – Format used to store data set (see Custom data sets).
title (str, optional) – Title of data set for use in, e.g., plots.
specifications (dict, optional) – Metadata for data set.
values (2-d NumPy array, optional) – Matrix for (count) values with rows representing examples/cells and columns features/genes.
labels (1-d NumPy array, optional) – List of labels for examples/cells in the same order as for values.
example_names (1-d NumPy array, optional) – List of names for examples/cells in the same order as for values.
feature_names (1-d NumPy array, optional) – List of names for features/genes in the same order as for values.
batch_indices (1-d NumPy array, optional) – List of batch indices for examples/cells in the same order as for values.
feature_selection (list, optional) – Method and parameters for feature selection in a list.
example_filter (list, optional) – Method and parameters for example filtering in a list.
preprocessing_methods (list, optional) – Ordered list of preprocessing methods applied to (count) values: "normalise" (each feature/gene), "log", and "exp".
directory (str, optional) – Directory where data set is saved.

name¶: Short name for data set used in filenames.

title¶: Title of data set for use in, e.g., plots.

specifications¶: Metadata for data set. If a JSON file was provided, this would contain the contents.

data_format¶: Format used to store data set.

terms¶: Dictionary of terms to use for, e.g., "example" (cell), "feature" (gene), and "class" (cell type).

values¶: Matrix for (count) values with rows representing examples/cells and columns features/genes.

labels¶: List of labels for examples/cells in the same order as for values.

example_names¶: List of names for examples/cells in the same order as for values.

feature_names¶: List of names for features/genes in the same order as for values.

batch_indices¶: List of batch indices for examples/cells in the same order as for values.

number_of_examples¶: The number of examples/cells.

number_of_features¶: The number of features/genes.

number_of_classes¶: The number of classes/cell types.

feature_selection_method¶: The method used for selecting features.

feature_selection_parameters¶: List of parameters for the feature selection method.

example_filter_method¶: The method used for filtering examples.

example_filter_parameters¶: List of parameters for the example filtering method.

kind¶: The kind of data set: "full", "training", "validation", or "test".

version¶: The version of the data set: "original", "reconstructed", or latent ("z" or "y").

property number_of_values¶: Total number of (count) values in matrix.

load()¶: Load data set.

split(method=None, fraction=None)¶

Split data set into subsets.

The data set is split into a training set to train a model, a validation set to validate the model during training, and a test set to evaluate the model after training.

Parameters

method (str, optional) – The method to use: "random" or "sequential".
fraction (float, optional) – The fraction to use for training and, optionally, validation.

Returns

Training, validation, and test sets.

clear()¶: Clear data set.

Models module¶

class scvae.models.VariationalAutoencoder(feature_size, latent_size=None, hidden_sizes=None, reconstruction_distribution=None, number_of_reconstruction_classes=None, latent_distribution=None, minibatch_normalisation=None, batch_correction=None, number_of_batches=None, number_of_warm_up_epochs=None, log_directory=None, **kwargs)¶

Variational auto-encoder class.

Parameters

feature_size (int) – The number of features/genes in the data set to model.
latent_size (int) – The number of dimensions to use for the latent space.
hidden_sizes (list(int)) – A list of the number of units in each hidden layer of both the inference (encoder) and the generative (decoder) networks. The number of layers in each network is thus the length of this list. For the inference network, the order of the hidden layers is the same as for the list, while for the generative network, it is the reverse.
reconstruction_distribution (str, optional) – The name of the reconstruction distribution (or likelihood function; see Training a model)
number_of_reconstruction_classes (int, optional) – The number of counts to model directly, starting from zero (see Training a model).
latent_distribution (str, optional) – The name of the latent prior distribution: "gaussian" or "unit_variance_gaussian" (see Training a model).
minibatch_normalisation (bool, optional) – If True, normalise each random minibatch of data when training or evaluating the model.
batch_correction (bool, optional) – If True, and if batches are present in data set to model, perform batch correction.
number_of_batches (int, optional) – The number of batches in the data set to model. Required, if batch_correction is True.
number_of_warm_up_epochs (int, optional) – The number of epochs during the start of training with a linear weight on the KL divergence. This weight is gradually increased linearly from 0 to 1 for this number of epochs.
log_directory (str, optional) – Directory where model is saved.

feature_size¶: The number of features/genes which can be modelled.

latent_size¶: The number of dimensions of the latent space.

hidden_sizes¶: A list of the number of units in each hidden layer of both the inference (encoder) and the generative (decoder) networks. The number of layers in each network is thus the length of this list. For the inference network, the order of the hidden layers is the same as for the list, while for the generative network, it is the reverse.

reconstruction_distribution¶: An instance of the reconstruction distribution (or likelihood function) class used by the model.

number_of_reconstruction_classes¶: The number of counts modelled directly, starting from zero.

latent_distribution¶: An instance of the latent prior distribution class used by the model.

minibatch_normalisation¶: If True, normalise each random minibatch of data when training or evaluating the model.

batch_correction¶: If True, and if batches are present in data set to model, perform batch correction.

number_of_batches¶: The number of batches in the data set to model, when batch_correction is True.

number_of_warm_up_epochs¶: The number of epochs during the start of training with a linear weight on the KL divergence. This weight is gradually increased linearly from 0 to 1 for this number of epochs.

property name¶: Short name for model used in filenames.

property description¶: Description of model.

property parameters¶: Trainable parameters in the model.

train(training_set, validation_set=None, number_of_epochs=None, minibatch_size=None, learning_rate=None, run_id=None, new_run=None, reset_training=None, **kwargs)¶

Train model.

Parameters

training_set (DataSet) – Data set used to train model.
validation_set (DataSet, optional) – Data set used to validate model during training, if given.
number_of_epochs (int, optional) – The number of epochs to train the model.
minibatch_size (int, optional) – The size of the random minibatches used at each step of training.
learning_rate (float, optional) – The learning rate used at each step of training.
run_id (str, optional) – ID used to identify a certain run of the model.
new_run (bool, optional) – If True, train a model anew as a separate run with an automatically generated ID.
reset_training (bool, optional) – If True, reset model by removing saved parameters for the model.

sample(sample_size=None, minibatch_size=None, run_id=None, use_early_stopping_model=False, use_best_model=False)¶

Sample from trained model.

Parameters

sample_size (int, optional) – The number of samples to draw from the model.
minibatch_size (int, optional) – The size of the random minibatches used at each step of training.
run_id (str, optional) – ID used to identify a certain run of the model.
use_early_stopping_model (bool, optional) – If True, use model parameters, when early stopping triggered during training. Defaults to False.
use_best_model (bool, optional) – If True, use model parameters, which resulted in the best performance on validation set during training. Defaults to False.

Returns

A data set of generated examples/cells as well as a dictionary of data sets of samples for the two latent variables.

evaluate(evaluation_set, minibatch_size=None, run_id=None, use_early_stopping_model=False, use_best_model=False, **kwargs)¶

Evaluate trained model

Parameters

evaluation_set (DataSet) – Data set used to evaluate model.
minibatch_size (int, optional) – The size of the random minibatches used at each step of training.
run_id (str, optional) – ID used to identify a certain run of the model.
use_early_stopping_model (bool, optional) – If True, use model parameters, when early stopping triggered during training. Defaults to False.
use_best_model (bool, optional) – If True, use model parameters, which resulted in the best performance on validation set during training. Defaults to False.

Returns

A data set of reconstructed examples/cells as well as a data set of the latent variable (wrapped in a dictionary for compatibility with GaussianMixtureVariationalAutoencoder).

class scvae.models.GaussianMixtureVariationalAutoencoder(feature_size, latent_size=None, hidden_sizes=None, reconstruction_distribution=None, number_of_reconstruction_classes=None, latent_distribution=None, prior_probabilities_method=None, prior_probabilities=None, number_of_latent_clusters=None, minibatch_normalisation=None, batch_correction=None, number_of_batches=None, number_of_warm_up_epochs=None, log_directory=None, **kwargs)¶

Gaussian-mixture variational auto-encoder class.

Parameters

feature_size (int) – The number of features/genes in the data set to model.
latent_size (int) – The number of dimensions to use for the latent space.
hidden_sizes (list(int)) – A list of the number of units in each hidden layer of both the inference (encoder) and the generative (decoder) networks. The number of layers in each network is thus the length of this list. For the inference network, the order of the hidden layers is the same as for the list, while for the generative network, it is the reverse.
reconstruction_distribution (str, optional) – The name of the reconstruction distribution (or likelihood function; see Training a model)
number_of_reconstruction_classes (int, optional) – The number of counts to model directly, starting from zero (see Training a model).
latent_distribution (str, optional) – The name of the latent prior distribution: "gaussian_mixture" or "full_covariance_gaussian_mixture" (see Training a model).
prior_probabilities_method (str, optional) – Method for how to set the mixture coefficients for the latent prior distribution: "uniform" distribution, "custom" (provide probabilities to prior_probabilities), or "learn" during training.
prior_probabilities (1-d array-like, optional) – Prior probabilities required when prior_probabilities_method is "custom".
number_of_latent_clusters (int, optional) – The number of latent clusters, which is also the number of components in the Gaussian-mixture model.
minibatch_normalisation (bool, optional) – If True, normalise each random minibatch of data when training or evaluating the model.
batch_correction (bool, optional) – If True, and if batches are present in data set to model, perform batch correction.
number_of_batches (int, optional) – The number of batches in the data set to model. Required, if batch_correction is True.
number_of_warm_up_epochs (int, optional) – The number of epochs during the start of training with a linear weight on the KL divergence. This weight is gradually increased linearly from 0 to 1 for this number of epochs.
log_directory (str, optional) – Directory where model is saved.

feature_size¶: The number of features/genes which can be modelled.

latent_size¶: The number of dimensions of the latent space.

hidden_sizes¶: A list of the number of units in each hidden layer of both the inference (encoder) and the generative (decoder) networks. The number of layers in each network is thus the length of this list. For the inference network, the order of the hidden layers is the same as for the list, while for the generative network, it is the reverse.

reconstruction_distribution¶: An instance of the reconstruction distribution (or likelihood function) class used by the model.

number_of_reconstruction_classes¶: The number of counts modelled directly, starting from zero.

latent_distribution¶: An instance of the latent prior distribution class used by the model.

prior_probabilities_method¶: Method for how the mixture coefficients for the latent prior distribution are set: "uniform" distribution, "custom" (given by prior_probabilities), or "learn" during training.

prior_probabilities¶: Prior probabilities when prior_probabilities_method is "custom".

minibatch_normalisation¶: If True, normalise each random minibatch of data when training or evaluating the model.

batch_correction¶: If True, and if batches are present in data set to model, perform batch correction.

number_of_batches¶: The number of batches in the data set to model, when batch_correction is True.

number_of_warm_up_epochs¶: The number of epochs during the start of training with a linear weight on the KL divergence. This weight is gradually increased linearly from 0 to 1 for this number of epochs.

property name¶: Short name for model used in filenames.

property description¶: Description of model.

property parameters¶: Trainable parameters in the model.

property number_of_latent_clusters¶: The number of latent clusters used in the model.

train(training_set, validation_set=None, number_of_epochs=None, minibatch_size=None, learning_rate=None, run_id=None, new_run=False, reset_training=False, **kwargs)¶

Train model.

Parameters

training_set (DataSet) – Data set used to train model.
validation_set (DataSet, optional) – Data set used to validate model during training, if given.
number_of_epochs (int, optional) – The number of epochs to train the model.
minibatch_size (int, optional) – The size of the random minibatches used at each step of training.
learning_rate (float, optional) – The learning rate used at each step of training.
run_id (str, optional) – ID used to identify a certain run of the model.
new_run (bool, optional) – If True, train a model anew as a separate run with an automatically generated ID.
reset_training (bool, optional) – If True, reset model by removing saved parameters for the model.

sample(sample_size=None, minibatch_size=None, run_id=None, use_early_stopping_model=False, use_best_model=False)¶

Sample from trained model.

Parameters

sample_size (int, optional) – The number of samples to draw from the model.
minibatch_size (int, optional) – The size of the random minibatches used at each step of training.
run_id (str, optional) – ID used to identify a certain run of the model.
use_early_stopping_model (bool, optional) – If True, use model parameters, when early stopping triggered during training. Defaults to False.
use_best_model (bool, optional) – If True, use model parameters, which resulted in the best performance on validation set during training. Defaults to False.

Returns

A data set of generated examples/cells as well as a dictionary of data sets of samples for the two latent variables.

evaluate(evaluation_set, minibatch_size=None, run_id=None, use_early_stopping_model=False, use_best_model=False, **kwargs)¶

Evaluate trained model

Parameters

evaluation_set (DataSet) – Data set used to evaluate model.
minibatch_size (int, optional) – The size of the random minibatches used at each step of training.
run_id (str, optional) – ID used to identify a certain run of the model.
use_early_stopping_model (bool, optional) – If True, use model parameters, when early stopping triggered during training. Defaults to False.
use_best_model (bool, optional) – If True, use model parameters, which resulted in the best performance on validation set during training. Defaults to False.

Returns

A data set of reconstructed examples/cells as well as a dictionary of data sets of the two latent variables.

Analyses module¶

scvae.analyses.analyse_data(data_sets, decomposition_methods=None, highlight_feature_indices=None, analyses_directory=None, **kwargs)¶

Analyse data set and save results and plots.

Parameters

data_sets (list(DataSet)) – List of data sets to analyse.
decomposition_methods (str or list(str)) – Method(s) used to decompose data set values: "PCA", "SVD", "ICA", and/or "t-SNE".
highlight_feature_indices (int or list(int)) – Index or indices to highlight in decompositions.
analyses_directory (str, optional) – Directory where to save analyses.

scvae.analyses.analyse_model(model, run_id=None, analyses_directory=None, **kwargs)¶

Analyse trained model and save results and plots.

Parameters

model ((GaussianMixture)VariationalAutoencoder) – Model to analyse.
run_id (str, optional) – ID used to identify a certain run of model.
analyses_directory (str, optional) – Directory where to save analyses.

scvae.analyses.analyse_intermediate_results(epoch, learning_curves=None, epoch_start=None, model_type=None, latent_values=None, data_set=None, centroids=None, model_name=None, run_id=None, analyses_directory=None)¶

Analyse reconstructions and latent values.

Reconstructions and latent values from evaluating a model on a data set are analysed, and results and plots are saved.

Parameters

evaluation_set (DataSet) – Data set used to evaluate model.
reconstructed_evaluation_set (DataSet) – Reconstructed data set from evaluating model on evaluation_set.
latent_evaluation_sets (dict(str, DataSet)) – Dictionary of data sets of the two latent variables.
model ((GaussianMixture)VariationalAutoencoder) – Model evaluated on evaluation_set.
run_id (str, optional) – ID used to identify a certain run of model.
sample_reconstruction_set (DataSet) – Reconstruction data set from sampling model.
decomposition_methods (str or list(str)) – Method(s) used to decompose data set values: "PCA", "SVD", "ICA", and/or "t-SNE".
highlight_feature_indices (int or list(int)) – Index or indices to highlight in decompositions.
early_stopping (bool, optional) – If True, use parameters for model, when early stopping triggered during training. Defaults to False.
best_model (bool, optional) – If True, use parameters for model, which resulted in the best performance on validation set during training. Defaults to False.
analyses_directory (str, optional) – Directory where to save analyses.

Argument defaults¶

Below are listed the defaults for some optional arguments:

{
	"data": {
		"format": "infer",
		"directory": "data",
		"map_features": false,
		"feature_selection": [],
		"example_filter": [],
		"preprocessing_methods": [],
		"noisy_preprocessing_methods": [],
		"split_data_set": false,
		"splitting_method": "default",
		"splitting_fraction": 0.9
	},
	"analyses": {
		"directory": "analyses",
		"decomposition_method": "PCA",
		"decomposition_dimensionality": 2,
		"highlight_feature_indices": [],
		"included_analyses": "standard",
		"analysis_level": "normal",
		"export_options": []
	},
	"models": {
		"directory": "models",
		"type": "VAE",
		"latent_size": 2,
		"hidden_sizes": [100],
		"number_of_samples": {
		    "training": 1,
		    "evaluation": 1
		},
		"latent_distribution": {
			"VAE": "gaussian",
			"GMVAE": "gaussian mixture"
		},
		"number_of_classes": 1,
		"parameterise_latent_posterior": false,
		"inference_architecture": "MLP",
		"generative_architecture": "MLP",
		"reconstruction_distribution": "poisson",
		"number_of_reconstruction_classes": 0,
		"prior_probabilities_method": "uniform",
		"number_of_warm_up_epochs": 0,
		"kl_weight": 1,
		"proportion_of_free_nats_for_y_kl_divergence": 0.0,
		"minibatch_normalisation": true,
		"batch_correction": false,
		"dropout_keep_probabilities": [],
		"count_sum": false,
		"number_of_epochs": 200,
		"minibatch_size": 100,
		"learning_rate": 1e-4,
		"sample_size": 0,
		"run_id": "",
		"new_run": false,
		"reset_training": false
	},
	"evaluation": {
		"data_set_kind": "test",
		"prediction_training_set_kind": "training",
		"prediction_method": "",
		"model_versions": "all"
	},
	"cross_analysis": {
		"log_summary": false
	}
}