Programming Interface

Data module

class scvae.data.DataSet(input_file_or_name, data_format=None, title=None, specifications=None, values=None, labels=None, example_names=None, feature_names=None, batch_indices=None, feature_selection=None, example_filter=None, preprocessing_methods=None, directory=None, **kwargs)

Data set class for working with scVAE.

To easily handle values, labels, metadata, and so on for data sets, scVAE uses this class. Other data formats will have to be converted to it.

Parameters
  • input_file_or_name (str) – Path to a data set file or a title for a supported data set (see Data sets).

  • data_format (str, optional) – Format used to store data set (see Custom data sets).

  • title (str, optional) – Title of data set for use in, e.g., plots.

  • specifications (dict, optional) – Metadata for data set.

  • values (2-d NumPy array, optional) – Matrix for (count) values with rows representing examples/cells and columns features/genes.

  • labels (1-d NumPy array, optional) – List of labels for examples/cells in the same order as for values.

  • example_names (1-d NumPy array, optional) – List of names for examples/cells in the same order as for values.

  • feature_names (1-d NumPy array, optional) – List of names for features/genes in the same order as for values.

  • batch_indices (1-d NumPy array, optional) – List of batch indices for examples/cells in the same order as for values.

  • feature_selection (list, optional) – Method and parameters for feature selection in a list.

  • example_filter (list, optional) – Method and parameters for example filtering in a list.

  • preprocessing_methods (list, optional) – Ordered list of preprocessing methods applied to (count) values: "normalise" (each feature/gene), "log", and "exp".

  • directory (str, optional) – Directory where data set is saved.

name

Short name for data set used in filenames.

title

Title of data set for use in, e.g., plots.

specifications

Metadata for data set. If a JSON file was provided, this would contain the contents.

data_format

Format used to store data set.

terms

Dictionary of terms to use for, e.g., "example" (cell), "feature" (gene), and "class" (cell type).

values

Matrix for (count) values with rows representing examples/cells and columns features/genes.

labels

List of labels for examples/cells in the same order as for values.

example_names

List of names for examples/cells in the same order as for values.

feature_names

List of names for features/genes in the same order as for values.

batch_indices

List of batch indices for examples/cells in the same order as for values.

number_of_examples

The number of examples/cells.

number_of_features

The number of features/genes.

number_of_classes

The number of classes/cell types.

feature_selection_method

The method used for selecting features.

feature_selection_parameters

List of parameters for the feature selection method.

example_filter_method

The method used for filtering examples.

example_filter_parameters

List of parameters for the example filtering method.

kind

The kind of data set: "full", "training", "validation", or "test".

version

The version of the data set: "original", "reconstructed", or latent ("z" or "y").

property number_of_values

Total number of (count) values in matrix.

load()

Load data set.

split(method=None, fraction=None)

Split data set into subsets.

The data set is split into a training set to train a model, a validation set to validate the model during training, and a test set to evaluate the model after training.

Parameters
  • method (str, optional) – The method to use: "random" or "sequential".

  • fraction (float, optional) – The fraction to use for training and, optionally, validation.

Returns

Training, validation, and test sets.

clear()

Clear data set.

Models module

class scvae.models.VariationalAutoencoder(feature_size, latent_size=None, hidden_sizes=None, reconstruction_distribution=None, number_of_reconstruction_classes=None, latent_distribution=None, minibatch_normalisation=None, batch_correction=None, number_of_batches=None, number_of_warm_up_epochs=None, log_directory=None, **kwargs)

Variational auto-encoder class.

Parameters
  • feature_size (int) – The number of features/genes in the data set to model.

  • latent_size (int) – The number of dimensions to use for the latent space.

  • hidden_sizes (list(int)) – A list of the number of units in each hidden layer of both the inference (encoder) and the generative (decoder) networks. The number of layers in each network is thus the length of this list. For the inference network, the order of the hidden layers is the same as for the list, while for the generative network, it is the reverse.

  • reconstruction_distribution (str, optional) – The name of the reconstruction distribution (or likelihood function; see Training a model)

  • number_of_reconstruction_classes (int, optional) – The number of counts to model directly, starting from zero (see Training a model).

  • latent_distribution (str, optional) – The name of the latent prior distribution: "gaussian" or "unit_variance_gaussian" (see Training a model).

  • minibatch_normalisation (bool, optional) – If True, normalise each random minibatch of data when training or evaluating the model.

  • batch_correction (bool, optional) – If True, and if batches are present in data set to model, perform batch correction.

  • number_of_batches (int, optional) – The number of batches in the data set to model. Required, if batch_correction is True.

  • number_of_warm_up_epochs (int, optional) – The number of epochs during the start of training with a linear weight on the KL divergence. This weight is gradually increased linearly from 0 to 1 for this number of epochs.

  • log_directory (str, optional) – Directory where model is saved.

feature_size

The number of features/genes which can be modelled.

latent_size

The number of dimensions of the latent space.

hidden_sizes

A list of the number of units in each hidden layer of both the inference (encoder) and the generative (decoder) networks. The number of layers in each network is thus the length of this list. For the inference network, the order of the hidden layers is the same as for the list, while for the generative network, it is the reverse.

reconstruction_distribution

An instance of the reconstruction distribution (or likelihood function) class used by the model.

number_of_reconstruction_classes

The number of counts modelled directly, starting from zero.

latent_distribution

An instance of the latent prior distribution class used by the model.

minibatch_normalisation

If True, normalise each random minibatch of data when training or evaluating the model.

batch_correction

If True, and if batches are present in data set to model, perform batch correction.

number_of_batches

The number of batches in the data set to model, when batch_correction is True.

number_of_warm_up_epochs

The number of epochs during the start of training with a linear weight on the KL divergence. This weight is gradually increased linearly from 0 to 1 for this number of epochs.

property name

Short name for model used in filenames.

property description

Description of model.

property parameters

Trainable parameters in the model.

train(training_set, validation_set=None, number_of_epochs=None, minibatch_size=None, learning_rate=None, run_id=None, new_run=None, reset_training=None, **kwargs)

Train model.

Parameters
  • training_set (DataSet) – Data set used to train model.

  • validation_set (DataSet, optional) – Data set used to validate model during training, if given.

  • number_of_epochs (int, optional) – The number of epochs to train the model.

  • minibatch_size (int, optional) – The size of the random minibatches used at each step of training.

  • learning_rate (float, optional) – The learning rate used at each step of training.

  • run_id (str, optional) – ID used to identify a certain run of the model.

  • new_run (bool, optional) – If True, train a model anew as a separate run with an automatically generated ID.

  • reset_training (bool, optional) – If True, reset model by removing saved parameters for the model.

sample(sample_size=None, minibatch_size=None, run_id=None, use_early_stopping_model=False, use_best_model=False)

Sample from trained model.

Parameters
  • sample_size (int, optional) – The number of samples to draw from the model.

  • minibatch_size (int, optional) – The size of the random minibatches used at each step of training.

  • run_id (str, optional) – ID used to identify a certain run of the model.

  • use_early_stopping_model (bool, optional) – If True, use model parameters, when early stopping triggered during training. Defaults to False.

  • use_best_model (bool, optional) – If True, use model parameters, which resulted in the best performance on validation set during training. Defaults to False.

Returns

A data set of generated examples/cells as well as a dictionary of data sets of samples for the two latent variables.

evaluate(evaluation_set, minibatch_size=None, run_id=None, use_early_stopping_model=False, use_best_model=False, **kwargs)

Evaluate trained model

Parameters
  • evaluation_set (DataSet) – Data set used to evaluate model.

  • minibatch_size (int, optional) – The size of the random minibatches used at each step of training.

  • run_id (str, optional) – ID used to identify a certain run of the model.

  • use_early_stopping_model (bool, optional) – If True, use model parameters, when early stopping triggered during training. Defaults to False.

  • use_best_model (bool, optional) – If True, use model parameters, which resulted in the best performance on validation set during training. Defaults to False.

Returns

A data set of reconstructed examples/cells as well as a data set of the latent variable (wrapped in a dictionary for compatibility with GaussianMixtureVariationalAutoencoder).

class scvae.models.GaussianMixtureVariationalAutoencoder(feature_size, latent_size=None, hidden_sizes=None, reconstruction_distribution=None, number_of_reconstruction_classes=None, latent_distribution=None, prior_probabilities_method=None, prior_probabilities=None, number_of_latent_clusters=None, minibatch_normalisation=None, batch_correction=None, number_of_batches=None, number_of_warm_up_epochs=None, log_directory=None, **kwargs)

Gaussian-mixture variational auto-encoder class.

Parameters
  • feature_size (int) – The number of features/genes in the data set to model.

  • latent_size (int) – The number of dimensions to use for the latent space.

  • hidden_sizes (list(int)) – A list of the number of units in each hidden layer of both the inference (encoder) and the generative (decoder) networks. The number of layers in each network is thus the length of this list. For the inference network, the order of the hidden layers is the same as for the list, while for the generative network, it is the reverse.

  • reconstruction_distribution (str, optional) – The name of the reconstruction distribution (or likelihood function; see Training a model)

  • number_of_reconstruction_classes (int, optional) – The number of counts to model directly, starting from zero (see Training a model).

  • latent_distribution (str, optional) – The name of the latent prior distribution: "gaussian_mixture" or "full_covariance_gaussian_mixture" (see Training a model).

  • prior_probabilities_method (str, optional) – Method for how to set the mixture coefficients for the latent prior distribution: "uniform" distribution, "custom" (provide probabilities to prior_probabilities), or "learn" during training.

  • prior_probabilities (1-d array-like, optional) – Prior probabilities required when prior_probabilities_method is "custom".

  • number_of_latent_clusters (int, optional) – The number of latent clusters, which is also the number of components in the Gaussian-mixture model.

  • minibatch_normalisation (bool, optional) – If True, normalise each random minibatch of data when training or evaluating the model.

  • batch_correction (bool, optional) – If True, and if batches are present in data set to model, perform batch correction.

  • number_of_batches (int, optional) – The number of batches in the data set to model. Required, if batch_correction is True.

  • number_of_warm_up_epochs (int, optional) – The number of epochs during the start of training with a linear weight on the KL divergence. This weight is gradually increased linearly from 0 to 1 for this number of epochs.

  • log_directory (str, optional) – Directory where model is saved.

feature_size

The number of features/genes which can be modelled.

latent_size

The number of dimensions of the latent space.

hidden_sizes

A list of the number of units in each hidden layer of both the inference (encoder) and the generative (decoder) networks. The number of layers in each network is thus the length of this list. For the inference network, the order of the hidden layers is the same as for the list, while for the generative network, it is the reverse.

reconstruction_distribution

An instance of the reconstruction distribution (or likelihood function) class used by the model.

number_of_reconstruction_classes

The number of counts modelled directly, starting from zero.

latent_distribution

An instance of the latent prior distribution class used by the model.

prior_probabilities_method

Method for how the mixture coefficients for the latent prior distribution are set: "uniform" distribution, "custom" (given by prior_probabilities), or "learn" during training.

prior_probabilities

Prior probabilities when prior_probabilities_method is "custom".

minibatch_normalisation

If True, normalise each random minibatch of data when training or evaluating the model.

batch_correction

If True, and if batches are present in data set to model, perform batch correction.

number_of_batches

The number of batches in the data set to model, when batch_correction is True.

number_of_warm_up_epochs

The number of epochs during the start of training with a linear weight on the KL divergence. This weight is gradually increased linearly from 0 to 1 for this number of epochs.

property name

Short name for model used in filenames.

property description

Description of model.

property parameters

Trainable parameters in the model.

property number_of_latent_clusters

The number of latent clusters used in the model.

train(training_set, validation_set=None, number_of_epochs=None, minibatch_size=None, learning_rate=None, run_id=None, new_run=False, reset_training=False, **kwargs)

Train model.

Parameters
  • training_set (DataSet) – Data set used to train model.

  • validation_set (DataSet, optional) – Data set used to validate model during training, if given.

  • number_of_epochs (int, optional) – The number of epochs to train the model.

  • minibatch_size (int, optional) – The size of the random minibatches used at each step of training.

  • learning_rate (float, optional) – The learning rate used at each step of training.

  • run_id (str, optional) – ID used to identify a certain run of the model.

  • new_run (bool, optional) – If True, train a model anew as a separate run with an automatically generated ID.

  • reset_training (bool, optional) – If True, reset model by removing saved parameters for the model.

sample(sample_size=None, minibatch_size=None, run_id=None, use_early_stopping_model=False, use_best_model=False)

Sample from trained model.

Parameters
  • sample_size (int, optional) – The number of samples to draw from the model.

  • minibatch_size (int, optional) – The size of the random minibatches used at each step of training.

  • run_id (str, optional) – ID used to identify a certain run of the model.

  • use_early_stopping_model (bool, optional) – If True, use model parameters, when early stopping triggered during training. Defaults to False.

  • use_best_model (bool, optional) – If True, use model parameters, which resulted in the best performance on validation set during training. Defaults to False.

Returns

A data set of generated examples/cells as well as a dictionary of data sets of samples for the two latent variables.

evaluate(evaluation_set, minibatch_size=None, run_id=None, use_early_stopping_model=False, use_best_model=False, **kwargs)

Evaluate trained model

Parameters
  • evaluation_set (DataSet) – Data set used to evaluate model.

  • minibatch_size (int, optional) – The size of the random minibatches used at each step of training.

  • run_id (str, optional) – ID used to identify a certain run of the model.

  • use_early_stopping_model (bool, optional) – If True, use model parameters, when early stopping triggered during training. Defaults to False.

  • use_best_model (bool, optional) – If True, use model parameters, which resulted in the best performance on validation set during training. Defaults to False.

Returns

A data set of reconstructed examples/cells as well as a dictionary of data sets of the two latent variables.

Analyses module

scvae.analyses.analyse_data(data_sets, decomposition_methods=None, highlight_feature_indices=None, analyses_directory=None, **kwargs)

Analyse data set and save results and plots.

Parameters
  • data_sets (list(DataSet)) – List of data sets to analyse.

  • decomposition_methods (str or list(str)) – Method(s) used to decompose data set values: "PCA", "SVD", "ICA", and/or "t-SNE".

  • highlight_feature_indices (int or list(int)) – Index or indices to highlight in decompositions.

  • analyses_directory (str, optional) – Directory where to save analyses.

scvae.analyses.analyse_model(model, run_id=None, analyses_directory=None, **kwargs)

Analyse trained model and save results and plots.

Parameters
  • model ((GaussianMixture)VariationalAutoencoder) – Model to analyse.

  • run_id (str, optional) – ID used to identify a certain run of model.

  • analyses_directory (str, optional) – Directory where to save analyses.

scvae.analyses.analyse_intermediate_results(epoch, learning_curves=None, epoch_start=None, model_type=None, latent_values=None, data_set=None, centroids=None, model_name=None, run_id=None, analyses_directory=None)

Analyse reconstructions and latent values.

Reconstructions and latent values from evaluating a model on a data set are analysed, and results and plots are saved.

Parameters
  • evaluation_set (DataSet) – Data set used to evaluate model.

  • reconstructed_evaluation_set (DataSet) – Reconstructed data set from evaluating model on evaluation_set.

  • latent_evaluation_sets (dict(str, DataSet)) – Dictionary of data sets of the two latent variables.

  • model ((GaussianMixture)VariationalAutoencoder) – Model evaluated on evaluation_set.

  • run_id (str, optional) – ID used to identify a certain run of model.

  • sample_reconstruction_set (DataSet) – Reconstruction data set from sampling model.

  • decomposition_methods (str or list(str)) – Method(s) used to decompose data set values: "PCA", "SVD", "ICA", and/or "t-SNE".

  • highlight_feature_indices (int or list(int)) – Index or indices to highlight in decompositions.

  • early_stopping (bool, optional) – If True, use parameters for model, when early stopping triggered during training. Defaults to False.

  • best_model (bool, optional) – If True, use parameters for model, which resulted in the best performance on validation set during training. Defaults to False.

  • analyses_directory (str, optional) – Directory where to save analyses.

Argument defaults

Below are listed the defaults for some optional arguments:

{
	"data": {
		"format": "infer",
		"directory": "data",
		"map_features": false,
		"feature_selection": [],
		"example_filter": [],
		"preprocessing_methods": [],
		"noisy_preprocessing_methods": [],
		"split_data_set": false,
		"splitting_method": "default",
		"splitting_fraction": 0.9
	},
	"analyses": {
		"directory": "analyses",
		"decomposition_method": "PCA",
		"decomposition_dimensionality": 2,
		"highlight_feature_indices": [],
		"included_analyses": "standard",
		"analysis_level": "normal",
		"export_options": []
	},
	"models": {
		"directory": "models",
		"type": "VAE",
		"latent_size": 2,
		"hidden_sizes": [100],
		"number_of_samples": {
		    "training": 1,
		    "evaluation": 1
		},
		"latent_distribution": {
			"VAE": "gaussian",
			"GMVAE": "gaussian mixture"
		},
		"number_of_classes": 1,
		"parameterise_latent_posterior": false,
		"inference_architecture": "MLP",
		"generative_architecture": "MLP",
		"reconstruction_distribution": "poisson",
		"number_of_reconstruction_classes": 0,
		"prior_probabilities_method": "uniform",
		"number_of_warm_up_epochs": 0,
		"kl_weight": 1,
		"proportion_of_free_nats_for_y_kl_divergence": 0.0,
		"minibatch_normalisation": true,
		"batch_correction": false,
		"dropout_keep_probabilities": [],
		"count_sum": false,
		"number_of_epochs": 200,
		"minibatch_size": 100,
		"learning_rate": 1e-4,
		"sample_size": 0,
		"run_id": "",
		"new_run": false,
		"reset_training": false
	},
	"evaluation": {
		"data_set_kind": "test",
		"prediction_training_set_kind": "training",
		"prediction_method": "",
		"model_versions": "all"
	},
	"cross_analysis": {
		"log_summary": false
	}
}