pydeseq2.dds.DeseqDataSet

class DeseqDataSet(*, adata=None, counts=None, metadata=None, design='~condition', design_factors=None, continuous_factors=None, ref_level=None, fit_type='parametric', size_factors_fit_type='ratio', control_genes=None, min_mu=0.5, min_disp=1e-08, max_disp=10.0, refit_cooks=True, min_replicates=7, beta_tol=1e-08, n_cpus=None, inference=None, quiet=False, low_memory=False)

Bases: AnnData

A class to implement dispersion and log fold-change (LFC) estimation.

The DeseqDataSet extends the AnnData class. As such, it implements the same methods and attributes, in addition to those that are specific to pydeseq2. Dispersions and LFCs are estimated following the DESeq2 pipeline [LHA14].

Parameters:
  • adata (anndata.AnnData) – AnnData from which to initialize the DeseqDataSet. Must have counts (‘X’) and sample metadata (‘obs’) fields. If None, both counts and metadata arguments must be provided.

  • counts (pandas.DataFrame) – Raw counts. One column per gene, rows are indexed by sample barcodes.

  • metadata (pandas.DataFrame) – DataFrame containing sample metadata. Must be indexed by sample barcodes.

  • design (str or pandas.DataFrame) – Model design. Can be either a pandas DataFrame representing a design matrix, or a formulaic formula in the format 'x + z' or '~x+z'. If a design matrix is provided, DeseqStats built from this DeseqDataSet will only support contrasts in the form of numeric vectors. (Default: '~condition').

  • design_factors (str or list, optional) – Depecated. An optional list of factors to include in the design matrix. Will be removed in a future release. (default: None).

  • continuous_factors (list, optional) – Deprecated. Continuous factors are now automatically detected from the design, or cast to categorical using the C() operator in the formula. (default: None).

  • ref_level (list, optional) – Deprecated.

  • fit_type (str) – Either "parametric" or "mean" for the type of fitting of dispersions to the mean intensity. "parametric": fit a dispersion-mean relation via a robust gamma-family GLM. "mean": use the mean of gene-wise dispersion estimates. Will set the fit type for the DEA and the vst transformation. If needed, it can be set separately for each method.(default: "parametric").

  • size_factors_fit_type (str) – The normalization method to use: "ratio", "poscounts" or "iterative". "ratio": fit size factors using the median-of-ratios method. "poscounts": fit size factors using the method implemented in DESeq2 for the case where there may be few or no genes which have no zero values. "iterative": fit size factors iteratively. (default: "ratio").

  • control_genes (ndarray, list, or pandas.Index, optional) – Genes to use as control genes for size factor fitting. If provided, size factors will be fit using only these genes. This is useful when certain genes are known to be invariant across conditions (e.g., housekeeping genes). Any valid AnnData indexer (bool array, integer positions, or gene name strings) can be used. (default: None).

  • min_mu (float) – Threshold for mean estimates. (default: 0.5).

  • min_disp (float) – Lower threshold for dispersion parameters. (default: 1e-8).

  • max_disp (float) – Upper threshold for dispersion parameters. Note: The threshold that is actually enforced is max(max_disp, len(counts)). (default: 10).

  • refit_cooks (bool) – Whether to refit cooks outliers. (default: True).

  • min_replicates (int) – Minimum number of replicates a condition should have to allow refitting its samples. (default: 7).

  • beta_tol (float) –

    Stopping criterion for IRWLS. (default: 1e-8).

    \[\vert dev_t - dev_{t+1}\vert / (\vert dev \vert + 0.1) < \beta_{tol}.\]

  • n_cpus (int) – Number of cpus to use. If None and if inference is not provided, all available cpus will be used by the DefaultInference. If both are specified (i.e., n_cpus and inference are not None), it will try to override the n_cpus attribute of the inference object. (default: None).

  • inference (Inference) – Implementation of inference routines object instance. (default: DefaultInference).

  • quiet (bool) – Suppress deseq2 status updates during fit.

  • low_memory (bool) – Remove intermediate data structures from .layers and from .obsm that are no longer necessary after they are used during deseq2 run, such as Cook’s distances. (default: False)

X

A ‘number of samples’ x ‘number of genes’ count data matrix.

obs

Key-indexed one-dimensional observations annotation of length ‘number of samples”. Used to store design factors.

var

Key-indexed one-dimensional gene-level annotation of length ‘number of genes’.

uns

Key-indexed unstructured annotation.

obsm

Key-indexed multi-dimensional observations annotation of length ‘number of samples’. Stores “design_matrix” and “size_factors”, among others.

varm

Key-indexed multi-dimensional gene annotation of length ‘number of genes’. Stores “dispersions” and “LFC”, among others.

layers

Key-indexed multi-dimensional arrays aligned to dimensions of X, e.g. “cooks”.

n_processes

Number of cpus to use for multiprocessing.

Type:

int

non_zero_idx

Indices of genes that have non-uniformly zero counts.

Type:

ndarray

non_zero_genes

Index of genes that have non-uniformly zero counts.

Type:

pandas.Index

counts_to_refit

Read counts after replacement, containing only genes for which dispersions and LFCs must be fitted again.

Type:

anndata.AnnData

new_all_zeroes_genes

Genes which have only zero counts after outlier replacement.

Type:

pandas.Index

quiet

Suppress deseq2 status updates during fit.

Type:

bool

logmeans

Gene-wise mean log counts, computed in preprocessing.deseq2_norm_fit().

Type:

numpy.ndarray

filtered_genes

Genes whose log means are different from -∞, computed in preprocessing.deseq2_norm_fit().

Type:

numpy.ndarray

factor_storage

A dictionary storing metadata for each factor processed by the custom materializer (only if design is input as a formula).

Type:

dict

variable_to_factors

A dictionary mapping variable names to factor names (only if design is input as a formula).

Type:

dict

References

[LHA14]

Michael I Love, Wolfgang Huber, and Simon Anders. Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome biology, 15(12):1–21, 2014. doi:10.1186/s13059-014-0550-8.

Methods

calculate_cooks()

Compute Cook's distance for outlier detection.

cond(**kwargs)

Get a contrast vector representing a specific condition.

deseq2([fit_type])

Perform dispersion and log fold-change (LFC) estimation.

fit_LFC()

Fit log fold change (LFC) coefficients.

fit_MAP_dispersions()

Fit Maximum a Posteriori dispersion estimates.

fit_dispersion_prior()

Fit dispersion variance priors and standard deviation of log-residuals.

fit_dispersion_trend([vst])

Fit the dispersion trend curve.

fit_genewise_dispersions([vst])

Fit gene-wise dispersion estimates.

fit_size_factors([fit_type, control_genes])

Fit sample-wise deseq2 normalization (size) factors.

plot_dispersions([log, save_path])

Plot dispersions.

refit()

Refit Cook outliers.

to_picklable_anndata()

Convert the DESeqDataSet to a picklable AnnData object.

vst([use_design, fit_type])

Fit a variance stabilizing transformation, and apply it to normalized counts.

calculate_cooks()

Compute Cook’s distance for outlier detection.

Measures the contribution of a single entry to the output of LFC estimation.

Return type:

None

cond(**kwargs)

Get a contrast vector representing a specific condition.

Parameters:

**kwargs – Column/value pairs.

Returns:

A contrast vector that aligns to the columns of the design matrix.

Return type:

ndarray

contrast(*args, **kwargs)

Get a contrast for a simple pairwise comparison.

cooks_outlier()

Filter p-values based on Cooks outliers.

deseq2(fit_type=None)

Perform dispersion and log fold-change (LFC) estimation.

Wrapper for the first part of the PyDESeq2 pipeline.

Parameters:

fit_type (str) –

Either None, "parametric" or "mean" for the type of fitting of dispersions to the mean intensity.``”parametric”: fit a dispersion-mean relation via a robust gamma-family GLM. ``"mean": use the mean of gene-wise dispersion estimates.

If None, the fit_type provided at class initialization is used. (default: None).

Return type:

None

disp_function(x)

Return the dispersion trend function at x.

fit_LFC()

Fit log fold change (LFC) coefficients.

In the 2-level setting, the intercept corresponds to the base mean, while the second is the actual LFC coefficient, in natural log scale.

Return type:

None

fit_MAP_dispersions()

Fit Maximum a Posteriori dispersion estimates.

After MAP dispersions are fit, filter genes for which we don’t apply shrinkage.

Return type:

None

fit_dispersion_prior()

Fit dispersion variance priors and standard deviation of log-residuals.

The computation is based on genes whose dispersions are above 100 * min_disp.

Note: when the design matrix has fewer than 3 degrees of freedom, the estimate of log dispersions is likely to be imprecise.

Return type:

None

fit_dispersion_trend(vst=False)

Fit the dispersion trend curve.

Parameters:

vst (bool) – Whether the dispersion trend curve is being fitted as part of the VST pipeline. (default: False).

Return type:

None

fit_genewise_dispersions(vst=False)

Fit gene-wise dispersion estimates.

Fits a negative binomial per gene, independently.

Parameters:

vst (bool) – Whether the dispersion estimates are being fitted as part of the VST pipeline. (default: False).

Return type:

None

fit_size_factors(fit_type=None, control_genes=None)

Fit sample-wise deseq2 normalization (size) factors.

Uses the median-of-ratios method: see pydeseq2.preprocessing.deseq2_norm(), unless each gene has at least one sample with zero read counts, in which case it switches to the iterative method.

Also available is the ‘poscounts’ method implemented in DESeq2 for the single-cell or metagenomics use case where there may be few or no features which have no zero values. In this situation, size factors can depend on a very small number of features (or only one feature) leading to incorrect inference. This method for calculating size factors will only exclude genes which have all-0 values (and are not amenable to inference anyway).

The “poscounts” method calculates the n-th root of the product of the non-zero (positive) counts.

Control genes can be optionally provided; if so, size factors will be fit to only the genes in this argument. This is the same functionality as controlGenes in R DESeq2. Any valid AnnData indexer (bool, int position, var_name string) is accepted.

Parameters:
  • fit_type (str) – The normalization method to use: “ratio”, “poscounts” or “iterative”. (default: "ratio").

  • control_genes (ndarray, list, or pandas.Index, optional) – Genes to use as control genes for size factor fitting. If None, all genes are used. Note that manually passing control genes here will override the DeseqDataSet control_genes attribute. (default: None).

Return type:

None

plot_dispersions(log=True, save_path=None, **kwargs)

Plot dispersions.

Make a scatter plot with genewise dispersions, trend curve and final (MAP) dispersions.

Parameters:
  • log (bool) – Whether to log scale x and y axes (default=True).

  • save_path (str, optional) – The path where to save the plot. If left None, the plot won’t be saved (default=None).

  • **kwargs – Keyword arguments for the scatter plot.

Return type:

None

refit()

Refit Cook outliers.

Replace values that are filtered out based on the Cooks distance with imputed values, and then re-run the whole DESeq2 pipeline on replaced values.

Return type:

None

to_picklable_anndata()

Convert the DESeqDataSet to a picklable AnnData object.

Builds an AnnData object from the DESeqDataSet with the same data, but converts the design matrix to a DataFrame to remove the formulaic model_spec attribute, which is not picklable.

Returns:

The AnnData object, without DeseqDataSet unpicklable attributes.

Return type:

anndata.AnnData

vst(use_design=False, fit_type=None)

Fit a variance stabilizing transformation, and apply it to normalized counts.

Results are stored in dds.layers["vst_counts"].

Parameters:
  • use_design (bool) – Whether to use the full design matrix to fit dispersions and the trend curve. If False, only an intercept is used. (default: False).

  • fit_type (str) –

    • None: fit_type provided at initialization to fit the dispersions trend curve.

    • "parametric": fit a dispersion-mean relation via a robust gamma-family GLM.

    • "mean": use the mean of gene-wise dispersion estimates.

    (default: None).

Return type:

None

vst_fit(use_design=False)

Fit a variance stabilizing transformation.

This method should be called before vst_transform.

Results are stored in dds.layers["vst_counts"].

Parameters:

use_design (bool) – Whether to use the full design matrix to fit dispersions and the trend curve. If False, only an intercept is used. Only useful if fit_type = "parametric"`. (default: ``False).

Return type:

None

vst_transform(counts=None)

Apply the variance stabilizing transformation.

Uses the results from the vst_fit method.

Parameters:

counts (numpy.ndarray) – Counts to transform. If None, use the counts from the current dataset. (default: None).

Returns:

Variance stabilized counts.

Return type:

numpy.ndarray

Raises:

RuntimeError – If the size factors were not fitted before calling this method.

property variables

Get the names of the variables used in the model definition.