pydeseq2.dds.DeseqDataSet

class DeseqDataSet(*, adata=None, counts=None, metadata=None, design_factors='condition', continuous_factors=None, ref_level=None, min_mu=0.5, min_disp=1e-08, max_disp=10.0, refit_cooks=True, min_replicates=7, beta_tol=1e-08, inference=None, quiet=False)

Bases: AnnData

A class to implement dispersion and log fold-change (LFC) estimation.

The DeseqDataSet extends the AnnData class. As such, it implements the same methods and attributes, in addition to those that are specific to pydeseq2. Dispersions and LFCs are estimated following the DESeq2 pipeline [LHA14].

Parameters

adata (anndata.AnnData) – AnnData from which to initialize the DeseqDataSet. Must have counts (‘X’) and sample metadata (‘obs’) fields. If None, both counts and metadata arguments must be provided.
counts (pandas.DataFrame) – Raw counts. One column per gene, rows are indexed by sample barcodes.
metadata (pandas.DataFrame) – DataFrame containing sample metadata. Must be indexed by sample barcodes.
design_factors (str or list) – Name of the columns of metadata to be used as design variables. (default: 'condition').
continuous_factors (list or None) – An optional list of continuous (as opposed to categorical) factors. Any factor not in continuous_factors will be considered categorical (default: None).
ref_level (list or None) – An optional list of two strings of the form ["factor", "test_level"] specifying the factor of interest and the reference (control) level against which we’re testing, e.g. ["condition", "A"]. (default: None).
min_mu (float) – Threshold for mean estimates. (default: 0.5).
min_disp (float) – Lower threshold for dispersion parameters. (default: 1e-8).
max_disp (float) – Upper threshold for dispersion parameters. Note: The threshold that is actually enforced is max(max_disp, len(counts)). (default: 10).
refit_cooks (bool) – Whether to refit cooks outliers. (default: True).
min_replicates (int) – Minimum number of replicates a condition should have to allow refitting its samples. (default: 7).
beta_tol (float) –
Stopping criterion for IRWLS. (default: 1e-8).

\[\vert dev_t - dev_{t+1}\vert / (\vert dev \vert + 0.1) < \beta_{tol}.\]
inference (Inference) – Implementation of inference routines object instance. (default: DefaultInference).
quiet (bool) – Suppress deseq2 status updates during fit.

Return type

None

X: A ‘number of samples’ x ‘number of genes’ count data matrix.

obs: Key-indexed one-dimensional observations annotation of length ‘number of samples”. Used to store design factors.

var: Key-indexed one-dimensional gene-level annotation of length ‘number of genes’.

uns: Key-indexed unstructured annotation.

obsm: Key-indexed multi-dimensional observations annotation of length ‘number of samples’. Stores “design_matrix” and “size_factors”, among others.

varm: Key-indexed multi-dimensional gene annotation of length ‘number of genes’. Stores “dispersions” and “LFC”, among others.

layers: Key-indexed multi-dimensional arrays aligned to dimensions of X, e.g. “cooks”.

n_processes

Number of cpus to use for multiprocessing.

Type: int

non_zero_idx

Indices of genes that have non-uniformly zero counts.

Type: ndarray

non_zero_genes

Index of genes that have non-uniformly zero counts.

Type: pandas.Index

counts_to_refit

Read counts after replacement, containing only genes for which dispersions and LFCs must be fitted again.

Type: anndata.AnnData

new_all_zeroes_genes

Genes which have only zero counts after outlier replacement.

Type: pandas.Index

quiet

Suppress deseq2 status updates during fit.

Type: bool

fit_type

Either “parametric” or “mean” for the type of fitting of dispersions to the mean intensity. “parametric”: fit a dispersion-mean relation via a robust gamma-family GLM. “mean”: use the mean of gene-wise dispersion estimates. (default: "parametric").

Type: str

logmeans

Gene-wise mean log counts, computed in preprocessing.deseq2_norm_fit().

Type: numpy.ndarray

filtered_genes

Genes whose log means are different from -∞, computed in preprocessing.deseq2_norm_fit().

Type: numpy.ndarray

References

LHA14: Michael I Love, Wolfgang Huber, and Simon Anders. Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome biology, 15(12):1–21, 2014. doi:10.1186/s13059-014-0550-8.

Methods

`calculate_cooks`()	Compute Cook's distance for outlier detection.
`deseq2`()	Perform dispersion and log fold-change (LFC) estimation.
`fit_LFC`()	Fit log fold change (LFC) coefficients.
`fit_MAP_dispersions`()	Fit Maximum a Posteriori dispersion estimates.
`fit_dispersion_prior`()	Fit dispersion variance priors and standard deviation of log-residuals.
`fit_dispersion_trend`()	Fit the dispersion trend coefficients.
`fit_genewise_dispersions`()	Fit gene-wise dispersion estimates.
`fit_size_factors`([fit_type])	Fit sample-wise deseq2 normalization (size) factors.
`plot_dispersions`([log, save_path])	Plot dispersions.
`refit`()	Refit Cook outliers.
`vst`([use_design, fit_type])	Fit a variance stabilizing transformation, and apply it to normalized counts.

calculate_cooks()

Compute Cook’s distance for outlier detection.

Measures the contribution of a single entry to the output of LFC estimation.

Return type: None

deseq2()

Perform dispersion and log fold-change (LFC) estimation.

Wrapper for the first part of the PyDESeq2 pipeline.

Return type: None

fit_LFC()

Fit log fold change (LFC) coefficients.

In the 2-level setting, the intercept corresponds to the base mean, while the second is the actual LFC coefficient, in natural log scale.

Return type: None

fit_MAP_dispersions()

Fit Maximum a Posteriori dispersion estimates.

After MAP dispersions are fit, filter genes for which we don’t apply shrinkage.

Return type: None

fit_dispersion_prior()

Fit dispersion variance priors and standard deviation of log-residuals.

The computation is based on genes whose dispersions are above 100 * min_disp.

Note: when the design matrix has fewer than 3 degrees of freedom, the estimate of log dispersions is likely to be imprecise.

Return type: None

fit_dispersion_trend()

Fit the dispersion trend coefficients.

\(f(\mu) = \alpha_1/\mu + a_0\).

Return type: None

fit_genewise_dispersions()

Fit gene-wise dispersion estimates.

Fits a negative binomial per gene, independently.

Return type: None

fit_size_factors(fit_type='ratio')

Fit sample-wise deseq2 normalization (size) factors.

Uses the median-of-ratios method: see pydeseq2.preprocessing.deseq2_norm(), unless each gene has at least one sample with zero read counts, in which case it switches to the iterative method.

Parameters: fit_type (str) – The normalization method to use (default: "ratio").
Return type: None

plot_dispersions(log=True, save_path=None, **kwargs)

Plot dispersions.

Make a scatter plot with genewise dispersions, trend curve and final (MAP) dispersions.

Parameters

log (bool) – Whether to log scale x and y axes (default=True).
save_path (str or None) – The path where to save the plot. If left None, the plot won’t be saved (default=None).
**kwargs – Keyword arguments for the scatter plot.

Return type

None

refit()

Refit Cook outliers.

Replace values that are filtered out based on the Cooks distance with imputed values, and then re-run the whole DESeq2 pipeline on replaced values.

Return type: None

vst(use_design=False, fit_type='parametric')

Fit a variance stabilizing transformation, and apply it to normalized counts.

Results are stored in dds.layers["vst_counts"].

Parameters

use_design (bool) – Whether to use the full design matrix to fit dispersions and the trend curve. If False, only an intercept is used. (default: False).
fit_type (str) – Either “parametric” or “mean” for the type of fitting of dispersions to the mean intensity. “parametric”: fit a dispersion-mean relation via a robust gamma-family GLM. “mean”: use the mean of gene-wise dispersion estimates. (default: "parametric").

Return type

None

vst_fit(use_design=False, fit_type='parametric')

Fit a variance stabilizing transformation.

Results are stored in dds.layers["vst_counts"].

Parameters

use_design (bool) – Whether to use the full design matrix to fit dispersions and the trend curve. If False, only an intercept is used. (default: False).
fit_type (str) – Either “parametric” or “mean” for the type of fitting of dispersions to the mean intensity. parametric - fit a dispersion-mean relation via a robust gamma-family GLM. mean - use the mean of gene-wise dispersion estimates. (default: "parametric").

Return type

None

vst_transform(counts=None)

Apply the variance stabilizing transformation.

Uses the results from the vst_fit method.

Parameters: counts (numpy.ndarray) – Counts to transform. If None, use the counts from the current dataset. (default: None).
Returns: Variance stabilized counts.
Return type: numpy.ndarray