pydeseq2.dds.DeseqDataSet

class DeseqDataSet(counts, clinical, design_factors='condition', reference_level=None, min_mu=0.5, min_disp=1e-08, max_disp=10.0, refit_cooks=True, min_replicates=7, beta_tol=1e-08, n_cpus=None, batch_size=128, joblib_verbosity=0)

Bases: AnnData

A class to implement dispersion and log fold-change (LFC) estimation.

The DeseqDataSet extends the AnnData class. As such, it implements the same methods and attributes, in addition to those that are specific to pydeseq2. Dispersions and LFCs are estimated following the DESeq2 pipeline [LHA14].

Parameters
  • counts (pandas.DataFrame) – Raw counts. One column per gene, rows are indexed by sample barcodes.

  • clinical (pandas.DataFrame) – DataFrame containing clinical information. Must be indexed by sample barcodes.

  • design_factors (str or list) – Name of the columns of clinical to be used as design variables. If a list, the last factor will be considered the variable of interest by default. Only bi-level factors are supported. (default: 'condition').

  • reference_level (str) – The factor to use as a reference. Must be one of the values taken by the design. If None, the reference will be chosen alphabetically (last in order). (default: None).

  • min_mu (float) – Threshold for mean estimates. (default: 0.5).

  • min_disp (float) – Lower threshold for dispersion parameters. (default: 1e-8).

  • max_disp (float) – Upper threshold for dispersion parameters. NB: The threshold that is actually enforced is max(max_disp, len(counts)). (default: 10).

  • refit_cooks (bool) – Whether to refit cooks outliers. (default: True).

  • min_replicates (int) – Minimum number of replicates a condition should have to allow refitting its samples. (default: 7).

  • beta_tol (float) –

    Stopping criterion for IRWLS. (default: 1e-8).

    \[\vert dev_t - dev_{t+1}\vert / (\vert dev \vert + 0.1) < \beta_{tol}.\]

  • n_cpus (int) – Number of cpus to use. If None, all available cpus will be used. (default: None).

  • batch_size (int) – Number of tasks to allocate to each joblib parallel worker. (default: 128).

  • joblib_verbosity (int) – The verbosity level for joblib tasks. The higher the value, the more updates are reported. (default: 0).

Return type

None

X

A ‘number of samples’ x ‘number of genes’ count data matrix.

obs

Key-indexed one-dimensional observations annotation of length ‘number of samples”. Used to store design factors.

var

Key-indexed one-dimensional gene-level annotation of length ‘number of genes’.

uns

Key-indexed unstructured annotation.

obsm

Key-indexed multi-dimensional observations annotation of length ‘number of samples’. Stores “design_matrix” and “size_factors”, among others.

varm

Key-indexed multi-dimensional gene annotation of length ‘number of genes’. Stores “dispersions” and “LFC”, among others.

layers

Key-indexed multi-dimensional arrays aligned to dimensions of X, e.g. “cooks”.

n_processes

Number of cpus to use for multiprocessing.

Type

int

non_zero_idx

Indices of genes that have non-uniformly zero counts.

Type

ndarray

non_zero_genes

Index of genes that have non-uniformly zero counts.

Type

pandas.Index

counts_to_refit

Read counts after replacement, containing only genes for which dispersions and LFCs must be fitted again.

Type

anndata.AnnData

new_all_zeroes_genes

Genes which have only zero counts after outlier replacement.

Type

pandas.Index

References

LHA14

Michael I Love, Wolfgang Huber, and Simon Anders. Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome biology, 15(12):1–21, 2014. doi:10.1186/s13059-014-0550-8.

Methods

calculate_cooks()

Compute Cook's distance for outlier detection.

deseq2()

Perform dispersion and log fold-change (LFC) estimation.

fit_LFC()

Fit log fold change (LFC) coefficients.

fit_MAP_dispersions()

Fit Maximum a Posteriori dispersion estimates.

fit_dispersion_prior()

Fit dispersion variance priors and standard deviation of log-residuals.

fit_dispersion_trend()

Fit the dispersion trend coefficients.

fit_genewise_dispersions()

Fit gene-wise dispersion estimates.

fit_size_factors()

Fit sample-wise deseq2 normalization (size) factors.

refit()

Refit Cook outliers.

calculate_cooks()

Compute Cook’s distance for outlier detection.

Measures the contribution of a single entry to the output of LFC estimation.

Return type

None

deseq2()

Perform dispersion and log fold-change (LFC) estimation.

Wrapper for the first part of the PyDESeq2 pipeline.

Return type

None

fit_LFC()

Fit log fold change (LFC) coefficients.

In the 2-level setting, the intercept corresponds to the base mean, while the second is the actual LFC coefficient, in natural log scale.

Return type

None

fit_MAP_dispersions()

Fit Maximum a Posteriori dispersion estimates.

After MAP dispersions are fit, filter genes for which we don’t apply shrinkage.

Return type

None

fit_dispersion_prior()

Fit dispersion variance priors and standard deviation of log-residuals.

The computation is based on genes whose dispersions are above 100 * min_disp.

Return type

None

fit_dispersion_trend()

Fit the dispersion trend coefficients.

\(f(\mu) = \alpha_1/\mu + a_0\).

Return type

None

fit_genewise_dispersions()

Fit gene-wise dispersion estimates.

Fits a negative binomial per gene, independently.

Return type

None

fit_size_factors()

Fit sample-wise deseq2 normalization (size) factors.

Uses the median-of-ratios method: see pydeseq2.preprocessing.deseq2_norm().

Return type

None

refit()

Refit Cook outliers.

Replace values that are filtered out based on the Cooks distance with imputed values, and then re-run the whole DESeq2 pipeline on replaced values.

Return type

None