pydeseq2.DeseqDataSet.DeseqDataSet

class pydeseq2.DeseqDataSet.DeseqDataSet(counts, clinical, design_factors='condition', reference_level=None, min_mu=0.5, min_disp=1e-08, max_disp=10, refit_cooks=True, min_replicates=7, beta_tol=1e-08, n_cpus=None, batch_size=128, joblib_verbosity=0)

Bases: object

A class to implement dispersion and log fold-change (LFC) estimation.

Follows the DESeq2 pipeline 1.

Parameters
  • counts (pandas.DataFrame) – Raw counts. One column per gene, rows are indexed by sample barcodes.

  • clinical (pandas.DataFrame) – DataFrame containing clinical information. Must be indexed by sample barcodes.

  • design_factors (str or list[str]) – Name of the columns of clinical to be used as design variables. If a list, the last factor will be considered the variable of interest by default. Only bi-level factors are supported. (default: ‘condition’).

  • reference_level (str) – The factor to use as a reference. Must be one of the values taken by the design. If None, the reference will be chosen alphabetically (last in order). (default: None).

  • min_mu (float) – Threshold for mean estimates. (default: 0.5).

  • min_disp (float) – Lower threshold for dispersion parameters. (default: 1e-8).

  • max_disp (float) – Upper threshold for dispersion parameters. NB: The threshold that is actually enforced is max(max_disp, len(counts)). (default: 10).

  • refit_cooks (bool) – Whether to refit cooks outliers. (default: True).

  • min_replicates (int) – Minimum number of replicates a condition should have to allow refitting its samples. (default: 7).

  • beta_tol (float) – Stopping criterion for IRWLS: math:: abs(dev - old_dev) / (abs(dev) + 0.1) < beta_tol. (default: 1e-8).

  • n_cpus (int) – Number of cpus to use. If None, all available cpus will be used. (default: None).

  • batch_size (int) – Number of tasks to allocate to each joblib parallel worker. (default: 128).

  • joblib_verbosity (int) – The verbosity level for joblib tasks. The higher the value, the more updates are reported. (default: 0).

design_matrix

A DataFrame with experiment design information (to split cohorts). Indexed by sample barcodes. Unexpanded, with intercept.

Type

pandas.DataFrame

n_processes

Number of cpus to use for multiprocessing.

Type

int

size_factors

DESeq normalization factors.

Type

pandas.Series

non_zero_genes

Index of genes that have non-uniformly zero counts.

Type

pandas.Index

genewise_dispersions

Initial estimates of gene counts dispersions.

Type

pandas.Series

trend_coeffs

Coefficients of the trend curve: \(f(\mu) = \alpha_1/ \mu + a_0\).

Type

pandas.Series

fitted_dispersions

Genewise dispersions regressed on the trend curve.

Type

pandas.Series

prior_disp_var

Dispersion prior of genewise dispersions, used for dispersion shrinkage towards the trend curve.

Type

float

MAP_dispersions

MAP dispersions, after shrinkage towards the trend curve and before filtering.

Type

pandas.Series

dispersions

Final dispersion estimates, after filtering MAP outliers.

Type

pandas.Series

LFCs

Log-fold change and intercept parameters, in natural log scale.

Type

pandas.DataFrame

cooks

Cooks distances, used for outlier detection.

Type

pandas.DataFrame

replaceable

Whether counts are replaceable, i.e. if a given condition has enough samples.

Type

pandas.Series

replaced

Counts which were replaced.

Type

pandas.Series

replace_cooks

Cooks distances after replacement.

Type

pandas.DataFrame

counts_to_refit

Read counts after replacement, for which dispersions and LFCs must be fitted again.

Type

pandas.DataFrame

new_all_zeroes

Genes which have only zero counts after outlier replacement.

Type

pandas.Series

_rough_dispersions

Intial method-of-moments estimates of the dispersions

Type

pandas.Series

References

1

Love, M. I., Huber, W., & Anders, S. (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome biology, 15(12), 1-21. <https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8>

Initialize the DeseqDataSet instance, computing the design matrix and the number of multiprocessing threads.

__init__(counts, clinical, design_factors='condition', reference_level=None, min_mu=0.5, min_disp=1e-08, max_disp=10, refit_cooks=True, min_replicates=7, beta_tol=1e-08, n_cpus=None, batch_size=128, joblib_verbosity=0)

Initialize the DeseqDataSet instance, computing the design matrix and the number of multiprocessing threads.

Methods

__init__(counts, clinical[, design_factors, ...])

Initialize the DeseqDataSet instance, computing the design matrix and the number of multiprocessing threads.

calculate_cooks()

Compute Cook's distance for outlier detection.

deseq2()

Perform dispersion and log fold-change (LFC) estimation.

fit_LFC()

Fit log fold change (LFC) coefficients.

fit_MAP_dispersions()

Fit Maximum a Posteriori dispersion estimates.

fit_dispersion_prior()

Fit dispersion variance priors and standard deviation of log-residuals.

fit_dispersion_trend()

Fit the dispersion trend coefficients.

fit_genewise_dispersions()

Fit gene-wise dispersion estimates.

fit_size_factors()

Fit sample-wise deseq2 normalization (size) factors.

refit()

Refit Cook outliers.

calculate_cooks()

Compute Cook’s distance for outlier detection.

Measures the contribution of a single entry to the output of LFC estimation.

deseq2()

Perform dispersion and log fold-change (LFC) estimation.

Wrapper for the first part of the PyDESeq2 pipeline.

fit_LFC()

Fit log fold change (LFC) coefficients.

In the 2-level setting, the intercept corresponds to the base mean, while the second is the actual LFC coefficient, in natural log scale.

fit_MAP_dispersions()

Fit Maximum a Posteriori dispersion estimates.

After MAP dispersions are fit, filter genes for which we don’t apply shrinkage.

fit_dispersion_prior()

Fit dispersion variance priors and standard deviation of log-residuals.

The computation is based on genes whose dispersions are above 100 * min_disp.

fit_dispersion_trend()

Fit the dispersion trend coefficients.

\[f(\mu) = \alpha_1/\mu + a_0.\]
fit_genewise_dispersions()

Fit gene-wise dispersion estimates.

Fits a negative binomial per gene, independently.

fit_size_factors()

Fit sample-wise deseq2 normalization (size) factors.

Uses the median-of-ratios method.

refit()

Refit Cook outliers.

Replace values that are filtered out based on the Cooks distance with imputed values, and then re-run the whole DESeq2 pipeline on replaced values.