pydeseq2.DeseqDataSet.DeseqDataSet
- class pydeseq2.DeseqDataSet.DeseqDataSet(counts, clinical, design_factors='condition', reference_level=None, min_mu=0.5, min_disp=1e-08, max_disp=10, refit_cooks=True, min_replicates=7, beta_tol=1e-08, n_cpus=None, batch_size=128, joblib_verbosity=0)
Bases:
object
A class to implement dispersion and log fold-change (LFC) estimation.
Follows the DESeq2 pipeline 1.
- Parameters
counts (pandas.DataFrame) – Raw counts. One column per gene, rows are indexed by sample barcodes.
clinical (pandas.DataFrame) – DataFrame containing clinical information. Must be indexed by sample barcodes.
design_factors (str or list[str]) – Name of the columns of clinical to be used as design variables. If a list, the last factor will be considered the variable of interest by default. Only bi-level factors are supported. (default: ‘condition’).
reference_level (str) – The factor to use as a reference. Must be one of the values taken by the design. If None, the reference will be chosen alphabetically (last in order). (default: None).
min_mu (float) – Threshold for mean estimates. (default: 0.5).
min_disp (float) – Lower threshold for dispersion parameters. (default: 1e-8).
max_disp (float) – Upper threshold for dispersion parameters. NB: The threshold that is actually enforced is max(max_disp, len(counts)). (default: 10).
refit_cooks (bool) – Whether to refit cooks outliers. (default: True).
min_replicates (int) – Minimum number of replicates a condition should have to allow refitting its samples. (default: 7).
beta_tol (float) – Stopping criterion for IRWLS: math:: abs(dev - old_dev) / (abs(dev) + 0.1) < beta_tol. (default: 1e-8).
n_cpus (int) – Number of cpus to use. If None, all available cpus will be used. (default: None).
batch_size (int) – Number of tasks to allocate to each joblib parallel worker. (default: 128).
joblib_verbosity (int) – The verbosity level for joblib tasks. The higher the value, the more updates are reported. (default: 0).
- design_matrix
A DataFrame with experiment design information (to split cohorts). Indexed by sample barcodes. Unexpanded, with intercept.
- Type
- size_factors
DESeq normalization factors.
- Type
- non_zero_genes
Index of genes that have non-uniformly zero counts.
- Type
- genewise_dispersions
Initial estimates of gene counts dispersions.
- Type
- trend_coeffs
Coefficients of the trend curve: \(f(\mu) = \alpha_1/ \mu + a_0\).
- Type
- fitted_dispersions
Genewise dispersions regressed on the trend curve.
- Type
- prior_disp_var
Dispersion prior of genewise dispersions, used for dispersion shrinkage towards the trend curve.
- Type
- MAP_dispersions
MAP dispersions, after shrinkage towards the trend curve and before filtering.
- Type
- dispersions
Final dispersion estimates, after filtering MAP outliers.
- Type
- LFCs
Log-fold change and intercept parameters, in natural log scale.
- Type
- cooks
Cooks distances, used for outlier detection.
- Type
- replaceable
Whether counts are replaceable, i.e. if a given condition has enough samples.
- Type
- replaced
Counts which were replaced.
- Type
- replace_cooks
Cooks distances after replacement.
- Type
- counts_to_refit
Read counts after replacement, for which dispersions and LFCs must be fitted again.
- Type
- new_all_zeroes
Genes which have only zero counts after outlier replacement.
- Type
- _rough_dispersions
Intial method-of-moments estimates of the dispersions
- Type
References
- 1
Love, M. I., Huber, W., & Anders, S. (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome biology, 15(12), 1-21. <https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8>
Initialize the DeseqDataSet instance, computing the design matrix and the number of multiprocessing threads.
- __init__(counts, clinical, design_factors='condition', reference_level=None, min_mu=0.5, min_disp=1e-08, max_disp=10, refit_cooks=True, min_replicates=7, beta_tol=1e-08, n_cpus=None, batch_size=128, joblib_verbosity=0)
Initialize the DeseqDataSet instance, computing the design matrix and the number of multiprocessing threads.
Methods
__init__
(counts, clinical[, design_factors, ...])Initialize the DeseqDataSet instance, computing the design matrix and the number of multiprocessing threads.
Compute Cook's distance for outlier detection.
deseq2
()Perform dispersion and log fold-change (LFC) estimation.
fit_LFC
()Fit log fold change (LFC) coefficients.
Fit Maximum a Posteriori dispersion estimates.
Fit dispersion variance priors and standard deviation of log-residuals.
Fit the dispersion trend coefficients.
Fit gene-wise dispersion estimates.
Fit sample-wise deseq2 normalization (size) factors.
refit
()Refit Cook outliers.
- calculate_cooks()
Compute Cook’s distance for outlier detection.
Measures the contribution of a single entry to the output of LFC estimation.
- deseq2()
Perform dispersion and log fold-change (LFC) estimation.
Wrapper for the first part of the PyDESeq2 pipeline.
- fit_LFC()
Fit log fold change (LFC) coefficients.
In the 2-level setting, the intercept corresponds to the base mean, while the second is the actual LFC coefficient, in natural log scale.
- fit_MAP_dispersions()
Fit Maximum a Posteriori dispersion estimates.
After MAP dispersions are fit, filter genes for which we don’t apply shrinkage.
- fit_dispersion_prior()
Fit dispersion variance priors and standard deviation of log-residuals.
The computation is based on genes whose dispersions are above 100 * min_disp.
- fit_dispersion_trend()
Fit the dispersion trend coefficients.
\[f(\mu) = \alpha_1/\mu + a_0.\]
- fit_genewise_dispersions()
Fit gene-wise dispersion estimates.
Fits a negative binomial per gene, independently.
- fit_size_factors()
Fit sample-wise deseq2 normalization (size) factors.
Uses the median-of-ratios method.
- refit()
Refit Cook outliers.
Replace values that are filtered out based on the Cooks distance with imputed values, and then re-run the whole DESeq2 pipeline on replaced values.