pydeseq2.dds.DeseqDataSet
- class DeseqDataSet(*, adata=None, counts=None, metadata=None, design_factors='condition', continuous_factors=None, ref_level=None, trend_fit_type='parametric', min_mu=0.5, min_disp=1e-08, max_disp=10.0, refit_cooks=True, min_replicates=7, beta_tol=1e-08, n_cpus=None, inference=None, quiet=False)
Bases:
AnnData
A class to implement dispersion and log fold-change (LFC) estimation.
The DeseqDataSet extends the AnnData class. As such, it implements the same methods and attributes, in addition to those that are specific to pydeseq2. Dispersions and LFCs are estimated following the DESeq2 pipeline [LHA14].
- Parameters:
adata (
anndata.AnnData
) – AnnData from which to initialize the DeseqDataSet. Must have counts (‘X’) and sample metadata (‘obs’) fields. IfNone
, bothcounts
andmetadata
arguments must be provided.counts (
pandas.DataFrame
) – Raw counts. One column per gene, rows are indexed by sample barcodes.metadata (
pandas.DataFrame
) – DataFrame containing sample metadata. Must be indexed by sample barcodes.design_factors (
str
orlist
) – Name of the columns of metadata to be used as design variables. (default:'condition'
).continuous_factors (
list
orNone
) – An optional list of continuous (as opposed to categorical) factors. Any factor not incontinuous_factors
will be considered categorical (default:None
).ref_level (
list
orNone
) – An optional list of two strings of the form["factor", "test_level"]
specifying the factor of interest and the reference (control) level against which we’re testing, e.g.["condition", "A"]
. (default:None
).trend_fit_type (
str
) – Either “parametric” or “mean” for the type of fitting of the dispersions trend curve. (default:"parametric"
).min_mu (
float
) – Threshold for mean estimates. (default:0.5
).min_disp (
float
) – Lower threshold for dispersion parameters. (default:1e-8
).max_disp (
float
) – Upper threshold for dispersion parameters. Note: The threshold that is actually enforced is max(max_disp, len(counts)). (default:10
).refit_cooks (
bool
) – Whether to refit cooks outliers. (default:True
).min_replicates (
int
) – Minimum number of replicates a condition should have to allow refitting its samples. (default:7
).beta_tol (
float
) –Stopping criterion for IRWLS. (default:
1e-8
).\[\vert dev_t - dev_{t+1}\vert / (\vert dev \vert + 0.1) < \beta_{tol}.\]n_cpus (
int
) – Number of cpus to use. IfNone
and ifinference
is not provided, all available cpus will be used by theDefaultInference
. If both are specified, it will try to override then_cpus
attribute of theinference
object. (default:None
).inference (
Inference
) – Implementation of inference routines object instance. (default:DefaultInference
).quiet (
bool
) – Suppress deseq2 status updates during fit.
- X
A ‘number of samples’ x ‘number of genes’ count data matrix.
- obs
Key-indexed one-dimensional observations annotation of length ‘number of samples”. Used to store design factors.
- var
Key-indexed one-dimensional gene-level annotation of length ‘number of genes’.
- uns
Key-indexed unstructured annotation.
- obsm
Key-indexed multi-dimensional observations annotation of length ‘number of samples’. Stores “design_matrix” and “size_factors”, among others.
- varm
Key-indexed multi-dimensional gene annotation of length ‘number of genes’. Stores “dispersions” and “LFC”, among others.
- layers
Key-indexed multi-dimensional arrays aligned to dimensions of X, e.g. “cooks”.
- non_zero_idx
Indices of genes that have non-uniformly zero counts.
- Type:
ndarray
- non_zero_genes
Index of genes that have non-uniformly zero counts.
- Type:
- counts_to_refit
Read counts after replacement, containing only genes for which dispersions and LFCs must be fitted again.
- Type:
- new_all_zeroes_genes
Genes which have only zero counts after outlier replacement.
- Type:
- fit_type
Either “parametric” or “mean” for the type of fitting of dispersions to the mean intensity. “parametric”: fit a dispersion-mean relation via a robust gamma-family GLM. “mean”: use the mean of gene-wise dispersion estimates. (default:
"parametric"
).- Type:
- logmeans
Gene-wise mean log counts, computed in
preprocessing.deseq2_norm_fit()
.- Type:
- filtered_genes
Genes whose log means are different from -∞, computed in preprocessing.deseq2_norm_fit().
- Type:
References
[LHA14]Michael I Love, Wolfgang Huber, and Simon Anders. Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome biology, 15(12):1–21, 2014. doi:10.1186/s13059-014-0550-8.