pydeseq2.utils

Functions

`build_design_matrix`(clinical_df[, ...])	Build design_matrix matrix for DEA.
`dispersion_trend`(normed_mean, coeffs)	Return dispersion trend from normalized counts.
`dnb_nll`(counts, mu, alpha)	Gradient of the negative log-likelihood of a negative binomial.
`fit_alpha_mle`(counts, design_matrix, mu, ...)	Estimate the dispersion parameter of a negative binomial GLM.
`fit_lin_mu`(counts, size_factors, design_matrix)	Estimate mean of negative binomial model using a linear regression.
`fit_moments_dispersions`(counts, size_factors)	Dispersion estimates based on moments, as per the R code.
`fit_rough_dispersions`(counts, size_factors, ...)	"Rough dispersion" estimates from linear model, as per the R code.
`get_num_processes`([n_cpus])	Return the number of processes to use for multiprocessing.
`irls_solver`(counts, size_factors, ...[, ...])	Fit a NB GLM wit log-link to predict counts from the design matrix.
`load_example_data`([modality, dataset, ...])	Load synthetic or TCGA data (gene raw counts or clinical) for a given cancer type.
`nb_nll`(counts, mu, alpha)	Negative log-likelihood of a negative binomial.
`nbinomFn`(beta, design_matrix, counts, size, ...)	Return the NB negative likelihood with apeGLM prior.
`nbinomGLM`(design_matrix, counts, size, ...)	Fit a negative binomial MAP LFC using an apeGLM prior.
`robust_method_of_moments_disp`(normed_counts, ...)	Perform dispersion estimation using a method of trimmed moments.
`test_valid_counts`(counts_df)	Test that the count matrix contains valid inputs.
`trimmed_cell_variance`(counts, cells)	Return trimmed variance of counts according to condition.
`trimmed_mean`(x[, trim])	Return trimmed mean.
`trimmed_variance`(x[, trim, axis])	Return trimmed variance.
`wald_test`(design_matrix, disp, lfc, mu, ...)	Run Wald test for differential expression.

pydeseq2.utils.build_design_matrix(clinical_df, design_factors='condition', ref=None, expanded=False, intercept=True)

Build design_matrix matrix for DEA.

Only single factor, 2-level designs are currently supported. Unless specified, the reference factor is chosen alphabetically. NOTE: For now, each factor should have exactly two levels.

Parameters

clinical_df (pandas.DataFrame) – DataFrame containing clinical information. Must be indexed by sample barcodes.
design_factors (str or list[str]) – Name of the columns of clinical_df to be used as design_matrix variables. (default: “condition”).
ref (str) – The factor to use as a reference. Must be one of the values taken by the design. If None, the reference will be chosen alphabetically (last in order). (default: None).
expanded (bool) – If true, use one column per category. Else, use a single column. (default: False).
intercept (bool) – If true, add an intercept (a column containing only ones). (default: True).

Returns

A DataFrame with experiment design information (to split cohorts). Indexed by sample barcodes.

Return type

pandas.DataFrame

pydeseq2.utils.dispersion_trend(normed_mean, coeffs)

Return dispersion trend from normalized counts.

\(a_1/ \mu + a_0\).

Parameters

normed_mean (float or ndarray) – Mean of normalized counts for a given gene or set of genes.
coeffs (ndarray or pd.Series) – Fitted dispersion trend coefficients \(a_0\) and \(a_1\).

Returns

Dispersion trend \(a_1/ \mu + a_0\).

Return type

float

pydeseq2.utils.dnb_nll(counts, mu, alpha)

Gradient of the negative log-likelihood of a negative binomial.

Unvectorized.

Parameters

counts (ndarray) – Observations.
mu (float) – Mean of the distribution.
alpha (float) – Dispersion of the distribution, s.t. the variance is \(\mu + \alpha * \mu^2\).

Returns

Derivative of negative log likelihood of NB w.r.t. \(\alpha\).

Return type

float

pydeseq2.utils.fit_alpha_mle(counts, design_matrix, mu, alpha_hat, min_disp, max_disp, prior_disp_var=None, cr_reg=True, prior_reg=False, optimizer='L-BFGS-B')

Estimate the dispersion parameter of a negative binomial GLM.

Note: it is possible to pass counts, design_matrix and mu arguments in the form of pandas Series, but using numpy arrays makes the code significantly faster.

Parameters

counts (ndarray) – Raw counts for a given gene.
design_matrix (ndarray) – Design matrix.
mu (ndarray) – Mean estimation for the NB model.
alpha_hat (float) – Initial dispersion estimate.
min_disp (float) – Lower threshold for dispersion parameters.
max_disp (float) – Upper threshold for dispersion parameters.
prior_disp_var (float) – Prior dispersion variance.
cr_reg (bool) – Whether to use Cox-Reid regularization. (default: True).
prior_reg (bool) – Whether to use prior log-residual regularization. (default: False).
optimizer (str) – Optimizing method to use. Accepted values: ‘BFGS’ or ‘L-BFGS-B’. (default: ‘BFGS’).

Returns

float – Dispersion estimate.
bool – Whether L-BFGS-B converged. If not, dispersion is estimated using grid search.

pydeseq2.utils.fit_lin_mu(counts, size_factors, design_matrix, min_mu=0.5)

Estimate mean of negative binomial model using a linear regression.

Used to initialize genewise dispersion models.

Parameters

counts (ndarray) – Raw counts for a given gene.
size_factors (ndarray) – Sample-wise scaling factors (obtained from median-of-ratios).
design_matrix (ndarray) – Design matrix.
min_mu (float) – Lower threshold for fitted means, for numerical stability. (default: 0.5).

Returns

Estimated mean.

Return type

float

pydeseq2.utils.fit_moments_dispersions(counts, size_factors)

Dispersion estimates based on moments, as per the R code.

Used as initial estimates in DeseqDataSet._fit_MoM_dispersions.

Parameters

counts (pandas.DataFrame) – Raw counts. One column per gene, rows are indexed by sample barcodes.
size_factors (pandas.Series) – DESeq2 normalization factors.

Returns

Estimated dispersion parameter for each gene.

Return type

pandas.Series

pydeseq2.utils.fit_rough_dispersions(counts, size_factors, design_matrix)

“Rough dispersion” estimates from linear model, as per the R code.

Used as initial estimates in DeseqDataSet._fit_MoM_dispersions.

Parameters

counts (pandas.DataFrame) – Raw counts. One column per gene, rows are indexed by sample barcodes.
size_factors (pandas.Series) – DESeq2 normalization factors.
design_matrix (pandas.DataFrame) – A DataFrame with experiment design information (to split cohorts). Indexed by sample barcodes. Unexpanded, with intercept.

Returns

Estimated dispersion parameter for each gene.

Return type

pandas.Series

pydeseq2.utils.get_num_processes(n_cpus=None)

Return the number of processes to use for multiprocessing.

Returns the maximum number of available cpus by default.

Parameters: n_cpus (int) – Desired number of cpus. If None, will return the number of available cpus. (default: None).
Returns: Number of processes to spawn.
Return type: int

pydeseq2.utils.irls_solver(counts, size_factors, design_matrix, disp, min_mu=0.5, beta_tol=1e-08, min_beta=- 30, max_beta=30, optimizer='L-BFGS-B', maxiter=250)

Fit a NB GLM wit log-link to predict counts from the design matrix.

See equations (1-2) in the DESeq2 paper.

Parameters

counts (pandas.Series) – Raw counts for a given gene.
size_factors (pandas.Series) – Sample-wise scaling factors (obtained from median-of-ratios).
design_matrix (pandas.DataFrame) – Design matrix.
disp (pandas.Series) – Gene-wise dispersion priors.
min_mu (float) – Lower bound on estimated means, to ensure numerical stability. (default: 0.5).
beta_tol (float) – Stopping criterion for IRWLS: \(abs(dev - old_dev) / (abs(dev) + 0.1) < beta_tol\). (default: 1e-8).
min_beta (int) – Lower-bound on LFC. (default: -30).
max_beta (int) – Upper-bound on LFC. (default: -30).
optimizer (str) – Optimizing method to use in case IRLS starts diverging. Accepted values: ‘BFGS’ or ‘L-BFGS-B’. NB: only ‘L-BFGS-B’ ensures that LFCS will lay in the [min_beta, max_beta] range. (default: ‘L-BFGS-B’).
maxiter (int) – Maximum number of IRLS iterations to perform before switching to L-BFGS-B. (default: 250).

Returns

beta (ndarray) – Fitted (basemean, lfc) coefficients of negative binomial GLM.
mu (ndarray) – Means estimated from size factors and beta: \(\\mu = s_{ij} \\exp(\\beta^t design_matrix)\).
H (ndarray) – Diagonal of the \(W^{1/2} X (X^t W X)^-1 X^t W^{1/2}\) covariance matrix.

pydeseq2.utils.load_example_data(modality='raw_counts', dataset='synthetic', debug=False, debug_seed=42)

Load synthetic or TCGA data (gene raw counts or clinical) for a given cancer type.

May load either clinical or rna-seq data.The synthetic data is part of this repo, but TCGA data should be downloaded as per the instructions in datasets/.

Parameters

modality (str) – Data modality. “raw_counts” or “clinical”.
dataset (str) – The cancer type for which to return gene expression data. If “synthetic”, will return the synthetic data that is used for CI unit tests. Otherwise, must be a valid TCGA dataset. (default: “synthetic”).
debug (bool) – If true, subsample 10 samples and 100 genes at random. (default: False).
debug_seed (int) – Seed for the debug mode. (default: 42).

Returns

Requested data modality.

Return type

pandas.DataFrame

pydeseq2.utils.nb_nll(counts, mu, alpha)

Negative log-likelihood of a negative binomial.

Unvectorized version.

Parameters

counts (ndarray) – Observations.
mu (float) – Mean of the distribution.
alpha (float) – Dispersion of the distribution, s.t. the variance is \(\mu + \alpha * \mu^2\).

Returns

Negative log likelihood of the observations counts following \(NB(\mu, \alpha)\).

Return type

float

pydeseq2.utils.nbinomFn(beta, design_matrix, counts, size, offset, prior_no_shrink_scale, prior_scale, shrink_index=1)

Return the NB negative likelihood with apeGLM prior.

Use for LFC shrinkage.

Parameters

beta (ndarray) – 2-element array: intercept and LFC coefficients.
design_matrix (ndarray) – Design matrix.
counts (ndarray) – Raw counts.
size (ndarray) – Size parameter of NB family (inverse of dispersion).
offset (ndarray) – Natural logarithm of size factor.
prior_no_shrink_scale (float) – Prior variance for the intercept.
prior_scale (float) – Prior variance for the intercept.
shrink_index (int) – Index of the LFC coordinate to shrink. (default: 1).

Returns

Sum of the NB negative likelihood and apeGLM prior.

Return type

float

pydeseq2.utils.nbinomGLM(design_matrix, counts, size, offset, prior_no_shrink_scale, prior_scale, optimizer='L-BFGS-B', shrink_index=1)

Fit a negative binomial MAP LFC using an apeGLM prior.

Only the LFC is shrinked, and not the intercept.

Parameters

design_matrix (ndarray) – Design matrix.
counts (ndarray) – Raw counts.
size (ndarray) – Size parameter of NB family (inverse of dispersion).
offset (ndarray) – Natural logarithm of size factor.
prior_no_shrink_scale (float) – Prior variance for the intercept.
prior_scale (float) – Prior variance for the LFC parameter.
optimizer (str) – Optimizing method to use in case IRLS starts diverging. Accepted values: ‘L-BFGS-B’, ‘BFGS’ or ‘Newton-CG’. (default: ‘Newton-CG’).
shrink_index (int) – Index of the LFC coordinate to shrink. (default: 1).

Returns

beta (ndarray) – 2-element array, containing the intercept (first) and the LFC (second).
inv_hessian (ndarray) – Inverse of the Hessian of the objective at the estimated MAP LFC.
converged (bool) – Whether L-BFGS-B converged.

pydeseq2.utils.robust_method_of_moments_disp(normed_counts, design_matrix)

Perform dispersion estimation using a method of trimmed moments.

Used for outlier detection based on Cook’s distance.

Parameters

normed_counts (pandas.DataFrame) – DF of deseq2-normalized read counts. Rows: samples, columns: genes.
design_matrix (pandas.DataFrame) – A DataFrame with experiment design information (to split cohorts). Indexed by sample barcodes. Unexpanded, with intercept.

Returns

Trimmed method of moment dispersion estimates. Used for outlier detection based on Cook’s distance.

Return type

pandas.Series

pydeseq2.utils.test_valid_counts(counts_df)

Test that the count matrix contains valid inputs.

More precisely, test that inputs are non-negative integers.

Parameters: counts_df (pandas.DataFrame) – Raw counts. One column per gene, rows are indexed by sample barcodes.

pydeseq2.utils.trimmed_cell_variance(counts, cells)

Return trimmed variance of counts according to condition.

Compute the variance after trimming data of its smallest and largest elements, grouped by cohorts, and return the max across cohorts. The trim factor is a function of data size.

Parameters

counts (pandas.DataFrame) – Sample-wise gene counts.
cells (pandas.DataFrame) – Cohort affiliation of each sample.

Returns

Gene-wise trimmed variance estimate.

Return type

pandas.Series

pydeseq2.utils.trimmed_mean(x, trim=0.1, **kwargs)

Return trimmed mean.

Compute the mean after trimming data of its smallest and largest quantiles.

Parameters

x (ndarray) – Data whose mean to compute.
trim (float) – Fraction of data to trim at each end. (default: 0.1).
**kwargs – Keyword arguments, useful to pass axis.

Returns

Trimmed mean.

Return type

float or ndarray

pydeseq2.utils.trimmed_variance(x, trim=0.125, axis=0)

Return trimmed variance.

Compute the variance after trimming data of its smallest and largest quantiles.

Parameters

x (ndarray) – Data whose trimmed variance to compute.
trim (float) – Fraction of data to trim at each end. (default: 0.125).
axis (int) – Dimension along which to compute variance. (default: 0).

Returns

Trimmed variances.

Return type

ndarray

pydeseq2.utils.wald_test(design_matrix, disp, lfc, mu, ridge_factor, idx=- 1)

Run Wald test for differential expression.

Computes Wald statistics, standard error and p-values from dispersion and LFC estimates.

Parameters

design_matrix (ndarray) – Design matrix.
disp (float) – Dispersion estimate.
lfc (float) – Log-fold change estimate (in natural log scale).
mu (float) – Mean estimation for the NB model.
ridge_factor (ndarray) – Regularization factors.
idx (int) – Index of design factor (in design matrix). (default: -1).

Returns

wald_p_value (float) – Estimated p-value.
wald_statistic (float) – Wald statistic.
wald_se (float) – Standard error of the Wald statistic.