pydeseq2.utils
Functions
|
Build design_matrix matrix for DEA. |
|
Return dispersion trend from normalized counts. |
|
Gradient of the negative log-likelihood of a negative binomial. |
|
Estimate the dispersion parameter of a negative binomial GLM. |
|
Estimate mean of negative binomial model using a linear regression. |
|
Dispersion estimates based on moments, as per the R code. |
|
Rough dispersion estimates from linear model, as per the R code. |
|
Return the number of processes to use for multiprocessing. |
|
Fit a NB GLM wit log-link to predict counts from the design matrix. |
|
Load synthetic example data. |
|
Create a scatter plot using matplotlib. |
|
Neg log-likelihood of a negative binomial of parameters |
|
Return the NB negative likelihood with apeGLM prior. |
|
Fit a negative binomial MAP LFC using an apeGLM prior. |
|
Perform dispersion estimation using a method of trimmed moments. |
|
Test that the count matrix contains valid inputs. |
|
Return trimmed variance of counts according to condition. |
|
Return trimmed mean. |
|
Return trimmed variance. |
|
Run Wald test for differential expression. |
- build_design_matrix(metadata, design_factors='condition', ref_level=None, continuous_factors=None, expanded=False, intercept=True)
Build design_matrix matrix for DEA.
Unless specified, the reference factor is chosen alphabetically.
- Parameters:
metadata (
pandas.DataFrame
) – DataFrame containing metadata information. Must be indexed by sample barcodes.design_factors (
str
orlist
) – Name of the columns of metadata to be used as design_matrix variables. (default:"condition"
).ref_level (
dict
orNone
) – An optional list of two strings of the form["factor", "ref_level"]
specifying the factor of interest and the desired reference level, e.g.["condition", "A"]
. (default:None
).continuous_factors (
list
orNone
) – An optional list of continuous (as opposed to categorical) factors. Any factor not incontinuous_factors
will be considered categorical (default:None
).expanded (
bool
) – If true, use one column per category. Else, use n-1 columns, for each n-level categorical factor. (default:False
).intercept (
bool
) – If true, add an intercept (a column containing only ones). (default:True
).
- Returns:
A DataFrame with experiment design information (to split cohorts). Indexed by sample barcodes.
- Return type:
- dispersion_trend(normed_mean, coeffs)
Return dispersion trend from normalized counts.
\(a_1/ \mu + a_0\).
- dnb_nll(counts, mu, alpha)
Gradient of the negative log-likelihood of a negative binomial.
Unvectorized.
- fit_alpha_mle(counts, design_matrix, mu, alpha_hat, min_disp, max_disp, prior_disp_var=None, cr_reg=True, prior_reg=False, optimizer='L-BFGS-B')
Estimate the dispersion parameter of a negative binomial GLM.
Note: it is possible to pass counts, design_matrix and mu arguments in the form of pandas Series, but using numpy arrays makes the code significantly faster.
- Parameters:
counts (
ndarray
) – Raw counts for a given gene.design_matrix (
ndarray
) – Design matrix.mu (
ndarray
) – Mean estimation for the NB model.alpha_hat (
float
) – Initial dispersion estimate.min_disp (
float
) – Lower threshold for dispersion parameters.max_disp (
float
) – Upper threshold for dispersion parameters.prior_disp_var (
float
) – Prior dispersion variance.cr_reg (
bool
) – Whether to use Cox-Reid regularization. (default:True
).prior_reg (
bool
) – Whether to use prior log-residual regularization. (default:False
).optimizer (
str
) – Optimizing method to use. Accepted values: ‘BFGS’ or ‘L-BFGS-B’. (default:'L-BFGS-B'
).
- Return type:
- Returns:
- fit_lin_mu(counts, size_factors, design_matrix, min_mu=0.5)
Estimate mean of negative binomial model using a linear regression.
Used to initialize genewise dispersion models.
- Parameters:
counts (
ndarray
) – Raw counts for a given gene.size_factors (
ndarray
) – Sample-wise scaling factors (obtained from median-of-ratios).design_matrix (
ndarray
) – Design matrix.min_mu (
float
) – Lower threshold for fitted means, for numerical stability. (default:0.5
).
- Returns:
Estimated mean.
- Return type:
ndarray
- fit_moments_dispersions(normed_counts, size_factors)
Dispersion estimates based on moments, as per the R code.
Used as initial estimates in
DeseqDataSet.fit_genewise_dispersions()
.- Parameters:
normed_counts (
ndarray
) – Array of deseq2-normalized read counts. Rows: samples, columns: genes.size_factors (
ndarray
) – DESeq2 normalization factors.
- Returns:
Estimated dispersion parameter for each gene.
- Return type:
ndarray
- fit_rough_dispersions(normed_counts, design_matrix)
Rough dispersion estimates from linear model, as per the R code.
Used as initial estimates in
DeseqDataSet.fit_genewise_dispersions()
.- Parameters:
normed_counts (
ndarray
) – Array of deseq2-normalized read counts. Rows: samples, columns: genes.design_matrix (
pandas.DataFrame
) – A DataFrame with experiment design information (to split cohorts). Indexed by sample barcodes. Unexpanded, with intercept.
- Returns:
Estimated dispersion parameter for each gene.
- Return type:
ndarray
- get_num_processes(n_cpus=None)
Return the number of processes to use for multiprocessing.
Returns the maximum number of available cpus by default.
- irls_solver(counts, size_factors, design_matrix, disp, min_mu=0.5, beta_tol=1e-08, min_beta=-30, max_beta=30, optimizer='L-BFGS-B', maxiter=250)
Fit a NB GLM wit log-link to predict counts from the design matrix.
See equations (1-2) in the DESeq2 paper.
- Parameters:
counts (
ndarray
) – Raw counts for a given gene.size_factors (
ndarray
) – Sample-wise scaling factors (obtained from median-of-ratios).design_matrix (
ndarray
) – Design matrix.disp (
float
) – Gene-wise dispersion prior.min_mu (
float
) – Lower bound on estimated means, to ensure numerical stability. (default:0.5
).beta_tol (
float
) – Stopping criterion for IRWLS: \(\vert dev - dev_{old}\vert / \vert dev + 0.1 \vert < \beta_{tol}\). (default:1e-8
).min_beta (
float
) – Lower-bound on LFC. (default:-30
).max_beta (
float
) – Upper-bound on LFC. (default:-30
).optimizer (
str
) – Optimizing method to use in case IRLS starts diverging. Accepted values: ‘BFGS’ or ‘L-BFGS-B’. NB: only ‘L-BFGS-B’ ensures that LFCS will lay in the [min_beta, max_beta] range. (default:'L-BFGS-B'
).maxiter (
int
) – Maximum number of IRLS iterations to perform before switching to L-BFGS-B. (default:250
).
- Return type:
- Returns:
beta (
ndarray
) – Fitted (basemean, lfc) coefficients of negative binomial GLM.mu (
ndarray
) – Means estimated from size factors and beta: \(\mu = s_{ij} \exp(\beta^t X)\).H (
ndarray
) – Diagonal of the \(W^{1/2} X (X^t W X)^-1 X^t W^{1/2}\) covariance matrix.converged (
bool
) – Whether IRLS or the optimizer converged. If not and if dimension allows it, perform grid search.
- load_example_data(modality='raw_counts', dataset='synthetic', debug=False, debug_seed=42)
Load synthetic example data.
May load either metadata or rna-seq data. For now, this function may only return the synthetic data provided as part of this repo, but new datasets might be added in the future.
- Parameters:
modality (
str
) – Data modality. “raw_counts” or “metadata”.dataset (
str
) – The dataset for which to return gene expression data. If “synthetic”, will return the synthetic data that is used for CI unit tests. (default:"synthetic"
).debug (
bool
) – If true, subsample 10 samples and 100 genes at random. (Note that the “synthetic” dataset is already 10 features 100.) (default:False
).debug_seed (
int
) – Seed for the debug mode. (default:42
).
- Returns:
Requested data modality.
- Return type:
- lowess(features, targets, frac=0.6666666666666666, iter=3)
Run lowess smoothing: Robust locally weighted regression.
The lowess function fits a nonparametric regression curve to a scatterplot. The arrays features and targets contain an equal number of elements; each pair (features[i], targets[i]) defines a data point in the scatterplot. The function returns the estimated (smooth) values of targets. The smoothing span is given by frac. A larger value for frac will result in a smoother curve. The number of robustifying iterations is given by iter. The function will run faster with a smaller number of iterations.
- Parameters:
- Returns:
Estimated (smooth) values of targets.
- Return type:
ndarray
- make_MA_plot(results_df, padj_thresh=0.05, log=True, save_path=None, lfc_null=0, alt_hypothesis=None, **kwargs)
Create an log ratio (M)-average (A) plot using matplotlib.
Useful for looking at log fold-change versus mean expression between two groups/samples/etc. Uses matplotlib to emulate the
make_MA()
function in DESeq2 in R.- Parameters:
results_df (
pd.DataFrame
) – Resultant dataframe after running DeseqStats() and .summary().padj_thresh (
float
) – P-value threshold to subset scatterplot colors on.log (
bool
) – Whether or not to log scale features and targets axes (default=True
).save_path (
str
orNone
) – The path where to save the plot. If left None, the plot won’t be saved (default=None
).lfc_null (
float
) – The (log2) log fold change under the null hypothesis. (default:0
).alt_hypothesis (
str
orNone
) – The alternative hypothesis for computing wald p-values. (default:None
).**kwargs – Matplotlib keyword arguments for the scatter plot.
- Return type:
- make_scatter(disps, legend_labels, x_val, log=True, save_path=None, **kwargs)
Create a scatter plot using matplotlib.
Used in
pydeseq2.dds.DeseqDataSet.plot_dispersions()
.- Parameters:
disps (
list
) – List of ndarrays to plot.legend_labels (
list
) – List of strings that correspond to plotted targets values for legend.x_val (
ndarray
) – 1D array to plot (example:dds.varm['_normed_means']
).log (
bool
) – Whether or not to log scale features and targets axes (default=True
).save_path (
str
orNone
) – The path where to save the plot. If left None, the plot won’t be saved (default=None
).**kwargs – Matplotlib keyword arguments for the scatter plot.
- Return type:
- mean_absolute_deviation(x)
Compute a scaled estimator of the mean absolute deviation.
- n_or_more_replicates(design_matrix, min_replicates)
Return a series indicating whether samples have a minimum number of replicates.
Checks whether each sample has at least
min_replicates
replicates, based on its combination of design factors.- Parameters:
design_matrix (
pandas.DataFrame
) – A DataFrame with experiment design information (to split cohorts).min_replicates (
int
) – The minimum number of replicates to have to pass the threshold.
- Returns:
A boolean series indicating whether each sample has at least
min_replicates
replicates.- Return type:
- nb_nll(counts, mu, alpha)
Neg log-likelihood of a negative binomial of parameters
mu
andalpha
.Mathematically, if
counts
is a vector of counting entries \(y_i\) then the likelihood of each entry \(y_i\) to be drawn from a negative binomial \(NB(\mu, \alpha)\) is [1]\[p(y_i | \mu, \alpha) = \frac{\Gamma(y_i + \alpha^{-1})}{ \Gamma(y_i + 1)\Gamma(\alpha^{-1}) } \left(\frac{1}{1 + \alpha \mu} \right)^{1/\alpha} \left(\frac{\mu}{\alpha^{-1} + \mu} \right)^{y_i}\]As a consequence, assuming there are \(n\) entries, the total negative log-likelihood for
counts
is\[\ell(\mu, \alpha) = \frac{n}{\alpha} \log(\alpha) + \sum_i \left \lbrace - \log \left( \frac{\Gamma(y_i + \alpha^{-1})}{ \Gamma(y_i + 1)\Gamma(\alpha^{-1}) } \right) + (\alpha^{-1} + y_i) \log (\alpha^{-1} + \mu) - y_i \log \mu \right \rbrace\]This is implemented in this function.
- Parameters:
counts (
ndarray
) – Observations.mu (
ndarray
) – Mean of the distribution \(\mu\).alpha (
float
orndarray
) – Dispersion of the distribution \(\alpha\), s.t. the variance is \(\mu + \alpha \mu^2\).
- Returns:
Negative log likelihood of the observations counts following \(NB(\mu, \alpha)\).
- Return type:
float
orndarray
Notes
[1] https://en.wikipedia.org/wiki/Negative_binomial_distribution
- nbinomFn(beta, design_matrix, counts, size, offset, prior_no_shrink_scale, prior_scale, shrink_index=1)
Return the NB negative likelihood with apeGLM prior.
Use for LFC shrinkage.
- Parameters:
beta (
ndarray
) – 2-element array: intercept and LFC coefficients.design_matrix (
ndarray
) – Design matrix.counts (
ndarray
) – Raw counts.size (
ndarray
) – Size parameter of NB family (inverse of dispersion).offset (
ndarray
) – Natural logarithm of size factor.prior_no_shrink_scale (
float
) – Prior variance for the intercept.prior_scale (
float
) – Prior variance for the intercept.shrink_index (
int
) – Index of the LFC coordinate to shrink. (default:1
).
- Returns:
Sum of the NB negative likelihood and apeGLM prior.
- Return type:
- nbinomGLM(design_matrix, counts, size, offset, prior_no_shrink_scale, prior_scale, optimizer='L-BFGS-B', shrink_index=1)
Fit a negative binomial MAP LFC using an apeGLM prior.
Only the LFC is shrinked, and not the intercept.
- Parameters:
design_matrix (
ndarray
) – Design matrix.counts (
ndarray
) – Raw counts.size (
ndarray
) – Size parameter of NB family (inverse of dispersion).offset (
ndarray
) – Natural logarithm of size factor.prior_no_shrink_scale (
float
) – Prior variance for the intercept.prior_scale (
float
) – Prior variance for the LFC parameter.optimizer (
str
) – Optimizing method to use in case IRLS starts diverging. Accepted values: ‘L-BFGS-B’, ‘BFGS’ or ‘Newton-CG’. (default:'Newton-CG'
).shrink_index (
int
) – Index of the LFC coordinate to shrink. (default:1
).
- Return type:
- Returns:
beta (
ndarray
) – 2-element array, containing the intercept (first) and the LFC (second).inv_hessian (
ndarray
) – Inverse of the Hessian of the objective at the estimated MAP LFC.converged (
bool
) – Whether L-BFGS-B converged.
- replace_underscores(factors)
Replace all underscores from strings in a list by hyphens.
To be used on design factors to avoid bugs due to the reliance on
str.split("_")
in parts of the code.
- robust_method_of_moments_disp(normed_counts, design_matrix)
Perform dispersion estimation using a method of trimmed moments.
Used for outlier detection based on Cook’s distance.
- Parameters:
normed_counts (
pandas.DataFrame
) – DF of deseq2-normalized read counts. Rows: samples, columns: genes.design_matrix (
pandas.DataFrame
) – A DataFrame with experiment design information (to split cohorts). Indexed by sample barcodes. Unexpanded, with intercept.
- Returns:
Trimmed method of moment dispersion estimates. Used for outlier detection based on Cook’s distance.
- Return type:
- test_valid_counts(counts)
Test that the count matrix contains valid inputs.
More precisely, test that inputs are non-negative integers.
- Parameters:
counts (
pandas.DataFrame
orndarray
) – Raw counts. One column per gene, rows are indexed by sample barcodes.- Return type:
- trimmed_cell_variance(counts, cells)
Return trimmed variance of counts according to condition.
Compute the variance after trimming data of its smallest and largest elements, grouped by cohorts, and return the max across cohorts. The trim factor is a function of data size.
- Parameters:
counts (
pandas.DataFrame
) – Sample-wise gene counts.cells (
pandas.Series
) – Cohort affiliation of each sample.
- Returns:
Gene-wise trimmed variance estimate.
- Return type:
- trimmed_mean(x, trim=0.1, **kwargs)
Return trimmed mean.
Compute the mean after trimming data of its smallest and largest quantiles.
- trimmed_variance(x, trim=0.125, axis=0)
Return trimmed variance.
Compute the variance after trimming data of its smallest and largest quantiles.
- wald_test(design_matrix, disp, lfc, mu, ridge_factor, contrast, lfc_null, alt_hypothesis)
Run Wald test for differential expression.
Computes Wald statistics, standard error and p-values from dispersion and LFC estimates.
- Parameters:
design_matrix (
ndarray
) – Design matrix.disp (
float
) – Dispersion estimate.lfc (
ndarray
) – Log-fold change estimate (in natural log scale).mu (
float
) – Mean estimation for the NB model.ridge_factor (
ndarray
) – Regularization factors.contrast (
ndarray
) – Vector encoding the contrast that is being tested.lfc_null (
float
) – The (log2) log fold change under the null hypothesis.alt_hypothesis (
str
orNone
) – The alternative hypothesis for computing wald p-values.
- Return type:
- Returns: