Note

Click here to download the full example code or to run this example in your browser via Binder

Step-by-step PyDESeq2 workflow

This notebook details all the steps of the PyDESeq2 pipeline.

It allows you to run the PyDESeq2 pipeline on the synthetic data provided in this repository.

If this is your first contact with PyDESeq2, we recommend you first have a look at the standard workflow example.

We start by importing required packages and setting up an optional path to save results.

import os
import pickle as pkl

from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats
from pydeseq2.utils import load_example_data

SAVE = False  # whether to save the outputs of this notebook

if SAVE:
    # Replace this with the path to directory where you would like results to be
    # saved
    OUTPUT_PATH = "../output_files/synthetic_example"
    os.makedirs(OUTPUT_PATH, exist_ok=True)  # Create path if it doesn't exist

Data loading 

Note that we are also explaining this step in the ‘getting started’ example. To perform differential expression analysis (DEA), PyDESeq2 requires two types of inputs:

A count matrix of shape ‘number of samples’ x ‘number of genes’, containing read counts (non-negative integers),

Metadata (or “column” data) of shape ‘number of samples’ x ‘number of variables’, containing sample annotations that will be used to split the data in cohorts.

Both should be provided as pandas dataframes.

To illustrate the required data format, we load a synthetic example dataset that may be obtained through PyDESeq2’s API using utils.load_example_data(). You may replace it with your own dataset.

counts_df = load_example_data(
    modality="raw_counts",
    dataset="synthetic",
    debug=False,
)

metadata = load_example_data(
    modality="metadata",
    dataset="synthetic",
    debug=False,
)

1. Read counts modeling 

Read counts modeling with the DeseqDataSet class

The DeseqDataSet class has two mandatory arguments, counts and metadata, as well as a set of optional keyword arguments, among which:

design_factor: the name of the column of metadata to be used as a design variable
refit_cooks: whether to refit cooks outliers – this is advised, in general.

Note

in the case of the provided synthetic data, there won’t be any Cooks outliers.

inference = DefaultInference(n_cpus=8)
dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design_factors="condition",  # compare samples based on the "condition"
    # column ("B" vs "A")
    refit_cooks=True,
    inference=inference,
)

Compute normalization factors 

dds.fit_size_factors()

dds.obsm["size_factors"]

Fitting size factors...
... done in 0.00 seconds.


array([1.22898094, 1.18877375, 0.99722229, 1.00215773, 0.83457743,
       1.10730382, 0.8999001 , 1.15343785, 0.68163849, 1.29764537,
       1.04491511, 1.45930946, 1.14588441, 0.8049275 , 0.88402672,
       0.88402672, 1.32879767, 0.82564657, 1.5978062 , 1.29764537,
       1.31940196, 0.69919197, 1.10697146, 1.10214803, 1.19152118,
       1.0624452 , 0.98548229, 0.76881428, 0.8939601 , 1.27135863,
       1.61101905, 1.55084302, 0.83601298, 0.98213727, 1.27270212,
       1.0510719 , 1.76078144, 1.08132885, 1.50390106, 1.0510719 ,
       0.80280751, 0.70955247, 1.32602392, 0.98031899, 1.1078077 ,
       0.68792508, 0.90429564, 1.56411155, 0.81918767, 1.19364837,
       0.79492024, 1.84963565, 0.79694628, 0.79708276, 0.97287297,
       1.16248554, 1.50489413, 1.41929759, 1.04612122, 1.05720226,
       0.99635345, 1.84224912, 1.03801163, 0.89633874, 0.72952979,
       1.33453944, 0.93968061, 1.14016425, 1.59166589, 1.08554239,
       0.72370261, 0.91558563, 1.14183629, 1.33857618, 0.94450599,
       0.85266438, 1.38005658, 1.01803293, 1.04472988, 1.25372879,
       1.77488931, 0.92641092, 1.06341062, 1.05653145, 0.61973175,
       0.75569423, 0.69781308, 1.38512317, 0.82165798, 0.81330537,
       0.96864497, 1.08526199, 0.78205578, 0.98864287, 1.15322141,
       1.26935914, 0.74399805, 0.73987825, 0.95887987, 0.63570796])

Fit genewise dispersions 

dds.fit_genewise_dispersions()

dds.varm["genewise_dispersions"]

Fitting dispersions...
... done in 0.02 seconds.


array([0.91013511, 0.21371805, 0.81359443, 0.1613621 , 0.24850424,
       0.97307734, 0.23303023, 0.19810261, 0.18363604, 0.64637486])

Fit dispersion trend coefficients 

dds.fit_dispersion_trend()
dds.uns["trend_coeffs"]
dds.varm["fitted_dispersions"]

Fitting dispersion trend curve...
... done in 0.01 seconds.


array([0.65142434, 0.31300062, 1.04986539, 0.13414536, 0.264005  ,
       0.97812827, 0.25676459, 0.20575044, 0.21602633, 0.50274561])

Dispersion priors 

dds.fit_dispersion_prior()
print(
    f"logres_prior={dds.uns['_squared_logres']}, sigma_prior={dds.uns['prior_disp_var']}"
)

logres_prior=0.05592493654750616, sigma_prior=0.25

The fit_MAP_dispersions method filters the genes for which dispersion shrinkage is applied. Indeed, for genes whose MLE dispersions are too high above the trend curve, the original MLE value is kept. The final values of the dispersions that are used for downstream analysis is stored in dds.dispersions.

dds.fit_MAP_dispersions()
dds.varm["MAP_dispersions"]
dds.varm["dispersions"]

Fitting MAP dispersions...
... done in 0.02 seconds.


array([0.88259824, 0.22257849, 0.83723751, 0.15897038, 0.24992574,
       0.97364737, 0.23515474, 0.19878066, 0.18652019, 0.63189957])

Fit log fold changes 

Note that in the DeseqDataSet object, the log-fold changes are stored in natural log scale, but that the results dataframe output by the summary method of DeseqStats displays LFCs in log2 scale (see later on).

dds.fit_LFC()
dds.varm["LFC"]

Fitting LFCs...
... done in 0.02 seconds.

	intercept	condition_B_vs_A
gene1	1.891436	0.438632
gene2	2.851662	0.373296
gene3	1.787780	-0.438645
gene4	4.741958	-0.285647
gene5	3.077798	0.403457
gene6	1.678536	0.001010
gene7	3.291025	0.093116
gene8	3.785129	-0.187604
gene9	3.682882	-0.147443
gene10	2.300515	0.267562

Calculate Cooks distances and refit 

Note that this step is optional.

dds.calculate_cooks()
if dds.refit_cooks:
    # Replace outlier counts
    dds.refit()

Replacing 0 outlier genes.

Save everything

if SAVE:
    with open(os.path.join(OUTPUT_PATH, "dds_detailed_pipe.pkl"), "wb") as f:
        pkl.dump(dds, f)

2. Statistical analysis 

Statistical analysis with the DeseqStats class. The DeseqDataSet class has a unique mandatory arguments, dds, which should be a fitted DeseqDataSet object, as well as a set of optional keyword arguments, among which:

alpha: the p-value and adjusted p-value significance threshold
cooks_filter: whether to filter p-values based on cooks outliers
independent_filter: whether to perform independent filtering to correct p-value trends.

stat_res = DeseqStats(dds, alpha=0.05, cooks_filter=True, independent_filter=True)

Wald tests 

stat_res.run_wald_test()
stat_res.p_values

Running Wald tests...
... done in 0.01 seconds.


gene1     0.028604
gene2     0.000329
gene3     0.032075
gene4     0.000513
gene5     0.000168
gene6     0.996253
gene7     0.370297
gene8     0.047227
gene9     0.110391
gene10    0.114518
dtype: float64

Cooks filtering 

This is optional

Note

Note: in the case of the provided synthetic data, there won’t be any outliers.

if stat_res.cooks_filter:
    stat_res._cooks_filtering()
stat_res.p_values

gene1     0.028604
gene2     0.000329
gene3     0.032075
gene4     0.000513
gene5     0.000168
gene6     0.996253
gene7     0.370297
gene8     0.047227
gene9     0.110391
gene10    0.114518
dtype: float64

P-value adjustment 

if stat_res.independent_filter:
    stat_res._independent_filtering()
else:
    stat_res._p_value_adjustment()

stat_res.padj

gene1     0.064150
gene2     0.001646
gene3     0.064150
gene4     0.001710
gene5     0.001646
gene6     0.996253
gene7     0.411441
gene8     0.078711
gene9     0.143147
gene10    0.143147
Name: 0, dtype: float64

Building a results dataframe 

This dataframe is stored in the results_df attribute of the DeseqStats class.

stat_res.summary()

Log2 fold change & Wald test p-value: condition B vs A
          baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
gene1     8.541317        0.632812  0.289101  2.188898  0.028604  0.064150
gene2    21.281239        0.538552  0.149963  3.591236  0.000329  0.001646
gene3     5.010123       -0.632830  0.295236 -2.143476  0.032075  0.064150
gene4   100.517961       -0.412102  0.118629 -3.473868  0.000513  0.001710
gene5    27.142450        0.582065  0.154706  3.762409  0.000168  0.001646
gene6     5.413043        0.001457  0.310311  0.004696  0.996253  0.996253
gene7    28.294023        0.134338  0.149945  0.895917  0.370297  0.411441
gene8    40.358344       -0.270656  0.136401 -1.984261  0.047227  0.078711
gene9    37.166183       -0.212715  0.133243 -1.596437  0.110391  0.143147
gene10   11.589325        0.386011  0.244588  1.578207  0.114518  0.143147

Save everything if SAVE is set to True

if SAVE:
    with open(os.path.join(OUTPUT_PATH, "stat_results_detailed_pipe.pkl"), "wb") as f:
        pkl.dump(stat_res, f)

LFC Shrinkage 

For visualization or post-processing purposes, it might be suitable to perform LFC shrinkage. This is implemented by the lfc_shrink method.

stat_res.lfc_shrink(coeff="condition_B_vs_A")

Fitting MAP LFCs...
... done in 0.02 seconds.

Shrunk log2 fold change & Wald test p-value: condition B vs A
          baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
gene1     8.541317        0.408253  0.294276  2.188898  0.028604  0.064150
gene2    21.281239        0.480145  0.151201  3.591236  0.000329  0.001646
gene3     5.010123       -0.396066  0.300796 -2.143476  0.032075  0.064150
gene4   100.517961       -0.374191  0.118703 -3.473868  0.000513  0.001710
gene5    27.142450        0.521487  0.156210  3.762409  0.000168  0.001646
gene6     5.413043        0.000716  0.239203  0.004696  0.996253  0.996253
gene7    28.294023        0.103421  0.141496  0.895917  0.370297  0.411441
gene8    40.358344       -0.226288  0.133477 -1.984261  0.047227  0.078711
gene9    37.166183       -0.175746  0.129138 -1.596437  0.110391  0.143147
gene10   11.589325        0.239935  0.231986  1.578207  0.114518  0.143147

Save everything

if SAVE:
    with open(
        os.path.join(OUTPUT_PATH, "shrunk_stat_results_detailed_pipe.pkl"), "wb"
    ) as f:
        pkl.dump(stat_res, f)

Total running time of the script: ( 0 minutes 0.331 seconds)

Gallery generated by Sphinx-Gallery