Note

Go to the end to download the full example code.

Step-by-step PyDESeq2 workflow

This notebook details all the steps of the PyDESeq2 pipeline.

It allows you to run the PyDESeq2 pipeline on the synthetic data provided in this repository.

If this is your first contact with PyDESeq2, we recommend you first have a look at the standard workflow example.

We start by importing required packages and setting up an optional path to save results.

import os
import pickle as pkl

import numpy as np

from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats
from pydeseq2.utils import load_example_data

SAVE = False  # whether to save the outputs of this notebook

if SAVE:
    # Replace this with the path to directory where you would like results to be
    # saved
    OUTPUT_PATH = "../output_files/synthetic_example"
    os.makedirs(OUTPUT_PATH, exist_ok=True)  # Create path if it doesn't exist

Data loading 

Note that we are also explaining this step in the ‘getting started’ example. To perform differential expression analysis (DEA), PyDESeq2 requires two types of inputs:

A count matrix of shape ‘number of samples’ x ‘number of genes’, containing read counts (non-negative integers),

Metadata (or “column” data) of shape ‘number of samples’ x ‘number of variables’, containing sample annotations that will be used to split the data in cohorts.

Both should be provided as pandas dataframes.

To illustrate the required data format, we load a synthetic example dataset that may be obtained through PyDESeq2’s API using utils.load_example_data(). You may replace it with your own dataset.

counts_df = load_example_data(
    modality="raw_counts",
    dataset="synthetic",
    debug=False,
)

metadata = load_example_data(
    modality="metadata",
    dataset="synthetic",
    debug=False,
)

1. Read counts modeling 

Read counts modeling with the DeseqDataSet class

The DeseqDataSet class has two mandatory arguments, counts and metadata, as well as a set of optional keyword arguments, among which:

design: a string representing a formulaic formula, or a design matrix (ndarray), that will be used to model the data.
refit_cooks: whether to refit cooks outliers – this is advised, in general.

Note

in the case of the provided synthetic data, there won’t be any Cooks outliers.

inference = DefaultInference(n_cpus=8)
dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design="~condition",  # compare samples based on the "condition"
    # column ("B" vs "A")
    refit_cooks=True,
    inference=inference,
)

Compute normalization factors 

dds.fit_size_factors()

dds.obs["size_factors"]

Fitting size factors...
Using None as control genes, passed at DeseqDataSet initialization
... done in 0.00 seconds.


sample1      1.228981
sample2      1.188774
sample3      0.997222
sample4      1.002158
sample5      0.834577
               ...
sample96     1.269359
sample97     0.743998
sample98     0.739878
sample99     0.958880
sample100    0.635708
Name: size_factors, Length: 100, dtype: float64

Fit genewise dispersions 

dds.fit_genewise_dispersions()

dds.var["genewise_dispersions"]

Fitting dispersions...
... done in 0.02 seconds.


gene1     0.910135
gene2     0.213718
gene3     0.813594
gene4     0.161362
gene5     0.248504
gene6     0.973077
gene7     0.233030
gene8     0.198103
gene9     0.183636
gene10    0.646375
Name: genewise_dispersions, dtype: float64

Fit dispersion trend coefficients 

dds.fit_dispersion_trend()
dds.uns["trend_coeffs"]
dds.var["fitted_dispersions"]

Fitting dispersion trend curve...
... done in 0.01 seconds.


gene1     0.651424
gene2     0.313001
gene3     1.049865
gene4     0.134145
gene5     0.264005
gene6     0.978128
gene7     0.256765
gene8     0.205750
gene9     0.216026
gene10    0.502746
Name: fitted_dispersions, dtype: float64

Dispersion priors 

dds.fit_dispersion_prior()
print(
    f"logres_prior={dds.uns['_squared_logres']}, sigma_prior={dds.uns['prior_disp_var']}"
)

logres_prior=0.05592493654750541, sigma_prior=0.25

The fit_MAP_dispersions method filters the genes for which dispersion shrinkage is applied. Indeed, for genes whose MLE dispersions are too high above the trend curve, the original MLE value is kept. The final values of the dispersions that are used for downstream analysis is stored in dds.dispersions.

dds.fit_MAP_dispersions()
dds.var["MAP_dispersions"]
dds.var["dispersions"]

Fitting MAP dispersions...
... done in 0.02 seconds.


gene1     0.882598
gene2     0.222578
gene3     0.837238
gene4     0.158970
gene5     0.249926
gene6     0.973647
gene7     0.235155
gene8     0.198781
gene9     0.186520
gene10    0.631900
Name: dispersions, dtype: float64

Fit log fold changes 

Note that in the DeseqDataSet object, the log-fold changes are stored in natural log scale, but that the results dataframe output by the summary method of DeseqStats displays LFCs in log2 scale (see later on).

dds.fit_LFC()
dds.varm["LFC"]

Fitting LFCs...
... done in 0.01 seconds.

	Intercept	condition[T.B]
gene1	1.891436	0.438632
gene2	2.851662	0.373296
gene3	1.787780	-0.438645
gene4	4.741958	-0.285647
gene5	3.077798	0.403457
gene6	1.678536	0.001010
gene7	3.291025	0.093116
gene8	3.785129	-0.187604
gene9	3.682882	-0.147443
gene10	2.300515	0.267562

Calculate Cooks distances and refit 

Note that this step is optional.

dds.calculate_cooks()
if dds.refit_cooks:
    # Replace outlier counts
    dds.refit()

Calculating cook's distance...
... done in 0.00 seconds.

Replacing 0 outlier genes.

Save everything

if SAVE:
    with open(os.path.join(OUTPUT_PATH, "dds_detailed_pipe.pkl"), "wb") as f:
        pkl.dump(dds, f)

2. Statistical analysis 

Statistical analysis with the DeseqStats class. The DeseqDataSet class has two unique mandatory arguments, dds, which should be a fitted DeseqDataSet object, and contrast, which should be a list of 3 strings of the form ["factor", "tested_level", "reference_level"] or a numerical contrast vector, as well as a set of optional keyword arguments, among which:

alpha: the p-value and adjusted p-value significance threshold
cooks_filter: whether to filter p-values based on cooks outliers
independent_filter: whether to perform independent filtering to correct p-value trends.

ds = DeseqStats(
    dds,
    contrast=np.array([0, 1]),
    alpha=0.05,
    cooks_filter=True,
    independent_filter=True,
)

Wald tests 

ds.run_wald_test()
ds.p_values

Running Wald tests...
... done in 1.02 seconds.


gene1     0.028604
gene2     0.000329
gene3     0.032075
gene4     0.000513
gene5     0.000168
gene6     0.996253
gene7     0.370297
gene8     0.047227
gene9     0.110391
gene10    0.114518
dtype: float64

Cooks filtering 

This is optional

Note

Note: in the case of the provided synthetic data, there won’t be any outliers.

if ds.cooks_filter:
    ds._cooks_filtering()
ds.p_values

gene1     0.028604
gene2     0.000329
gene3     0.032075
gene4     0.000513
gene5     0.000168
gene6     0.996253
gene7     0.370297
gene8     0.047227
gene9     0.110391
gene10    0.114518
dtype: float64

P-value adjustment 

if ds.independent_filter:
    ds._independent_filtering()
else:
    ds._p_value_adjustment()

ds.padj

gene1     0.064150
gene2     0.001646
gene3     0.064150
gene4     0.001710
gene5     0.001646
gene6     0.996253
gene7     0.411441
gene8     0.078711
gene9     0.143147
gene10    0.143147
Name: 0, dtype: float64

Building a results dataframe 

This dataframe is stored in the results_df attribute of the DeseqStats class.

ds.summary()

Log2 fold change & Wald test p-value, contrast vector: [0 1]
          baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
gene1     8.541317        0.632812  0.289101  2.188898  0.028604  0.064150
gene2    21.281239        0.538552  0.149963  3.591236  0.000329  0.001646
gene3     5.010123       -0.632830  0.295236 -2.143476  0.032075  0.064150
gene4   100.517961       -0.412102  0.118629 -3.473868  0.000513  0.001710
gene5    27.142450        0.582065  0.154706  3.762409  0.000168  0.001646
gene6     5.413043        0.001457  0.310311  0.004696  0.996253  0.996253
gene7    28.294023        0.134338  0.149945  0.895917  0.370297  0.411441
gene8    40.358344       -0.270656  0.136401 -1.984261  0.047227  0.078711
gene9    37.166183       -0.212715  0.133243 -1.596437  0.110391  0.143147
gene10   11.589325        0.386011  0.244588  1.578207  0.114518  0.143147

Save everything if SAVE is set to True

if SAVE:
    with open(os.path.join(OUTPUT_PATH, "stat_results_detailed_pipe.pkl"), "wb") as f:
        pkl.dump(ds, f)

LFC Shrinkage 

For visualization or post-processing purposes, it might be suitable to perform LFC shrinkage. This is implemented by the lfc_shrink method.

ds.lfc_shrink(coeff="condition[T.B]")

Fitting MAP LFCs...
... done in 0.01 seconds.

Shrunk log2 fold change & Wald test p-value: condition[T.B]
          baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
gene1     8.541317        0.408148  0.294271  2.188898  0.028604  0.064150
gene2    21.281239        0.480228  0.151198  3.591236  0.000329  0.001646
gene3     5.010123       -0.396066  0.300796 -2.143476  0.032075  0.064150
gene4   100.517961       -0.374346  0.118704 -3.473868  0.000513  0.001710
gene5    27.142450        0.521534  0.156204  3.762409  0.000168  0.001646
gene6     5.413043        0.000715  0.239203  0.004696  0.996253  0.996253
gene7    28.294023        0.103473  0.141497  0.895917  0.370297  0.411441
gene8    40.358344       -0.226414  0.133479 -1.984261  0.047227  0.078711
gene9    37.166183       -0.175964  0.129142 -1.596437  0.110391  0.143147
gene10   11.589325        0.239935  0.231986  1.578207  0.114518  0.143147

Save everything

if SAVE:
    with open(os.path.join(OUTPUT_PATH, "shrunk_results_detailed_pipe.pkl"), "wb") as f:
        pkl.dump(ds, f)

Total running time of the script: (0 minutes 1.188 seconds)

Gallery generated by Sphinx-Gallery