Note
Go to the end to download the full example code.
Step-by-step PyDESeq2 workflow
This notebook details all the steps of the PyDESeq2 pipeline.
It allows you to run the PyDESeq2 pipeline on the synthetic data provided in this repository.
If this is your first contact with PyDESeq2, we recommend you first have a look at the standard workflow example.
We start by importing required packages and setting up an optional path to save results.
import os
import pickle as pkl
import numpy as np
from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats
from pydeseq2.utils import load_example_data
SAVE = False # whether to save the outputs of this notebook
if SAVE:
# Replace this with the path to directory where you would like results to be
# saved
OUTPUT_PATH = "../output_files/synthetic_example"
os.makedirs(OUTPUT_PATH, exist_ok=True) # Create path if it doesn't exist
Data loading
Note that we are also explaining this step in the ‘getting started’ example. To perform differential expression analysis (DEA), PyDESeq2 requires two types of inputs:
A count matrix of shape ‘number of samples’ x ‘number of genes’, containing read counts (non-negative integers),
Metadata (or “column” data) of shape ‘number of samples’ x ‘number of variables’, containing sample annotations that will be used to split the data in cohorts.
Both should be provided as pandas dataframes.
To illustrate the required data format, we load a synthetic example dataset
that may be
obtained through PyDESeq2’s API using utils.load_example_data().
You may replace it with your own dataset.
1. Read counts modeling
Read counts modeling with the DeseqDataSet class
The DeseqDataSet class has two mandatory
arguments, counts and
metadata, as well as a set of optional keyword arguments, among which:
design: a string representing a formulaic formula, or a design matrix (ndarray), that will be used to model the data.refit_cooks: whether to refit cooks outliers – this is advised, in general.
Note
in the case of the provided synthetic data, there won’t be any Cooks outliers.
inference = DefaultInference(n_cpus=8)
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~condition", # compare samples based on the "condition"
# column ("B" vs "A")
refit_cooks=True,
inference=inference,
)
Compute normalization factors
dds.fit_size_factors()
dds.obs["size_factors"]
Fitting size factors...
Using None as control genes, passed at DeseqDataSet initialization
... done in 0.00 seconds.
sample1 1.228981
sample2 1.188774
sample3 0.997222
sample4 1.002158
sample5 0.834577
...
sample96 1.269359
sample97 0.743998
sample98 0.739878
sample99 0.958880
sample100 0.635708
Name: size_factors, Length: 100, dtype: float64
Fit genewise dispersions
dds.fit_genewise_dispersions()
dds.var["genewise_dispersions"]
Fitting dispersions...
... done in 0.02 seconds.
gene1 0.910135
gene2 0.213718
gene3 0.813594
gene4 0.161362
gene5 0.248504
gene6 0.973077
gene7 0.233030
gene8 0.198103
gene9 0.183636
gene10 0.646375
Name: genewise_dispersions, dtype: float64
Fit dispersion trend coefficients
Fitting dispersion trend curve...
... done in 0.01 seconds.
gene1 0.651424
gene2 0.313001
gene3 1.049865
gene4 0.134145
gene5 0.264005
gene6 0.978128
gene7 0.256765
gene8 0.205750
gene9 0.216026
gene10 0.502746
Name: fitted_dispersions, dtype: float64
Dispersion priors
logres_prior=0.05592493654750541, sigma_prior=0.25
MAP Dispersions
The fit_MAP_dispersions method filters the genes for which dispersion shrinkage is applied. Indeed, for genes whose MLE dispersions are too high above the trend curve, the original MLE value is kept. The final values of the dispersions that are used for downstream analysis is stored in dds.dispersions.
Fitting MAP dispersions...
... done in 0.02 seconds.
gene1 0.882598
gene2 0.222578
gene3 0.837238
gene4 0.158970
gene5 0.249926
gene6 0.973647
gene7 0.235155
gene8 0.198781
gene9 0.186520
gene10 0.631900
Name: dispersions, dtype: float64
Fit log fold changes
Note that in the DeseqDataSet object, the log-fold changes are stored in natural log scale, but that the results dataframe output by the summary method of DeseqStats displays LFCs in log2 scale (see later on).
dds.fit_LFC()
dds.varm["LFC"]
Fitting LFCs...
... done in 0.01 seconds.
Calculate Cooks distances and refit
Note that this step is optional.
dds.calculate_cooks()
if dds.refit_cooks:
# Replace outlier counts
dds.refit()
Calculating cook's distance...
... done in 0.00 seconds.
Replacing 0 outlier genes.
Save everything
if SAVE:
with open(os.path.join(OUTPUT_PATH, "dds_detailed_pipe.pkl"), "wb") as f:
pkl.dump(dds, f)
2. Statistical analysis
Statistical analysis with the DeseqStats class.
The DeseqDataSet class has two unique mandatory arguments, dds, which should
be a fitted DeseqDataSet object, and contrast, which should be a list
of 3 strings of the form ["factor", "tested_level", "reference_level"] or a
numerical contrast vector, as well as a set of optional keyword
arguments, among which:
alpha: the p-value and adjusted p-value significance threshold
cooks_filter: whether to filter p-values based on cooks outliers
independent_filter: whether to perform independent filtering to correct p-value trends.
ds = DeseqStats(
dds,
contrast=np.array([0, 1]),
alpha=0.05,
cooks_filter=True,
independent_filter=True,
)
Wald tests
ds.run_wald_test()
ds.p_values
Running Wald tests...
... done in 1.02 seconds.
gene1 0.028604
gene2 0.000329
gene3 0.032075
gene4 0.000513
gene5 0.000168
gene6 0.996253
gene7 0.370297
gene8 0.047227
gene9 0.110391
gene10 0.114518
dtype: float64
Cooks filtering
This is optional
Note
Note: in the case of the provided synthetic data, there won’t be any outliers.
if ds.cooks_filter:
ds._cooks_filtering()
ds.p_values
gene1 0.028604
gene2 0.000329
gene3 0.032075
gene4 0.000513
gene5 0.000168
gene6 0.996253
gene7 0.370297
gene8 0.047227
gene9 0.110391
gene10 0.114518
dtype: float64
P-value adjustment
if ds.independent_filter:
ds._independent_filtering()
else:
ds._p_value_adjustment()
ds.padj
gene1 0.064150
gene2 0.001646
gene3 0.064150
gene4 0.001710
gene5 0.001646
gene6 0.996253
gene7 0.411441
gene8 0.078711
gene9 0.143147
gene10 0.143147
Name: 0, dtype: float64
Building a results dataframe
This dataframe is stored in the results_df attribute of the DeseqStats class.
ds.summary()
Log2 fold change & Wald test p-value, contrast vector: [0 1]
baseMean log2FoldChange lfcSE stat pvalue padj
gene1 8.541317 0.632812 0.289101 2.188898 0.028604 0.064150
gene2 21.281239 0.538552 0.149963 3.591236 0.000329 0.001646
gene3 5.010123 -0.632830 0.295236 -2.143476 0.032075 0.064150
gene4 100.517961 -0.412102 0.118629 -3.473868 0.000513 0.001710
gene5 27.142450 0.582065 0.154706 3.762409 0.000168 0.001646
gene6 5.413043 0.001457 0.310311 0.004696 0.996253 0.996253
gene7 28.294023 0.134338 0.149945 0.895917 0.370297 0.411441
gene8 40.358344 -0.270656 0.136401 -1.984261 0.047227 0.078711
gene9 37.166183 -0.212715 0.133243 -1.596437 0.110391 0.143147
gene10 11.589325 0.386011 0.244588 1.578207 0.114518 0.143147
Save everything if SAVE is set to True
if SAVE:
with open(os.path.join(OUTPUT_PATH, "stat_results_detailed_pipe.pkl"), "wb") as f:
pkl.dump(ds, f)
LFC Shrinkage
For visualization or post-processing purposes, it might be suitable to perform LFC shrinkage. This is implemented by the lfc_shrink method.
ds.lfc_shrink(coeff="condition[T.B]")
Fitting MAP LFCs...
... done in 0.01 seconds.
Shrunk log2 fold change & Wald test p-value: condition[T.B]
baseMean log2FoldChange lfcSE stat pvalue padj
gene1 8.541317 0.408148 0.294271 2.188898 0.028604 0.064150
gene2 21.281239 0.480228 0.151198 3.591236 0.000329 0.001646
gene3 5.010123 -0.396066 0.300796 -2.143476 0.032075 0.064150
gene4 100.517961 -0.374346 0.118704 -3.473868 0.000513 0.001710
gene5 27.142450 0.521534 0.156204 3.762409 0.000168 0.001646
gene6 5.413043 0.000715 0.239203 0.004696 0.996253 0.996253
gene7 28.294023 0.103473 0.141497 0.895917 0.370297 0.411441
gene8 40.358344 -0.226414 0.133479 -1.984261 0.047227 0.078711
gene9 37.166183 -0.175964 0.129142 -1.596437 0.110391 0.143147
gene10 11.589325 0.239935 0.231986 1.578207 0.114518 0.143147
Save everything
if SAVE:
with open(os.path.join(OUTPUT_PATH, "shrunk_results_detailed_pipe.pkl"), "wb") as f:
pkl.dump(ds, f)
Total running time of the script: (0 minutes 1.188 seconds)