Note
Click here to download the full example code or to run this example in your browser via Binder
Step-by-step PyDESeq2 workflow
This notebook details all the steps of the PyDESeq2 pipeline.
It allows you to run the PyDESeq2 pipeline on the synthetic data provided in this repository.
If this is your first contact with PyDESeq2, we recommend you first have a look at the standard workflow example.
We start by importing required packages and setting up an optional path to save results.
import os
import pickle as pkl
from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats
from pydeseq2.utils import load_example_data
SAVE = False # whether to save the outputs of this notebook
if SAVE:
# Replace this with the path to directory where you would like results to be
# saved
OUTPUT_PATH = "../output_files/synthetic_example"
os.makedirs(OUTPUT_PATH, exist_ok=True) # Create path if it doesn't exist
Data loading
Note that we are also explaining this step in the ‘getting started’ example. To perform differential expression analysis (DEA), PyDESeq2 requires two types of inputs:
A count matrix of shape ‘number of samples’ x ‘number of genes’, containing read counts (non-negative integers),
Metadata (or “column” data) of shape ‘number of samples’ x ‘number of variables’, containing sample annotations that will be used to split the data in cohorts.
Both should be provided as pandas dataframes.
To illustrate the required data format, we load a synthetic example dataset
that may be
obtained through PyDESeq2’s API using utils.load_example_data()
.
You may replace it with your own dataset.
1. Read counts modeling
Read counts modeling with the DeseqDataSet
class
The DeseqDataSet
class has two mandatory
arguments, counts
and
metadata
, as well as a set of optional keyword arguments, among which:
design_factor
: the name of the column of metadata to be used as a design variablerefit_cooks
: whether to refit cooks outliers – this is advised, in general.
Note
in the case of the provided synthetic data, there won’t be any Cooks outliers.
inference = DefaultInference(n_cpus=8)
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design_factors="condition", # compare samples based on the "condition"
# column ("B" vs "A")
refit_cooks=True,
inference=inference,
)
Compute normalization factors
dds.fit_size_factors()
dds.obsm["size_factors"]
Fitting size factors...
... done in 0.00 seconds.
array([1.22898094, 1.18877375, 0.99722229, 1.00215773, 0.83457743,
1.10730382, 0.8999001 , 1.15343785, 0.68163849, 1.29764537,
1.04491511, 1.45930946, 1.14588441, 0.8049275 , 0.88402672,
0.88402672, 1.32879767, 0.82564657, 1.5978062 , 1.29764537,
1.31940196, 0.69919197, 1.10697146, 1.10214803, 1.19152118,
1.0624452 , 0.98548229, 0.76881428, 0.8939601 , 1.27135863,
1.61101905, 1.55084302, 0.83601298, 0.98213727, 1.27270212,
1.0510719 , 1.76078144, 1.08132885, 1.50390106, 1.0510719 ,
0.80280751, 0.70955247, 1.32602392, 0.98031899, 1.1078077 ,
0.68792508, 0.90429564, 1.56411155, 0.81918767, 1.19364837,
0.79492024, 1.84963565, 0.79694628, 0.79708276, 0.97287297,
1.16248554, 1.50489413, 1.41929759, 1.04612122, 1.05720226,
0.99635345, 1.84224912, 1.03801163, 0.89633874, 0.72952979,
1.33453944, 0.93968061, 1.14016425, 1.59166589, 1.08554239,
0.72370261, 0.91558563, 1.14183629, 1.33857618, 0.94450599,
0.85266438, 1.38005658, 1.01803293, 1.04472988, 1.25372879,
1.77488931, 0.92641092, 1.06341062, 1.05653145, 0.61973175,
0.75569423, 0.69781308, 1.38512317, 0.82165798, 0.81330537,
0.96864497, 1.08526199, 0.78205578, 0.98864287, 1.15322141,
1.26935914, 0.74399805, 0.73987825, 0.95887987, 0.63570796])
Fit genewise dispersions
dds.fit_genewise_dispersions()
dds.varm["genewise_dispersions"]
Fitting dispersions...
... done in 0.02 seconds.
array([0.91013511, 0.21371805, 0.81359443, 0.1613621 , 0.24850424,
0.97307734, 0.23303023, 0.19810261, 0.18363604, 0.64637486])
Fit dispersion trend coefficients
Fitting dispersion trend curve...
... done in 0.01 seconds.
array([0.65142434, 0.31300062, 1.04986539, 0.13414536, 0.264005 ,
0.97812827, 0.25676459, 0.20575044, 0.21602633, 0.50274561])
Dispersion priors
logres_prior=0.05592493654750616, sigma_prior=0.25
MAP Dispersions
The fit_MAP_dispersions method filters the genes for which dispersion shrinkage is applied. Indeed, for genes whose MLE dispersions are too high above the trend curve, the original MLE value is kept. The final values of the dispersions that are used for downstream analysis is stored in dds.dispersions.
Fitting MAP dispersions...
... done in 0.02 seconds.
array([0.88259824, 0.22257849, 0.83723751, 0.15897038, 0.24992574,
0.97364737, 0.23515474, 0.19878066, 0.18652019, 0.63189957])
Fit log fold changes
Note that in the DeseqDataSet object, the log-fold changes are stored in natural log scale, but that the results dataframe output by the summary method of DeseqStats displays LFCs in log2 scale (see later on).
dds.fit_LFC()
dds.varm["LFC"]
Fitting LFCs...
... done in 0.02 seconds.
Calculate Cooks distances and refit
Note that this step is optional.
dds.calculate_cooks()
if dds.refit_cooks:
# Replace outlier counts
dds.refit()
Replacing 0 outlier genes.
Save everything
if SAVE:
with open(os.path.join(OUTPUT_PATH, "dds_detailed_pipe.pkl"), "wb") as f:
pkl.dump(dds, f)
2. Statistical analysis
Statistical analysis with the DeseqStats
class.
The DeseqDataSet class has a unique mandatory arguments, dds, which should
be a fitted DeseqDataSet object, as well as a set of optional keyword
arguments, among which:
alpha: the p-value and adjusted p-value significance threshold
cooks_filter: whether to filter p-values based on cooks outliers
independent_filter: whether to perform independent filtering to correct p-value trends.
stat_res = DeseqStats(dds, alpha=0.05, cooks_filter=True, independent_filter=True)
Wald tests
stat_res.run_wald_test()
stat_res.p_values
Running Wald tests...
... done in 0.01 seconds.
gene1 0.028604
gene2 0.000329
gene3 0.032075
gene4 0.000513
gene5 0.000168
gene6 0.996253
gene7 0.370297
gene8 0.047227
gene9 0.110391
gene10 0.114518
dtype: float64
Cooks filtering
This is optional
Note
Note: in the case of the provided synthetic data, there won’t be any outliers.
if stat_res.cooks_filter:
stat_res._cooks_filtering()
stat_res.p_values
gene1 0.028604
gene2 0.000329
gene3 0.032075
gene4 0.000513
gene5 0.000168
gene6 0.996253
gene7 0.370297
gene8 0.047227
gene9 0.110391
gene10 0.114518
dtype: float64
P-value adjustment
if stat_res.independent_filter:
stat_res._independent_filtering()
else:
stat_res._p_value_adjustment()
stat_res.padj
gene1 0.064150
gene2 0.001646
gene3 0.064150
gene4 0.001710
gene5 0.001646
gene6 0.996253
gene7 0.411441
gene8 0.078711
gene9 0.143147
gene10 0.143147
Name: 0, dtype: float64
Building a results dataframe
This dataframe is stored in the results_df attribute of the DeseqStats class.
stat_res.summary()
Log2 fold change & Wald test p-value: condition B vs A
baseMean log2FoldChange lfcSE stat pvalue padj
gene1 8.541317 0.632812 0.289101 2.188898 0.028604 0.064150
gene2 21.281239 0.538552 0.149963 3.591236 0.000329 0.001646
gene3 5.010123 -0.632830 0.295236 -2.143476 0.032075 0.064150
gene4 100.517961 -0.412102 0.118629 -3.473868 0.000513 0.001710
gene5 27.142450 0.582065 0.154706 3.762409 0.000168 0.001646
gene6 5.413043 0.001457 0.310311 0.004696 0.996253 0.996253
gene7 28.294023 0.134338 0.149945 0.895917 0.370297 0.411441
gene8 40.358344 -0.270656 0.136401 -1.984261 0.047227 0.078711
gene9 37.166183 -0.212715 0.133243 -1.596437 0.110391 0.143147
gene10 11.589325 0.386011 0.244588 1.578207 0.114518 0.143147
Save everything if SAVE is set to True
if SAVE:
with open(os.path.join(OUTPUT_PATH, "stat_results_detailed_pipe.pkl"), "wb") as f:
pkl.dump(stat_res, f)
LFC Shrinkage
For visualization or post-processing purposes, it might be suitable to perform LFC shrinkage. This is implemented by the lfc_shrink method.
stat_res.lfc_shrink(coeff="condition_B_vs_A")
Fitting MAP LFCs...
... done in 0.02 seconds.
Shrunk log2 fold change & Wald test p-value: condition B vs A
baseMean log2FoldChange lfcSE stat pvalue padj
gene1 8.541317 0.408253 0.294276 2.188898 0.028604 0.064150
gene2 21.281239 0.480145 0.151201 3.591236 0.000329 0.001646
gene3 5.010123 -0.396066 0.300796 -2.143476 0.032075 0.064150
gene4 100.517961 -0.374191 0.118703 -3.473868 0.000513 0.001710
gene5 27.142450 0.521487 0.156210 3.762409 0.000168 0.001646
gene6 5.413043 0.000716 0.239203 0.004696 0.996253 0.996253
gene7 28.294023 0.103421 0.141496 0.895917 0.370297 0.411441
gene8 40.358344 -0.226288 0.133477 -1.984261 0.047227 0.078711
gene9 37.166183 -0.175746 0.129138 -1.596437 0.110391 0.143147
gene10 11.589325 0.239935 0.231986 1.578207 0.114518 0.143147
Save everything
if SAVE:
with open(
os.path.join(OUTPUT_PATH, "shrunk_stat_results_detailed_pipe.pkl"), "wb"
) as f:
pkl.dump(stat_res, f)
Total running time of the script: ( 0 minutes 0.331 seconds)