Note
Go to the end to download the full example code.
A simple PyDESeq2 workflow
In this example, we show how to perform a simple differential expression analysis on bulk RNAseq data, using PyDESeq2.
We start by importing required packages and setting up an optional path to save results.
import os
import pickle as pkl
from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats
from pydeseq2.utils import load_example_data
SAVE = False # whether to save the outputs of this notebook
if SAVE:
# Replace this with the path to directory where you would like results to be saved
OUTPUT_PATH = "../output_files/synthetic_example"
os.makedirs(OUTPUT_PATH, exist_ok=True) # Create path if it doesn't exist
Data loading
To perform differential expression analysis (DEA), PyDESeq2 requires two types of inputs:
A count matrix of shape ‘number of samples’ x ‘number of genes’, containing read counts (non-negative integers),
Metadata (or “column” data) of shape ‘number of samples’ x ‘number of variables’, containing sample annotations that will be used to split the data in cohorts.
Both should be provided as pandas dataframes.
To illustrate the required data format, we load a synthetic example dataset that may be
obtained through PyDESeq2’s API using utils.load_example_data().
You may replace it with your own dataset.
gene1 gene2 gene3 gene4 ... gene7 gene8 gene9 gene10
sample1 12 21 4 130 ... 16 54 49 3
sample2 1 44 2 63 ... 70 32 57 9
sample3 4 4 11 180 ... 28 34 65 2
sample4 1 10 2 100 ... 28 16 33 9
sample5 1 11 6 135 ... 32 29 31 5
... ... ... ... ... ... ... ... ... ...
sample96 7 26 3 67 ... 41 44 54 1
sample97 1 14 3 71 ... 19 42 25 4
sample98 10 36 2 72 ... 66 27 16 9
sample99 18 14 3 66 ... 32 19 79 11
sample100 21 9 3 42 ... 19 78 30 5
[100 rows x 10 columns]
print(metadata)
condition group
sample1 A X
sample2 A Y
sample3 A X
sample4 A Y
sample5 A X
... ... ...
sample96 B Y
sample97 B X
sample98 B Y
sample99 B X
sample100 B Y
[100 rows x 2 columns]
In this example, the metadata data contains two columns, condition and group,
representing two types of bi-level annotations. In the first part, we will only use the
condition factor. Later on, we’ll see how to use both the condition and the
group factors in our analysis (see Multifactor analysis).
Data filtering
Before proceeding with DEA, it is good practice to preprocess your data, e.g. to remove samples for which annotations are missing and exclude genes with very low levels of expression. This is not necessary in the case of our synthetic data, but don’t forget this step if you are using real data. To this end you can use the code below.
We start by removing samples for which condition is NaN. If you are using
another dataset, do not forget to change “condition” for the column of metadata
you wish to use as a design factor in your analysis.
Note
In the case where the design factor contains NaN entries, PyDESeq2 will throw an
error when intializing a DeseqDataSet.
Next, we filter out genes that have less than 10 read counts in total. Note again that there are no such genes in this synthetic dataset.
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]
Now that we have loaded and filtered our data, we may proceed with the differential analysis.
Single factor analysis
In this first analysis, we ignore the group variable and use the condition
column as our design factor. That is, we compare gene expressions of samples that have
condition B to those that have condition A.
Read counts modeling with the DeseqDataSet class
We start by creating a DeseqDataSet
object from the count and metadata data.
A DeseqDataSet fits dispersion and
log-fold change (LFC) parameters from the data, and stores them.
inference = DefaultInference(n_cpus=8)
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~condition",
refit_cooks=True,
inference=inference,
# n_cpus=8, # n_cpus can be specified here or in the inference object
)
A DeseqDataSet has two mandatory
arguments: a counts and a metadata dataframe, like the ones we’ve loaded in the
first part of this tutorial.
Next, we should specify a design, i.e. a Wilkinson formula that describes the
design, or directly a design matrix. Here we provide a formula, which is a string
that formulaic should be able to
parse.
Note
The "condition" factor in design corresponds to a column
from the metadata dataframe we loaded earlier.
You might need to change it according to your own dataset.
Several other arguments may be optionally specified (see the API documentation).
Among those, the refit_cooks argument (set to True by default), controls
whether Cooks outlier should be refitted (which is advised, in general) and n_cpus
sets the number of CPUs to use for computation. Here, we use 8 threads. Feel free to
adapt this to your setup or to set to None to use all available CPUs.
Note
In the case of the provided synthetic data, there won’t be any Cooks outliers.
Once a DeseqDataSet was initialized,
we may run the deseq2() method
to fit dispersions and LFCs.
dds.deseq2()
if SAVE:
with open(os.path.join(OUTPUT_PATH, "dds.pkl"), "wb") as f:
pkl.dump(dds, f)
Fitting size factors...
Using None as control genes, passed at DeseqDataSet initialization
... done in 0.00 seconds.
Fitting dispersions...
... done in 0.07 seconds.
Fitting dispersion trend curve...
... done in 0.10 seconds.
Fitting MAP dispersions...
... done in 0.07 seconds.
Fitting LFCs...
... done in 0.05 seconds.
Calculating cook's distance...
... done in 0.01 seconds.
Replacing 0 outlier genes.
The DeseqDataSet class extends the
AnnData
class.
print(dds)
AnnData object with n_obs × n_vars = 100 × 10
obs: 'condition', 'group', 'size_factors', 'replaceable'
var: '_normed_means', 'non_zero', '_MoM_dispersions', 'genewise_dispersions', '_genewise_converged', 'fitted_dispersions', 'MAP_dispersions', '_MAP_converged', 'dispersions', '_outlier_genes', '_LFC_converged', 'replaced', 'refitted', '_pvalue_cooks_outlier'
uns: 'trend_coeffs', 'disp_function_type', '_squared_logres', 'prior_disp_var'
obsm: 'design_matrix', '_mu_LFC', '_hat_diagonals'
varm: 'LFC'
layers: 'normed_counts', '_mu_hat', 'cooks'
Hence, parameters are stored according to the AnnData data
structure, with key-based data fields. In particular,
Xstores the count data,obsstores 1D sample-level data, such as design factors and"size_factors",obsmstores multi-dimensional sample-level data, such as"design_matrix",varstores 1D gene-level data, such as gene names and"dispersions",varmstores multi-dimensional gene-level data, such as"LFC".
As an example, here is how we would access dispersions and LFCs (in natural log scale):
print(dds.var["dispersions"])
gene1 0.882598
gene2 0.222578
gene3 0.837238
gene4 0.158970
gene5 0.249926
gene6 0.973647
gene7 0.235155
gene8 0.198781
gene9 0.186520
gene10 0.631900
Name: dispersions, dtype: float64
print(dds.varm["LFC"])
Intercept condition[T.B]
gene1 1.891436 0.438632
gene2 2.851662 0.373296
gene3 1.787780 -0.438645
gene4 4.741958 -0.285647
gene5 3.077798 0.403457
gene6 1.678536 0.001010
gene7 3.291025 0.093116
gene8 3.785129 -0.187604
gene9 3.682882 -0.147443
gene10 2.300515 0.267562
Statistical analysis with the DeseqStats class
Now that dispersions and LFCs were fitted, we may proceed with statistical tests to
compute p-values and adjusted p-values for differential expresion. This is the role of
the DeseqStats class. It has two mandatory arguments:
dds, which should be a fittedDeseqDataSetobject,contrast, which is a list of three strings of the form["variable", "tested_level", "control_level"], or directly a contrast vector.
ds = DeseqStats(dds, contrast=["condition", "B", "A"], inference=inference)
It also has a set of optional keyword arguments (see the API documentation), among which:
alpha: the p-value and adjusted p-value significance threshold (0.05by default),cooks_filter: whether to filter p-values based on cooks outliers (Trueby default),independent_filter: whether to perform independent filtering to correct p-value trends (Trueby default).
In the section on multifactor analysis, we will also see how
to use the contrast argument to specify according to which variable samples should
be compared.
Wald test
PyDESeq2 computes p-values using Wald tests. This can be done using the
summary() method, which runs the whole statistical
analysis, cooks filtering and multiple testing adjustement included.
ds.summary()
if SAVE:
with open(os.path.join(OUTPUT_PATH, "ds.pkl"), "wb") as f:
pkl.dump(ds, f)
Running Wald tests...
... done in 0.04 seconds.
Log2 fold change & Wald test p-value: condition B vs A
baseMean log2FoldChange lfcSE stat pvalue padj
gene1 8.541317 0.632812 0.289101 2.188898 0.028604 0.064150
gene2 21.281239 0.538552 0.149963 3.591236 0.000329 0.001646
gene3 5.010123 -0.632830 0.295236 -2.143476 0.032075 0.064150
gene4 100.517961 -0.412102 0.118629 -3.473868 0.000513 0.001710
gene5 27.142450 0.582065 0.154706 3.762409 0.000168 0.001646
gene6 5.413043 0.001457 0.310311 0.004696 0.996253 0.996253
gene7 28.294023 0.134338 0.149945 0.895917 0.370297 0.411441
gene8 40.358344 -0.270656 0.136401 -1.984261 0.047227 0.078711
gene9 37.166183 -0.212715 0.133243 -1.596437 0.110391 0.143147
gene10 11.589325 0.386011 0.244588 1.578207 0.114518 0.143147
The results are then stored in the results_df attribute (ds.results_df).
Optional: threshold-based tests
The user can specify a (log2) log fold change under the null hypothesis and an
alternative hypothesis to re-compute Wald statistics and p-values.
The alternative hypothesis corresponds to what the user wants to find rather than the
null hypothesis. It can take one of the values
["greaterAbs", "lessAbs", "greater", "less"].
ds.summary(lfc_null=0.1, alt_hypothesis="greaterAbs")
ds.plot_MA(s=20)

Running Wald tests...
... done in 0.05 seconds.
Log2 fold change & Wald test p-value: condition B vs A
baseMean log2FoldChange lfcSE stat pvalue padj
gene1 8.541317 0.632812 0.289101 1.842998 0.065329 0.142223
gene2 21.281239 0.538552 0.149963 2.924404 0.003451 0.017256
gene3 5.010123 -0.632830 0.295236 -1.804763 0.071112 0.142223
gene4 100.517961 -0.412102 0.118629 -2.630904 0.008516 0.028386
gene5 27.142450 0.582065 0.154706 3.116020 0.001833 0.017256
gene6 5.413043 0.001457 0.310311 0.000000 1.000000 1.000000
gene7 28.294023 0.134338 0.149945 0.229004 0.818866 0.909851
gene8 40.358344 -0.270656 0.136401 -1.251130 0.210887 0.346086
gene9 37.166183 -0.212715 0.133243 -0.845931 0.397591 0.496989
gene10 11.589325 0.386011 0.244588 1.169357 0.242260 0.346086
LFC shrinkage
For visualization or post-processing purposes, it might be suitable to perform
LFC shrinkage. This is implemented by the lfc_shrink()
method, which takes as argument the name of the coefficient to shrink (i.e., the name
of one of the columns of the design matrix dds.obsm["design_matrix"]).
For instance, to shrink the LFCs of the condition B samples, we would run:
ds.lfc_shrink(coeff="condition[T.B]")
if SAVE:
with open(os.path.join(OUTPUT_PATH, "shrunk_results.pkl"), "wb") as f:
pkl.dump(ds, f)
Fitting MAP LFCs...
... done in 5.17 seconds.
Shrunk log2 fold change & Wald test p-value: condition[T.B]
baseMean log2FoldChange lfcSE stat pvalue padj
gene1 8.541317 0.408148 0.294271 1.842998 0.065329 0.142223
gene2 21.281239 0.480228 0.151198 2.924404 0.003451 0.017256
gene3 5.010123 -0.396066 0.300796 -1.804763 0.071112 0.142223
gene4 100.517961 -0.374346 0.118704 -2.630904 0.008516 0.028386
gene5 27.142450 0.521534 0.156204 3.116020 0.001833 0.017256
gene6 5.413043 0.000715 0.239203 0.000000 1.000000 1.000000
gene7 28.294023 0.103473 0.141497 0.229004 0.818866 0.909851
gene8 40.358344 -0.226414 0.133479 -1.251130 0.210887 0.346086
gene9 37.166183 -0.175964 0.129142 -0.845931 0.397591 0.496989
gene10 11.589325 0.239935 0.231986 1.169357 0.242260 0.346086
Note
Running lfc_shrink() will overwrite a
DeseqStats’ log fold changes (and standard errors) with shrunk values.
This can be checked using the shrunk_LFCs flag.
print(ds.shrunk_LFCs) # Will be True only if lfc_shrink() was run.
True
Multifactor analysis
So far, we have only used the condition column of metadata, which divides
samples between conditions A and B. Yet, metadata contains second
column, which separates samples according to group X and Y.
print(metadata)
condition group
sample1 A X
sample2 A Y
sample3 A X
sample4 A Y
sample5 A X
... ... ...
sample96 B Y
sample97 B X
sample98 B Y
sample99 B X
sample100 B Y
[100 rows x 2 columns]
The goal of multifactor analysis is to use both variables to fit LFCs.
Read counts modeling
To perform multifactor analysis with PyDESeq2, we start by inializing a
DeseqDataSet as previously, but we provide several variables we would like
to use in the design argument.
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~group + condition",
refit_cooks=True,
inference=inference,
)
As for the single-factor analysis, we fit dispersions and LFCs using the
deseq2() method.
dds.deseq2()
Fitting size factors...
Using None as control genes, passed at DeseqDataSet initialization
... done in 0.00 seconds.
Fitting dispersions...
... done in 0.05 seconds.
Fitting dispersion trend curve...
... done in 0.03 seconds.
Fitting MAP dispersions...
... done in 0.03 seconds.
Fitting LFCs...
... done in 0.02 seconds.
Calculating cook's distance...
... done in 0.00 seconds.
Replacing 0 outlier genes.
Now, if we print log fold changes, we will have two columns in addition to the
intercept: one corresponding to the group variable, and the other to condition.
print(dds.varm["LFC"])
Intercept group[T.Y] condition[T.B]
gene1 1.560313 0.525572 0.507013
gene2 2.812255 0.079229 0.371026
gene3 2.059577 -0.602606 -0.467002
gene4 4.919837 -0.385197 -0.293527
gene5 2.973802 0.194078 0.407582
gene6 1.846061 -0.352410 -0.013510
gene7 3.235967 0.106773 0.093654
gene8 3.640464 0.270819 -0.188248
gene9 3.645334 0.074474 -0.148346
gene10 2.136670 0.303009 0.270003
Statistical analysis
P-values are computed as earlier from a DeseqStats object with the
summary() method. The contrast argument will allow us
to determines which variable we want to obtain LFCs and pvalues for.
It is a list of three strings of the form
["variable", "tested level", "reference level"]
As an example, to compare the condition B to the condition A, we set
contrast=["condition", "B", "A"].
ds_B_vs_A = DeseqStats(dds, contrast=["condition", "B", "A"], inference=inference)
Let us fit p-values:
ds_B_vs_A.summary()
Running Wald tests...
... done in 0.02 seconds.
Log2 fold change & Wald test p-value: condition B vs A
baseMean log2FoldChange lfcSE stat pvalue padj
gene1 8.541317 0.731466 0.286275 2.555117 0.010615 0.026538
gene2 21.281239 0.535278 0.149824 3.572711 0.000353 0.001178
gene3 5.010123 -0.673742 0.287399 -2.344269 0.019064 0.038129
gene4 100.517961 -0.423471 0.106221 -3.986711 0.000067 0.000592
gene5 27.142450 0.588016 0.152758 3.849332 0.000118 0.000592
gene6 5.413043 -0.019490 0.307845 -0.063313 0.949518 0.949518
gene7 28.294023 0.135114 0.149561 0.903410 0.366308 0.407009
gene8 40.358344 -0.271584 0.131512 -2.065084 0.038915 0.064858
gene9 37.166183 -0.214018 0.132989 -1.609288 0.107553 0.139684
gene10 11.589325 0.389532 0.244929 1.590388 0.111747 0.139684
As we can see, although we are comparing the same cohorts (condition B vs A), the
results differ from the single-factor analysis. This is because the
model uses information from both the condition and group variables.
Let us now evaluate differential expression according to group Y vs X. To do so,
we create a new DeseqStats from the same
DeseqDataSet
with contrast=["group", "Y", "X"], and run the analysis again.
ds_Y_vs_X = DeseqStats(dds, contrast=["group", "Y", "X"], inference=inference)
ds_Y_vs_X.summary()
Running Wald tests...
... done in 0.01 seconds.
Log2 fold change & Wald test p-value: group Y vs X
baseMean log2FoldChange lfcSE stat pvalue padj
gene1 8.541317 0.758241 0.286328 2.648155 8.093232e-03 0.020233
gene2 21.281239 0.114304 0.149771 0.763189 4.453508e-01 0.445351
gene3 5.010123 -0.869376 0.287516 -3.023748 2.496642e-03 0.009903
gene4 100.517961 -0.555721 0.106222 -5.231721 1.679390e-07 0.000002
gene5 27.142450 0.279995 0.152731 1.833256 6.676450e-02 0.123874
gene6 5.413043 -0.508421 0.307951 -1.650981 9.874255e-02 0.141061
gene7 28.294023 0.154041 0.149565 1.029927 3.030443e-01 0.378805
gene8 40.358344 0.390709 0.131520 2.970732 2.970913e-03 0.009903
gene9 37.166183 0.107443 0.132982 0.807956 4.191157e-01 0.445351
gene10 11.589325 0.437149 0.244955 1.784611 7.432438e-02 0.123874
LFC shrinkage (multifactor)
In a multifactor setting, LFC shrinkage works as in the single-factor case, but will
only shrink the LFCs of a DeseqStats object based on its
contrast argument.
ds_B_vs_A.lfc_shrink(coeff="condition[T.B]")
Fitting MAP LFCs...
... done in 0.02 seconds.
Shrunk log2 fold change & Wald test p-value: condition[T.B]
baseMean log2FoldChange lfcSE stat pvalue padj
gene1 8.541317 0.526859 0.299299 2.555117 0.010615 0.026538
gene2 21.281239 0.479377 0.150834 3.572711 0.000353 0.001178
gene3 5.010123 -0.463489 0.296137 -2.344269 0.019064 0.038129
gene4 100.517961 -0.394805 0.106322 -3.986711 0.000067 0.000592
gene5 27.142450 0.531655 0.154061 3.849332 0.000118 0.000592
gene6 5.413043 -0.008636 0.241570 -0.063313 0.949518 0.949518
gene7 28.294023 0.106695 0.141879 0.903410 0.366308 0.407009
gene8 40.358344 -0.233149 0.129067 -2.065084 0.038915 0.064858
gene9 37.166183 -0.180068 0.129246 -1.609288 0.107553 0.139684
gene10 11.589325 0.251459 0.233141 1.590388 0.111747 0.139684
Total running time of the script: (0 minutes 7.043 seconds)