Note

Click here to download the full example code or to run this example in your browser via Binder

A simple PyDESeq2 workflow

In this example, we show how to perform a simple differential expression analysis on bulk RNAseq data, using PyDESeq2.

We start by importing required packages and setting up an optional path to save results.

import os
import pickle as pkl

from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats
from pydeseq2.utils import load_example_data

SAVE = False  # whether to save the outputs of this notebook

if SAVE:
    # Replace this with the path to directory where you would like results to be saved
    OUTPUT_PATH = "../output_files/synthetic_example"
    os.makedirs(OUTPUT_PATH, exist_ok=True)  # Create path if it doesn't exist

Data loading 

To perform differential expression analysis (DEA), PyDESeq2 requires two types of inputs:

A count matrix of shape ‘number of samples’ x ‘number of genes’, containing read counts (non-negative integers),

Metadata (or “column” data) of shape ‘number of samples’ x ‘number of variables’, containing sample annotations that will be used to split the data in cohorts.

Both should be provided as pandas dataframes.

To illustrate the required data format, we load a synthetic example dataset that may be obtained through PyDESeq2’s API using utils.load_example_data(). You may replace it with your own dataset.

counts_df = load_example_data(
    modality="raw_counts",
    dataset="synthetic",
    debug=False,
)

metadata = load_example_data(
    modality="metadata",
    dataset="synthetic",
    debug=False,
)

print(counts_df)

           gene1  gene2  gene3  gene4  ...  gene7  gene8  gene9  gene10
sample1       12     21      4    130  ...     16     54     49       3
sample2        1     44      2     63  ...     70     32     57       9
sample3        4      4     11    180  ...     28     34     65       2
sample4        1     10      2    100  ...     28     16     33       9
sample5        1     11      6    135  ...     32     29     31       5
...          ...    ...    ...    ...  ...    ...    ...    ...     ...
sample96       7     26      3     67  ...     41     44     54       1
sample97       1     14      3     71  ...     19     42     25       4
sample98      10     36      2     72  ...     66     27     16       9
sample99      18     14      3     66  ...     32     19     79      11
sample100     21      9      3     42  ...     19     78     30       5

[100 rows x 10 columns]

print(metadata)

          condition group
sample1           A     X
sample2           A     Y
sample3           A     X
sample4           A     Y
sample5           A     X
...             ...   ...
sample96          B     Y
sample97          B     X
sample98          B     Y
sample99          B     X
sample100         B     Y

[100 rows x 2 columns]

In this example, the metadata data contains two columns, condition and group, representing two types of bi-level annotations. In the first part, we will only use the condition factor. Later on, we’ll see how to use both the condition and the group factors in our analysis (see Multifactor analysis).

Data filtering 

Before proceeding with DEA, it is good practice to preprocess your data, e.g. to remove samples for which annotations are missing and exclude genes with very low levels of expression. This is not necessary in the case of our synthetic data, but don’t forget this step if you are using real data. To this end you can use the code below.

We start by removing samples for which condition is NaN. If you are using another dataset, do not forget to change “condition” for the column of metadata you wish to use as a design factor in your analysis.

samples_to_keep = ~metadata.condition.isna()
counts_df = counts_df.loc[samples_to_keep]
metadata = metadata.loc[samples_to_keep]

Note

In the case where the design factor contains NaN entries, PyDESeq2 will throw an error when intializing a DeseqDataSet.

Next, we filter out genes that have less than 10 read counts in total. Note again that there are no such genes in this synthetic dataset.

genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]

Now that we have loaded and filtered our data, we may proceed with the differential analysis.

Single factor analysis 

In this first analysis, we ignore the group variable and use the condition column as our design factor. That is, we compare gene expressions of samples that have condition B to those that have condition A.

Read counts modeling with the `DeseqDataSet` class

We start by creating a DeseqDataSet object from the count and metadata data. A DeseqDataSet fits dispersion and log-fold change (LFC) parameters from the data, and stores them.

inference = DefaultInference(n_cpus=8)
dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design_factors="condition",
    refit_cooks=True,
    inference=inference,
    # n_cpus=8, # n_cpus can be specified here or in the inference object
)

A DeseqDataSet has two mandatory arguments: a counts and a metadata dataframe, like the ones we’ve loaded in the first part of this tutorial.

Next, we should specify the design_factor, i.e. the column of the metadata dataframe that will be used to compare samples. This can be a single string as above, or a list of strings, as in the section on multifactor analysis.

Note

The "condition" argument passed to design_factors corresponds to a column from the metadata dataframe we loaded earlier. You might need to change it according to your own dataset.

Several other arguments may be optionally specified (see the API documentation). Among those, the refit_cooks argument (set to True by default), controls whether Cooks outlier should be refitted (which is advised, in general) and n_cpus sets the number of CPUs to use for computation. Here, we use 8 threads. Feel free to adapt this to your setup or to set to None to use all available CPUs.

Note

In the case of the provided synthetic data, there won’t be any Cooks outliers.

Once a DeseqDataSet was initialized, we may run the deseq2() method to fit dispersions and LFCs.

dds.deseq2()


if SAVE:
    with open(os.path.join(OUTPUT_PATH, "dds.pkl"), "wb") as f:
        pkl.dump(dds, f)

Fitting size factors...
... done in 0.00 seconds.

Fitting dispersions...
... done in 0.02 seconds.

Fitting dispersion trend curve...
... done in 0.01 seconds.

Fitting MAP dispersions...
... done in 0.02 seconds.

Fitting LFCs...
... done in 0.02 seconds.

Replacing 0 outlier genes.

The DeseqDataSet class extends the AnnData class.

print(dds)

AnnData object with n_obs × n_vars = 100 × 10
    obs: 'condition', 'group'
    uns: 'trend_coeffs', 'disp_function_type', '_squared_logres', 'prior_disp_var'
    obsm: 'design_matrix', 'size_factors', 'replaceable'
    varm: 'non_zero', '_MoM_dispersions', 'genewise_dispersions', '_genewise_converged', '_normed_means', 'fitted_dispersions', 'MAP_dispersions', '_MAP_converged', 'dispersions', '_outlier_genes', 'LFC', '_LFC_converged', 'replaced', 'refitted'
    layers: 'normed_counts', '_mu_hat', '_mu_LFC', '_hat_diagonals', 'cooks'

Hence, parameters are stored according to the AnnData data structure, with key-based data fields. In particular,

X stores the count data,
obs stores design factors,
obsm stores sample-level data, such as "design_matrix" and "size_factors",
varm stores gene-level data, such as "dispersions" and "LFC".

As an example, here is how we would access dispersions and LFCs (in natural log scale):

print(dds.varm["dispersions"])

[0.88259824 0.22257849 0.83723751 0.15897038 0.24992574 0.97364737
 0.23515474 0.19878066 0.18652019 0.63189957]

print(dds.varm["LFC"])

        intercept  condition_B_vs_A
gene1    1.891436          0.438632
gene2    2.851662          0.373296
gene3    1.787780         -0.438645
gene4    4.741958         -0.285647
gene5    3.077798          0.403457
gene6    1.678536          0.001010
gene7    3.291025          0.093116
gene8    3.785129         -0.187604
gene9    3.682882         -0.147443
gene10   2.300515          0.267562

Statistical analysis with the `DeseqStats` class

Now that dispersions and LFCs were fitted, we may proceed with statistical tests to compute p-values and adjusted p-values for differential expresion. This is the role of the DeseqStats class. It has a unique mandatory argument, dds, which should be a fitted DeseqDataSet object.

stat_res = DeseqStats(dds, inference=inference)

It also has a set of optional keyword arguments (see the API documentation), among which:

alpha: the p-value and adjusted p-value significance threshold (0.05 by default),
cooks_filter: whether to filter p-values based on cooks outliers (True by default),
independent_filter: whether to perform independent filtering to correct p-value trends (True by default).

In the section on multifactor analysis, we will also see how to use the contrast argument to specify according to which variable samples should be compared.

Wald test 

PyDESeq2 computes p-values using Wald tests. This can be done using the summary() method, which runs the whole statistical analysis, cooks filtering and multiple testing adjustement included.

stat_res.summary()

if SAVE:
    with open(os.path.join(OUTPUT_PATH, "stat_results.pkl"), "wb") as f:
        pkl.dump(stat_res, f)

Running Wald tests...
... done in 0.01 seconds.

Log2 fold change & Wald test p-value: condition B vs A
          baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
gene1     8.541317        0.632812  0.289101  2.188898  0.028604  0.064150
gene2    21.281239        0.538552  0.149963  3.591236  0.000329  0.001646
gene3     5.010123       -0.632830  0.295236 -2.143476  0.032075  0.064150
gene4   100.517961       -0.412102  0.118629 -3.473868  0.000513  0.001710
gene5    27.142450        0.582065  0.154706  3.762409  0.000168  0.001646
gene6     5.413043        0.001457  0.310311  0.004696  0.996253  0.996253
gene7    28.294023        0.134338  0.149945  0.895917  0.370297  0.411441
gene8    40.358344       -0.270656  0.136401 -1.984261  0.047227  0.078711
gene9    37.166183       -0.212715  0.133243 -1.596437  0.110391  0.143147
gene10   11.589325        0.386011  0.244588  1.578207  0.114518  0.143147

The results are then stored in the results_df attribute (stat_res.results_df).

Optional: threshold-based tests 

The user can specify a (log2) log fold change under the null hypothesis and an alternative hypothesis to re-compute Wald statistics and p-values. The alternative hypothesis corresponds to what the user wants to find rather than the null hypothesis. It can take one of the values ["greaterAbs", "lessAbs", "greater", "less"].

stat_res.summary(lfc_null=0.1, alt_hypothesis="greaterAbs")
stat_res.plot_MA(s=20)

Running Wald tests...
... done in 0.02 seconds.

Log2 fold change & Wald test p-value: condition B vs A
          baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
gene1     8.541317        0.632812  0.289101  1.842998  0.065329  0.142223
gene2    21.281239        0.538552  0.149963  2.924404  0.003451  0.017256
gene3     5.010123       -0.632830  0.295236 -1.804763  0.071112  0.142223
gene4   100.517961       -0.412102  0.118629 -2.630904  0.008516  0.028386
gene5    27.142450        0.582065  0.154706  3.116020  0.001833  0.017256
gene6     5.413043        0.001457  0.310311  0.000000  1.000000  1.000000
gene7    28.294023        0.134338  0.149945  0.229004  0.818866  0.909851
gene8    40.358344       -0.270656  0.136401 -1.251130  0.210887  0.346086
gene9    37.166183       -0.212715  0.133243 -0.845931  0.397591  0.496989
gene10   11.589325        0.386011  0.244588  1.169357  0.242260  0.346086

LFC shrinkage 

For visualization or post-processing purposes, it might be suitable to perform LFC shrinkage. This is implemented by the lfc_shrink() method.

stat_res.lfc_shrink(coeff="condition_B_vs_A")

if SAVE:
    with open(os.path.join(OUTPUT_PATH, "shrunk_stat_results.pkl"), "wb") as f:
        pkl.dump(stat_res, f)

Fitting MAP LFCs...
... done in 0.02 seconds.

Shrunk log2 fold change & Wald test p-value: condition B vs A
          baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
gene1     8.541317        0.408253  0.294276  1.842998  0.065329  0.142223
gene2    21.281239        0.480145  0.151201  2.924404  0.003451  0.017256
gene3     5.010123       -0.396066  0.300796 -1.804763  0.071112  0.142223
gene4   100.517961       -0.374191  0.118703 -2.630904  0.008516  0.028386
gene5    27.142450        0.521487  0.156210  3.116020  0.001833  0.017256
gene6     5.413043        0.000716  0.239203  0.000000  1.000000  1.000000
gene7    28.294023        0.103421  0.141496  0.229004  0.818866  0.909851
gene8    40.358344       -0.226288  0.133477 -1.251130  0.210887  0.346086
gene9    37.166183       -0.175746  0.129138 -0.845931  0.397591  0.496989
gene10   11.589325        0.239935  0.231986  1.169357  0.242260  0.346086

Note

Running lfc_shrink() will overwrite a DeseqStats’ log fold changes (and standard errors) with shrunk values. This can be checked using the shrunk_LFCs flag.

print(stat_res.shrunk_LFCs)  # Will be True only if lfc_shrink() was run.

True

Multifactor analysis 

So far, we have only used the condition column of metadata, which divides samples between conditions A and B. Yet, metadata contains second column, which separates samples according to group X and Y.

print(metadata)

          condition group
sample1           A     X
sample2           A     Y
sample3           A     X
sample4           A     Y
sample5           A     X
...             ...   ...
sample96          B     Y
sample97          B     X
sample98          B     Y
sample99          B     X
sample100         B     Y

[100 rows x 2 columns]

The goal of multifactor analysis is to use both variables to fit LFCs.

Read counts modeling 

To perform multifactor analysis with PyDESeq2, we start by inializing a DeseqDataSet as previously, but we provide the list of variables we would like to use in the design_factors argument.

dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design_factors=["group", "condition"],
    refit_cooks=True,
    inference=inference,
)

Note

By default, the last variable in the list (here, "condition") will be the one for which LFCs and p-values will be displayed, but this may be changed later on when performing the statistical analysis.

As for the single-factor analysis, we fit dispersions and LFCs using the deseq2() method.

dds.deseq2()

Fitting size factors...
... done in 0.00 seconds.

Fitting dispersions...
... done in 0.02 seconds.

Fitting dispersion trend curve...
... done in 0.01 seconds.

Fitting MAP dispersions...
... done in 0.02 seconds.

Fitting LFCs...
... done in 0.02 seconds.

Replacing 0 outlier genes.

Now, if we print log fold changes, we will have two columns in addition to the intercept: one corresponding to the group variable, and the other to condition.

print(dds.varm["LFC"])

        intercept  group_Y_vs_X  condition_B_vs_A
gene1    1.560313      0.525572          0.507013
gene2    2.812255      0.079229          0.371026
gene3    2.059577     -0.602606         -0.467002
gene4    4.919837     -0.385197         -0.293527
gene5    2.973802      0.194078          0.407582
gene6    1.846061     -0.352410         -0.013510
gene7    3.235967      0.106773          0.093654
gene8    3.640464      0.270819         -0.188248
gene9    3.645334      0.074474         -0.148346
gene10   2.136670      0.303009          0.270003

Statistical analysis 

P-values are computed as earlier from a DeseqStats object with the summary() method, with a new important argument: the contrast. It is a list of three strings of the form ["variable", "tested level", "reference level"] which determines which variable we want to compute LFCs and pvalues for. As an example, to compare the condition B to the condition A, we set contrast=["condition", "B", "A"].

stat_res_B_vs_A = DeseqStats(dds, contrast=["condition", "B", "A"], inference=inference)

Note

If left blank, the variable of interest will be the last one provided in the design_factors attribute of the corresponding DeseqDataSet object, and the reference level will be picked alphabetically. In any case, both variables are still used. This is due to the fact that dds was fit with both as design factors.

Let us fit p-values:

stat_res_B_vs_A.summary()

Running Wald tests...
... done in 0.02 seconds.

Log2 fold change & Wald test p-value: condition B vs A
          baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
gene1     8.541317        0.731466  0.286275  2.555117  0.010615  0.026538
gene2    21.281239        0.535278  0.149824  3.572711  0.000353  0.001178
gene3     5.010123       -0.673742  0.287399 -2.344269  0.019064  0.038129
gene4   100.517961       -0.423471  0.106221 -3.986711  0.000067  0.000592
gene5    27.142450        0.588016  0.152758  3.849332  0.000118  0.000592
gene6     5.413043       -0.019490  0.307845 -0.063313  0.949518  0.949518
gene7    28.294023        0.135114  0.149561  0.903410  0.366308  0.407009
gene8    40.358344       -0.271584  0.131512 -2.065084  0.038915  0.064858
gene9    37.166183       -0.214018  0.132989 -1.609288  0.107553  0.139684
gene10   11.589325        0.389532  0.244929  1.590388  0.111747  0.139684

As we can see, although we are comparing the same cohorts (condition B vs A), the results differ from the single-factor analysis. This is because the model uses information from both the condition and group variables.

Let us now evaluate differential expression according to group Y vs X. To do so, we create a new DeseqStats from the same DeseqDataSet with contrast=["group", "Y", "X"], and run the analysis again.

stat_res_Y_vs_X = DeseqStats(dds, contrast=["group", "Y", "X"], inference=inference)
stat_res_Y_vs_X.summary()

Running Wald tests...
... done in 0.01 seconds.

Log2 fold change & Wald test p-value: group Y vs X
          baseMean  log2FoldChange     lfcSE      stat        pvalue      padj
gene1     8.541317        0.758241  0.286328  2.648155  8.093232e-03  0.020233
gene2    21.281239        0.114304  0.149771  0.763189  4.453508e-01  0.445351
gene3     5.010123       -0.869376  0.287516 -3.023748  2.496642e-03  0.009903
gene4   100.517961       -0.555721  0.106222 -5.231721  1.679390e-07  0.000002
gene5    27.142450        0.279995  0.152731  1.833256  6.676450e-02  0.123874
gene6     5.413043       -0.508421  0.307951 -1.650981  9.874255e-02  0.141061
gene7    28.294023        0.154041  0.149565  1.029927  3.030443e-01  0.378805
gene8    40.358344        0.390709  0.131520  2.970732  2.970913e-03  0.009903
gene9    37.166183        0.107443  0.132982  0.807956  4.191157e-01  0.445351
gene10   11.589325        0.437149  0.244955  1.784611  7.432438e-02  0.123874

LFC shrinkage (multifactor)

In a multifactor setting, LFC shrinkage works as in the single-factor case, but will only shrink the LFCs of a DeseqStats object based on its contrast argument.

stat_res_B_vs_A.lfc_shrink(coeff="condition_B_vs_A")

Fitting MAP LFCs...
... done in 0.02 seconds.

Shrunk log2 fold change & Wald test p-value: condition B vs A
          baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
gene1     8.541317        0.526936  0.299297  2.555117  0.010615  0.026538
gene2    21.281239        0.479377  0.150834  3.572711  0.000353  0.001178
gene3     5.010123       -0.463492  0.296136 -2.344269  0.019064  0.038129
gene4   100.517961       -0.394989  0.106316 -3.986711  0.000067  0.000592
gene5    27.142450        0.531655  0.154061  3.849332  0.000118  0.000592
gene6     5.413043       -0.008636  0.241570 -0.063313  0.949518  0.949518
gene7    28.294023        0.106695  0.141879  0.903410  0.366308  0.407009
gene8    40.358344       -0.233346  0.129070 -2.065084  0.038915  0.064858
gene9    37.166183       -0.180064  0.129246 -1.609288  0.107553  0.139684
gene10   11.589325        0.251471  0.233140  1.590388  0.111747  0.139684

Total running time of the script: ( 0 minutes 1.699 seconds)

Gallery generated by Sphinx-Gallery

A simple PyDESeq2 workflow

Read counts modeling with the DeseqDataSet class

Statistical analysis with the DeseqStats class

Read counts modeling with the `DeseqDataSet` class

Statistical analysis with the `DeseqStats` class