Loading data and saving results with pandas and pickle

In this example, we show how load data in order to perform a DEA analysis with PyDESeq2, and how to solve its results, using pandas and pickle.

We refer to the getting started example for more details on the analysis itself.

We start by importing required packages and setting up a path to save results.

import os
import pickle as pkl

import pandas as pd

from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats

# Replace this with the path to directory where you would like results to be saved
OUTPUT_PATH = "../output_files/synthetic_example/"
os.makedirs(OUTPUT_PATH, exist_ok=True)  # Create path if it doesn't exist

Data loading with pandas

To perform differential expression analysis (DEA), PyDESeq2 requires two types of inputs:

  • A count matrix of shape ‘number of samples’ x ‘number of genes’, containing read counts (non-negative integers),

  • Metadata (or annotations, or “column” data) of shape ‘number of samples’ x ‘number of variables’, containing sample annotations that will be used to split the data in cohorts.

Both should be provided as pandas dataframes.

Here, we show how to load CSV corresponding to counts and annotations as pandas dataframes, using pandas.read_csv().

We assume that DATA_PATH is a directory containing a test_counts.csv and a test_metadata.csv file.

# Replace this with the path to your dataset
DATA_PATH = "https://raw.githubusercontent.com/owkin/PyDESeq2/main/datasets/synthetic/"
counts_df = pd.read_csv(os.path.join(DATA_PATH, "test_counts.csv"), index_col=0)
print(counts_df)
        sample1  sample2  sample3  ...  sample98  sample99  sample100
gene1        12        1        4  ...        10        18         21
gene2        21       44        4  ...        36        14          9
gene3         4        2       11  ...         2         3          3
gene4       130       63      180  ...        72        66         42
gene5        18       11       21  ...        11        53         13
gene6         0       10        3  ...         2        11         13
gene7        16       70       28  ...        66        32         19
gene8        54       32       34  ...        27        19         78
gene9        49       57       65  ...        16        79         30
gene10        3        9        2  ...         9        11          5

[10 rows x 100 columns]

Note that the counts data is in a ‘number of genes’ x ‘number of samples’ format, whereas ‘number of samples’ x ‘number of genes’ is required. To fix this issue, we transpose the counts dataframe.

           gene1  gene2  gene3  gene4  ...  gene7  gene8  gene9  gene10
sample1       12     21      4    130  ...     16     54     49       3
sample2        1     44      2     63  ...     70     32     57       9
sample3        4      4     11    180  ...     28     34     65       2
sample4        1     10      2    100  ...     28     16     33       9
sample5        1     11      6    135  ...     32     29     31       5
...          ...    ...    ...    ...  ...    ...    ...    ...     ...
sample96       7     26      3     67  ...     41     44     54       1
sample97       1     14      3     71  ...     19     42     25       4
sample98      10     36      2     72  ...     66     27     16       9
sample99      18     14      3     66  ...     32     19     79      11
sample100     21      9      3     42  ...     19     78     30       5

[100 rows x 10 columns]
metadata = pd.read_csv(os.path.join(DATA_PATH, "test_metadata.csv"), index_col=0)
print(metadata)
          condition group
sample1           A     X
sample2           A     Y
sample3           A     X
sample4           A     Y
sample5           A     X
...             ...   ...
sample96          B     Y
sample97          B     X
sample98          B     Y
sample99          B     X
sample100         B     Y

[100 rows x 2 columns]

In this example, the metadata data contains two columns, condition and group, representing two types of bi-level annotations. Here, we will only use the condition factor.

Data filtering

Before proceeding with DEA, we start by preprocessing the data, as in the getting started example.

Next, we filter out genes that have less than 10 read counts in total. Note again that there are no such genes in this synthetic dataset.

Now that we have loaded and filtered our data, we may proceed with the differential analysis.

Single factor analysis

As in the getting started example, we ignore the group variable and use the condition column as our design factor.

Read counts modeling with the DeseqDataSet class

We start by creating a DeseqDataSet object from the count and metadata data that were just loaded.

inference = DefaultInference(n_cpus=8)
dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design_factors="condition",
    refit_cooks=True,
    inference=inference,
)

Once a DeseqDataSet was initialized, we may run the deseq2() method to fit dispersions and LFCs.

dds.deseq2()
Fitting size factors...
... done in 0.00 seconds.

Fitting dispersions...
... done in 0.02 seconds.

Fitting dispersion trend curve...
... done in 0.01 seconds.

Fitting MAP dispersions...
... done in 0.02 seconds.

Fitting LFCs...
... done in 0.02 seconds.

Replacing 0 outlier genes.

The DeseqDataSet class extends the AnnData class.

print(dds)
AnnData object with n_obs × n_vars = 100 × 10
    obs: 'condition', 'group'
    uns: 'trend_coeffs', 'disp_function_type', '_squared_logres', 'prior_disp_var'
    obsm: 'design_matrix', 'size_factors', 'replaceable'
    varm: 'non_zero', '_MoM_dispersions', 'genewise_dispersions', '_genewise_converged', '_normed_means', 'fitted_dispersions', 'MAP_dispersions', '_MAP_converged', 'dispersions', '_outlier_genes', 'LFC', '_LFC_converged', 'replaced', 'refitted'
    layers: 'normed_counts', '_mu_hat', '_mu_LFC', '_hat_diagonals', 'cooks'

As such, it can be saved using pickle.dump.

with open(os.path.join(OUTPUT_PATH, "dds.pkl"), "wb") as f:
    pkl.dump(dds, f)

It may be loaded again using pickle.load.

with open(os.path.join(OUTPUT_PATH, "dds.pkl"), "rb") as f:
    dds2 = pkl.load(f)

print(dds2)
AnnData object with n_obs × n_vars = 100 × 10
    obs: 'condition', 'group'
    uns: 'trend_coeffs', 'disp_function_type', '_squared_logres', 'prior_disp_var'
    obsm: 'design_matrix', 'size_factors', 'replaceable'
    varm: 'non_zero', '_MoM_dispersions', 'genewise_dispersions', '_genewise_converged', '_normed_means', 'fitted_dispersions', 'MAP_dispersions', '_MAP_converged', 'dispersions', '_outlier_genes', 'LFC', '_LFC_converged', 'replaced', 'refitted'
    layers: 'normed_counts', '_mu_hat', '_mu_LFC', '_hat_diagonals', 'cooks'

Parameters are stored according to the AnnData data structure, with key-based data fields. In particular,

  • X stores the count data,

  • obs stores design factors,

  • obsm stores sample-level data, such as "design_matrix" and "size_factors",

  • varm stores gene-level data, such as "dispersions" and "LFC".

As an example, here is how we would access dispersions and LFCs (in natural log scale):

print(dds.varm["dispersions"])
[0.88259824 0.22257849 0.83723751 0.15897038 0.24992574 0.97364737
 0.23515474 0.19878066 0.18652019 0.63189957]
print(dds.varm["LFC"])
        intercept  condition_B_vs_A
gene1    1.891436          0.438632
gene2    2.851662          0.373296
gene3    1.787780         -0.438645
gene4    4.741958         -0.285647
gene5    3.077798          0.403457
gene6    1.678536          0.001010
gene7    3.291025          0.093116
gene8    3.785129         -0.187604
gene9    3.682882         -0.147443
gene10   2.300515          0.267562

Statistical analysis with the DeseqStats class

Now that dispersions and LFCs were fitted, we may proceed with statistical tests to compute p-values and adjusted p-values for differential expresion. This is the role of the DeseqStats class.

stat_res = DeseqStats(dds, inference=inference)

PyDESeq2 computes p-values using Wald tests. This can be done using the summary() method, which runs the whole statistical analysis, cooks filtering and multiple testing adjustement included.

stat_res.summary()
Running Wald tests...
... done in 0.01 seconds.

Log2 fold change & Wald test p-value: condition B vs A
          baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
gene1     8.541317        0.632812  0.289101  2.188898  0.028604  0.064150
gene2    21.281239        0.538552  0.149963  3.591236  0.000329  0.001646
gene3     5.010123       -0.632830  0.295236 -2.143476  0.032075  0.064150
gene4   100.517961       -0.412102  0.118629 -3.473868  0.000513  0.001710
gene5    27.142450        0.582065  0.154706  3.762409  0.000168  0.001646
gene6     5.413043        0.001457  0.310311  0.004696  0.996253  0.996253
gene7    28.294023        0.134338  0.149945  0.895917  0.370297  0.411441
gene8    40.358344       -0.270656  0.136401 -1.984261  0.047227  0.078711
gene9    37.166183       -0.212715  0.133243 -1.596437  0.110391  0.143147
gene10   11.589325        0.386011  0.244588  1.578207  0.114518  0.143147

The results are then stored in the results_df attribute (stat_res.results_df). As with as DeseqDataSet, the whole DeseqStats object may be saved using pickle. However, it is often more convenient to have the results as a CSV. Hence, we may export stat_res.results_df as CSV, using pandas.DataFrame.to_csv().

Total running time of the script: ( 0 minutes 3.582 seconds)

Gallery generated by Sphinx-Gallery