Note
Go to the end to download the full example code.
Loading data and saving results with pandas and pickle
In this example, we show how load data in order to perform a DEA analysis with PyDESeq2, and how to save its results, using pandas and pickle.
We refer to the getting started example for more details on the analysis itself.
We start by importing required packages and setting up a path to save results.
import os
import pickle as pkl
import pandas as pd
from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats
# Replace this with the path to directory where you would like results to be saved
OUTPUT_PATH = "../output_files/synthetic_example/"
os.makedirs(OUTPUT_PATH, exist_ok=True) # Create path if it doesn't exist
Data loading with pandas
To perform differential expression analysis (DEA), PyDESeq2 requires two types of inputs:
A count matrix of shape ‘number of samples’ x ‘number of genes’, containing read counts (non-negative integers),
Metadata (or annotations, or “column” data) of shape ‘number of samples’ x ‘number of variables’, containing sample annotations that will be used to split the data in cohorts.
Both should be provided as pandas dataframes.
Here, we show how to load CSV corresponding to counts and annotations as pandas dataframes, using pandas.read_csv().
We assume that DATA_PATH is a directory containing a test_counts.csv and a
test_metadata.csv file.
# Replace this with the path to your dataset
DATA_PATH = "https://raw.githubusercontent.com/owkin/PyDESeq2/main/datasets/synthetic/"
counts_df = pd.read_csv(os.path.join(DATA_PATH, "test_counts.csv"), index_col=0)
print(counts_df)
sample1 sample2 sample3 ... sample98 sample99 sample100
gene1 12 1 4 ... 10 18 21
gene2 21 44 4 ... 36 14 9
gene3 4 2 11 ... 2 3 3
gene4 130 63 180 ... 72 66 42
gene5 18 11 21 ... 11 53 13
gene6 0 10 3 ... 2 11 13
gene7 16 70 28 ... 66 32 19
gene8 54 32 34 ... 27 19 78
gene9 49 57 65 ... 16 79 30
gene10 3 9 2 ... 9 11 5
[10 rows x 100 columns]
Note that the counts data is in a ‘number of genes’ x ‘number of samples’ format, whereas ‘number of samples’ x ‘number of genes’ is required. To fix this issue, we transpose the counts dataframe.
counts_df = counts_df.T
print(counts_df)
gene1 gene2 gene3 gene4 ... gene7 gene8 gene9 gene10
sample1 12 21 4 130 ... 16 54 49 3
sample2 1 44 2 63 ... 70 32 57 9
sample3 4 4 11 180 ... 28 34 65 2
sample4 1 10 2 100 ... 28 16 33 9
sample5 1 11 6 135 ... 32 29 31 5
... ... ... ... ... ... ... ... ... ...
sample96 7 26 3 67 ... 41 44 54 1
sample97 1 14 3 71 ... 19 42 25 4
sample98 10 36 2 72 ... 66 27 16 9
sample99 18 14 3 66 ... 32 19 79 11
sample100 21 9 3 42 ... 19 78 30 5
[100 rows x 10 columns]
metadata = pd.read_csv(os.path.join(DATA_PATH, "test_metadata.csv"), index_col=0)
print(metadata)
condition group
sample1 A X
sample2 A Y
sample3 A X
sample4 A Y
sample5 A X
... ... ...
sample96 B Y
sample97 B X
sample98 B Y
sample99 B X
sample100 B Y
[100 rows x 2 columns]
In this example, the metadata data contains two columns, condition and group,
representing two types of bi-level annotations. Here, we will only use the
condition factor.
Data filtering
Before proceeding with DEA, we start by preprocessing the data, as in the getting started example.
Next, we filter out genes that have less than 10 read counts in total. Note again that there are no such genes in this synthetic dataset.
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]
Now that we have loaded and filtered our data, we may proceed with the differential analysis.
Single factor analysis
As in the getting started example, we ignore
the group variable and use the condition column as our design factor.
Read counts modeling with the DeseqDataSet class
We start by creating a DeseqDataSet
object from the count and metadata data that were just loaded.
inference = DefaultInference(n_cpus=8)
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~condition",
refit_cooks=True,
inference=inference,
)
Once a DeseqDataSet was initialized,
we may run the deseq2() method
to fit dispersions and LFCs.
dds.deseq2()
Fitting size factors...
Using None as control genes, passed at DeseqDataSet initialization
... done in 0.00 seconds.
Fitting dispersions...
... done in 0.04 seconds.
Fitting dispersion trend curve...
... done in 0.02 seconds.
Fitting MAP dispersions...
... done in 0.02 seconds.
Fitting LFCs...
... done in 0.02 seconds.
Calculating cook's distance...
... done in 0.00 seconds.
Replacing 0 outlier genes.
The DeseqDataSet class extends the
AnnData
class.
print(dds)
AnnData object with n_obs × n_vars = 100 × 10
obs: 'condition', 'group', 'size_factors', 'replaceable'
var: '_normed_means', 'non_zero', '_MoM_dispersions', 'genewise_dispersions', '_genewise_converged', 'fitted_dispersions', 'MAP_dispersions', '_MAP_converged', 'dispersions', '_outlier_genes', '_LFC_converged', 'replaced', 'refitted', '_pvalue_cooks_outlier'
uns: 'trend_coeffs', 'disp_function_type', '_squared_logres', 'prior_disp_var'
obsm: 'design_matrix', '_mu_LFC', '_hat_diagonals'
varm: 'LFC'
layers: 'normed_counts', '_mu_hat', 'cooks'
After removing unpicklable DeseqDataSet attributes, we can save the corresponding
AnnData object. This can be done using the
to_picklable_anndata() method.
with open(os.path.join(OUTPUT_PATH, "result_adata.pkl"), "wb") as f:
pkl.dump(dds.to_picklable_anndata(), f)
Parameters are stored according to the AnnData data
structure, with key-based data fields. In particular,
Xstores the count data,obsstores 1D sample-level data, such as design factors and"size_factors",obsmstores multi-dimensional sample-level data, such as"design_matrix",varstores 1D gene-level data, such as gene names and"dispersions",varmstores multi-dimensional gene-level data, such as"LFC".
As an example, here is how we would access dispersions and LFCs (in natural log scale):
print(dds.var["dispersions"])
gene1 0.882598
gene2 0.222578
gene3 0.837238
gene4 0.158970
gene5 0.249926
gene6 0.973647
gene7 0.235155
gene8 0.198781
gene9 0.186520
gene10 0.631900
Name: dispersions, dtype: float64
print(dds.varm["LFC"])
Intercept condition[T.B]
gene1 1.891436 0.438632
gene2 2.851662 0.373296
gene3 1.787780 -0.438645
gene4 4.741958 -0.285647
gene5 3.077798 0.403457
gene6 1.678536 0.001010
gene7 3.291025 0.093116
gene8 3.785129 -0.187604
gene9 3.682882 -0.147443
gene10 2.300515 0.267562
Statistical analysis with the DeseqStats class
Now that dispersions and LFCs were fitted, we may proceed with statistical tests to
compute p-values and adjusted p-values for differential expresion. This is the role of
the DeseqStats class.
ds = DeseqStats(dds, contrast=["condition", "B", "A"], inference=inference)
PyDESeq2 computes p-values using Wald tests. This can be done using the
summary() method, which runs the whole statistical
analysis, cooks filtering and multiple testing adjustement included.
ds.summary()
Running Wald tests...
... done in 0.01 seconds.
Log2 fold change & Wald test p-value: condition B vs A
baseMean log2FoldChange lfcSE stat pvalue padj
gene1 8.541317 0.632812 0.289101 2.188898 0.028604 0.064150
gene2 21.281239 0.538552 0.149963 3.591236 0.000329 0.001646
gene3 5.010123 -0.632830 0.295236 -2.143476 0.032075 0.064150
gene4 100.517961 -0.412102 0.118629 -3.473868 0.000513 0.001710
gene5 27.142450 0.582065 0.154706 3.762409 0.000168 0.001646
gene6 5.413043 0.001457 0.310311 0.004696 0.996253 0.996253
gene7 28.294023 0.134338 0.149945 0.895917 0.370297 0.411441
gene8 40.358344 -0.270656 0.136401 -1.984261 0.047227 0.078711
gene9 37.166183 -0.212715 0.133243 -1.596437 0.110391 0.143147
gene10 11.589325 0.386011 0.244588 1.578207 0.114518 0.143147
The results are then stored in the results_df attribute (stat_res.results_df).
As with as DeseqDataSet, the whole
DeseqStats object may be saved using pickle.
However, it is often more convenient to have the results as a CSV.
Hence, we may export stat_res.results_df as CSV, using pandas.DataFrame.to_csv().
ds.results_df.to_csv(os.path.join(OUTPUT_PATH, "results.csv"))
Total running time of the script: (0 minutes 11.762 seconds)