Note
Click here to download the full example code or to run this example in your browser via Binder
Loading data and saving results with pandas and pickle
In this example, we show how load data in order to perform a DEA analysis with PyDESeq2, and how to solve its results, using pandas and pickle.
We refer to the getting started example for more details on the analysis itself.
We start by importing required packages and setting up a path to save results.
import os
import pickle as pkl
import pandas as pd
from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats
# Replace this with the path to directory where you would like results to be saved
OUTPUT_PATH = "../output_files/synthetic_example/"
os.makedirs(OUTPUT_PATH, exist_ok=True) # Create path if it doesn't exist
Data loading with pandas
To perform differential expression analysis (DEA), PyDESeq2 requires two types of inputs:
A count matrix of shape ‘number of samples’ x ‘number of genes’, containing read counts (non-negative integers),
Metadata (or annotations, or “column” data) of shape ‘number of samples’ x ‘number of variables’, containing sample annotations that will be used to split the data in cohorts.
Both should be provided as pandas dataframes.
Here, we show how to load CSV corresponding to counts and annotations as pandas dataframes, using pandas.read_csv().
We assume that DATA_PATH
is a directory containing a test_counts.csv
and a
test_metadata.csv
file.
# Replace this with the path to your dataset
DATA_PATH = "https://raw.githubusercontent.com/owkin/PyDESeq2/main/datasets/synthetic/"
counts_df = pd.read_csv(os.path.join(DATA_PATH, "test_counts.csv"), index_col=0)
print(counts_df)
sample1 sample2 sample3 ... sample98 sample99 sample100
gene1 12 1 4 ... 10 18 21
gene2 21 44 4 ... 36 14 9
gene3 4 2 11 ... 2 3 3
gene4 130 63 180 ... 72 66 42
gene5 18 11 21 ... 11 53 13
gene6 0 10 3 ... 2 11 13
gene7 16 70 28 ... 66 32 19
gene8 54 32 34 ... 27 19 78
gene9 49 57 65 ... 16 79 30
gene10 3 9 2 ... 9 11 5
[10 rows x 100 columns]
Note that the counts data is in a ‘number of genes’ x ‘number of samples’ format, whereas ‘number of samples’ x ‘number of genes’ is required. To fix this issue, we transpose the counts dataframe.
counts_df = counts_df.T
print(counts_df)
gene1 gene2 gene3 gene4 ... gene7 gene8 gene9 gene10
sample1 12 21 4 130 ... 16 54 49 3
sample2 1 44 2 63 ... 70 32 57 9
sample3 4 4 11 180 ... 28 34 65 2
sample4 1 10 2 100 ... 28 16 33 9
sample5 1 11 6 135 ... 32 29 31 5
... ... ... ... ... ... ... ... ... ...
sample96 7 26 3 67 ... 41 44 54 1
sample97 1 14 3 71 ... 19 42 25 4
sample98 10 36 2 72 ... 66 27 16 9
sample99 18 14 3 66 ... 32 19 79 11
sample100 21 9 3 42 ... 19 78 30 5
[100 rows x 10 columns]
metadata = pd.read_csv(os.path.join(DATA_PATH, "test_metadata.csv"), index_col=0)
print(metadata)
condition group
sample1 A X
sample2 A Y
sample3 A X
sample4 A Y
sample5 A X
... ... ...
sample96 B Y
sample97 B X
sample98 B Y
sample99 B X
sample100 B Y
[100 rows x 2 columns]
In this example, the metadata data contains two columns, condition
and group
,
representing two types of bi-level annotations. Here, we will only use the
condition
factor.
Data filtering
Before proceeding with DEA, we start by preprocessing the data, as in the getting started example.
Next, we filter out genes that have less than 10 read counts in total. Note again that there are no such genes in this synthetic dataset.
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]
Now that we have loaded and filtered our data, we may proceed with the differential analysis.
Single factor analysis
As in the getting started example, we ignore
the group
variable and use the condition
column as our design factor.
Read counts modeling with the DeseqDataSet
class
We start by creating a DeseqDataSet
object from the count and metadata data that were just loaded.
inference = DefaultInference(n_cpus=8)
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design_factors="condition",
refit_cooks=True,
inference=inference,
)
Once a DeseqDataSet
was initialized,
we may run the deseq2()
method
to fit dispersions and LFCs.
dds.deseq2()
Fitting size factors...
... done in 0.00 seconds.
Fitting dispersions...
... done in 0.02 seconds.
Fitting dispersion trend curve...
... done in 0.01 seconds.
Fitting MAP dispersions...
... done in 0.02 seconds.
Fitting LFCs...
... done in 0.02 seconds.
Replacing 0 outlier genes.
The DeseqDataSet
class extends the
AnnData
class.
print(dds)
AnnData object with n_obs × n_vars = 100 × 10
obs: 'condition', 'group'
uns: 'trend_coeffs', 'disp_function_type', '_squared_logres', 'prior_disp_var'
obsm: 'design_matrix', 'size_factors', 'replaceable'
varm: 'non_zero', '_MoM_dispersions', 'genewise_dispersions', '_genewise_converged', '_normed_means', 'fitted_dispersions', 'MAP_dispersions', '_MAP_converged', 'dispersions', '_outlier_genes', 'LFC', '_LFC_converged', 'replaced', 'refitted'
layers: 'normed_counts', '_mu_hat', '_mu_LFC', '_hat_diagonals', 'cooks'
As such, it can be saved using pickle.dump.
with open(os.path.join(OUTPUT_PATH, "dds.pkl"), "wb") as f:
pkl.dump(dds, f)
It may be loaded again using pickle.load.
with open(os.path.join(OUTPUT_PATH, "dds.pkl"), "rb") as f:
dds2 = pkl.load(f)
print(dds2)
AnnData object with n_obs × n_vars = 100 × 10
obs: 'condition', 'group'
uns: 'trend_coeffs', 'disp_function_type', '_squared_logres', 'prior_disp_var'
obsm: 'design_matrix', 'size_factors', 'replaceable'
varm: 'non_zero', '_MoM_dispersions', 'genewise_dispersions', '_genewise_converged', '_normed_means', 'fitted_dispersions', 'MAP_dispersions', '_MAP_converged', 'dispersions', '_outlier_genes', 'LFC', '_LFC_converged', 'replaced', 'refitted'
layers: 'normed_counts', '_mu_hat', '_mu_LFC', '_hat_diagonals', 'cooks'
Parameters are stored according to the AnnData
data
structure, with key-based data fields. In particular,
X
stores the count data,obs
stores design factors,obsm
stores sample-level data, such as"design_matrix"
and"size_factors"
,varm
stores gene-level data, such as"dispersions"
and"LFC"
.
As an example, here is how we would access dispersions and LFCs (in natural log scale):
print(dds.varm["dispersions"])
[0.88259824 0.22257849 0.83723751 0.15897038 0.24992574 0.97364737
0.23515474 0.19878066 0.18652019 0.63189957]
print(dds.varm["LFC"])
intercept condition_B_vs_A
gene1 1.891436 0.438632
gene2 2.851662 0.373296
gene3 1.787780 -0.438645
gene4 4.741958 -0.285647
gene5 3.077798 0.403457
gene6 1.678536 0.001010
gene7 3.291025 0.093116
gene8 3.785129 -0.187604
gene9 3.682882 -0.147443
gene10 2.300515 0.267562
Statistical analysis with the DeseqStats
class
Now that dispersions and LFCs were fitted, we may proceed with statistical tests to
compute p-values and adjusted p-values for differential expresion. This is the role of
the DeseqStats
class.
stat_res = DeseqStats(dds, inference=inference)
PyDESeq2 computes p-values using Wald tests. This can be done using the
summary()
method, which runs the whole statistical
analysis, cooks filtering and multiple testing adjustement included.
stat_res.summary()
Running Wald tests...
... done in 0.01 seconds.
Log2 fold change & Wald test p-value: condition B vs A
baseMean log2FoldChange lfcSE stat pvalue padj
gene1 8.541317 0.632812 0.289101 2.188898 0.028604 0.064150
gene2 21.281239 0.538552 0.149963 3.591236 0.000329 0.001646
gene3 5.010123 -0.632830 0.295236 -2.143476 0.032075 0.064150
gene4 100.517961 -0.412102 0.118629 -3.473868 0.000513 0.001710
gene5 27.142450 0.582065 0.154706 3.762409 0.000168 0.001646
gene6 5.413043 0.001457 0.310311 0.004696 0.996253 0.996253
gene7 28.294023 0.134338 0.149945 0.895917 0.370297 0.411441
gene8 40.358344 -0.270656 0.136401 -1.984261 0.047227 0.078711
gene9 37.166183 -0.212715 0.133243 -1.596437 0.110391 0.143147
gene10 11.589325 0.386011 0.244588 1.578207 0.114518 0.143147
The results are then stored in the results_df
attribute (stat_res.results_df
).
As with as DeseqDataSet
, the whole
DeseqStats
object may be saved using pickle.
However, it is often more convenient to have the results as a CSV.
Hence, we may export stat_res.results_df
as CSV, using pandas.DataFrame.to_csv().
stat_res.results_df.to_csv(os.path.join(OUTPUT_PATH, "results.csv"))
Total running time of the script: ( 0 minutes 3.582 seconds)