第2回RNA-seq実戦総括(3)-DESeq 2による遺伝子発現差異分析
11737 ワード
DESeq 2は遺伝子発現の違いを分析するためのRパッケージであり、具体的な操作はR言語で実行する.R言語インストールDESeq 2
2.遺伝子発現量ファイルを読み込み、カラム名を追加する
3.データ統合
4.遺伝子への注釈-geneを取得するSymbol用bioMart対ensembl_idをgeneに変換symbol
5.DEseq 2差異発現遺伝子をスクリーニングし、注釈する(bioMart)
ddsオブジェクトを構築し、DESeqプロセスを開始
差異発現遺伝子を抽出するここで私が用いる方法は倍差法でpadj(p値が多重検査補正された値)が0.05未満であり、発現倍数は2を対数として1以上-1未満の差異発現遺伝子を取得する.
これでRNA-seqのデータ処理が完了し、次はpheatmapでホットマップを描きます
>source("https://bioconductor.org/biocLite.R")
>biocLite("DESeq2")
2.遺伝子発現量ファイルを読み込み、カラム名を追加する
> setwd("C:\\Users\\18019\\Desktop\\counts")
> options(stringsAsFactors=FALSE)
> control1 head(control1)
gene_id control1
1 ENSG00000000003.14_2 1576
2 ENSG00000000005.5_2 0
3 ENSG00000000419.12_2 756
4 ENSG00000000457.13_3 301
5 ENSG00000000460.16_5 764
6 ENSG00000000938.12_2 0
> control2 treat1treat2
3.データ統合
> raw_count head(raw_count)
gene_id control1 control2 treat1 treat2
1 __alignment_not_unique 7440131 2973831 7861484 8676884
2 __ambiguous 976485 412543 1014239 1179051
3 __no_feature 1860117 768637 1289737 1812056
4 __not_aligned 1198545 572588 1256232 1348068
5 __too_low_aQual 0 0 0 0
6 ENSG00000000003.14_2 1576 713 1589 1969
#
>raw_count_filt aENSEMBL row.names(raw_count_filt) raw_count_filt colnames(raw_count_filt)[1] head(raraw_count_filt )
ensembl_gene_id gene_id control1 control2 treat1 treat2
ENSG00000000003 ENSG00000000003 ENSG00000000003.14_2 1576 713 1589 1969
ENSG00000000005 ENSG00000000005 ENSG00000000005.5_2 0 0 0 1
ENSG00000000419 ENSG00000000419 ENSG00000000419.12_2 756 384 806 984
ENSG00000000457 ENSG00000000457 ENSG00000000457.13_3 301 151 217 324
ENSG00000000460 ENSG00000000460 ENSG00000000460.16_5 764 312 564 784
ENSG00000000938 ENSG00000000938 ENSG00000000938.12_2 0 0 0 0
4.遺伝子への注釈-geneを取得するSymbol用bioMart対ensembl_idをgeneに変換symbol
> library("biomaRt")
> library("curl")
> mart my_ensembl_gene_id options(timeout = 4000000)
> hg_symbols head(readcount)
ensembl_gene_id gene_id control1 control2 treat1 treat2 hgnc_symbol chromosome_name start_position
1 ENSG00000000003 ENSG00000000003.14_2 1576 713 1589 1969 TSPAN6 X 100627109
2 ENSG00000000005 ENSG00000000005.5_2 0 0 0 1 TNMD X 100584802
3 ENSG00000000419 ENSG00000000419.12_2 756 384 806 984 DPM1 20 50934867
4 ENSG00000000457 ENSG00000000457.13_3 301 151 217 324 SCYL3 1 169849631
5 ENSG00000000460 ENSG00000000460.16_5 764 312 564 784 C1orf112 1 169662007
6 ENSG00000000938 ENSG00000000938.12_2 0 0 0 0 FGR 1 27612064
end_position band
1 100639991 q22.1
2 100599885 q22.1
3 50958555 q13.13
4 169894267 q24.2
5 169854080 q24.2
6 27635277 p35.3
# count
> write.csv(readcount, file='readcount_all.csv')
> readcount write.csv(readcount, file='readcount.csv')
> head(readcount)
control1 control2 treat1 treat2
ENSG00000000003 1576 713 1589 1969
ENSG00000000005 0 0 0 1
ENSG00000000419 756 384 806 984
ENSG00000000457 301 151 217 324
ENSG00000000460 764 312 564 784
ENSG00000000938 0 0 0 0
5.DEseq 2差異発現遺伝子をスクリーニングし、注釈する(bioMart)
# (countData colData)
> mycounts head(mycounts)
control1 control2 treat1 treat2
ENSG00000000003 1576 713 1589 1969
ENSG00000000005 0 0 0 1
ENSG00000000419 756 384 806 984
ENSG00000000457 301 151 217 324
ENSG00000000460 764 312 564 784
ENSG00000000938 0 0 0 0
> condition condition
[1] control control treat treat
Levels: control treat
> colData colData
condition
control1 control
control2 control
treat1 treat
treat2 treat
ddsオブジェクトを構築し、DESeqプロセスを開始
>library("DESeq2")
> dds dds dds
class: DESeqDataSet
dim: 60880 4
metadata(1): version
assays(4): counts mu H cooks
rownames(60880): ENSG00000000003 ENSG00000000005 ... ENSG00000285993 ENSG00000285994
rowData names(22): baseMean baseVar ... deviance maxCooks
colnames(4): control1 control2 treat1 treat2
colData names(2): condition sizeFactor
#
> res = results(dds, contrast=c("condition", "control", "treat"))
> res = res[order(res$pvalue),]
> head(res)
log2 fold change (MLE): condition control vs treat
Wald test p-value: condition control vs treat
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
ENSG00000178691 1025.66218695436 2.83012875791025 0.225513526042636 12.5497073615672 3.98981786210676e-36
ENSG00000135535 2415.77359618136 1.22406336488047 0.183431131037356 6.67314952460929 2.5037106203736e-11
ENSG00000164172 531.425786834548 1.30449018960413 0.207785830749451 6.27805170785243 3.42841906697453e-10
ENSG00000172239 483.998634607265 1.31701332235233 0.215453141699223 6.11275988813803 9.79226522597759e-10
ENSG00000237296 53.0114998109978 2.70139282483841 0.480033904207378 5.62750422660019 1.82835684560772e-08
ENSG00000196504 3592.67315807893 1.09372324353448 0.200308218929736 5.46020153031335 4.75594407815571e-08
padj
ENSG00000178691 3.90682965057494e-32
ENSG00000135535 1.22581671973492e-07
ENSG00000164172 1.11903598346049e-06
ENSG00000172239 2.39714652731931e-06
ENSG00000237296 NA
ENSG00000196504 9.31404088266014e-05
> summary(res)
out of 33100 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 78, 0.24%
LFC < 0 (down) : 15, 0.045%
outliers [1] : 0, 0%
low counts [2] : 23308, 70%
(mean count < 135)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
# 78 ,15
#
> write.csv(res,file="All_results.csv")
差異発現遺伝子を抽出するここで私が用いる方法は倍差法でpadj(p値が多重検査補正された値)が0.05未満であり、発現倍数は2を対数として1以上-1未満の差異発現遺伝子を取得する.
> diff_gene_deseq2 1)
> dim(diff_gene_deseq2)
[1] 21 6
> head(diff_gene_deseq2)
log2 fold change (MLE): condition control vs treat
Wald test p-value: condition control vs treat
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
ENSG00000178691 1025.66218695436 2.83012875791025 0.225513526042636 12.5497073615672 3.98981786210676e-36
ENSG00000135535 2415.77359618136 1.22406336488047 0.183431131037356 6.67314952460929 2.5037106203736e-11
ENSG00000164172 531.425786834548 1.30449018960413 0.207785830749451 6.27805170785243 3.42841906697453e-10
ENSG00000172239 483.998634607265 1.31701332235233 0.215453141699223 6.11275988813803 9.79226522597759e-10
ENSG00000196504 3592.67315807893 1.09372324353448 0.200308218929736 5.46020153031335 4.75594407815571e-08
ENSG00000163848 633.066990185649 1.15489622775117 0.219655131372136 5.25777030810433 1.45812478575117e-07
padj
ENSG00000178691 3.90682965057494e-32
ENSG00000135535 1.22581671973492e-07
ENSG00000164172 1.11903598346049e-06
ENSG00000172239 2.39714652731931e-06
ENSG00000196504 9.31404088266014e-05
ENSG00000163848 0.000230253090268928
#
> write.csv(diff_gene_deseq2,file= "DEG_treat_vs_control.csv")
# bioMart
> library("biomaRt")
> library("curl")
> hg_symbols head(hg_symbols)
ensembl_gene_id external_gene_name
1 ENSG00000011405 PIK3C2A
2 ENSG00000100731 PCNX1
3 ENSG00000128512 DOCK4
4 ENSG00000135535 CD164
5 ENSG00000140526 ABHD2
6 ENSG00000144228 SPOPL
description
1 phosphatidylinositol-4-phosphate 3-kinase catalytic subunit type 2 alpha [Source:HGNC Symbol;Acc:HGNC:8971]
2 pecanex 1 [Source:HGNC Symbol;Acc:HGNC:19740]
3 dedicator of cytokinesis 4 [Source:HGNC Symbol;Acc:HGNC:19192]
4 CD164 molecule [Source:HGNC Symbol;Acc:HGNC:1632]
5 abhydrolase domain containing 2 [Source:HGNC Symbol;Acc:HGNC:18717]
6 speckle type BTB/POZ protein like [Source:HGNC Symbol;Acc:HGNC:27934]
# :res hg_symbols
> ensembl_gene_id diff_gene_deseq2 colnames(diff_gene_deseq2)[1] diff_name head(diff_name)
DataFrame with 6 rows and 9 columns
ensembl_gene_id baseMean log2FoldChange lfcSE stat pvalue
1 ENSG00000011405 1600.01408863821 1.07722909393382 0.24714564887963 4.35868120202462 1.30848557424083e-05
2 ENSG00000100731 1162.93822827396 1.0006257630015 0.214393389946423 4.66724166846545 3.05270197242525e-06
3 ENSG00000128512 368.442571635954 1.19657846347522 0.262780839813213 4.5535224878867 5.27550292947225e-06
4 ENSG00000135535 2415.77359618136 1.22406336488047 0.183431131037356 6.67314952460929 2.5037106203736e-11
5 ENSG00000140526 796.447227235737 1.05296203760187 0.23492350092969 4.48214858639031 7.38952622958053e-06
6 ENSG00000144228 293.746859588111 1.10903132755747 0.283181091639851 3.91633255290906 8.9906210067011e-05
padj external_gene_name
1 0.00533862114290257 PIK3C2A
2 0.002491004809499 PCNX1
3 0.00319558231320097 DOCK4
4 1.22581671973492e-07 CD164
5 0.00364483675888146 ABHD2
6 0.0214722343652725 SPOPL
description
1 phosphatidylinositol-4-phosphate 3-kinase catalytic subunit type 2 alpha [Source:HGNC Symbol;Acc:HGNC:8971]
2 pecanex 1 [Source:HGNC Symbol;Acc:HGNC:19740]
3 dedicator of cytokinesis 4 [Source:HGNC Symbol;Acc:HGNC:19192]
4 CD164 molecule [Source:HGNC Symbol;Acc:HGNC:1632]
5 abhydrolase domain containing 2 [Source:HGNC Symbol;Acc:HGNC:18717]
6 speckle type BTB/POZ protein like [Source:HGNC Symbol;Acc:HGNC:27934]
#
write.csv(diff_name,file= "diff_gene.csv")
これでRNA-seqのデータ処理が完了し、次はpheatmapでホットマップを描きます