biopythonを利用して遺伝子配列を一括ダウンロードする

5164 ワード

自分の必要なblastライブラリを構築するには、自分の必要な遺伝子をすべてダウンロードする必要があります.biopythonで素早くできます.
1.まず、「org.Hs.eg.db」を利用して自分の遺伝子symbolをaccession中のidに変換する
1 library(org.Hs.eg.db)
2 symbol=c("PDCD1","CD274","IL4","IL7")
3 accession =mapIds(org.Hs.eg.db,keys=x,column = "REFSEQ",keytype = "SYMBOL",multiVals = "first")
4 accession=as.matrix(accession)
5 write.table(accession,"accession.txt",row.names = T,col.names = F,quote = F,sep = "\t")

結果は次のとおりです.
1 PDCD1    NM_005018
2 CD274    NM_001267706
3 IL4    NM_000589
4 IL7    NM_000880

2.biopythonでシーケンスをダウンロードします.
 1 from Bio import Entrez
 2 from Bio import SeqIO
 3 file_in_name="accession.txt"
 4 file_out_name="result.fasta"
 5 Entrez.email = '[email protected]'##    
 6 input_file=open(file_in_name,"r")
 7 output_file=open(file_out_name,"a")
 8 for record_id in input_file:
 9     record_id=record_id.strip().split("\t")[1]
10     result_handle = Entrez.efetch(db="nucleotide", rettype="gb",  id=record_id)
11     seqRecord = SeqIO.read(result_handle, format='gb')
12     result_handle.close()
13     output_file.write(seqRecord.format('fasta'))
14 output_file.close()
15 input_file.close()

結果:
>NM_005018.3 Homo sapiens programmed cell death 1 (PDCD1), mRNA
GCTCACCTCCGCCTGAGCAGTGGAGAAGGCGGCACTCTGGTGGGGCTGCTCCAGGCATGCAGATCCCACAGGCGCCCTGGCCAGTCGTCTGGGCGGTGCTACAACTGGGCTGGCGGCCAG....
>NM_001267706.1 Homo sapiens CD274 molecule (CD274), transcript variant 2, mRNA
GGCGCAACGCTGAGCAGCTGGCGCGTCCCGCGCGGCCCCAGTTCTGCGCAGCTTCCCGAGGCTCCGCACCAGCCGCGCTTCTGTCCGCCTGCAGGGCATTCCAGAAAGATGAGGATATTT...
>NM_000589.4 Homo sapiens interleukin 4 (IL4), transcript variant 1, mRNA
ATCGTTAGCTTCTCCTGATAAACTAATTGCCTCACATTGTCACTGCAAATCGACACCTATTAATGGGTCTCACCTCCCAACTGCTTCCCCCTCTGTTCTTCCTGCTAGCATGTGCCGGCA...
>NM_000880.4 Homo sapiens interleukin 7 (IL7), transcript variant 1, mRNA
ACACTTGTGGCTTCCGTGCACACATTAACAACTCATGGTTCTAGCTCCCAGTCGCCAAGCGTTGCCAAGGCGTTGAGAGATCATCTGGGAAGTCTTTTACCCAGAATTGCTTTGATTCAG...

完了
転載先:https://www.cnblogs.com/pipix/p/10184783.html