問題是計算seq.txt檔案中最長的連續 dna 基因序列數并將其存盤在字典中。字典應該存盤 dna 對應的鍵值對。例如,AGAT:5。這意味著 AGAT 在樣本中連續出現五次,是任何基因序列連續出現的最大次數。然后程式應該將 dna 字典中的值與 csv 檔案進行比較,以檢查它是誰的 dna。
這是 seq.txt 檔案:
TCTATTCTTTGAGGATACGCTCGGCCTAGGCGGGGCTAATGGAAGCCAGGCTAATCCGATGTTGCGGTGCACCTCGATACCGTTCTAAAATATCACATCAACGCGCTCCAGTTGTGTGCCAAGGCCCGCTGAAGAGCAATGGAGCACCTACCCGGCCTTCTAACGCTGTCTAAAACTCCAAGCGAATTGCAGATTTTGGTTAGGACCCGTTTAATCTGTGGGCTTTGGTACTATGCAACCAATGGAACCGGTCGGACTCTGATCAGTCCCGACTGACAGGTCTCAAGTAGTTTGCTTACACGTTCTGACCCCCGTGCGCACCGTTGGGCGTACAGCGGTTCGGTCTATGGAATCAAGGAAAATCATTCGTATGGGGACGTAGTCACATAACAGCTGCAGGGAACTATGGAGATGACGAGGGGTCGTTTAGTGGAACGTCAAATGTCCTAACTGGTTCTGAGCTGTCTGGAACGTTGCAGTCAACGTCTACGATCTGGATTCTACAGTCTAGGCGTTCCAAGGGGCACCAGTAAGCTAAGTTGTTTAAATATGGCGGGTGTCGAAATGACGTCCAAAATCGCAAATAAGACAGATAGCAGGGGTGCAACTTAGGTATCTAAGGTAACTCTGACATACCTCATACAACTATCGAACAGTGGATTCCTTGTCGTCCTGTTGTAAACAGTTCAAGTCGGTACATGTTAGCGGGTGGTTTGGACGAGTATACAGGACCTGGCCTACACGGAATGTTTTAGATTCTATGTCCGGCGGGGACATCGCGTGCCGCTAGGATATAATTGGATTGTGGGAAGAATTTGGCCGGATTTTTGGCCTAGACTCGCGCTTCAGACCATACCGTGCGATCAGCACGATTGCTGACAAGCGTCGGTATTAAAGCAGGCTCCTTCCCAGCCAAACTAACCCAACGAAGACATCATGTTTCGCCGAAGTATCTTTGGGAGATGGGCGAATTAATCGCTTAGCGTGGCCGACTTGGGGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGGTTTAAGGGACTTATCCGACCAGAGGGGCAGTTACTTGTGGCGGTCACACGCCAGGACGAGTCTGTTCTTGCTGTGCGTAGATTAGGCTTGATCTGTGACTACAGGCGAATAGTAGGTGTGGGAAACAGAGGGGGGAGCAATGTGATCCCGGGGGGAGTGCTTCCTATACCTCGGTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGGATTATCCCCACCCAATGATCCGATCGCAAGCCTTAATACCATGGCACACCTCTCAACCTACTGATCTTCCATCCGTTTAACCCAGCACTAAGCTGCTCAGTGGTCACACTATGTTCAAGCTTCCGTGACGTGGGATCCTGGGGTCTTCGCAAGGCTAGTTTTGACCATTATCGACGACCGTCACCCTGTGACTGGTTCCCAACAAGGTGTCAAGTTCTAGCCCGTACCTGCAATCGGGAACCTCCGGTGCTTCATGAACCATGGATATAGGAATTATTGGTCTCCTCTCGCGTAGGTAGCGCGAATACCCCCAAGATGACACACTGTGGTGAACTTTGAGGACTCCCAGAAGGGTGACGGGTTATGTGGTTACGCGAAGTCGGCGTATCCACCGCCTAATTTTAAATTCAGCTCGAGCGACACGCGCGCTTCCTGGAAACGTTAGACGGGAAAAACCCCGCCCGAGAATGCGGGTTCCGCGGCCCACTAGGGGGCCCCCCAAGGATCTGACCGCGTATAAGCAATGCACAGCTGTACCATTTCAAATAGGACAGATAGTACCCCCACCGTGACTCGGCCTCAGATAATGGAATACGACCTGGTGACGGCGGTAGGGGTTCTATCTCAGGTATTCAGAGGGTGCATCCAGGTGATTCGTCACGTCCCGATTTCGACCCCACCACAGGATTTGTGCGATGGTAGTCTTGATGCTGTTTGCAGGCGGCCAAGCATCTAGGAGATGCCTCACTGCGCGAGATGAACCGGCGTTTCACAAGGGGACGCCAGGCCTTGCCGTCTCCATAAACCACGAGAAGGTATCGAACGTCAAACGGATAAATGCCGCGATACCGCTCGTTTCGAAGCGGCACTTCGATGGAAATGAGTAGTATGGCCTCGCCACACGACTACTCATCGGCTTGCGCTGACATCAATCCTGGCTGGCTTGAGGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGCTCCATAGGAAGGTGCGGGATAGCGGACAGCTAATCGGACAGAAGGGCCAGCTTGCACTCTCCTATAATTAGCAAGCGCCATACAATTGTAATCACGTATAAAATACAGCTACGTAAGTAATAGAGAGGCTCCCGGACTGTCCGGCGTCCCGCCAGTCTCGTACCAGGAGGTGGGATGGTAGGCAAACGAGCCTACTAGAATTGGGCCACCCTGTGAATAATATGCAGAGGCAACTACAGACGTCCGTCACCTGCCTAGAATCGAGTTCATTGACGGTGGGATATGCTCCGTTACCTGACTGTAGTTCGACTTTGTGGTGCGCACATAACGAGTGTCTACGATGCACAAAGTGTGAGCAAATTAGGAGTGTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTGCCGAGATGTTGGCGGGAAGTGTACGGCTTTGCGTCGTCGAGTGCTACGCAGTGTGCTACACTCCCGCAGCTGAGGCTAGGGCCCGAAACTAGACATTTTTTCTTTTGGCACTTCGTTCCGTATAATGAGTTCCCTCAATTCCCCGTCCGCAAGCCTCAGGATTACAATTAATTATACGGTTAAAGTTGGCTGCCAAGCCCGTTATTGACCGGTACCTGAGTCGAGGGGGGGTTGGGGATAGGCAATTATAGTATTCACTCACAGGACGCTCAGTAATGCCGCCGTTGTACTTCACGTAAGGGCCACAGTTTTTCTACCACAGAGGATGATCTGAGGACAGCGGTGCGTGAAGCCCGCTATTCAGGACACCCTCGAAACCCGTGGTTCACAGACAAAAAATTCGCCGCGGAAGCTGTTGCCCCTATGCCCCGGGTCAGCAAGGAGTCTGGATTTTATTCCAAGACTGCGTCTTTATTTTCTGGTGAGTATGAAATGACTCTGAGAAAATGGTCGAACCACGAGCTAGCTACAGCCACAGTCCGCTCAACTAACTTACCTCTACTCTAACAGTTACACGGCTTCCCGTTTTATGGGAAGAAGCACCTGTTCCTTTCCCAAGCCCCTTATAGCAGAGGTTGGTATTCGGTTGATTTGGAATAGTTAAACAGCGGCTATTTTGTAATCACTTTCCAGTCGGTAAGACATTCGAACCTCGTTTTGACGCTGCTCGCCATCGCGTTCGACTAGGAGTATTCCACTTTTCGGAGAGATGATTACTCATGACGCGGGGAACTCCATGGCTGTCATGCAGGATCTGGGCTAAATAAGATTAGATGTTCAACTGTCGTATACTTACTGCTACCAGCGGTGCTAGGCCCAGGACCCGCCATACCTGGCTATTGATCACTCTACCAGATGTCTCTTGACGAGTTACGAATTGCTGGGTGCTCTTGGAGACGAGTTGAGTCCGTAGTCGTGGCTGGGGAACGGGCGAGTTCGTACGTACCGTTTCAAAGCCCCACGAACCCAACCTCTTAGCCTTAACCCCACATTAGATACCCAAGTTGCATGACGCATTATGCGAGTACGACACTGGTATCGGCTGATCCGTCACTGCTCAAAGTCCAGTGGTTTCCTTATCTCGGGCTGGAAAGTGTAGCTTGTTCCAAACCTTCGAGAGGTTGATCGATGACCGGTTCTCACACACATCTTGCGGAGGGATGCTTGCGATGTGGCTTTACGTCCACCGACGGGCCGACTAGCTGGAAATCACAAACCCCTGCTCCGATAAGGTATTCTCGTTGACTTAGGGTAAACAAAATGCCCGTTACGTCCTAACCGAGTTTCCGGGCCTTCACTACCCGCGAGGGATGTGTAGTGGGGCCATTTACCTAAGCAGATGTACACCGAGTTACGATAGTCACATGGCCATTCAAAGCGTCTCACATAATCGATCGATAGATGATGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCCCGGAGGAAGCTGCGATTGGAATGCGGCTAACTTCGCTCTGCAACATTCTTGGCAGACGGCCCCAATGGCGTAATTTAGGCGTGTGTACCTAAAGTGGTCTACTCCTATGAACCGAATCGCGGGATAAATCGAGTTGGGACTGCTTTGCCTTAATTACATTCACTGATTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGGTCGAGCACGGCTGGCACGTCCGGCTCCATCGCGTCGTAATCCATCCCTATTCGACCAACAAAACCTCAGGGGACGGGATGTGAGTGGGTATCGATCATTATCGAACGCCCATAAGTACTCCACTCATCTGTCTGAAAAGTTTGTCGAGTGCCGCTCTCTGAAGAGTACGATAACTTACTCCAAACACTCTACGCCTAGTGGTCGAAAACACTAAAGGGAAAATACTCACTGACTTACTCTGTCGCTCTACGATTGCCGCGATACCTTAATAAGACACGTATCGGCTGTCGCAGCGATGGATTCCTTAAGCGATACAACTAAGATCAATCGGTGCCGGGCCTACAGCCTGGGCCCTAGCTCCAAAAGTGATAATGGATAGTCGGTTCAAGCGAATTTACACCAGACTGATCCTTTACGGTCATTCCGACCGCCGCATGATACATGCCAAAAGACACTTGTCTTCTTTCCTCTAAAAGACAGACCTTGTTTGCAAGGAGAGCCCAATCGGCACGACCCAAAGGGATTATCAACTGAACTATTATTGCATACTACTAAGCAGACGGACCGTATAGCATCATTGATACCTATTATATTTCCATACACCAACTCCATACGCGATGGGTCGAAACTACAAGCTTCACTTACGTGTACAGCCGCAGGACCCACTCTCTAATCTAGCCAATGACACTACTAATTTGAACATTCCCCAGCGATGAACAGGCACATGAGCGGTCCTCGTACCCACCACGGCCCGCTCAACTGCAAGGGGCCGCTCGGATCAAAGTTTTTCACTAACTCATGTCGAGCAGATCGGCATGCTCAAGATAGTATTTTAGGAGG
這是csv檔案的副本
name,AGATC,TTTTTTCT,AATG,TCTAG,GATA,TATC,GAAA,TCTG
Albus,15,49,38,5,14,44,14,12
Cedric,31,21,41,28,30,9,36,44
Draco,9,13,8,26,15,25,41,39
Fred,37,40,10,6,5,10,28,8
Ginny,37,47,10,23,5,48,28,23
Hagrid,25,38,45,49,39,18,42,30
Harry,46,49,48,29,15,5,28,40
Hermione,43,31,18,25,26,47,31,36
James,46,41,38,29,15,5,48,22
Kingsley,7,11,18,33,39,31,23,14
Lavender,22,33,43,12,26,18,47,41
Lily,42,47,48,18,35,46,48,50
Lucius,9,13,33,26,45,11,36,39
Luna,18,23,35,13,11,19,14,24
Minerva,17,49,18,7,6,18,17,30
Neville,14,44,28,27,19,7,25,20
Petunia,29,29,40,31,45,20,40,35
Remus,6,18,5,42,39,28,44,22
Ron,37,47,13,25,17,6,13,35
Severus,29,27,32,41,6,27,8,34
Sirius,31,11,28,26,35,19,33,6
Vernon,26,45,34,50,44,30,32,28
Zacharias,29,50,18,23,38,24,22,9
e
這是我的代碼的副本
import csv
import sys
def main():
# TODO: Check for command-line usage
# TODO: Read database file into a variable
with open("dna.csv","r") as csv_file:
csv_reader = csv.reader(csv_file)
header = next(csv_reader)
# TODO: Read DNA sequence file into a variable
fil = open("seq.txt","r")
sequence = fil.read()
fil.close()
# TODO: Find longest match of each STR in DNA sequence
countsequence = {}
numberstr = 0
for r in header[1:]:
times = longest_match(sequence,r)
countsequence[r] = times
numberstr = 1
# TODO: Check database for matching profiles
checkstr = 0
with open("dna.csv" , "r") as csvfile:
csvreader = csv.DictReader(csvfile)
for rows in csvreader:
for key in countsequence:
if rows[key] != countsequence[key]:
break
elif rows[key] == countsequence[key]:
checkstr = 1
if checkstr == numberstr:
print(rows["name"])
return
print(checkstr)
def longest_match(sequence, subsequence):
"""Returns length of longest run of subsequence in sequence."""
# Initialize variables
longest_run = 0
subsequence_length = len(subsequence)
sequence_length = len(sequence)
# Check each character in sequence for most consecutive runs of subsequence
for i in range(sequence_length):
# Initialize count of consecutive runs
count = 0
# Check for a subsequence match in a "substring" (a subset of characters) within sequence
# If a match, move substring to next potential match in sequence
# Continue moving substring and checking for matches until out of consecutive matches
while True:
# Adjust substring start and end
start = i count * subsequence_length
end = start subsequence_length
# If there is a match in the substring
if sequence[start:end] == subsequence:
count = 1
# If there is no match in the substring
else:
break
# Update most consecutive matches found
longest_run = max(longest_run, count)
# After checking for runs at each character in seqeuence, return longest run found
return longest_run
main()
我期待代碼列印出人員 dna 名稱。
uj5u.com熱心網友回復:
您使用 csvreader 創建的字典中有一個細微的錯誤。這些值被讀取為字串而不是整數。如果您列印,您可以看到這一點rows:
OrderedDict([('name', 'Albus'), ('AGATC', '15'), ('TTTTTTCT', '49'), ('AATG', '38'), ('TCTAG', '5'), ('GATA', '14'), ('TATC', '44'), ('GAAA', '14'), ('TCTG', '12')])
因此,當您比較rows[key] != countsequence[key]“AGATC”時,您正在測驗'a_str' != 18即使在rows['AGATC'] is 18. 要么rows[key]需要轉換為 int 或countsequence[key]字串。一旦你解決了這個問題,它應該可以作業(對于這個例子)。
此外,您還有另一個(更微妙的)錯誤與checkstr. 您checkstr = 0在 csvreader 回圈外初始化,然后在 countsequence 回圈內遞增。這僅在 dna 集中沒有其他人的基因匹配時才有效。但是,如果 1 人匹配 1 個基因(僅)會發生什么。答案:您將增加那個人的 checkstr,而不是為下一個人重置為 0。嘗試將 'Albus' 修改為 'AGATC', 18 并查看它是否有效。祝你好運。
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/520932.html
