我正在嘗試創建兩個包含字串的“開始”和“結束”索引的串列。在這種情況下,兩個字串的長度相等。例如
str1='ATGGATCGATCG'
str2='CGGGCGCGCGCG'
在這里,匹配的長度是:GG、CG、CG
我想要以下型別的輸出:
list = [2,3,6,7,10,11] #list of the matched indices
start = [2,6,10] #start indices of the matched lengths
end = [3,7,11] #end indices if the matched lengths
現在,我的代碼塊類似于以下代碼,但我希望索引可以定位匹配的序列。
str1='ATGGATCGATCG'
str2='CGGGCGCGCGCG'
result1 = ''
result2 = ''
#handle the case where one string is longer than the other
maxlen=len(str2) if len(str1)<len(str2) else len(str1)
#loop through the characters
for i in range(maxlen):
letter1=str1[i:i 1]
letter2=str2[i:i 1]
if ((letter1 == letter2) and letter1 in ['A','T','C','G'] and letter2 in ['A','T','C','G']):
result1 =letter1
result2 =letter2
uj5u.com熱心網友回復:
這實際上是為了zip:
str1='ATGGATCGATCG'
str2='CGGGCGCGCGCG'
matches = []
for i,(a,b) in enumerate(zip(str1,str2)):
if a == b:
if not matches or matches[-1][1] != i-1
matches.append([i,i])
else:
matches[-1][1] = 1
print(matches)
starts = [k[0] for k in matches]
ends = [k[1] for k in matches]
輸出:
[[2, 3], [6, 7], [10, 11]]
這也將捕獲單個字符匹配。如果需要,您可以在之后的快速回圈中過濾掉那些。
uj5u.com熱心網友回復:
你也可以用正則運算式做類似的事情。
import re
str1='ATGGATCGATCG'
str2='CGGGCGCGCGCG'
pat = 'GG|CG|CG'
matches = [[(m.span()[0],m.span()[1]-1) for m in re.finditer(pat,x)] for x in [str1,str2]]
m = set(matches[0]) & set(matches[1])
starts= [x[0] for x in m]
ends= [x[1] for x in m]
print(m,starts,ends, sep='\n')
輸出
{(2, 3), (6, 7), (10, 11)}
[2, 6, 10]
[3, 7, 11]
uj5u.com熱心網友回復:
您還可以使用numpy.split拆分非連續索引:
lst = [i for i, (s1,s2) in enumerate(zip(str1, str2)) if s1==s2]
splits = [0] [idx 1 for idx, (i,j) in enumerate(zip(lst, lst[1:])) if j-i != 1] [len(lst)]
start, end = zip(*[[arr[0], arr[-1]] for arr in np.split(lst, np.where(np.diff(lst) != 1)[0] 1)])
輸出:
((2, 6, 10), (3, 7, 11))
uj5u.com熱心網友回復:
對您的代碼進行了一些更正 1)max()是內置的,無需執行 if 陳述句,2) 字串已經是串列型別物件,因此"a" in "bbbbabb"已經回傳 True,無需將每個字母放入串列中。
看來您需要一個函式來確定兩個字串的開頭有多少一致。
import itertools as it
def f(s,t):
return sum(it.takewhile(bool,map(lambda z:z[0]==z[1],zip(s,t))))
使用這樣的函式,我們現在可以按照您的描述進行操作,并找到字串之間任意長度的所有同時匹配項:
str1='ATGGATCGATCG'
str2='CGGGCGCGCGCG'
matches = [(i,i l-1) for i,(a,b) in enumerate(zip(str1,str2)) if (l:=f(str1[i:],str2[i:]))>=2]
print(matches)
uj5u.com熱心網友回復:
讓我們從一個輔助函式開始,它將計算給定索引處兩個字串的公共前綴的長度
def helper(index, str1, str2):
length = 0
try:
while str1[index] == str2[index]: #and other needed conditions
length = 1
index = 1
except IndexError:
pass
return length
現在我們想在迭代時使用它
index = 0
result = []
while index < min(len(str1), len(str2)):
length = helper(index, str1, str2)
if length > 0:
result.append(i, i length)
i = length 1 # We can omit one character as it was checked in helper
else:
i = 1
轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/393373.html
