我有兩個 csv 檔案,我正在考慮通過 python 將它們組合起來 - 練習我的技能,結果比我想象的要困難得多......
我的問題的一個簡單結論:我覺得我的代碼應該是正確的,但編輯后的 ??csv 檔案結果不是我想的那樣。
一個檔案,我命名為chrM_location.csv我要編輯的檔案。
第一個檔案看起來像這樣 
另一個名為chrM_genes.csv的檔案是我參考的檔案。
第二個檔案看起來像這樣:

還有其他一些列,但我目前沒有使用它們。前幾個角色是主題“CDS”,然后是一個空白行,然后是其他一些主題為“外顯子”的角色,然后是另一個空白行,然后是一些行“基因”(以及其他一些)。
我試圖做的是,我想逐行讀取第一個檔案,關注第二列中的數字(第 1 行沒有標題為 42),看看它是否屬于檔案二中的 4-5 列范圍(也逐行讀取),如果是,我記錄對應行的資訊,粘貼回第一個檔案,在行的末尾,如果不是,我跳過它。
下面是我的代碼,我開始首先通過 CDS 部分運行所有內容,因此我撰寫了一個函式refcds(). 它回傳給我:
- 該值是否在范圍內;
- 如果在范圍內,它會形成一個我想要粘貼到第二個檔案的資訊串列。
對于代碼的主要部分,一切正常,我有final[]包含該行所有資訊的串列,據說我只需要在該行上將其過去并覆寫之前的所有內容。我曾經print(final)檢查過資訊,這似乎正是我想要的。
但這就是結果的樣子:

我不知道為什么要插入一個新行以及為什么一些行被粘貼到這里,當第 2 列應該很小 - > 根據值而變大時。
類似的事情也發生在其他地方。
非常感謝你的幫助!我的解決方案快用完了......沒有給出錯誤訊息,我無法真正弄清楚出了什么問題。
import csv
from csv import reader
from csv import writer
mylist=[]
a=0
final=[]
def refcds(value):
mylist=[]
with open("chrM_genes.csv", "r") as infile:
r = csv.reader(infile)
for rows in r:
for i in range(0,12):
if value >= rows[3] and value <= rows[4]:
mylist = ["CDS",rows[3],rows[4],int(int(value)-int(rows[3]) 1)]
return 0, mylist
else:
return 1,[]
with open('chrM_location.csv','r ') as myfile:
csv_reader = csv.reader(myfile)
csv_writer = csv.writer(myfile)
for row in csv_reader:
if (row[1]) != 'POS':
final=[]
a,mylist = refcds(row[1])
if a==0:
lista=[row[0],row[1],row[2],row[3],row[4],row[5]]
final.extend(lista)
final.extend(mylist),
csv_writer.writerow(final)
if a==1:
pass
if (row[1]) == 'END':
break
myfile.close()```
uj5u.com熱心網友回復:
如果我理解正確 - 您的代碼正在嘗試同時讀取和寫入同一個檔案。
csv_reader = csv.reader(myfile)
csv_writer = csv.writer(myfile)
我沒有試過你的代碼:但我很確定這會導致奇怪的事情發生......(如果你重構并輸出到第三個檔案 - 你是否仍然看到同樣的問題?)
uj5u.com熱心網友回復:
我認為問題在于您將讀取器和寫入器設定為同一個檔案——我不知道這是做什么的。更簡潔的解決方案是在讀取回圈中累積修改后的行,然后一旦退出讀取回圈(并關閉檔案),打開同一個檔案進行寫入(不追加)并寫入累積的行。
我進行了一項重大更改以解決問題。
你還說你正在努力改進你的 Python,所以我做了一些更像 Python 的其他更改。
import csv
# Return a matched list, or return None
def refcds(value):
with open('chrM_genes.csv', 'r', newline='') as infile:
reader = csv.reader(infile)
for row in reader:
if value >= row[3] and value <= row[4]:
computed = int(value)-int(row[3]) 1 # probably negative??
mylist = ['CDS', row[3], row[4], computed]
return mylist
return None # if we get to this return, we've evaluated every row and didn't already return (because of a match)
# Accumulate rows here
final_rows = []
with open('chrM_location.csv', 'r', newline='') as myfile:
reader = csv.reader(myfile)
# next(reader) ## if you know your file has a header
for row in reader:
# Show unusual conditions first...
if row[1] == 'POS':
continue # skip header??
if row[1] == 'END':
break
# ...and if not met, do desired work
mylist = refcds(row[1])
if mylist is not None:
# no need to declare an empty list and then extend it
# just create it with initial items...
final = row[0:6] # use slice notation to get a subset of a list (6 non-inclusive, so only to 5th col)
final.extend(mylist)
final_rows.append(final)
# Write accumulated rows here
with open('final.csv', 'w', newline='') as finalfile:
writer = csv.writer(finalfile)
writer.writerows(final_rows)
我也試圖弄清楚整個事情,并想出了以下...
我認為您想按主題查找 chrM_genes 行并將POS(來自 chrM_locaction)與 每個基因的Start和End邊界進行比較,如果 POS 在 Start 和 End 的范圍內,則回傳 chrM_gene 資料并填充一些空單元格已經在 chrM_location 中。
我的第一步是從 chrM_genes 創建一個資料結構,因為我們將一遍又一遍地讀取它。仔細閱讀您的問題,我可以看到需要按主題(“CDS”、“外顯子”等)“過濾”結果,但我不確定這一點。盡管如此,我還是要按主題索引這個資料結構:
import csv
from collections import defaultdict
# This will create a dictionary, where subject will be the key
# and the value will be a list (of chrM (gene) rows)
chrM_rows_by_subject = defaultdict(list)
# Fill the data structure
with open('chrM_genes.csv', newline='') as f:
reader = csv.reader(f)
next(reader) # read (skip) header
subject_col = 2
for row in reader:
# you mentioned empty rows, that divide subjects, so skip empty rows
if row == []:
continue
subject = row[subject_col]
chrM_rows_by_subject[subject].append(row)
我模擬了chrM_genes.csv(并添加了一個標題,以嘗試闡明結構):
Col1,Col2,Subject,Start,End
chrM,ENSEMBL,CDS,3307,4262
chrM,ENSEMBL,CDS,4470,5511
chrM,ENSEMBL,CDS,5904,7445
chrM,ENSEMBL,CDS,7586,8266
chrM,ENSEMBL,exon,100,200
chrM,ENSEMBL,exon,300,400
chrM,ENSEMBL,exon,700,750
Just printing the data structure to get an idea of what it's doing:
import pprint
pprint.pprint(chrM_rows_by_subject)
yields:
defaultdict(<class 'list'>,
{'CDS': [['chrM', 'ENSEMBL', 'CDS', '3307', '4262'],
['chrM', 'ENSEMBL', 'CDS', '4470', '5511'],
...
],
'exon': [['chrM', 'ENSEMBL', 'exon', '100', '200'],
['chrM', 'ENSEMBL', 'exon', '300', '400'],
...
],
})
Next, I want a function to match a row by subject and POS:
# Return a row that matches `subject` with `pos` between Start and End; or return None.
def match_gene_row(subject, pos):
rows = chrM_rows_by_subject[subject]
pos = int(pos)
start_col = 3
end_col = 4
for row in rows:
start = row[start_col])
end = row[end_col])
if pos >= start and pos <= end:
# return just the data we want...
return row
# or return nothing at all
return None
If I run these commands to test:
print(match_gene_row('CDS', '42'))
print(match_gene_row('CDS', '4200'))
print(match_gene_row('CDS', '7586'))
print(match_gene_row('exon', '500'))
print(match_gene_row('exon', '399'))
I get :
['chrM', 'ENSEMBL', 'CDS', '3307', '4262']
['chrM', 'ENSEMBL', 'CDS', '3307', '4262']
['chrM', 'ENSEMBL', 'CDS', '7586', '8266']
None # exon: 500
['chrM', 'ENSEMBL', 'exon', '300', '400']
Read chrM_location.csv, and build a list of rows with matching gene data.
final_rows = [] # accumulate all rows here, for writing later
with open('chrM_location.csv', newline='') as f:
reader = csv.reader(f)
# Modify header
header = next(reader)
header.extend(['CDS','Start','End','cc'])
final_rows.append(header)
# Read rows and match to genes
pos_column = 1
for row in reader:
pos = row[pos_column]
matched_row = match_gene_row('CDS', pos) # hard-coded to CDS
if matched_row is not None:
subj, start, end = matched_row[2:5]
computed = str(int(pos)-int(start) 1) # this is coming out negative??
row.extend([subj, start, end, computed])
final_rows.append(row)
Finally, write.
with open('final.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(final_rows)
I mocked up chrM_location.csv:
name,POS,id,Ref,ALT,Frequency
chrM,41,.,C,T,0.002498
chrM,42,rs377245343,T,TC,0.001562
chrM,55,.,TA,T,0.00406
chrM,55,.,T,C,0.001874
When I run the whole thing, I get a final.csv that looks likes this:
| name | POS | id | Ref | ALT | Frequency | CDS | Start | End | sequence_cc |
|---|---|---|---|---|---|---|---|---|---|
| chrM | 41 | . | C | T | 0.002498 | CDS | 3307 | 4262 | -3265 |
| chrM | 42 | rs377245343 | T | TC | 0.001562 | CDS | 3307 | 4262 | -3264 |
| chrM | 55 | . | TA | T | 0.00406 | CDS | 4470 | 5511 | -4414 |
| chrM | 55 | . | T | C | 0.001874 | CDS | 4470 | 5511 | -4414 |
I put this all together in a Gist.
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/349634.html
