我正在嘗試讀取 fasta 檔案。我想洗掉/忽略以“>”開頭的標題/資訊行,并將以下序列存盤到單獨的字串中。下面是我必須這樣做的代碼(從https://rosettacode.org/wiki/FASTA_format#C 部分修改,因為我最初的作業更少)。他們有我想做的一個很好的例子。
我的問題是這個fasta檔案:
">sequence_1
MSTAGKVIKCKAAVLWELHKPFTIEDIEVAPPKAHEVRIKMVATGVCRSDDHVVSGTLVTPLPAVLGHE
GAGIVEGVTCVKPGDKVIPLFSPQCGECRICKHPESNFCSRSDLLMPRGTLREGTSRFSCKGKQIHNFI
STSTFSQYTVVDDIAVAKIDGASPLDKVCLIGCGFSTGYGSAVKVAKVTPGSTCAVFGLGGVGLSVIIG
CKAAGAARIIAVDINKDKFAKAKELGATECIYSKPIQEVLQEMTDGGVDFSFEVIGRLDTMTSALLSCH
AACGVSVVVGVPPNAQNLSMNPMLLLLGRTWKGAIFGGFKSKDSVPKLVAKKFPLDPLITHVLPFEKIN
EAFDLLRSGKSIRTVLTF
">sequence_2
MNQGKVIKCKAAVLWEVKKPFSIEDVEVAPPKAYEVRIKMVAVGICHTDDHVVSGNLVTPLPVILGHEA
AGIVESVGEGVTTVKPGDKVIPLFTCRVCKNPESNYCLKNDLGNPRGTLQDGTRRFTCRGKPIHHFLGT
STFSQYTVVDENAVAKIDAASPLEKVCLIGCGFSTGYGSAVNVAKVTPGSTCAVFGLGGVGLSAVMGCK
AAGAARIIAVDINKDKFAKAKELGATECINPQDYKLPIQEVLKEMTDGSTVIGRLDTMMASLLCCGTSV
IVEDTPASQNLSINPMLLLTGRTWKGAVYGGFKSKEGIPKLVADFMAKKFSLDALITHVLPFEKINEGF
DLLHSGKSIRTVLTF
我的輸出:
Sequence 1: MSTAGKVIKCKAAVLWELHKPFTIEDIEVAPPKAHEVRIKMVATGVCRSDDHVVSGTLVTPLPAVLGHEGAGIVEGVTCVKPGDKVIPLFSPQCGECRICKHPESNFCSRSDLLMPRGTLREGTSRFSCKGKQIHNFISTSTFSQYTVVDDIAVAKIDGASPLDKVCLIGCGFSTGYGSAVKVAKVTPGSTCAVFGLGGVGLSVIIGCKAAGAARIIAVDINKDKFAKAKELGATECIYSKPIQEVLQEMTDGGVDFSFEVIGRLDTMTSALLSCHAACGVSVVVGVPPNAQNLSMNPMLLLLGRTWKGAIFGGFKSKDSVPKLVAKKFPLDPLITHVLPFEKINEAFDLLRSGKSIRTVLTF
Sequence 2: MNQGKVIKCKAAVLWEVKKPFSIEDVEVAPPKAYEVRIKMVAVGICHTDDHVVSGNLVTPLPVILGHEAAGIVESVGEGVTTVKPGDKVIPLFTCRVCKNPESNYCLKNDLGNPRGTLQDGTRRFTCRGKPIHHFLGTSTFSQYTVVDENAVAKIDAASPLEKVCLIGCGFSTGYGSAVNVAKVTPGSTCAVFGLGGVGLSAVMGCKAAGAARIIAVDINKDKFAKAKELGATECINPQDYKLPIQEVLKEMTDGSTVIGRLDTMMASLLCCGTSVIVEDTPASQNLSINPMLLLTGRTWKGAVYGGFKSKEGIPKLVADFMAKKFSLDALITHVLPFEKINEGF
序列 2 的最后一行左右被切斷......有什么幫助/解決方案嗎?
void read_in_Protein(string Protein_filename)
{ // read in the sequences
fstream myfile;
myfile.open(Protein_filename, ios::in);
if (!myfile.is_open()) {
cerr << "Error can not open file" << endl;
exit(1);
}
string Protein_Sequences{};
string Protein_Seq_names{};
// string temp{};
string Prot_Seq1{};
string Prot_Seq2{};
string line{};
while (getline(myfile, line).good()) {
//std::cout << "Line input received (" << line.length() << "): " << line << std::endl;
if (line.empty() || line[0] == '>') { // Identifier marker
if (!Protein_Seq_names.empty()) { // Print out what we read from the last entry
//std::cout << "\tReseting to new sequence" << std::endl;
// cout << Protein_Sequences << endl;
Protein_Seq_names.clear();
Prot_Seq1 = Protein_Sequences;
}
if (!line.empty()) {
//std::cout << "\tSetting sequence start" << std::endl;
Protein_Seq_names = line.substr(1);
}
// std::cout << "\tClearing sequences..." << std::endl;
Protein_Sequences.clear();
}
else if (!Protein_Seq_names.empty()) {
line = line.substr(0, line.length() - 1);
if (line.find(' ') != string::npos) { // Invalid sequence--no spaces allowed
//std::cout << "\tSpace found, clearing buffers..." << std::endl;
Protein_Seq_names.clear();
Protein_Sequences.clear();
}
else {
//std::cout << "\tAppending line to protein sequence..." << std::endl;
Protein_Sequences = line;
}
}
//std::cout << "Protein_Sequences: " << Protein_Sequences << std::endl;
}
if (!Protein_Seq_names.empty()) { // Print out what we read from the last entry
// cout << Protein_Sequences << endl;
Prot_Seq2 = Protein_Sequences;
}
cout << "\nSequence 1: " << Prot_Seq1 << endl;
cout << Prot_Seq1.length();
cout << "\nSequence 2: " << Prot_Seq2 << endl;
cout << Prot_Seq2.length();
}
uj5u.com熱心網友回復:
假設您的檔案沒有以新行結尾,那么最后一次呼叫std::getline將設定該eof位以指示它在找到行結尾之前已到達檔案末尾。當您檢查.good()您的 while 回圈時,最后一行將被丟棄。您應該改為檢查!fail()(或只是流本身的布林值,相當于)!fail():
while (getline(myfile, line))
在讀取最后一行之后,回圈的下一次迭代將在流處于狀態時嘗試讀取,eof并立即失敗并跳出回圈。
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/511803.html
標籤:C 文件生物信息学法斯塔
