BeautifulSoup獲取一行標簽之間的文本-有解無憂

我有一堆GCOV分支和行覆寫工具的HTML檔案，檔案如下所示：

<tr>
<td align="right" class="lineno"><pre>224</pre></td>
<td align="right" class="linebranch"><span class="takenBranch" title="Branch 1 taken 329 times">&check;</span><span class="notTakenBranch" title="Branch 2 not taken">&cross;</span><span class="notTakenBranch" title="Branch 4 not taken">&cross;</span><span class="takenBranch" title="Branch 5 taken 329 times">&check;</span><br/><span class="notTakenBranch" title="Branch 6 not taken">&cross;</span><span class="takenBranch" title="Branch 7 taken 329 times">&check;</span></td>
<td align="right" class="linecount coveredLine"><pre>329</pre></td>
<td align="left" class="src coveredLine"><pre>        line of C   code</pre></td>
</tr>

<tr>
<td align="right" class="lineno"><pre>225</pre></td>
<td align="right" class="linebranch"></td>
<td align="right" class="linecount uncoveredLine"><pre></pre></td>
<td align="left" class="src uncoveredLine"><pre>   another line of  C   code;</pre></td>
</tr>

我想提取文本“（另一）C 行”代碼以及理想情況下的行號，以便輸出如下所示：

224 line of C   code
225 another line of C   code

我嘗試使用 BeautifulSoup 但它沒有提供請求的輸出，我的代碼如下所示：

from itertools import islice
import codecs
import glob
from ntpath import join
import os
from bs4 import BeautifulSoup

lineNo = "<td align=\"right\" class=\"lineNo\"><pre>"
linetextCovered = "<td align=\"left\" class=\"src coveredLine\"><pre>"
linetextNotCovered = "<td align=\"left\" class=\"src uncoveredLine\"><pre>"
open('Output.txt', 'w').close() #Erase any content of Output.txt file

for filepath in glob.iglob('path/To/Reports/*.html'):
    with codecs.open(os.path.join(filepath), "r") as inputFile, open('Output.txt',"a") as outputFile:
        for num, line in enumerate(inputFile, 1):
            if lineNo in line:
                inputSoup = BeautifulSoup(line)
                text = inputSoup.getText()
                outputFile.write("".join(islice(text, 1)   "\t"))
            if linetextCovered or linetextNotCovered in line:
                inputSoup = BeautifulSoup(line)
                text = inputSoup.getText()
                outputFile.write("".join(islice(text, 4)))
            outputFile.write("\n")
print("Done")

但輸出看起來像這樣

/* L
a:li
{

colo
text
}

我究竟做錯了什么？非常感謝您的幫助。

uj5u.com熱心網友回復：

你可以這樣做：

from bs4 import BeautifulSoup

html = '''
<tr>
<td align="right" ><pre>224</pre></td>
<td align="right" ><span  title="Branch 1 taken 329 times">&check;</span><span  title="Branch 2 not taken">&cross;</span><span  title="Branch 4 not taken">&cross;</span><span  title="Branch 5 taken 329 times">&check;</span><br/><span  title="Branch 6 not taken">&cross;</span><span  title="Branch 7 taken 329 times">&check;</span></td>
<td align="right" ><pre>329</pre></td>
<td align="left" ><pre>        line of C   code</pre></td>
</tr>

<tr>
<td align="right" ><pre>225</pre></td>
<td align="right" ></td>
<td align="right" ><pre></pre></td>
<td align="left" ><pre>   another line of  C   code;</pre></td>
</tr>
'''


for tr in BeautifulSoup(html.encode(), 'html.parser').find_all('tr'):
    lineno  = tr.find('td',{'class':'src'}).text.strip()
    src     = tr.find('td', {'class':'lineno'}).text.strip()
    print(lineno, src)

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/373181.html

標籤：Python html 解析美汤

上一篇：使用SSIS或T-SQL將一列帶引號和不帶引號的逗號分隔值拆分為多列

下一篇：遞回運算式的決議器掛在ghci中