根據Python中的標簽將大型xml檔案拆分為多個-有解無憂

我有一個非常大的 xml 檔案，我需要根據特定標簽將其拆分為多個。XML 檔案是這樣的：

<xml>
<file id="13">
  <head>
    <talkid>2458</talkid>
    <transcription>
      <seekvideo id="645">So in college,</seekvideo>
      ...
    </transcription>
  </head>
  <content> *** This is the content I am trying to save *** </content>
</file>
<file>
      ... 
</file>
</xml>

我想提取每個檔案的內容并根據talkid保存。

這是我嘗試過的代碼：

import xml.etree.ElementTree as ET

all_talks = 'path\\to\\big\\file'

context = ET.iterparse(all_talks, events=('end', ))
for event, elem in context:
    if elem.tag == 'file':
        content = elem.find('content').text
        title = elem.find('talkid').text
        filename = format(title   ".txt")
        with open(filename, 'wb', encoding='utf-8') as f:
            f.write(ET.tostring(content), encoding='utf-8')

但我收到以下錯誤：

AttributeError: 'NoneType' object has no attribute 'text'

uj5u.com熱心網友回復：

如果您已經在使用.iterparse()它更通用的是僅依賴事件：

import xml.etree.ElementTree as ET
from pathlib import Path

all_talks = Path(r'file.xml')
context = ET.iterparse(all_talks, events=('start', 'end'))

for event, element in context:
    if event == 'end':
        if element.tag == 'talkid':
            title = element.text
        elif element.tag == 'content':
            content = element.text
        elif element.tag == 'file' and title and content:
            with open(all_talks.with_name(title   '.txt'), 'w') as f:
                f.write(content)
    elif element.tag == 'file':
        content = title = None

更新。在類似的問題@Leila詢問如何將所有<seekvideo>標簽中的文本寫入檔案而不是<content>檔案，所以這里有一個解決方案：

import xml.etree.ElementTree as ET
from pathlib import Path

all_talks = Path(r'file.xml')
context = ET.iterparse(all_talks, events=('start', 'end'))

for event, element in context:
    if event == 'end':
        if element.tag == 'file' and title and parts:
            with open(all_talks.with_name(title   '.txt'), 'w') as f:
                f.write('\n'.join(parts))
        elif element.text:
            if element.tag == 'talkid':
                title = element.text
            elif element.tag == 'seekvideo':
                parts.append(element.text)
    elif element.tag == 'file':
        title = None
        parts = []

你可以幫助我的國家，查看我的個人資料資訊。

uj5u.com熱心網友回復：

試試這樣吧。。

問題是talkid是head標簽的子標簽而不是file標簽。

import xml.etree.ElementTree as ET

all_talks = 'file.xml'

context = ET.iterparse(all_talks, events=('end', ))
for event, elem in context:
    if elem.tag == 'file':
        head = elem.find('head')
        content = elem.find('content').text
        title = head.find('talkid').text
        filename = format(title   ".txt")
        with open(filename, 'wb') as f:  # 'wt' or just 'w' if you want to write text instead of bytes
            f.write(content.encode())    # in which case you would remove the .encode()

uj5u.com熱心網友回復：

您可以使用Beautiful Soup來決議 xml。

它會這樣（我在 xml 中添加了第二個談話 id 以演示查找多個標簽）

xml_file = '''<xml>
<file id="13">
  <head>
    <talkid>2458</talkid>
    <transcription>
      <seekvideo id="645">So in college,</seekvideo>
      ...
    </transcription>
     <talkid>second talk id</talkid>
  </head>
  <content> *** This is the content I am trying to save *** </content>
</file>
<file>
      ... 
</file>
</xml>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(xml_file, "xml")

first_talk_id = soup.find('talkid').get_text()
talk_ids = soup.findAll('talkid')

print(first_talk_id)
# prints 2458


for talk in talk_ids:
    print(talk.get_text())

# prints 
# 2458
# second talk id

注意：例如，您需要為 bs4 安裝決議器才能使用 xml pip install lxml。

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/521254.html

標籤：Pythonxml

上一篇：XML編輯替換/洗掉

下一篇：如何在不使用phpdom覆寫的情況下在xml中添加子項？