將XML資料展平為pandas資料框-有解無憂

如何將此地址的 XML 檔案轉換為 pandas 資料框？我已將 XML 作為檔案下載并呼叫它'058com.xml'并運行下面的代碼，盡管結果資料框的最后一列是排列為多個 OrderedDict 的混亂資料。XML 結構似乎很復雜，超出了我的知識范圍。

json_normalize檔案讓我感到困惑。如何改進代碼以完全展平 XML？

import pandas as pd
import xmltodict

rawdata = '058com.xml'

with open(rawdata) as fd:
    doc = xmltodict.parse(fd.read(), encoding='ISO-8859-1', process_namespaces=False)

pd.json_normalize(doc['Election']['Departement']['Communes']['Commune'])

理想情況下，資料框應該看起來像 ID、地理物體的名稱和投票結果以及候選人的姓名。

當完全展平時，最終的資料框應該包含很多列，并且預計與下面的 CSV 非常接近。我以（分號分隔）的形式粘貼了標題和第一行.csv作為資料框應該是什么樣子的代表性樣本

Code du département;Libellé du département;Code de la commune;Libellé de la commune;Etat saisie;Inscrits;Abstentions;% Abs/Ins;Votants;% Vot/Ins;Blancs;% Blancs/Ins;% Blancs/Vot;Nuls;% Nuls/Ins;% Nuls/Vot;Exprimés;% Exp/Ins;% Exp/Vot;N°Panneau;Sexe;Nom;Prénom;Voix;% Voix/Ins;% Voix/Exp
01;Ain;001;L'Abergement-Clémenciat;Complet;645;108;16,74;537;83,26;16;2,48;2,98;1;0,16;0,19;520;80,62;96,83;1;F;ARTHAUD;Nathalie;3;0,47;0,58;2;M;ROUSSEL;Fabien;6;0,93;1,15;3;M;MACRON;Emmanuel;150;23,26;28,85;4;M;LASSALLE;Jean;18;2,79;3,46;5;F;LE PEN;Marine;149;23,10;28,65;6;M;ZEMMOUR;éric;43;6,67;8,27;7;M;MéLENCHON;Jean-Luc;66;10,23;12,69;8;F;HIDALGO;Anne;5;0,78;0,96;9;M;JADOT;Yannick;30;4,65;5,77;10;F;PéCRESSE;Valérie;26;4,03;5,00;11;M;POUTOU;Philippe;3;0,47;0,58;12;M;DUPONT-AIGNAN;Nicolas;21;3,26;4,04

uj5u.com熱心網友回復：

由于 URL 確實在每個下面包含兩個資料部分<Tour>，特別是<Mentions>（這似乎是聚合投票資料）和<Candidats>（這是細化的個人級別資料）（請原諒我的法語），請考慮使用新的 IO 方法構建兩個單獨的資料框pandas.read_xml，它支持 XSLT 1.0（通過第三方lxml包）。無需遷移到字典以進行 JSON 處理。

作為一種用 XML 撰寫的專用語言，XSLT可以將您的嵌套結構轉換為更扁平的格式，以便遷移到資料框。具體來說，每個樣式表都會深入到最細粒度的節點，然后按ancestor軸將更高級別的資訊作為兄弟列拉取。

提及 （另存為 .xsl、特殊的 .xml 檔案或在 Python 中嵌入為字串）

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>
  
  <xsl:template match="/">
    <Tours>
      <xsl:apply-templates select="descendant::Tour/Mentions"/>
    </Tours>
  </xsl:template>
  
  <xsl:template match="Mentions/*">
    <Mention>
      <xsl:copy-of select="ancestor::Election/Scrutin/*"/>
      <xsl:copy-of select="ancestor::Departement/*[name()!='Communes']"/>
      <xsl:copy-of select="ancestor::Commune/*[name()!='Tours']"/>
      <xsl:copy-of select="ancestor::Tour/NumTour"/>
      <Mention><xsl:value-of select="name()"/></Mention>
      <xsl:copy-of select="*"/>
    </Mention>
  </xsl:template>
  
</xsl:stylesheet>

Python （直接從 URL 讀取）

url = (
    "https://www.resultats-elections.interieur.gouv.fr/telechargements/" 
    "PR2022/resultatsT1/027/058/058com.xml"
)

mentions_df = pd.read_xml(url, stylesheet=mentions_xsl)

輸出

                Type  Annee  CodReg  CodReg3Car                   LibReg  CodDpt  CodMinDpt  CodDpt3Car  LibDpt  CodSubCom    LibSubCom  NumTour      Mention  Nombre RapportInscrit RapportVotant
0     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1     Inscrits     105           None          None
1     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1  Abstentions      24          22,86          None
2     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1      Votants      81          77,14          None
3     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1       Blancs       2           1,90          2,47
4     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1         Nuls       0           0,00          0,00
             ...    ...     ...         ...                      ...     ...        ...         ...     ...        ...          ...      ...          ...     ...            ...           ...
1849  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1  Abstentions      13          14,94          None
1850  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1      Votants      74          85,06          None
1851  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1       Blancs       1           1,15          1,35
1852  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1         Nuls       0           0,00          0,00
1853  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1     Exprimes      73          83,91         98,65

[1854 rows x 16 columns]

候選人 （另存為 .xsl、特殊的 .xml 檔案或在 Python 中嵌入為字串）

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>
  
  <xsl:template match="/">
    <Candidats>
      <xsl:apply-templates select="descendant::Tour/Resultats/Candidats"/>
    </Candidats>
  </xsl:template>
  
  <xsl:template match="Candidat">
    <xsl:copy>
      <xsl:copy-of select="ancestor::Election/Scrutin/*"/>
      <xsl:copy-of select="ancestor::Departement/*[name()!='Communes']"/>
      <xsl:copy-of select="ancestor::Commune/*[name()!='Tours']"/>
      <xsl:copy-of select="ancestor::Tour/NumTour"/>
      <xsl:copy-of select="*"/>
    </xsl:copy>
  </xsl:template>
  
</xsl:stylesheet>

Python （直接從 URL 讀取）

url = (
    "https://www.resultats-elections.interieur.gouv.fr/telechargements/" 
    "PR2022/resultatsT1/027/058/058com.xml"
)

candidats_df = pd.read_xml(url, stylesheet=candidats_xsl)

輸出

                Type  Annee  CodReg  CodReg3Car                   LibReg  CodDpt  CodMinDpt  CodDpt3Car  LibDpt  CodSubCom    LibSubCom  NumTour  NumPanneauCand         NomPsn PrenomPsn CivilitePsn  NbVoix RapportExprime RapportInscrit
0     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1               1        ARTHAUD  Nathalie         Mme       0           0,00           0,00
1     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1               2        ROUSSEL    Fabien          M.       3           3,80           2,86
2     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1               3         MACRON  Emmanuel          M.      14          17,72          13,33
3     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1               4       LASSALLE      Jean          M.       2           2,53           1,90
4     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1               5         LE PEN    Marine         Mme      28          35,44          26,67
             ...    ...     ...         ...                      ...     ...        ...         ...     ...        ...          ...      ...             ...            ...       ...         ...     ...            ...            ...
3703  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1               8        HIDALGO      Anne         Mme       0           0,00           0,00
3704  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1               9          JADOT   Yannick          M.       4           5,48           4,60
3705  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1              10       PéCRESSE   Valérie         Mme       6           8,22           6,90
3706  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1              11         POUTOU  Philippe          M.       1           1,37           1,15
3707  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1              12  DUPONT-AIGNAN   Nicolas          M.       4           5,48           4,60

[3708 rows x 19 columns]

Communes您可以使用它們的共享節點連接生成的資料幀：<CodSubCom>但<LibSubCom>可能必須pivot_table在聚合資料上進行一對多合并。下面用Nombre聚合演示：

mentions_candidats_df = (
    candidats_df.merge(
        mentions_df.pivot_table(
            index=["CodSubCom", "LibSubCom"],
            columns="Mention",
            values="Nombre",
            aggfunc="max"
        ).reset_index(),
        on=["CodSubCom", "LibSubCom"]
    )
)

mentions_candidats_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3708 entries, 0 to 3707
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Type            3708 non-null   object
 1   Annee           3708 non-null   int64 
 2   CodReg          3708 non-null   int64 
 3   CodReg3Car      3708 non-null   int64 
 4   LibReg          3708 non-null   object
 5   CodDpt          3708 non-null   int64 
 6   CodMinDpt       3708 non-null   int64 
 7   CodDpt3Car      3708 non-null   int64 
 8   LibDpt          3708 non-null   object
 9   CodSubCom       3708 non-null   int64 
 10  LibSubCom       3708 non-null   object
 11  NumTour         3708 non-null   int64 
 12  NumPanneauCand  3708 non-null   int64 
 13  NomPsn          3708 non-null   object
 14  PrenomPsn       3708 non-null   object
 15  CivilitePsn     3708 non-null   object
 16  NbVoix          3708 non-null   int64 
 17  RapportExprime  3708 non-null   object
 18  RapportInscrit  3708 non-null   object
 19  Abstentions     3708 non-null   int64 
 20  Blancs          3708 non-null   int64 
 21  Exprimes        3708 non-null   int64 
 22  Inscrits        3708 non-null   int64 
 23  Nuls            3708 non-null   int64 
 24  Votants         3708 non-null   int64 
dtypes: int64(16), object(9)
memory usage: 753.2  KB

在即將發布的 pandas 1.5 中，read_xml將支持dtypes在這種情況下允許在XSLT轉換后進行轉換。

uj5u.com熱心網友回復：

我試過這個：

import pandas as pd
import xmltodict

rawdata = '058com.xml'

with open(rawdata) as fd:
    doc = xmltodict.parse(fd.read(), encoding='ISO-8859-1', process_namespaces=False)

df = pd.json_normalize(doc['Election']['Departement']['Communes']['Commune'])


col_length_df = len(df.columns)
all_columns = list(df.columns[:-1])   list(df.iloc[0, len(df.columns)-1][0].keys())

new_df = df.reindex(columns = all_columns)

new_df.astype({"RapportExprime": str, "RapportInscrit": str}).dtypes

for index, rows in new_df.iterrows():
    new_df.iloc[index, col_length_df-1:] = list(df.iloc[index, len(df.columns)-1][0].values())

由于的最后一行df是一個有序字典，代碼使用它的鍵將空列與 , 的原始列一起添加df到new_df。最后，它遍歷df和的行new_df以填充的空列new_df。

上面的代碼給了我們：

    CodSubCom           LibSubCom Tours.Tour.NumTour Tours.Tour.Mentions.Inscrits.Nombre Tours.Tour.Mentions.Abstentions.Nombre  ... PrenomPsn CivilitePsn NbVoix RapportExprime RapportInscrit
0         001               Achun                  1                                 105                                     24  ...  Nathalie         Mme      0           0,00           0,00
1         002       Alligny-Cosne                  1                                 696                                    133  ...  Nathalie         Mme      3           0,54           0,43
2         003   Alligny-en-Morvan                  1                                 533                                    123  ...  Nathalie         Mme      5           1,25           0,94
3         004               Alluy                  1                                 263                                     48  ...  Nathalie         Mme      1           0,48           0,38
4         005               Amazy                  1                                 188                                     51  ...  Nathalie         Mme      2           1,53           1,06
..        ...                 ...                ...                                 ...                                    ...  ...       ...         ...    ...            ...            ...
304       309        Villapour?on                  1                                 327                                     70  ...  Nathalie         Mme      1           0,40           0,31
305       310     Villiers-le-Sec                  1                                  34                                      4  ...  Nathalie         Mme      0           0,00           0,00
306       311         Ville-Langy                  1                                 203                                     46  ...  Nathalie         Mme      1           0,64           0,49
307       312  Villiers-sur-Yonne                  1                                 263                                     60  ...  Nathalie         Mme      0           0,00           0,00
308       313         Vitry-Laché                  1                                  87                                     13  ...  Nathalie         Mme      1           1,37           1,15

最后new_df.columns是：

Index(['CodSubCom', 'LibSubCom', 'Tours.Tour.NumTour',
       'Tours.Tour.Mentions.Inscrits.Nombre',
       'Tours.Tour.Mentions.Abstentions.Nombre',
       'Tours.Tour.Mentions.Abstentions.RapportInscrit',
       'Tours.Tour.Mentions.Votants.Nombre',
       'Tours.Tour.Mentions.Votants.RapportInscrit',
       'Tours.Tour.Mentions.Blancs.Nombre',
       'Tours.Tour.Mentions.Blancs.RapportInscrit',
       'Tours.Tour.Mentions.Blancs.RapportVotant',
       'Tours.Tour.Mentions.Nuls.Nombre',
       'Tours.Tour.Mentions.Nuls.RapportInscrit',
       'Tours.Tour.Mentions.Nuls.RapportVotant',
       'Tours.Tour.Mentions.Exprimes.Nombre',
       'Tours.Tour.Mentions.Exprimes.RapportInscrit',
       'Tours.Tour.Mentions.Exprimes.RapportVotant', 'NumPanneauCand',
       'NomPsn', 'PrenomPsn', 'CivilitePsn', 'NbVoix', 'RapportExprime',
       'RapportInscrit'],
      dtype='object')

中的總列數new_df：24

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/471884.html

標籤：Python 熊猫 xml json标准化

上一篇：XSLT1.0將多個重復元素復制到新父項

下一篇：XSLT忽略變數選擇和調節并執行一切