Python正則運算式模式以查找包含在前處理器指令中的字串-有解無憂

我正在嘗試使用 Python 讀取 C 源檔案以提取加載的頭檔案。
頭檔案在#ifdef TYPEA和#elseOR之間指定#endif。如果有#else-clause，則頭檔案將始終在-clause之前指定#else。

讓我們假設源內容的摘錄如下所示：

source_content = '\n'.join([
    '#ifdef TYPEZERO',
    '  int someint = 42;',
    '#endif',
    'void abc ( int value) {',
    '  return 5 ** 2.5',
    '}',
    '',
    'abc',
    '#ifdef TYPEA',                                 # <---- begin identifier, may contain leading/trailing whitespaces
    '#include "some_header.h"',                     # <---- I want these lines
    '           #include "some_other_header23.h"',  # <---- I want these lines
    '           #else        ',                     # optional stop identifier, may contain leading/trailing whitespaces
    'double in_fact_int = 5;',                      # some irrelevant content
    '         #endif    ',                          # final stop identifier, may contain leading/trailing whitespaces
    '',
    'a = 5',
    '#ifdef TYPEB',
    '  abc = 23.5;',
    '#endif',
])

我想提取帶有注釋的行，不包括#ifdef TYPEA, #else, #endif，這樣我的結果是：

desired_match = '#include "some_header.h"\n           #include "some_other_header23.h"'

print(desired_match)
# Out:    #include "some_header.h"
# Out:               #include "some_other_header23.h"

洗掉空格會很好，但是我可以將其與正則運算式分開進行。

我目前的做法是：

import re

pattern = re.compile(
    r'(\s*.*)#ifdef(\s )TYPEA(\s*)(.*?)(?=((\s*)#else|(\s*)#endif))',
    re.DOTALL
)
match = re.match(pattern, source_content)

print(match.group())
# Out:    #ifdef TYPEZERO
# Out:      int someint = 42;
# Out:    #endif
# Out:    void abc ( int value) {
# Out:      return 5 ** 2.5
# Out:    }
# Out:    
# Out:    abc
# Out:    #ifdef TYPEA
# Out:    #include "some_header.h"
# Out:               #include "some_other_header23.h"

這適用于切斷#elseor #endif，但正如您所看到的#ifdef TYPEA，更糟糕的是，所有前面的行也都匹配。
如果我(\s*.*)從模式中洗掉前導（或將其更改為(\s*)），那么我將看不到任何匹配項。

我怎樣才能排除之前的行#ifdef TYPEA并且可能也#ifdef TYPEA得到我想要的匹配？提前致謝！

uj5u.com熱心網友回復：

這是使用命名組的一種方法。你已經解決了大部分問題。

請注意，在正則運算式的變化包括#ifdef...內部分(?P<M>...)。

import re

source_content = '\n'.join([
    '#ifdef TYPEZERO',
    '  int someint = 42;',
    '#endif',
    'void abc ( int value) {',
    '  return 5 ** 2.5',
    '}',
    '',
    'abc',
    '#ifdef TYPEA',                                 # <---- begin identifier, may contain leading/trailing whitespaces
    '#include "some_header.h"',                     # <---- I want these lines
    '           #include "some_other_header23.h"',  # <---- I want these lines
    '           #else        ',                     # optional stop identifier, may contain leading/trailing whitespaces
    'double in_fact_int = 5;',                      # some irrelevant content
    '         #endif    ',                          # final stop identifier, may contain leading/trailing whitespaces
    '',
    'a = 5',
    '#ifdef TYPEB',
    '  abc = 23.5;',
    '#endif',
])

pattern = re.compile(
    r'(\s*.*)#ifdef(\s )TYPEA(\s*)(?P<M>(.*?)(?=((\s*)#else|(\s*)#endif)))',
    re.DOTALL
)
match = re.match(pattern, source_content)

print(match.group( "M" ))

uj5u.com熱心網友回復：

您可以使用re.search代替re.match和使用組號來獲取部分正則運算式結果。

pattern = re.compile(
    r'\s #ifdef\s TYPEA\s*(.*?)(?=(\s*#else|\s*#endif))',
    re.DOTALL
)
match = re.search(pattern, source_content)

print(match.group(1))

它解決了你的問題嗎？

uj5u.com熱心網友回復：

如果您只想要#ifdef TYPEA和#else或之間的內容#endif，您可以匹配整個內容并在這些關鍵字之間創建一個組。re.findall將回傳組：

import re
comment_pattern = re.compile(r'#ifdef TYPEA(.*?)(?:#else|#endif)', re.MULTILINE | re.DOTALL)
print(*re.findall(comment_pattern, source_content), sep='\n-------------\n')

輸出：

#include "some_header.h"
           #include "some_other_header23.h"

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/377521.html

標籤：Python 正则表达式

上一篇：為什么這個正則運算式模式只回傳最后一個實體？

下一篇：用于捕獲一部分url的正確正則運算式