Perl錨定正則運算式性能-有解無憂

問題和資料

這篇文章的底部是生成此 NYTProf 資料的整個腳本。該腳本構建一個散列，然后嘗試洗掉包含某些錯誤模式的鍵。通過 NYTProf 運行代碼生成以下內容

delete @$hash{ grep { /\Q$bad_pattern\E/ } sort keys %$hash };
# spent  7.29ms making    2 calls to main::CORE:sort, avg 3.64ms/call
# spent   808μs making 7552 calls to main::CORE:match, avg 107ns/call
# spent   806μs making 7552 calls to main::CORE:regcomp, avg 107ns/call

有超過 7,000 次呼叫 main::CORE:match 和 main::CORE:regcomp。假設這是足夠數量的呼叫來降低噪音水平。

繼續！如果壞模式出現在鍵的開頭，則只需洗掉它們。聽起來很棒！添加 ^ 以錨定正則運算式應該可以提高性能。但是，NYTProf 生成以下內容。NYTprof 已經運行了很多次了，這很一致

delete @$hash{ grep { /^\Q$bad_pattern\E/ } sort keys %$hash };
# spent  7.34ms making    2 calls to main::CORE:sort, avg 3.67ms/call
# spent  1.62ms making 7552 calls to main::CORE:regcomp, avg 214ns/call
# spent   723μs making 7552 calls to main::CORE:match, avg 96ns/call

問題

錨定的正則運算式在這些 main::CORE:* 方法中花費的時間幾乎翻了一番。但是錨定的正則運算式應該可以提高性能。這個資料集有什么獨特之處導致錨定的正則運算式花費這么多額外的時間？

整個腳本

use strict;
use Devel::NYTProf;

my @states = qw(KansasCity MississippiState ColoradoMountain IdahoInTheNorthWest AnchorageIsEvenFurtherNorth);
my @cities = qw(WitchitaHouston ChicagoDenver);
my @streets = qw(DowntownMainStreetInTheCity CenterStreetOverTheHill HickoryBasketOnTheWall);
my @seasoncode = qw(8000S 8000P 8000F 8000W);
my @historycode = qw(7000S 7000P 7000F 7000W 7000A 7000D 7000G 7000H);
my @sides = qw(left right up down);

my $hash;
for my $state (@states) {
    for my $city (@cities) {
        for my $street (@streets) {
            for my $season (@seasoncode) {
                for my $history (@historycode) {
                    for my $side (@sides) {
                        $hash->{$state . '[0].' . $city . '[1].' . $street . '[2].' . $season . '.' . $history . '.' . $side} = 1;
                    }
                }
            }
        }
    }
}

sub CleanseHash {
    my @bad_patterns = (
        'KansasCity[0].WitchitaHouston[1].DowntownMainStreetInTheCity[2]',
        'ColoradoMountain[0].ChicagoDenver[1].HickoryBasketOnTheWall[2].8000F'
    );

    for my $bad_pattern (@bad_patterns) {
        delete @$hash{ grep { /^\Q$bad_pattern\E/ } sort keys %$hash };
    }
}

DB::enable_profile();
CleanseHash();
DB::finish_profile();

uj5u.com熱心網友回復：

您不太可能優化正則運算式引擎。但是，如果性能是您的目標，您可以專注于代碼的其他部分。例如，試試這個：

for my $bad_pattern (@bad_patterns) {
    my $re = qr/^\Q$bad_pattern\E/;
    delete @$hash{ grep /$re/, sort keys %$hash };
}

在我的機器上，它運行得更快（不管錨點是否存在），因為 grep 的運算式形式不必創建范圍，并且對于每個壞模式，正則運算式的復雜編譯只發生一次。

uj5u.com熱心網友回復：

這是一個相當簡單的匹配，模式是一個固定的字串。所以錨定模式通常必須更快。分析證實了這一點，96 ns/call vs 107 ns/call。

但是當我對代碼的錨定和非錨定版本進行基準測驗時，它們并駕齊驅。這是關于代碼的其余部分，它壓倒了正則運算式的匹配：sortof 鍵不需要進行比較，并且正則運算式正在grep的回圈內編譯，不需要。

當這松了一口氣時，我確實讓錨定呼叫速度提高了 11--15%（多次運行）

use warnings;
use strict;
use feature 'say';

use Data::Dump;
use Storable qw(dclone);
use Benchmark qw(cmpthese);

my $runfor = shift // 3;

my @states = qw(KansasCity MississippiState ColoradoMountain IdahoInTheNorthWest AnchorageIsEvenFurtherNorth);
my @cities = qw(WitchitaHouston ChicagoDenver);
my @streets = qw(DowntownMainStreetInTheCity CenterStreetOverTheHill HickoryBasketOnTheWall);
my @seasoncode = qw(8000S 8000P 8000F 8000W);
my @historycode = qw(7000S 7000P 7000F 7000W 7000A 7000D 7000G 7000H);
my @sides = qw(left right up down);

my @bad_patterns = ( 
    'KansasCity[0].WitchitaHouston[1].DowntownMainStreetInTheCity[2]',
    'ColoradoMountain[0].ChicagoDenver[1].HickoryBasketOnTheWall[2].8000F'
);

my $hash_1;
for my $state (@states) {
    for my $city (@cities) {
        for my $street (@streets) {
            for my $season (@seasoncode) {
                for my $history (@historycode) {
                    for my $side (@sides) {
                        $hash_1->{$state . '[0].' . $city . '[1].' . $street . '[2].' . $season . '.' . $history . '.' . $side} = 1;
                    }
                }
            }
        }
    }   
}    
my $hash_2 = dclone $hash_1;

#say for @bad_patterns; say '---'; dd $hash_1; exit;

sub no_anchor {
    for my $bad_pattern (@bad_patterns) {
        my $re = qr/\Q$bad_pattern\E/;
        delete @$hash_2{ grep { /$re/ } keys %$hash_2 };
    }   
}   
sub w_anchor {
    for my $bad_pattern (@bad_patterns) {
        my $re = qr/^\Q$bad_pattern\E/;
        delete @$hash_1{ grep { /$re/ } keys %$hash_1 };
    }
}


cmpthese( -$runfor, {
    'no_anchor' => sub { no_anchor() },
    'w_anchor'  => sub { w_anchor() },
});

我讓比較 subs 使用外部資料（不像通常那樣傳遞給測驗 subs），以減少任何額外的作業，然后我使用單獨的 hashref 副本Storable::dclone。

上述基準測驗的輸出運行 10 秒（10運行時傳遞給程式）：

           Rate no_anchor  w_anchor
no_anchor 296/s        --      -13%
w_anchor  341/s       15%        --

所以錨定版本確實贏了，盡管利潤微薄：有了這個資料，正則運算式匹配在大約 96% 的情況下失敗，而且所有這些，非錨定版本做了更多的作業，不得不搜索整個字串，我d 預計會有更大的差異。這是由于代碼的其余部分，以及正則運算式編譯成本，這進一步稀釋了匹配效率本身的差異。

這給了我們關于計時代碼的重要教訓：它可能很微妙。人們需要確保只比較相關部分，并且是公平的（在同等情況下）。

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/351487.html

標籤：正则表达式表现 perl 剖析

上一篇：將網路應用上傳到Firebase的程序

下一篇：必須將基因組質量轉換為ASCII時，如何使用Regexs/進行搜索和替換？