用于計算字串中三個不同ID的出現次數的Perl腳本，其中出現次數可能為零-有解無憂

我在下面附上了一個示例，其中包含來自我的輸入檔案的 5 行（制表符分隔）：

G157    G157.2  535 3   344 344:m64019_201112_211057/51839190/ccs,m64019_201112_211057/167772263/ccs,m64019_201112_211057/152963146/ccs
G157    G157.6  535 42  276,344,365 276:m54312U_201103_152606/5964842/ccs,m54312U_201103_152606/78907467/ccs,m54312U_201103_152606/136382258/ccs,m54312U_201103_152606/124453202/ccs,m54312U_201103_152606/117441369/ccs,m54312U_201103_152606/134415958/ccs,m54312U_201103_152606/42665917/ccs,m54312U_201103_152606/20709542/ccs,m54312U_201103_152606/137956132/ccs,m54312U_201103_152606/26020309/ccs;344:m64019_201112_211057/80413080/ccs,m64019_201112_211057/20840619/ccs,m64019_201112_211057/32964769/ccs,m64019_201112_211057/119801176/ccs,m64019_201112_211057/62327216/ccs,m64019_201112_211057/155584613/ccs,m64019_201112_211057/78775365/ccs,m64019_201112_211057/85525815/ccs,m64019_201112_211057/40042591/ccs,m64019_201112_211057/129304850/ccs,m64019_201112_211057/16450019/ccs,m64019_201112_211057/127666695/ccs,m64019_201112_211057/29427856/ccs,m64019_201112_211057/171181539/ccs,m64019_201112_211057/175898871/ccs,m64019_201112_211057/28771811/ccs,m64019_201112_211057/167051372/ccs,m64019_201112_211057/25428057/ccs;365:m64019_201101_022708/103875458/ccs,m64019_201101_022708/24576259/ccs,m64019_201101_022708/67961035/ccs,m64019_201101_022708/149356854/ccs,m64019_201101_022708/5767478/ccs,m64019_201101_022708/155123744/ccs,m64019_201101_022708/125829415/ccs,m64019_201101_022708/137232674/ccs,m64019_201101_022708/83232122/ccs,m64019_201101_022708/126617353/ccs,m64019_201101_022708/64619288/ccs,m64019_201101_022708/64751219/ccs,m64019_201101_022708/132055970/ccs,m64019_201101_022708/34539631/ccs
G157    G157.9  535 4   344 344:m64019_201112_211057/80413080/ccs,m64019_201112_211057/78775365/ccs,m64019_201112_211057/85525815/ccs,m64019_201112_211057/27198805/ccs
G157    G157.11 535 6   276 276:m54312U_201103_152606/156304839/ccs,m54312U_201103_152606/15336676/ccs,m54312U_201103_152606/136382258/ccs,m54312U_201103_152606/134415958/ccs,m54312U_201103_152606/42665917/ccs,m54312U_201103_152606/20709542/ccs

第二列包含轉錄本 ID，第五列包含讀取映射到該轉錄本的樣本 ID，第六列包含所有這些讀取的串列。下面解釋一下第六列的結構：

SAMPLEID:readinfo/separated/by/forwardslashes,readinfo/separated/by/forwardslashes;SAMPLEID:readinfo/separated/by/forwardslashes

有三個樣本 ID（276,344 和 365），但每個轉錄本可以覆寫任何一個、兩個或所有三個樣本。

這就是我希望輸出的樣子：

transcript_id   276 344 365
G157.1  0   0   2
G157.2  0   3   0
G157.6  9   18  15
G157.9  0   4   0
G157.11 6   0   0

我已經能夠讓它作業，但是因為我對 Perl 比較陌生，所以我無法弄清楚如何完全在 Perl 中完成這項任務。最后，我使用第二個 R 腳本拼湊出一個矩陣。這是我的 Perl 腳本：

#!/usr/bin/perl -w
use strict;
my($U_ID, $ave, $PI, $Gene_ID, $Attribute, %test,$inclusion, $gene,$skipping, %hash, 
    $inclusion_intron, $line, $sum,$counts,%counter1, %counter2, $countIntron, 
    $countGene, %unique, );

open(INFILE, $ARGV[0]) or die"File1 is Dead\n";
while(<INFILE>) {
    $line=$_;
    chomp $line;
    if ($line=~m/\S /) {
        my ($transcript) = ($line=~m/\S \s (\S )/);
        my ($transcript_info) = ($line=~m/\S \s \S \s \S \s \S \s \S \s (\S )/);
        my ($unique_donor_counts)= ($line=~m/\S \s \S \s \S \s \S \s (\S )/);
        my ($unique_donor_ID) = ($transcript_info =~m/(\S )\:/);
    
       if ($unique_donor_counts !~ m/,/) {
           my $commas = $transcript_info =~ y/,//;
           my $counts = $commas 1;
           print "$transcript\t$unique_donor_ID\t$counts\n";
       } else { 
           my @spl = split(';', $unique_donor_ID);  
           foreach my $i (@spl) {
               my $commas = $i =~ y/,//;
               my $counts = $commas 1;
               my ($donor) = ($i=~m/(\d\d\d)/);
               print "$transcript\t$donor\t$counts\n";
           }
       }
    }
}

輸出：

G157.1  2   365
G157.2  3   344
G157.6  9   276
G157.6  18  344
G157.6  15  365
G157.9  4   344
G157.11 6   276

我沒有列印，而是將其保存到輸出檔案并運行此 R 腳本：

library(tidyr)

DF <- read.delim("output", sep = "\t", header = F)
df <- tidyr::pivot_wider(output, id_cols = V1, names_from = V2, values_from = V3)
write.table(df, file = "FLcount", sep = "\t", col.names = T, row.names = F, quote = F, dec = ".")

The problem I was encountering in Perl is that I can't figure out how to get a count of 0 when the sample ID doesn't occur in that line. I really want to get better at Perl, so I'd like to figure out how to do this all in one script instead of manipulating the matrix using R. Thank you for taking the time to help a Perl newbie learn!

edited post from my original with some progress

uj5u.com熱心網友回復：

我將瀏覽您的程式檔案并在需要時發表評論。

#!/usr/bin/perl -w
use strict;

該-w開關會在全域范圍內打開警告。您應該使用警告的詞法版本：use warnings. 通常認為這是一種更好的做法，因為它允許您在需要時關閉警告，例如：no warnings 'experimental'.

my($U_ID, $ave, $PI, $Gene_ID, $Attribute, %test,$inclusion, $gene,$skipping, %hash, 
    $inclusion_intron, $line, $sum,$counts,%counter1, %counter2, $countIntron, 
    $countGene, %unique, );

所有這些變數都未使用。所以洗掉所有未使用的變數。

您不應該在程式頂部宣告所有變數并將它們全部設為全域變數。您應該將它們宣告為盡可能靠近您將要使用它們的地方，并且在盡可能小的范圍內。用宣告的變數的范圍my是最近的封閉塊：{ block here }。這將封裝您的代碼，并使其更易于閱讀。

open(INFILE, $ARGV[0]) or die"File1 is Dead\n";

這是舊的兩個引數 open，通常被認為是不好的做法，因為它允許插入代碼。您希望以顯式打開模式打開三個引數。您還希望$!在 die 陳述句中包含錯誤報告：

open my $fh, "<", $ARGV[0] or die "Cannot open '$ARGV[0]' for reading: $!";

您也不希望在die陳述句末尾添加換行符，因為這會抑制錯誤中的行號。

while(<INFILE>) {
    $line=$_;
    chomp $line;

您不必跳過所有這些障礙。你可以做

while (my $line = <$fh>) {    # declared in the smallest possible scope
    chomp $line;

或者

while (<$fh>) {
    chomp;        # chomps $_

The latter has the benefit of not having to type out what variable you want to match your regexes against. E.g. instead of $line =~ m/..../ just type m/..../. It is a commonly used Perl idiom.

    if ($line=~m/\S /) {
        my ($transcript)         = ($line=~m/\S \s (\S )/);
        my ($transcript_info)    = ($line=~m/\S \s \S \s \S \s \S \s \S \s (\S )/);
        my ($unique_donor_counts)= ($line=~m/\S \s \S \s \S \s \S \s (\S )/);

You could just match once, and capture three values at the same time:

my ($trans,$udc,$trans_info) = $line =~ /^\S \s (\S )\s \S \s (\S )\s (\S )\s (\S )/;

Note that you have to have the variables in the same order as the captured values.

This might also be somewhat simpler if you use split

my @vals = split ' ', $line;
my $trans = $vals[1];        # and so on...

This part

       my ($unique_donor_ID) = ($transcript_info =~m/(\S )\:/);

is wrong. You think it just matches a number in front of the first colon, e.g. 344:. If you only have one colon, that is what it does. Otherwise it matches all the characters before the last colon in the line, because \S is greedy and will try to match as much as possible. When you later split $unique_donor_ID on ;, you will miss all the data after the last colon. You can see the result here. Notice line 2 of the output which begins with 276 and ends with 365. 365 is the start of the last record.

       if ($unique_donor_counts !~ m/,/) {            # checks for 311,322, ..
           my $commas = $transcript_info =~ y/,//;    # count by destroying data
           my $counts = $commas 1;                    # unnecessary extra variable
           print "$transcript\t$unique_donor_ID\t$counts\n";
       } else { 
           my @spl = split(';', $unique_donor_ID);    # splitting the wrong variable
           foreach my $i (@spl) {
               my $commas = $i =~ y/,//;              
               my $counts = $commas 1;
               my ($donor) = ($i=~m/(\d\d\d)/);       # risky way to get the first 3 numbers
               print "$transcript\t$donor\t$counts\n";
        .....

Your logic is to check for commas in the donor id short list, and handle the single values separately. That is most likely a flawed approach. Why not just split the last field on ; and handle the results from there?

use strict;
use warnings;
use feature 'say';

my 





        
      轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/425569.html
      標籤：perl unix count find-occurrences 
      上一篇：替換大檔案中參考字串中的換行符
下一篇：為什么JSON編碼需要參考Perl版本字串？