管道curl到awk以下載和解壓縮檔案-有解無憂

我想從 HTML 頁面的這一部分下載所有檔案：

    <td><a class="xm" name="item_1" type="dd" href="/data/24765/dd">Item 1</a></td>
    <td><a class="xm" name="item_2" type="dd" href="/data/12345/dd">Item 2</a></td>
    <td><a class="xm" name="item_3" type="dd" href="/data/75239/dd">Item 3</a></td>

第一個檔案的下載鏈接是https://foo.bar/data/24765/dd，因為它是一個 zip 檔案，我也想解壓縮它。

我的腳本是這樣的：

#!/bin/bash
curl -s "https://foo.bar/path/to/page" > data.html

gawk 'match($0, /href="\/(data\/[0-9]{5}\/dd)"/, m){print m[1]}' data.html > data.txt

for f in $(cat data.txt); do 
    curl -s "https://foo.bar/$f" > data.zip
    unzip data.zip
done

有沒有更優雅的方式來撰寫這個腳本？我想避免保存 html、txt 和 zip 檔案。

uj5u.com熱心網友回復：

該bsdtar命令可以從標準輸入解壓縮檔案，允許您執行以下操作：

curl -s "https://foo.bar/$f" | bsdtar -xf-

當然，您可以將第一個curl命令直接輸入awk：

curl -s "https://foo.bar/path/to/page" |
gawk 'match($0, /href="\/(data\/[0-9]{5}\/dd)"/, m){print m[1]}' > data.txt

事實上，您也可以直接將該管道的輸出通過管道傳輸到一個回圈中：

curl -s "https://foo.bar/path/to/page" |
gawk 'match($0, /href="\/(data\/[0-9]{5}\/dd)"/, m){print m[1]}' |
while read archive; do
    curl -s "https://foo.bar/$archive" | bsdtar -xf-
done

uj5u.com熱心網友回復：

我想避免保存（...）zip檔案。

通常，許多 linux 終端命令將接受在需要檔案名的地方使用 stdin-的含義。經過粗略搜索后，某些版本似乎不支持此功能（請參閱如何將 wget 的輸出重定向為 unzip 的輸入？at），而其他版本則如 do 所描述的那樣unzipunix.stack.exchangefreebsd.org

如果指定的檔案名是“-”，則從標準輸入讀取資料。

所以如果你使用的版本那么做

curl -s "https://foo.bar/$f" > data.zip
unzip data.zip

可以改進為

curl -s "https://foo.bar/$f" > unzip -

如果沒有，但你想使用unzip然后根據unix.stack.exchange前綴unzip使用的答案busybux將修復即

curl -s "https://foo.bar/$f" > busybux unzip -

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/420246.html

標籤：

上一篇：使用Python逐項迭代多個串列

下一篇：在python與curl中發布請求