如何也獲得下載檔案鏈接描述？-有解無憂

鏈接示例：

<img src="https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg" alt="This is the description i want to get too" >

以及我用來決議來自 html 下載的源檔案的鏈接的方法：

public List<string> GetLinks(string message)
        {
            List<string> list = new List<string>();
            string txt = message;
            foreach (Match item in Regex.Matches(txt, @"(http|ftp|https):\/\/([\w\-_] (?:(?:\.[\w\-_] ) ))([\w\-\.,@?^=%&amp;:/~\ #]*[\w\-\@?^=%&amp;/~\ #])?"))
            {
                if (item.Value.Contains("thumbs"))
                {
                    int index1 = item.Value.IndexOf("mp4");

                    string news = ReplaceLastOccurrence(item.Value, "thumbs", "videos");

                    if (index1 != -1)
                    {
                        string result = news.Substring(0, index1   3);
                        if (!list.Contains(result))
                        {
                            list.Add(result);
                        }
                    }
                }
            }

            return list;
        }

但這只會給出我想獲得的鏈接，也給出這個例子中的鏈接描述：

這是一個測驗

然后使用它：

string[] files = Directory.GetFiles(@"D:\Videos\");
            foreach (string file in files)
            {
                foreach(string text in GetLinks(File.ReadAllText(file)))
                {
                    if (!videosLinks.Contains(text))
                    {
                        videosLinks.Add(text);
                    }
                }
               
            }

下載鏈接時：

private async void btnStartDownload_Click(object sender, EventArgs e)
        {
            if (videosLinks.Count > 0)
            {
                for (int i = 0; i < videosLinks.Count; i  )
                {
                    string fileName = System.IO.Path.GetFileName(videosLinks[i]);
                    await DownloadFile(videosLinks[i], @"D:\Videos\videos\"   fileName);
                }
            }
        }

但檔案名我想成為每個鏈接的描述。

uj5u.com熱心網友回復：

您可以使用Html Agility Pack，它是一個用 C# 撰寫的 HTML 決議器來讀/寫 DOM，并支持普通的 XPATH 或 XSLT。在下面的示例中，您可以檢索alt屬性和其他中的描述。

執行：

using HtmlAgilityPack;
using System;
                    
public class Program
{
    public static void Main()
    {
        HtmlDocument doc = new HtmlDocument();
        var html = "<img src=\"https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg\" alt=\"This is the description i want to get too\" >";
        doc.LoadHtml(html);
        HtmlNode image = doc.DocumentNode.SelectSingleNode("//img");

        Console.WriteLine("Source: {0}", image.Attributes["src"].Value);
        Console.WriteLine("Description: {0}", image.Attributes["alt"].Value);
        Console.Read();
    }
}

演示：
https ://dotnetfiddle.net/nAAZDL

輸出：

Source: https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg
Description: This is the description i want to get too

uj5u.com熱心網友回復：

易卜拉欣的回答顯示了使用適當的 HTML 決議器是多么簡單，但我想如果您只想從單個頁面中提取單個標簽，或者不想使用外部依賴項，那么正則運算式并非不合理，尤其是如果您可以對要匹配的 HTML 做出某些假設。

請注意，下面的模式和代碼僅用于演示目的，并不意味著是一個健壯、詳盡的標簽決議器；讀者可以根據需要對它們進行擴充，以處理他們在狂野的網路中可能遇到的各種 HTML 怪癖和特性。例如，該模式不會匹配帶有單引號或根本沒有引號的屬性值的影像標簽，并且如果標簽具有多個同名的屬性，則代碼會拋出例外。

我這樣做的方法是使用一個匹配<img />標簽及其所有屬性對的模式......

<img(?:\s (?<name>[a-z] )="(?<value>[^"]*)")*\s*/?>

...然后您可以查詢以找到您關心的屬性。您將使用該模式將影像屬性提取為Dictionary<string, string>這樣的...

static IEnumerable<Dictionary<string, string>> EnumerateImageTags(string input)
{
    const string pattern =
@"
<img                     # Start of tag
    (?:                  # Attribute name/value pair: noncapturing group
        \s               # One or more whitespace characters
        (?<name>[a-z] )  # Attribute name: one or more letters
        =                # Literal equals sign
        ""               # Literal double quote
        (?<value>[^""]*) # Attribute value: zero or more non-double quote characters
        ""               # Literal double quote
    )*                   # Zero or more attributes are allowed
    \s*                  # Zero or more whitespace characters
/?>                      # End of tag with optional forward slash
";

    foreach (Match match in Regex.Matches(input, pattern, RegexOptions.IgnorePatternWhitespace))
    {
        string[] attributeValues = match.Groups["value"].Captures
            .Cast<Capture>()
            .Select(capture => capture.Value)
            .ToArray();
        // Create a case-insensitive dictionary mapping from each capture of the "name" group to the same-indexed capture of the "value" group
        Dictionary<string, string> attributes = match.Groups["name"].Captures
            .Cast<Capture>()
            .Select((capture, index) => new KeyValuePair<string, string>(capture.Value, attributeValues[index]))
            .ToDictionary(pair => pair.Key, pair => pair.Value, StringComparer.OrdinalIgnoreCase);

        yield return attributes;
    }
}

鑒于SO74133924.html...

<html>
    <body>
        <p>This image comes from https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg:
        <img src="https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg" alt="This is the description i want to get too">
        <p>This image has additional attributes on multiple lines in a self-closing tag:
        <img
            first="abc"
            src="https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg"
            empty=""
            alt="This image has additional attributes on multiple lines in a self-closing tag"
            last="xyz"
        />
        <p>This image has empty alternate text:
        <img src="https://example.com/?message=This image has empty alternate text" alt="">
        <p>This image has no alternate text:
        <img src="https://example.com/?message=This image has no alternate text">
    </body>
</html>

...您會像這樣使用每個標簽的屬性字典...

static void Main()
{
    string input = File.ReadAllText("SO74133924.html");

    foreach (Dictionary<string, string> imageAttributes in EnumerateImageTags(input))
    {
        foreach (string attributeName in new string[] { "src", "alt" })
        {
            string displayValue = imageAttributes.TryGetValue(attributeName, out string attributeValue)
                ? $"\"{attributeValue}\"" : "(null)";
            Console.WriteLine($"{attributeName}: {displayValue}");
        }
        Console.WriteLine();
    }
}

...輸出這個...

src：“https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg”
alt：“這也是我想要得到的描述”

src：“https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg”
alt：“此影像在自閉合標簽中的多行上具有附加屬性”

src: "https://example.com/?message=這張圖片有空的替代文字"
替代：“”

src: "https://example.com/?message=這張圖片沒有替代文字"
替代：（空）

uj5u.com熱心網友回復：

如果您使用正則運算式的代碼，它將花費更多的 CPU 周期并且執行速度很慢。使用像 AngleSharp 這樣的庫。

我試圖用 AngleSharp 撰寫你的代碼。我就是這樣做的。

        string test = "<img src=\"https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg\" alt=\"This is the description i want to get too\" >\r\n";
        var configuration = Configuration.Default.WithDefaultLoader();
        var context = BrowsingContext.New(configuration);
        using var doc = await context.OpenAsync(req => req.Content(test));

        string href = doc.QuerySelector("img").Attributes["alt"].Value;

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/522154.html

標籤：C＃。网正则表达式表格html解析

上一篇：如何在RichTextBox中的兩個字符之間選擇文本

下一篇：無法通過在設計器中拖動來調整表單大小