如何使用BFS實作Web爬取的并行和異步-有解無憂

這個問題會很長，但我想盡可能徹底地解釋我的代碼和思考程序，所以這里......

我正在用 C# 撰寫一個網路爬蟲，它應該從給定的源鏈接搜索維基百科并找到到達目標鏈接的方法。例如，您可以給它一個 toaster Wiki 頁面鏈接和一個 pancake Wiki 鏈接，它應該輸出一條將您從 toast 帶到 pancake 的路線。換句話說 - 我想找到兩個 Wiki 文章之間的最短路徑。

我想我已經正確編碼了，我創建了兩個類：一個稱為 a CrawlerPage，這是它的主體：

using HtmlAgilityPack;
using System.Collections.Generic;
using System.Linq;
using System.Net.Http;
using System.Threading.Tasks;

namespace Wikipedia_Crawler
{
    internal class CrawlerPage
    {
        public string mainLink;
        private List<CrawlerPage> _pages = new();
        public CrawlerPage(string mainLink)
        {
            this.mainLink = mainLink;
        }

        public async Task<List<CrawlerPage>> GetPages()
        {
            var pagesLinks = await Task.Run(() => GetPages(this));
            
            foreach(var page in pagesLinks)
            {
                _pages.Add(new CrawlerPage(page));
            }

            return _pages;
        }

        private HashSet<string> GetPages(CrawlerPage page)
        {
            string result = "";

            using (HttpClient client = new HttpClient())
            {
                using (HttpResponseMessage response = client.GetAsync(page.mainLink).Result)
                {
                    using (HttpContent content = response.Content)
                    {
                        result = content.ReadAsStringAsync().Result;
                    }
                }
            }

            var wikiLinksList = ParseLinks(result)
                .Where(x => x.Contains("/wiki/") && !x.Contains("https://") && !x.Contains(".jpg") &&
                            !x.Contains(".png"))
                .AsParallel()
                .ToList();

            var wikiLinksHashSet = new HashSet<string>();
            foreach(var wikiLink in wikiLinksList)
            {
                wikiLinksHashSet.Add("https://en.wikipedia.org"   wikiLink);
            }

            HashSet<string> ParseLinks(string html)
            {
                var doc = new HtmlDocument();
                doc.LoadHtml(html);
                var nodes = doc.DocumentNode.SelectNodes("//a[@href]");
                return nodes == null ? new HashSet<string>() : nodes.AsParallel().ToList().ConvertAll(
                       r => r.Attributes.AsParallel().ToList().ConvertAll(
                       i => i.Value)).SelectMany(j => j).AsParallel().ToHashSet();
            }

            return wikiLinksHashSet;
        }
    }
}

上面的類應該代表一個 Wiki 頁面文章。它包含自己的鏈接（mainLink欄位）和該頁面上所有其他頁面的串列（_pages欄位）。GetPages()方法基本上是讀取 HTML 中的頁面并將它們決議為HashSet帶有我感興趣的鏈接（帶有指向其他文章的鏈接，這樣我們可以丟棄任何其他垃圾鏈接）。

第二類是Crawler執行 BFS（廣度優先搜索）的類。下面的代碼：

using System;
using System.Collections.Generic;
using System.Threading.Tasks;

namespace Wikipedia_Crawler
{
    internal class Crawler
    {
        private int _maxDepth;
        private int _currDepth;

        public Crawler(int maxDepth)
        {
            _currDepth = 0;
            _maxDepth = maxDepth;
        }
        
        public async Task CrawlParallelAsync(string sourceLink, string destinationLink)
        {
            var sourcePage = new CrawlerPage(sourceLink);
            var destinationPage = new CrawlerPage(destinationLink);
            var visited = new HashSet<string>();
            
            Queue <CrawlerPage> queue = new();
            queue.Enqueue(sourcePage);

            while (queue.Count > 0)
            {
                var currPage = queue.Dequeue();
                Console.WriteLine(currPage.mainLink);
                
                var currPageSubpages = await Task.Run(() => currPage.GetPages());

                if (currPage.mainLink == destinationPage.mainLink || _currDepth == _maxDepth)
                {
                    visited.Add(currPage.mainLink);
                    break;
                }

                if (visited.Contains(currPage.mainLink))
                    continue;

                visited.Add(currPage.mainLink);
                
                foreach (var page in currPageSubpages)
                {
                    if (!visited.Contains(page.mainLink))
                    {
                        queue.Enqueue(page);
                    }
                }
            }

            foreach (var visitedPage in visited)
            {
                Console.WriteLine(visitedPage);
            }
        }
    }
}

Note that I am not incrementing currDepth - I want to make it so that if the depth of the search goes too far, the search would stop because of the route being too long. The class above works as follows: it enqueues the page with sourceLink and performs standard BFS: it dequeues the page, checks if it has been visited, checks if this is the destination page and then gets every subpage of that page (using currPage.GetPages() and adds them to the queue. I believe that the algorithm works fine, although it is extremely sluggish and does not provide any use because of that.

My conclusion: it absolutely needs to be done asynchronously and parallel in order to be efficient. I have tried with Tasks as you can tell, but that doesn't improve the performance at all. My intuition tells me that every time we read subpages of a page, we should do that async and parallel and every time we start crawling on a page, we have to do that async and in parallel as well. I have no idea on how to achieve that, do I need to completely refactor my code? Should I create a new crawler every time I enqueue a subpage?

I'm lost, can you help me figure it out?

uj5u.com熱心網友回復：

您可以考慮使用新的 (.NET 6) API Parallel.ForEachAsync。此方法接受一個可列舉的序列，并為序列中的每個元素呼叫一個異步委托，具有特定的并行度。這個方法的一個多載特別有趣，因為它接受一個IAsyncEnumerable<T>作為輸入，它本質上是一個異步資料流。您可以使用迭代器方法（即yields 的方法）動態創建這樣的流，但使用Channel<T>將其內容公開為IAsyncEnumerable<T>. 這是這個想法的粗略演示：

var channel = Channel.CreateUnbounded<CrawlerPage>();
channel.Writer.TryWrite(new CrawlerPage(sourceLink));

var cts = new CancellationTokenSource();
var options = new ParallelOptions()
{
    MaxDegreeOfParallelism = 10,
    CancellationToken = cts.Token
};

await Parallel.ForEachAsync(channel.Reader.ReadAllAsync(), async (page, ct) =>
{
    CrawlerPage[] subpages = await GetPagesAsync(page);
    foreach (var subpage in subpages) channel.Writer.TryWrite(subpage);
});

并行回圈將繼續處理頁面，直到channel.Writer.Complete()呼叫該方法，然后消耗通道中的所有剩余頁面，或者直到CancellationTokenSource取消該方法。

uj5u.com熱心網友回復：

呼叫client.GetAsync(page.mainLink).Result使您的代碼同步等待。使用await client.GetAsync(page.mainLink). 這樣做你不應該使用Task.Run. Task.Run可用于異步執行同步作業。

如果您想要并行性，您可以使用Task.WhenAll.

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/410860.html

標籤：

上一篇：帶有tokio的異步Rust中相互關聯的期貨的集合

下一篇：傳遞給異步函式的引數范圍