如何通过指定搜索条件从网站中提取数据?

哈利勒·哈拉夫(Khalil Khalaf)

我有一个我不熟悉的新项目。一项任务是我需要浏览一些网站以收集一些数据。一个示例网站就是:https : //www.hudhomestore.com/Home/Index.aspx

在此处输入图片说明

我已阅读并观看了有关从网页“收集”数据的教程,例如:

但是我的问题是,我们通常如何设置首选项,然后根据我们的首选项进行“搜索”,然后使用上述链接将结果加载到我的代码中?

编辑

这对于根据我的选择设置搜索条件是正确的。但是,搜索的总数(如果我手动为MI状态执行此操作)为223,但是我执行以下代码,tdNodeCollection仅为121。您能告诉我我要去哪里了吗?

    HtmlWeb web = new HtmlWeb();
    HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

    string zipCode = "", city = "", county = "", street = "", sState = "MI", fromPrice = "0", toPrice = "0", fcaseNumber = "",
           bed = "0", bath = "0", buyerType = "0", Status = "0", indoorAmenities = "", outdoorAmenities = "", housingType = "",
           stories = "", parking = "", propertyAge = "", sLanguage = "ENGLISH";

    var doc = await (Task.Factory.StartNew(() => web.Load("https://www.hudhomestore.com/Listing/PropertySearchResult.aspx?" +
        "zipCode=" + zipCode + "&city=" + city + "&country=" + county + "&street=" + street + "&sState=" + sState +
        "&fromPrice=" + fromPrice + "&toPrice=" + toPrice +
        "&fcaseNumber=" + fcaseNumber + "&bed=" + bed + "&bath=" + bath +
        "&buyerType=" + buyerType + "&Status=" + Status + "&indoorAmenities=" + indoorAmenities +
        "&outdoorAmenities=" + outdoorAmenities + "&housingType=" + housingType + "&stories=" + stories +
        "&parking=" + parking + "&propertyAge=" + propertyAge + "&sLanguage=" + sLanguage)));

    HtmlNodeCollection tdNodeCollection = doc
                             .DocumentNode
                             .SelectNodes("//*[@id=\"dgPropertyList\"]//tr//td");
阿德·哈立德(M. Adeel Khalid)

您可以为此目的使用HTMLAgilityPack。我编写了一个小的测试代码,并根据您可以设置的搜索条件对您要抓取的第二页进行了测试。

        HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
        HtmlWeb web = new HtmlWeb();
        //string InitialUrl = "https://www.hudhomestore.com/Home/Index.aspx";
        //Here you need to set the values of these variable to whatever user inputs
        //after setting these values, add them to initial URL
        string zipCode = "", city = "", county = "", street = "", sState = "AK", fromPrice = "0", toPrice = "0", fcaseNumber = "",
               bed = "0", bath = "0", buyerType = "0", Status = "0", indoorAmenities = "", outdoorAmenities = "", housingType = "",
               stories = "", parking = "", propertyAge = "", sLanguage = "ENGLISH";
        HtmlAgilityPack.HtmlDocument document = web.Load("https://www.hudhomestore.com/Listing/PropertySearchResult.aspx?" +
            "zipCode=" + zipCode + "&city=" + city + "&country=" + county + "&street=" + street + "&sState=" + sState + 
            "&fromPrice=" + fromPrice + "&toPrice=" + toPrice +
            "&fcaseNumber=" + fcaseNumber + "&bed=" + bed + "&bath=" + bath + 
            "&buyerType=" + buyerType + "&Status=" + Status + "&indoorAmenities=" + indoorAmenities + 
            "&outdoorAmenities=" +outdoorAmenities + "&housingType=" + housingType + "&stories=" + stories + 
            "&parking=" + parking + "&propertyAge=" + propertyAge + "&sLanguage=" + sLanguage);
        HtmlNodeCollection tdNodeCollection = document
                                 .DocumentNode
                                 .SelectNodes("//*[@id=\"dgPropertyList\"]//tr//td");

在你的表达一次看尽数,恰好有121td'strid="dgPropertyList"接下来,您检查td手动和跟踪您从需要的东西td,获取该数据。

            foreach (HtmlAgilityPack.HtmlNode node in tdNodeCollection)
            {
                //Do you say you want to access to <h2>, <p> here?
                //You can do:
                HtmlNode h2Node = node.SelectSingleNode("./h2"); //That will get the first <h2> node
                HtmlNodeCollection allH2Nodes = node.SelectNodes(".//h2"); //That will search in depth too

                //And you can also take a look at the children, without using XPath (like in a tree):        
                HtmlNode h2Node_ = node.ChildNodes["h2"];
            }

我已经测试了代码,它可以正常工作并解析整个文档以到达所需的表。它将使您获得该表在div内的所有行。因此,您可以进一步挖掘这些行,找到您的td并获得所需的内容。

另一个选择可能是使用Selenium webdriver动手使用Selenium

如果您不希望浏览器可见并且仍然希望使用Selenium之类的功能,则可以使用PhantomJS

希望能帮助到你。

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章