regex gets stuck with this call

user3689167

I'm working on a movie scraper / auto-downloader that iterates over my current movie collection, finds new recommendations, and downloads the new goods.

There is a part where I scrape IMDb for metadata and it seems to get stuck in this one spot and I can't seem to figure out why.... it has run this same code with different imdb pages just fine (this is the 29th iteration of a new page)

I am using c#!

The code:

    private string Match(string regex, string html, int i = 1)
    {
        return new Regex(regex, RegexOptions.Multiline).Match(html).Groups[i].Value.Trim();
    }

regex parameter string contents:

 <title>.*?\\(.*?(\\d{4}).*?\\).*?</title>

html parameter string contents: too big to paste here, but literally the html string representation of http://www.imdb.com/title/tt4422748/combined

if in chrome, you can view easily with:

view-source:http://www.imdb.com/title/tt4422748/combined

I have paused execution in visual studio and stepped forward, it continues to run but just hangs (it doesn't let me step, it just runs). If i hit pause again it will return to the same spot with the same parameter values (and no I am not calling it in an infinite loop. I'm pretty new to Regex so any help would be appreciated!

ΩmegaMan

Use of .* is like saying I want to match everything, yet nothing. Each use of it causes the parser to backtrack on so many different possibilities it becomes unresponsive and appears to lock up.

Does the person designing the pattern really not know if there is going to be text there or not for title? I bet 99% of the time the title has text..so why is .* even used, how about .+ at least?

If you want text between the delimiters, use this

title\>(?<Title>[^<]+)\</title

Then extract the matched text through the named group "Title" instead of group[0]. Group[1] will have the actual match text as well if one loathes named match captures.

Answer for Regex Haters

Use the HTML agility pack.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related