Link Scraping Methods

Some of us love those electrons just a little too much
Post Reply
Freecare Spiritwise
Grand Pontificator
Posts: 3015
Joined: Thu Mar 13, 2003 5:35 pm

Link Scraping Methods

Post by Freecare Spiritwise »

I saw Tax mention working on a link scraper, and I thought it might be a good blog post, so I wrote some C# code. I went old school with a one-pass, finite state machine parser. Then I got to thinking that it looked primitive and maybe I should be using regular expressions.

Then I looked at some code that uses regular expressions and it looked inefficient. If I wanted to scrape multiple tags then I would need lots of passes with the regular expressions. And what if I was parsing malformed HTML? The code I looked at wouldn't handle that. If I'm looking at scraping with lots of recursion, this could could be executed millions of times.

Here's the rough draft of my parser method, which takes a block of HTML as a string and builds a list of links. The next version will scrape links out of content too, not just tags. Anything that looks like a link and can be verified.

Thoughts? I'm inclined to think my code would execute faster than plucking the tags I want out using regex. It seems to work well.

Code: Select all

public void ParsePage(string scrapeUrl, string page)
        {
            try
            {
                string curTagName = "";
                string curAttrName = "";
                string curAttrVal = "";

                Dictionary<string, string>   attributes = null;

                foreach (char c in page)
                {
                    switch (c)
                    { 
                        case '<':
                            // Change state regardless of current state.
                            NewState(WebScraperStates.InTag);
                            attributes = new Dictionary<string, string>();
                            curTagName = "";
                            break;

                        case '>':
                            NewState(WebScraperStates.InContent);
                            string tagName = curTagName.ToLower();

                            if (( tagName == "a" || tagName == "img") && attributes != null && attributes.Count > 0)
                            {
              
                                try
                                {
                                    string link = "";

                                    if (tagName == "a")
                                    {
                                        link = attributes["href"];
                                    }
                                    if (tagName == "img")
                                    {
                                        link = attributes["src"];                                    
                                    }

                                    if (link.Length > 0)
                                    {
                                        link = link.StartsWith("//") ? "http:" + link : link;
                                        link = link.StartsWith("/") ? scrapeUrl + link : link;

                                        if (!link.StartsWith("http://") && !link.StartsWith("https://"))
                                        {
                                            link = MergeUrl(scrapeUrl, link);
                                        }

                                        NewWebLink(link, tagName, "", "", "");
                                    }
                                }
                                catch (Exception)
                                {
                             
                                }
                             }

                            if (tagName == "meta")
                            { 
                            
                            }

                            break;

                        case '=':
                            if (ScrapeState == WebScraperStates.InTagAttr)
                            {
                                NewState(WebScraperStates.InAttrValStart);
                            }
                            break;

                        case '\"':
                            switch (ScrapeState)
                            { 
                                case WebScraperStates.InAttrValStart:
                                    NewState(WebScraperStates.InAttrValContent);
                                    curAttrVal = "";
                                    break;

                                // Add a new attribute to the list.
                                case WebScraperStates.InAttrValContent:
                                    NewState(WebScraperStates.InTag);
                                    attributes.Add(curAttrName, curAttrVal);
                                    break;
                        
                            }
                            break;

                        case ' ':
                            if (ScrapeState == WebScraperStates.InTag)
                            {
                                // Have the tag name, now get the attributes.
                                NewState(WebScraperStates.InTagAttr);
                                curAttrName = "";
                            }
                            break;
                    
                        default:
                            switch (ScrapeState)
                            { 
                                // Add the character to the tag name.
                                case WebScraperStates.InTag:
                                    curTagName += c;
                                    break;

                                // Add the character to the attribute name.
                                case WebScraperStates.InTagAttr:
                                    curAttrName += c;
                                    break;

                                // Add the character to the attribute value.
                                case WebScraperStates.InAttrValContent:
                                    curAttrVal += c;
                                    break;

                            }
                            break;

                    }
                 }
            }
            catch (Exception)
            {
              
            }
        }
Here's some regex code I found. Seems like a lot of passes I want to pull <img> and <meta> tags, too.

Code: Select all

static class LinkFinder
{
    public static List<LinkItem> Find(string file)
    {
	List<LinkItem> list = new List<LinkItem>();

	// 1.
	// Find all matches in file.
	MatchCollection m1 = Regex.Matches(file, @"(<a.*?>.*?</a>)",
	    RegexOptions.Singleline);

	// 2.
	// Loop over each match.
	foreach (Match m in m1)
	{
	    string value = m.Groups[1].Value;
	    LinkItem i = new LinkItem();

	    // 3.
	    // Get href attribute.
	    Match m2 = Regex.Match(value, @"href=\""(.*?)\""",
		RegexOptions.Singleline);
	    if (m2.Success)
	    {
		i.Href = m2.Groups[1].Value;
	    }

	    // 4.
	    // Remove inner tags from text.
	    string t = Regex.Replace(value, @"\s*<.*?>\s*", "",
		RegexOptions.Singleline);
	    i.Text = t;

	    list.Add(i);
	}
	return list;
    }
}
Freecare Spiritwise
Grand Pontificator
Posts: 3015
Joined: Thu Mar 13, 2003 5:35 pm

Re: Link Scraping Methods

Post by Freecare Spiritwise »

It's looking good. I may have the blog post up tonight. One thing my little app does is request every link and grab its content type. It's also recursive and you can set the search depth. I've never tested it past 2, which has gotten me > 10K links on a couple sites before I hit the stop button.

Here's a sample of what the blog post app will output. I ran it on a domain I own.

Image
Freecare Spiritwise
Grand Pontificator
Posts: 3015
Joined: Thu Mar 13, 2003 5:35 pm

Re: Link Scraping Methods

Post by Freecare Spiritwise »

Ddrak
Save a Koala, deport an Australian
Posts: 17516
Joined: Thu Jan 02, 2003 3:00 pm
Location: Straya mate!
Contact:

Re: Link Scraping Methods

Post by Ddrak »

On the regex side, you should be using capturing groups and alternation:

Code: Select all

<(?:a\s+href=|img\s+src=|link\s+href=)"(.*?)".*>
will strip out the url component from any "a href", "img src" and "link href" tag and return it to you as capture group 1 (0 is the whole thing). Could make it more generic with optional whitespace around the place and handle the href somewhere but the first if I got creative.

(Not that I mind a well-constructed FSM. People don't do that enough these days.)

Dd
Image
Ddrak
Save a Koala, deport an Australian
Posts: 17516
Joined: Thu Jan 02, 2003 3:00 pm
Location: Straya mate!
Contact:

Re: Link Scraping Methods

Post by Ddrak »

I should add that Expresso is your friend for regexes. And trial-and-error. Lots of trial-and-error. Insane amounts, until your regex-fu is good.

Dd
Image
User avatar
Taxious
Rum Guzzler
Posts: 5056
Joined: Fri Apr 18, 2003 10:16 am
Location: Denver, CO

Re: Link Scraping Methods

Post by Taxious »

Nice script there Free! I did mine in PHP with Curl. Did you look at any HTML parsers out there already for C#? This one?
Ddrak wrote:I should add that Expresso is your friend for regexes.
Oh, nice! I'm such a slow regex writer, this could help a lot.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Freecare Spiritwise
Grand Pontificator
Posts: 3015
Joined: Thu Mar 13, 2003 5:35 pm

Re: Link Scraping Methods

Post by Freecare Spiritwise »

Ddrak wrote:On the regex side, you should be using capturing groups and alternation:

Code: Select all

<(?:a\s+href=|img\s+src=|link\s+href=)"(.*?)".*>
will strip out the url component from any "a href", "img src" and "link href" tag and return it to you as capture group 1 (0 is the whole thing). Could make it more generic with optional whitespace around the place and handle the href somewhere but the first if I got creative.

(Not that I mind a well-constructed FSM. People don't do that enough these days.)

Dd

Thanks Dd! Much appreciated. I think what I might do is benchmark both versions and see which executes faster. But the regex certainly is more elegant. It's a skill I should have either way.

So you think the FSM looks OK? The main purpose of this blog (other than maybe some ad traffic) is just in case I'm ever back in the job market. Not being the hotshot kid is kind of freaking me out, so I not only need to stay sharp, but I need to show the world. I was a lot like Tax and then I blinked one day and I was old.
Freecare Spiritwise
Grand Pontificator
Posts: 3015
Joined: Thu Mar 13, 2003 5:35 pm

Re: Link Scraping Methods

Post by Freecare Spiritwise »

Ddrak wrote:I should add that Expresso is your friend for regexes. And trial-and-error. Lots of trial-and-error. Insane amounts, until your regex-fu is good.

Dd
Awesome, thank you. I definitely need to increase my regex mojo. Though, I'm a big fan of Espresso, so it's hard to see people butcher good coffee terminology haha.
Freecare Spiritwise
Grand Pontificator
Posts: 3015
Joined: Thu Mar 13, 2003 5:35 pm

Re: Link Scraping Methods

Post by Freecare Spiritwise »

Taxious wrote:Nice script there Free! I did mine in PHP with Curl. Did you look at any HTML parsers out there already for C#? This one?
Thanks for the linkage. I had only looked at a few link scraper examples, not full blown HTML parsers. I figured it would be higher performance just to scrape a few tags than parse the whole thing. Though, I guess my parser is touching most of these HTML documents anyway.

I still wonder what's faster to scrape a million pages.
User avatar
Taxious
Rum Guzzler
Posts: 5056
Joined: Fri Apr 18, 2003 10:16 am
Location: Denver, CO

Re: Link Scraping Methods

Post by Taxious »

Freecare Spiritwise wrote:I still wonder what's faster to scrape a million pages.
Do some benchmarks and post it on your blog! :wink:
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Freecare Spiritwise
Grand Pontificator
Posts: 3015
Joined: Thu Mar 13, 2003 5:35 pm

Re: Link Scraping Methods

Post by Freecare Spiritwise »

Taxious wrote:
Freecare Spiritwise wrote:I still wonder what's faster to scrape a million pages.
Do some benchmarks and post it on your blog! :wink:
Exactly. I'll do a regex version of the parser class and add a stopwatch to the test app. In fact, I'll put both classes in the test app and the user can choose which version to run. Perfect.

The code blog started blowing up recently, once I got a domain for it. It was an easy 10 bucks spent because Google SEO seems to like my content, proving some of my SEO theories true. In fact, if you Google "best AAA flashlights", I believe my link still comes up #1. A small victory, but still...
Ddrak
Save a Koala, deport an Australian
Posts: 17516
Joined: Thu Jan 02, 2003 3:00 pm
Location: Straya mate!
Contact:

Re: Link Scraping Methods

Post by Ddrak »

Freecare Spiritwise wrote:So you think the FSM looks OK? The main purpose of this blog (other than maybe some ad traffic) is just in case I'm ever back in the job market. Not being the hotshot kid is kind of freaking me out, so I not only need to stay sharp, but I need to show the world. I was a lot like Tax and then I blinked one day and I was old.
Yep, FSM looks fine, right down to separate methods for state changes. I actually find it a sign of maturity in programming that people drop back to a well structured FSM when shit gets real. Junior programmers tend to create infinite quasi-state machines. The difference in RL between falling back to a structured plan or dropping into panic when it gets hard.

Dd
Image
Freecare Spiritwise
Grand Pontificator
Posts: 3015
Joined: Thu Mar 13, 2003 5:35 pm

Re: Link Scraping Methods

Post by Freecare Spiritwise »

Thanks, Dd. Yep, I was that junior programmer, but I had a couple really good mentors.

It's a pretty good life writing well structured code and having to maintain what I write. I don't get the 5 problem phone calls a day that other programmers at my company get from clients. I take pride in my shit working. But it's been too long since I've worked on a "real" team, where everyone on the team is at the same level. And I'm now kind of in a management position, where people are taking my code as gospel and are afraid to tell me if it sucks. I'm starting to feel out of touch. But the blog is helping, and so are you guys, thanks :)
User avatar
Taxious
Rum Guzzler
Posts: 5056
Joined: Fri Apr 18, 2003 10:16 am
Location: Denver, CO

Re: Link Scraping Methods

Post by Taxious »

Every time you describe your work environment Free, I just imagine you sitting in a completely dark room with a padlocked door typing furiously with the family upstairs. It must be nice to be in that manager position though, intimidating the little code monkeys.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Freecare Spiritwise
Grand Pontificator
Posts: 3015
Joined: Thu Mar 13, 2003 5:35 pm

Re: Link Scraping Methods

Post by Freecare Spiritwise »

Taxious wrote:Every time you describe your work environment Free, I just imagine you sitting in a completely dark room with a padlocked door typing furiously with the family upstairs. It must be nice to be in that manager position though, intimidating the little code monkeys.
Hah! My house is 3 stories and my office is on the main floor next to the kitchen and living room. My mother in law lives in the basement but there's no bathroom there. Every time she comes up to use the bathroom, my 7 dogs flip out because she puts off some weird energy. Usually on a conference call with clients. So my work environment is a living hell of trying to get concentration for long enough between distractions to actually get a little bit of work done.

Sometimes I move up to my top floor man cave with my laptop and tablet if it gets too noisy. I do my blog code work late at night when it's quiet.

Being a manager sucks, at least I don't like it much. My outsourced code monkeys in Manila aren't very good programmers. They try hard, but they also try hard to avoid any human interaction, which is hard enough with them so far away. I will send them an email, and they will only reply when they think I've gone to bed. So, then, a simple discussion spanning several emails takes several days.

The pay is good, but I'd almost trade places with you, starting a job at a new company and the excitement of being in an environment that maybe has some fucking talent and semi-normal people. I like the being a slob and eating ribeye for lunch part of working at home, but that's about it.
Post Reply