Link Scraping Methods

Freecare Spiritwise · Post by **Freecare Spiritwise** » Sun May 05, 2013 1:31 pm

I saw Tax mention working on a link scraper, and I thought it might be a good blog post, so I wrote some C# code. I went old school with a one-pass, finite state machine parser. Then I got to thinking that it looked primitive and maybe I should be using regular expressions.

Then I looked at some code that uses regular expressions and it looked inefficient. If I wanted to scrape multiple tags then I would need lots of passes with the regular expressions. And what if I was parsing malformed HTML? The code I looked at wouldn't handle that. If I'm looking at scraping with lots of recursion, this could could be executed millions of times.

Here's the rough draft of my parser method, which takes a block of HTML as a string and builds a list of links. The next version will scrape links out of content too, not just tags. Anything that looks like a link and can be verified.

Thoughts? I'm inclined to think my code would execute faster than plucking the tags I want out using regex. It seems to work well.

Code: Select all

public void ParsePage(string scrapeUrl, string page)
        {
            try
            {
                string curTagName = "";
                string curAttrName = "";
                string curAttrVal = "";

                Dictionary<string, string>   attributes = null;

                foreach (char c in page)
                {
                    switch (c)
                    { 
                        case '<':
                            // Change state regardless of current state.
                            NewState(WebScraperStates.InTag);
                            attributes = new Dictionary<string, string>();
                            curTagName = "";
                            break;

                        case '>':
                            NewState(WebScraperStates.InContent);
                            string tagName = curTagName.ToLower();

                            if (( tagName == "a" || tagName == "img") && attributes != null && attributes.Count > 0)
                            {
              
                                try
                                {
                                    string link = "";

                                    if (tagName == "a")
                                    {
                                        link = attributes["href"];
                                    }
                                    if (tagName == "img")
                                    {
                                        link = attributes["src"];                                    
                                    }

                                    if (link.Length > 0)
                                    {
                                        link = link.StartsWith("//") ? "http:" + link : link;
                                        link = link.StartsWith("/") ? scrapeUrl + link : link;

                                        if (!link.StartsWith("http://") && !link.StartsWith("https://"))
                                        {
                                            link = MergeUrl(scrapeUrl, link);
                                        }

                                        NewWebLink(link, tagName, "", "", "");
                                    }
                                }
                                catch (Exception)
                                {
                             
                                }
                             }

                            if (tagName == "meta")
                            { 
                            
                            }

                            break;

                        case '=':
                            if (ScrapeState == WebScraperStates.InTagAttr)
                            {
                                NewState(WebScraperStates.InAttrValStart);
                            }
                            break;

                        case '\"':
                            switch (ScrapeState)
                            { 
                                case WebScraperStates.InAttrValStart:
                                    NewState(WebScraperStates.InAttrValContent);
                                    curAttrVal = "";
                                    break;

                                // Add a new attribute to the list.
                                case WebScraperStates.InAttrValContent:
                                    NewState(WebScraperStates.InTag);
                                    attributes.Add(curAttrName, curAttrVal);
                                    break;
                        
                            }
                            break;

                        case ' ':
                            if (ScrapeState == WebScraperStates.InTag)
                            {
                                // Have the tag name, now get the attributes.
                                NewState(WebScraperStates.InTagAttr);
                                curAttrName = "";
                            }
                            break;
                    
                        default:
                            switch (ScrapeState)
                            { 
                                // Add the character to the tag name.
                                case WebScraperStates.InTag:
                                    curTagName += c;
                                    break;

                                // Add the character to the attribute name.
                                case WebScraperStates.InTagAttr:
                                    curAttrName += c;
                                    break;

                                // Add the character to the attribute value.
                                case WebScraperStates.InAttrValContent:
                                    curAttrVal += c;
                                    break;

                            }
                            break;

                    }
                 }
            }
            catch (Exception)
            {
              
            }
        }

Here's some regex code I found. Seems like a lot of passes I want to pull <img> and <meta> tags, too.

Code: Select all

static class LinkFinder
{
    public static List<LinkItem> Find(string file)
    {
	List<LinkItem> list = new List<LinkItem>();

	// 1.
	// Find all matches in file.
	MatchCollection m1 = Regex.Matches(file, @"(<a.*?>.*?</a>)",
	    RegexOptions.Singleline);

	// 2.
	// Loop over each match.
	foreach (Match m in m1)
	{
	    string value = m.Groups[1].Value;
	    LinkItem i = new LinkItem();

	    // 3.
	    // Get href attribute.
	    Match m2 = Regex.Match(value, @"href=\""(.*?)\""",
		RegexOptions.Singleline);
	    if (m2.Success)
	    {
		i.Href = m2.Groups[1].Value;
	    }

	    // 4.
	    // Remove inner tags from text.
	    string t = Regex.Replace(value, @"\s*<.*?>\s*", "",
		RegexOptions.Singleline);
	    i.Text = t;

	    list.Add(i);
	}
	return list;
    }
}

Freecare Spiritwise · Post by **Freecare Spiritwise** » Sun May 05, 2013 4:54 pm

It's looking good. I may have the blog post up tonight. One thing my little app does is request every link and grab its content type. It's also recursive and you can set the search depth. I've never tested it past 2, which has gotten me > 10K links on a couple sites before I hit the stop button.

Here's a sample of what the blog post app will output. I ran it on a domain I own.

Freecare Spiritwise · Post by **Freecare Spiritwise** » Sun May 05, 2013 6:24 pm

I got the blog post finished:

http://www.coderskitchen.com/2013/05/c- ... raper.html

Post by **Ddrak** » Mon May 06, 2013 2:54 am

On the regex side, you should be using capturing groups and alternation:

Code: Select all

<(?:a\s+href=|img\s+src=|link\s+href=)"(.*?)".*>

will strip out the url component from any "a href", "img src" and "link href" tag and return it to you as capture group 1 (0 is the whole thing). Could make it more generic with optional whitespace around the place and handle the href somewhere but the first if I got creative.

(Not that I mind a well-constructed FSM. People don't do that enough these days.)

Dd

Post by **Ddrak** » Mon May 06, 2013 2:58 am

I should add that Expresso is your friend for regexes. And trial-and-error. Lots of trial-and-error. Insane amounts, until your regex-fu is good.

Dd

Taxious · Post by **Taxious** » Mon May 06, 2013 9:35 am

Nice script there Free! I did mine in PHP with Curl. Did you look at any HTML parsers out there already for C#? This one?

Ddrak wrote:I should add that Expresso is your friend for regexes.

Oh, nice! I'm such a slow regex writer, this could help a lot.

Freecare Spiritwise · Mon May 06, 2013 10:02 am

Ddrak wrote:On the regex side, you should be using capturing groups and alternation:
Code: Select all
<(?:a\s+href=|img\s+src=|link\s+href=)"(.*?)".*>
will strip out the url component from any "a href", "img src" and "link href" tag and return it to you as capture group 1 (0 is the whole thing). Could make it more generic with optional whitespace around the place and handle the href somewhere but the first if I got creative.

(Not that I mind a well-constructed FSM. People don't do that enough these days.)

Dd

Thanks Dd! Much appreciated. I think what I might do is benchmark both versions and see which executes faster. But the regex certainly is more elegant. It's a skill I should have either way.

So you think the FSM looks OK? The main purpose of this blog (other than maybe some ad traffic) is just in case I'm ever back in the job market. Not being the hotshot kid is kind of freaking me out, so I not only need to stay sharp, but I need to show the world. I was a lot like Tax and then I blinked one day and I was old.

Freecare Spiritwise · Mon May 06, 2013 10:16 am

Ddrak wrote:I should add that Expresso is your friend for regexes. And trial-and-error. Lots of trial-and-error. Insane amounts, until your regex-fu is good.

Dd

Awesome, thank you. I definitely need to increase my regex mojo. Though, I'm a big fan of Espresso, so it's hard to see people butcher good coffee terminology haha.

Freecare Spiritwise · Mon May 06, 2013 10:20 am

Taxious wrote:Nice script there Free! I did mine in PHP with Curl. Did you look at any HTML parsers out there already for C#? This one?

Thanks for the linkage. I had only looked at a few link scraper examples, not full blown HTML parsers. I figured it would be higher performance just to scrape a few tags than parse the whole thing. Though, I guess my parser is touching most of these HTML documents anyway.

I still wonder what's faster to scrape a million pages.

Taxious · Post by **Taxious** » Mon May 06, 2013 11:16 am

Freecare Spiritwise wrote:I still wonder what's faster to scrape a million pages.

Do some benchmarks and post it on your blog!

Freecare Spiritwise · Post by **Freecare Spiritwise** » Mon May 06, 2013 1:44 pm

Taxious wrote:
Freecare Spiritwise wrote:I still wonder what's faster to scrape a million pages.
Do some benchmarks and post it on your blog!

Exactly. I'll do a regex version of the parser class and add a stopwatch to the test app. In fact, I'll put both classes in the test app and the user can choose which version to run. Perfect.

The code blog started blowing up recently, once I got a domain for it. It was an easy 10 bucks spent because Google SEO seems to like my content, proving some of my SEO theories true. In fact, if you Google "best AAA flashlights", I believe my link still comes up #1. A small victory, but still...

Post by **Ddrak** » Mon May 06, 2013 3:49 pm

Freecare Spiritwise wrote:So you think the FSM looks OK? The main purpose of this blog (other than maybe some ad traffic) is just in case I'm ever back in the job market. Not being the hotshot kid is kind of freaking me out, so I not only need to stay sharp, but I need to show the world. I was a lot like Tax and then I blinked one day and I was old.

Yep, FSM looks fine, right down to separate methods for state changes. I actually find it a sign of maturity in programming that people drop back to a well structured FSM when shit gets real. Junior programmers tend to create infinite quasi-state machines. The difference in RL between falling back to a structured plan or dropping into panic when it gets hard.

Dd

Freecare Spiritwise · Post by **Freecare Spiritwise** » Mon May 06, 2013 8:41 pm

Thanks, Dd. Yep, I was that junior programmer, but I had a couple really good mentors.

It's a pretty good life writing well structured code and having to maintain what I write. I don't get the 5 problem phone calls a day that other programmers at my company get from clients. I take pride in my shit working. But it's been too long since I've worked on a "real" team, where everyone on the team is at the same level. And I'm now kind of in a management position, where people are taking my code as gospel and are afraid to tell me if it sucks. I'm starting to feel out of touch. But the blog is helping, and so are you guys, thanks

Taxious · Post by **Taxious** » Tue May 07, 2013 10:18 am

Every time you describe your work environment Free, I just imagine you sitting in a completely dark room with a padlocked door typing furiously with the family upstairs. It must be nice to be in that manager position though, intimidating the little code monkeys.

Freecare Spiritwise · Tue May 07, 2013 12:09 pm

Taxious wrote:Every time you describe your work environment Free, I just imagine you sitting in a completely dark room with a padlocked door typing furiously with the family upstairs. It must be nice to be in that manager position though, intimidating the little code monkeys.

Hah! My house is 3 stories and my office is on the main floor next to the kitchen and living room. My mother in law lives in the basement but there's no bathroom there. Every time she comes up to use the bathroom, my 7 dogs flip out because she puts off some weird energy. Usually on a conference call with clients. So my work environment is a living hell of trying to get concentration for long enough between distractions to actually get a little bit of work done.

Sometimes I move up to my top floor man cave with my laptop and tablet if it gets too noisy. I do my blog code work late at night when it's quiet.

Being a manager sucks, at least I don't like it much. My outsourced code monkeys in Manila aren't very good programmers. They try hard, but they also try hard to avoid any human interaction, which is hard enough with them so far away. I will send them an email, and they will only reply when they think I've gone to bed. So, then, a simple discussion spanning several emails takes several days.

The pay is good, but I'd almost trade places with you, starting a job at a new company and the excitement of being in an environment that maybe has some fucking talent and semi-normal people. I like the being a slob and eating ribeye for lunch part of working at home, but that's about it.

Brell Rants

Link Scraping Methods

Link Scraping Methods

Re: Link Scraping Methods

Re: Link Scraping Methods

Re: Link Scraping Methods

Re: Link Scraping Methods

Re: Link Scraping Methods

Re: Link Scraping Methods

Re: Link Scraping Methods

Re: Link Scraping Methods

Re: Link Scraping Methods

Re: Link Scraping Methods

Re: Link Scraping Methods

Re: Link Scraping Methods

Re: Link Scraping Methods

Re: Link Scraping Methods