Article Options
Premium Sponsor
Premium Sponsor

 »  Home  »  Web Development  »  Screen Scraping
 »  Home  »  Web Development  »  Web Services  »  Screen Scraping
Screen Scraping
by Tiberius OsBurn | Published  09/08/2002 | Web Development Web Services | Rating:
Tiberius OsBurn

Tiberius OsBurn is a Senior Developer/System Analyst for The Gallup Organization (http://www.gallup.com). He recently completed a huge data warehousing project that archived data and documents from 1935 to the present - all coded in C#, SQL Server and ASP.NET.

Tiberius has extensive experience in VB, VB.NET, C#, SQL Server, ASP.NET and various other web technologies. Be sure to visit his site for his latest articles of interest to .NET developers.

http://tiberi.us

 

View all articles by Tiberius OsBurn...
Screen Scraping

Article source code: screen_scrape.zip

Not too long ago, if you wanted some particular information off of a particular web site, you'd have to snake the HTML off a page and incorporate it into yours. Whether you did that manually via cut and paste or with a homegrown process was up to you - usually it involved some pain and misery to get it right.

Even today, as we teeter on the 'new age' of web services, we still have problems getting what we want from our favorite web pages - maybe we need some information that isn't exposed via a web service, and until the Frito chomping, Jolt drinking programmer that wrote the page shuts off 'Star Trek', gets up off the sofa and writes a web service, we'll have to do their job for them.

The idea of screen scraping isn't new, in fact, many unsavory types use some sort of screen scraping to retrieve email addresses and harvest images from unsuspecting sites. Actually, this is common practice on the web - one that is nefarious and ill received by most of the Internet community.

No, I'm not going to show you how to screen scrape email addresses off of pages, so don't ask me - instead, we'll do a little constructive scraping in order to put more content out on the web.

A word of caution:

In reality, you can scrape ANY site on the web. Now, just a quick warning, this may not be the most 'legal' thing to do, especially if you haven't received permission from the owner of the content. Just make sure that you get the 'okey-dokey' from the owner of the content if you are going to redistribute their content.

Coding Offensively and Defensively

Over the years I picked up a nice habit of adding comments to my HTML code. I'd always get lost in the many table and td tags, so I'd demark sections of HTML with a begin and end comment. For instance, the section on my site called 'HIP', is demarked with <!-- BEGIN HIP --> and <!-- END HIP -->. What we want to scrape is whatever is between those HTML Comments, being the layout and images of that section.

I'll go on record saying that if you demark out your HTML code, you'll have a hell of a lot easier time setting up a screen scraper for your site. If you don't want more than the curious scraper to snatch information off your site, I would strongly encourage you to 'bunch' up your code - making it as difficult to scrape as possible - in other words, don't format your code and don't add comments. One of the easiest ways to ward off a 'scraper' is to put your entire HTML (or the HTML output) on ONE line. This'll keep even the most ardent of content scrapers busy for hours scouring your code for a nice break.

Remember, you can scrape ANY site, so if you don't like that idea, you'll have to take measures to ensure that it's more pain than gain.

Viewing Source

If you want to scrape, you'll have to view the HTML source of the site. Let's take a quick look at the source of my default.aspx page...

<!-- BEGIN HIP -->
<tr>
<td align="left" valign="center" width="100"><br>
<br>
<IMG src="http://tiberi.us/images/hip.gif">
</td>
</tr>
<tr>
<td>
<IMG SRC="http://tiberi.us/images/hip/microsoftphone.gif">
This is very, very sweet...
Microsoft's new phone, the Pocket PC Phone Edition is sure to
ring your bell.
<br>
<a href="http://www.microsoft.com/mobile/pocketpc/phoneedition/">
Pocket PC Phone</a>
</td>
</tr>
<!-- END HIP -->

Here we can clearly see where my 'HIP' section begins and ends. This is important, because if you want to capture the content on a site, you'll have to find a beginning and an ending section - Look hard for a unique demarcation - somewhere there is a clear beginning to the content and a clear ending, or you'll end up with a lot of garbage that you don't want.

Once you've become familiar with the HTML source, you're ready to craft a regular expression.

Firing up RegEx

So, with that in mind, we'll fire up the regular expression object, REGEX, and parse out the Hip section quite painlessly.

If you're not a fan of Regular Expressions, you soon will be. If you've been a Java or C++ programmer, you've been spoiled by how nice regular expressions are. If you were a Visual Basic programmer, you were stuck with some crappy OCX or a DLL Library or regular expressions in VBScript that didn't quite work right. Now that .NET is on the scene, have no fear - you'll be using RegEx plenty.

Let's take a peek at our regular expression that we use to get out the content we want from tiberi.us:

Regex regex = new Regex("<!-- BEGIN HIP -->((.|\n)*?)<!-- END HIP -->",
    RegexOptions.IgnoreCase);

Look confusing? Naw. It's simple.

We want to get out whatever is between <!-- BEGIN HIP --> and <!-- END HIP -->. The ((.|\n)*?) part of the expression, as foreign and weird as it looks, actually isn't that bad.

The period character followed by the | character and then the \n works to restrict the new line character but allows a match on any other character. The asterisk and question mark tell the RegEx engine to match on zero or more occurrences.

It's beyond the scope of this article to delve too deep into regular expressions, but there are plenty of resources out there if you'd like to learn more.

Getting down to Business

If we look at our code, you'll see that we're using a StreamReader, the web Request and Response objects and the ubiquitous Regex object.

Coding our Screen Scraper:

private string getHip() {

    StreamReader oSR = null;

    //Here's the work horse of what we're doing, the WebRequest object 
    //fetches the URL
    WebRequest objRequest = WebRequest.Create("http://tiberi.us");

    //The WebResponse object gets the Request's response (the HTML) 
    WebResponse objResponse = objRequest.GetResponse();

    //Now dump the contents of our HTML in the Response object to a 
    //Stream reader
    oSR =  new StreamReader(objResponse.GetResponseStream());

    //And dump the StreamReader into a string...
    string strContent = oSR.ReadToEnd();

    //Here we set up our Regular expression to snatch what's between the 
    //BEGIN and END
    Regex regex = new Regex("<!-- BEGIN HIP -->((.|\n)*?)<!-- END HIP -->",
        RegexOptions.IgnoreCase);

    //Here we apply our regular expression to our string using the 
    //Match object. 
    Match oM = regex.Match(strContent);

    //Bam! We return the value from our Match, and we're in business. 
    return oM.Value;
}

I've done some pretty liberal commenting - so you'll be able to figure out what's going on. I fill my WebRequest object with the URL to my site, then fill the WebResponse object up with the resultant HTML. After that, I dump the WebResponse object into a StreamReader and then into a string, which is in turn, parsed by the regular expression engine.

Not much to it, is there?

Web Service

Now that we've done the tough part - we can have a little cake with our code. Transforming a method into a full-blown web service is simple. Essentially, all we need to do is whip a [WebMethod] declaration above our method and magically, we have a web service ready for the world to use.

using System;
using System.Collections;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Web;
using System.IO;
using System.Net;
using System.Text.RegularExpressions;
using System.Text;
using System.Diagnostics;
using System.Web.Services;

namespace screenscrape {

    public class getHip : System.Web.Services.WebService {

        public getHip() {
            InitializeComponent();
        }

        [WebMethod]
        public string getHipWS() {
            StreamReader oSR = null;
            string strURL = "http://tiberi.us";
            WebRequest objRequest = WebRequest.Create(strURL);
            WebResponse objResponse = objRequest.GetResponse();
            oSR =  new StreamReader(objResponse.GetResponseStream());
            string strContent = oSR.ReadToEnd();
            Regex regex = new Regex("<!-- BEGIN HIP -->((.|\n)*?)<!-- END HIP -->",
                RegexOptions.IgnoreCase);
            Match oM = regex.Match(strContent);
            return "<table width=100 border=0 align=center>" + oM.Value + "</table>";
        }

    }

}

Please feel free to download the code for this project.

Finally, if you're interested in doing some serious screen scraping, I'd suggest that you bone up on regular expressions - you'll need them. Most sites have a usage policy that you'll have to slog through as well; you don't want to be getting threatening email from angry lawyers and livid webmasters.

How would you rate the quality of this article?
1 2 3 4 5
Poor Excellent
Tell us why you rated this way (optional):

Article Rating
The average rating is: No-one else has rated this article yet.

Article rating:4 out of 5
 34 people have rated this page
Article Score39767
Comments    Submit Comment

Comment #1  (Posted by Justin Magaram on 01/07/2003)

I've experimented with this a bit and find it to be far more complicated than I expected. For example, suppose the site you are going to tries to redirect you to a different URL. This is very common on the more popular sites. Try www.budget.com or www.etrade.com. When I tried downloading from the initial URL I got a somewhat useless web page back with redirection instructions. OK, so now I have to parse the page I received for redirection instructions and issue a second download. In the end I found it easier to automate Internet Explorer. Am I overstating the complications? I'd be interested in having an e-mail conversation with someone about this.
 
Comment #2  (Posted by Tiberius on 01/08/2003)

Yes, Screen Scraping isn't easy. It does take a lot of elbow grease to get what you want.
 
Comment #3  (Posted by Joe Crawford on 01/15/2003)

anyone here that knows where an article just like this one is but in VB?

i use vb and i tried to convert the C# however i don't know a thing aobout c# so i have failed.

anyone willing to convert this for me and post it as a comment here?

Joe
 
Comment #4  (Posted by pete b on 01/16/2003)

Joe,

I am not sure if you are refering to the creation of a web service or screen scraping, but here is an article that covers screen scraping that has c#, vb, and javascript versions that do the same thing...

http://www.aspalliance.com/stevesmith/articles/netscrape.asp

enjoy!
 
Comment #5  (Posted by Sinead on 01/21/2003)

Might sound a bit stupid but do you put this in your HTML doc?
 
Comment #6  (Posted by Priyanka on 02/09/2003)

I was wondering if it is possible to determine a url once a page has been loaded. I am parsing a site for specific information, one of which is a station url. This url is not giving directly but is given as a database lookup like http://www.comfm.com/php/radio/?id=26607
when the above url is loaded, automatically the url is resolved and loads the following url which is
http://www.clando.fr.fm/

Is there a way to determine the url after the page has been loaded in my case clando.fr.frm.I tried using the
HttpWebResponse. server which just gives me the type of server meaning"apache"

thanks


 
Comment #7  (Posted by ramprasad on 03/19/2003)

Is there any way to save the image into the drive? Pl help me out in this regard.. Thanks in advance..Ram
 
Comment #8  (Posted by Gregg on 04/08/2003)

What is the best way to parse the url that is returned on a webresponse? When its a posted url, how do I retrieve the values out of a name/value pair?

Thanks
 
Comment #9  (Posted by Michael Orozco on 03/02/2004)

Looking at this code, it sure does look alot like java. With that being said, can you do this also in java?
 
Comment #10  (Posted by Tiberius on 03/03/2004)

You bet... You could easily convert this over to be Java code.
C# and Java are like cousins.
 
Comment #11  (Posted by mcgants on 03/09/2004)

do you have a version of this in VB.NET? i am interested in researching screen scraping as i am researching VB.NET web services at the moment and find this a good demonstration of useful technologies.
 
Comment #12  (Posted by Morten Solberg on 04/02/2004)

I am trying to find the same type of code but in java, i am having difficulty finding any screen scraping code in java, can anyone help me? Just contact me using my email...
 
Comment #13  (Posted by Venkatesh on 04/07/2004)

Hi,
Request.Url.ToString();
 
Comment #14  (Posted by Venkatesh on 04/07/2004)

Hi,
you can get URL using this.

Request.Url.ToString();


 
Comment #15  (Posted by Brian Gaines on 07/26/2004)

I tried implementing the code, but I get the following exception ("The remote server returned an error: (401) Unauthorized."). I am using this technique for an internal application. Do you know the security levels/permissions I should use on the directory which contains the page I am scraping? Thanks

 
Comment #16  (Posted by Cary S. Magaram on 08/03/2004)

Hello, my name is Cary S. Magaram. I am a 20 year old college student living in Staten Island, NY. For you west coasters, ( I think that's what you guys like to be called) Staten Island is one of the 5 boroughs comprising NYC. Yes, NYC is more than just Manhattan and Brooklyn. Anyways, I don't have anything to contribute on the subject of screen scraping so to anyone but Justin Magaram who is reading this, I apologize. I was surfing the web searching random stuff when I came up with the idea to do some research on my family etymology. I was astonished to learn that there are people on the West Coast of the United States with my same, unique last name. Up until now, I have never met another person with the same family last name I have- that is until now. I am curious if we are relatives somewhere along the line or if this all just a strange coincidence. Justin, please reply back to me at magaman69@earthlink.net if you would like to set up a connection to a possible relative on the east coast. Thank you all for your time. I hope to hear from you, Justin very soon.
 
Comment #17  (Posted by Code Not ready4Newbs on 08/25/2004)

Great article but I couldn't get your code in default.aspx.cs to work as downloaded. So 4 any other newbies out there, here's what I had to do:

I called gethip() instead of gethipWS() in line 28
Removed the response.end in line 29

Then proceeded to:
Setup the folder this app would reside in as an "Application" in IIS Admin
Added all files to a new project in VS and rebuilt the whole thing.

Then it worked.


 
Comment #18  (Posted by dev on 09/28/2004)

in the deafult.aspx.cs file
oztop.getHip oGH = new oztop.getHip();
Response.Write(oGH.getHipWS("http://tiberi.us").ToString());

iam getting the error

"oztop" namespace is not found and the method "getHipWS" does not take any argument
please help me
 
Comment #19  (Posted by K Mills on 10/20/2004)

How would you go about getting a page that requires a form submission for a username