Tutorial: Writing Web Bots and Scrapers Simple subreddit scraper

Prince · August 3 - 8:41

Hey guys,

This is my first tutorial on writing web bots and scrapers! Today we'll be writing a basic bot that runs from command line, it accepts one parameter(a subreddit) and scrapes the frontpage of that subreddit posting all of the Post Titles and the Links to those posts. We'll be working in php and using the webBot.class.php file i've posted in the zentrixplus pastebin so grab a copy and lets get started.

A couple of quick things to note: When writing web bots you have to forget a lot of preconceptions you have about how web pages work based on your experience with a browser, often times things like javascript can be entirely avoided, so if you see some really intricate ajaxy type interface and aren't sure what to do, take a step back and bust out a program like live http headers, because everything we do can be broken down into the core HTTP requests being sent back and forth. For this tutorial we'll only be focusing on GET based requests so things should be pretty straight forward.

So the first thing we want to do is parse our options and set the URL of the page we want to scrape. You will find that when scraping webpages that the vast majority of the time it seems as if a retarted monkey wrote the HTML which can make extracting the specific data you want difficult. To make things easier on ourselves for this tutorial we're going to use reddits rss feed, the first part of our code will be:

Code:

Please, Log in or Register to view codes content!

A quick overview of whats being done here:

We're including our webBot so that we can create webBot objects in this script.
Next we grab the parameter -s from argv and assign its value to $options['s'](this is our subreddit name) then to simplify things we create an alias to that value called $subreddit
Now that we have our subreddit we can craft our URL, appending /.rss on the end to make things a little more parseable.
next we create an instance of the webBot object $bot and scrape our URL, the returned HTML is stored in $page

As far as making web requests goes, we're done. We now have all the data we need for this particular script, easy huh? But getting the data is only half the work(usually less than half

circlejerks always a good time too. Load it in your browser manually(don't forget to add /.rss to the end of the URL) now view the source of the page and lookup one of the post titles that you saw in the browser, try to look for common anchors that are around ONLY the post titles.

Code:

Please, Log in or Register to view codes content!

ok, so titles were easy links are going to be a tad more difficult, if you try using just <link></link> as your tags you'll find you end up with a few more links than you have titles, and that your links are offset from the titles when you go to print them out. Look at we did with the titles and think hard, try to solve this problem on your own. If you completely give up the solution is below
Spoiler

Code:

Please, Log in or Register to view codes content!

we need to include the <link> tag as well as the url up until /$subreddit/comments/ to ensure that we are ONLY obtaining the links to the topics listed and not a link to the subreddit itself

at this point the vast majority of our work is done, for something like this where there are two values that i need to line up, I usually include a check just to make sure everythings on the up and up:

Code:

Please, Log in or Register to view codes content!

Now all that's left is our cleanup and presentation which can be done with a simple for loop:

Code:

Please, Log in or Register to view codes content!

Nothing particularily dumbfounding here, the purposes of $i and $x are to maintain two iterators, one is a visual representation which starts at 1 and goes to 25(this is used as the counter displayed in the print statements to the user) the other is used as the actual iterator through the arrays and runs from 0-24. For each value in the $titles and $links arrays we run through and use return_between() to remove the surrounding text, notice we can use the array values as the haystack itself to make things simple. in $titles we also do a quick str_replace() to substitute any instances of " with an "

So our whole bot put together becomes:

Code:

Please, Log in or Register to view codes content!

I hope this was helpful for you guys

I'll write up another one that will go through automating the login and messaging process using POST requests!

Tutorial: Writing Web Bots and Scrapers Simple subreddit scraper

Prince

[ Verified Seller ]