- Joined
- 11 yrs. 6 mth. 26 days
- Messages
- 5,381
- Reaction score
- 18,380
- Age
- 45
- Wallet
- 11,590$
- [email protected]
What we're going to do is create a class to use to instantiate an instance of a web bot capable of making POST and GET requests and returning the output, it will also include several subroutines for parsing the data provided from Michael Schrenks book 'Webbots Spiders and Screenscrapers' The bot will allow you to specify a custom user agent and have proxy support, with this class it will make the ability to write much more complex scrapers extremely easy. I've posted something similar in the pastebin before but in this tutorial we'll go through creating it step by step so that it makes a little bit more sense. I always write bots as CLI scripts so reading my previous tutorial on the basics of PHP CLI can be a benefit but isnt required.
Creating our class
If you're not familiar with Object Oriented Programming some of these concepts may seem a little weird at first but just stick with it it'll make sense. Think of a class as being a custom data type that you define. This isn't exactly true but it helps to visualize it, as you will be creating a variable of the type of your class which then has access to all of the properties(variables) and methods(functions) that the class holds. This isn't meant to be a tutorial on OOP so I wont go into a lot of detail but things should be fairly straight forward with some examples.
The first thing we'll need to do after declaring our class is to declare our basic properties:
I'll do a brief overview of these and highlight some of the more useful ones that you can use:
split_string()
this method takes a string and a delineator and returns either all of the text before the marker or all of the text after the marker and can be set to include or exclude the marker by adjusting the INCL/EXCL constants
return_between()
this method accepts a string, two anchors, and a boolean value as parameters, the string will be parsed and the first instance of the opening anchor/closing anchor set will be parsed and returned. The boolean value at the end determines whether the anchors are included in the returned string or not a value of True will exclude the anchor and a value of false will include it.
parse_array()
similar to return_between parse_array accepts a string and two anchors as parameters, it returns an array of each occurrence of that phrase, it always includes the anchors. Using parse_array() to scrape a page and looping through the array using return_between() to clean and extract the data can be very effective. These two methods are used by myself more often than any other parsing methods.
get_attributes()
accepts a tag in html format and attempts to extract all of its attributes.
remove()
remove accepts 3 parameters a string and two anchors it similar to parse_array() with the exception that it will remove all occurrences of the anchors and text between them from the string
tidy_html()
accepts a single string, this method requires the tidy_html library be installed, if it is not it will simply return the same string it was submitted it won't return an error.
validate_url()
this is the one method i wrote for parsing it just uses a regex i googled to validate a URL, pretty self explanitory.
With this you should be able to start writing some fairly non-trivial webBots. I'm going to follow up with tutorials on using the bot to do basic GET requests and a second on how to use it for POST requests(logging into a website). After that if its received well I intend to write a complementary set of tutorials covering the same content in python. Have fun guys if you keep readin em ill keep writing em!
Creating our class
If you're not familiar with Object Oriented Programming some of these concepts may seem a little weird at first but just stick with it it'll make sense. Think of a class as being a custom data type that you define. This isn't exactly true but it helps to visualize it, as you will be creating a variable of the type of your class which then has access to all of the properties(variables) and methods(functions) that the class holds. This isn't meant to be a tutorial on OOP so I wont go into a lot of detail but things should be fairly straight forward with some examples.
The first thing we'll need to do after declaring our class is to declare our basic properties:
I'll do a brief overview of these and highlight some of the more useful ones that you can use:
split_string()
this method takes a string and a delineator and returns either all of the text before the marker or all of the text after the marker and can be set to include or exclude the marker by adjusting the INCL/EXCL constants
return_between()
this method accepts a string, two anchors, and a boolean value as parameters, the string will be parsed and the first instance of the opening anchor/closing anchor set will be parsed and returned. The boolean value at the end determines whether the anchors are included in the returned string or not a value of True will exclude the anchor and a value of false will include it.
parse_array()
similar to return_between parse_array accepts a string and two anchors as parameters, it returns an array of each occurrence of that phrase, it always includes the anchors. Using parse_array() to scrape a page and looping through the array using return_between() to clean and extract the data can be very effective. These two methods are used by myself more often than any other parsing methods.
get_attributes()
accepts a tag in html format and attempts to extract all of its attributes.
remove()
remove accepts 3 parameters a string and two anchors it similar to parse_array() with the exception that it will remove all occurrences of the anchors and text between them from the string
tidy_html()
accepts a single string, this method requires the tidy_html library be installed, if it is not it will simply return the same string it was submitted it won't return an error.
validate_url()
this is the one method i wrote for parsing it just uses a regex i googled to validate a URL, pretty self explanitory.
With this you should be able to start writing some fairly non-trivial webBots. I'm going to follow up with tutorials on using the bot to do basic GET requests and a second on how to use it for POST requests(logging into a website). After that if its received well I intend to write a complementary set of tutorials covering the same content in python. Have fun guys if you keep readin em ill keep writing em!