Previously, I would simply use PHP's fsockopen to quickly grab a web document or 2, or to masquerade as a browser, but this time fsockopen was simply not going to cut it. I needed something a lot easier to set up and one that could survive whatever robot traps there are on these kinds of sites usually.
Curl and Wget
When amateurs like me want to develop software that will masquerade as web robots or crawlers, the obvious choices are of course Wget and Curl. I have had limited experience with Wget, especially when setting up cron jobs on my web servers, but I have never had any with Curl. After quickly doing some research on both, I concluded the one more suited for my needs today is Curl.It took me nearly 3 weeks, but today I have completed my "web robot" that successfully crawls all the necessary web sites, grabs any document I want, extracts just the information I need and puts it all, very nicely, into a MySQL database!
My custom web crawler, powered by PHP and Curl, is able to connect to a web site, manage cookies, send referrer data, request compressed web pages, navigate itself around a web site to get to the best parts, fetch the document containing the data I want, parse it, just extract the data I need, verify that it is correct, and save it all to the database! And it does this all at the rate of 1.5 minutes for one month's worth of data from one web site.
Considering that I have over 20 years of data to fetch, and that too from more than one web site, it is not bad at all, if you ask me! :)
At this rate, GIDApp No. 2 should be ready in 3 months. :)