As I said in my last post, I’ve written many a scraper using php with curl or fsockopen in my time, trying to write automated tools and scraping data. I’ve tried many tools to help me sniff the HTTP traffic so I could emulate it in PHP as quick as possible. I started off using Wireshark or Ethereal as it was called at the time which was complete overkill, mostly used for network trouble shooting and grabs all TCP/UDP packets which is information overload, all we want is HTTP data. Then I think I used the LiveHTTPheaders addon for Firefox which was pretty limited. Then a Java program called Burpsuite which was pretty powerful but I ran into a problem trying to automate myspace myads submissions, trying to figure out what HTTP the myads flash file was sending over HTTPS. I ran the gamut of every proxy tool out there until I came across Charles Web Proxy.
It’s basically the best out there. It sits as a proxy between the web and your browser, grabbing all data as it comes in. This usually causes problems with SSL but it has a custom SSL cert that you manually add to your browser that lets you log HTTPS data with no warnings. It can grab Flash traffic as it seems to work as a Windows proxy, not just a browser one. It presents HTTP data many different ways so you can understand what’s going on quicker. For example, a multipart form upload is presented as the the raw HTTP data sent, just the headers, just the cookies, the text body and all the form fields. I won’t list all the features as they’re all listed on the site. If you’re using any other tool for automation/scraping, you’re wasting time.