I’ve been wondering for some time now about what pages and paths are visited the most by “bad” bots – scrapers, data harvesters and other automated scanners which disregards the exclusions set in robots.txt. To determine this, I’ve set up a little experiment – I placed robots.txt on one of my domains, which disallowed access to commonly used paths and PHP pages which might of interest to bots (login.php, /wp-admin/, etc.), configured the server to provide HTTP 200 response for these paths and pages and started logging details about requests sent to them.
To avoid as much legitimate or manually generated traffic as possible, I’ve done this on a domain which pointed to a server on which none of the common content management systems was used.
The captured requests were a mixed bag, as one might expect. Some of them were simple one-shot HTTP GET requests while others were part of multi-request scans, some had no parameters set, while others carried generic SQL injection or XSS payloads or tried to “blindly” exploit vulnerabilities specific to common content management systems.
For our purposes, however, this is beside the point as we’re more interested in finding out which pages were looked for the most. I went over the logs and put the “top 10” most commonly requested pages for the past 12 months in the following table, along with the number of times each path or page was hit.
Although finding wp-login.php in the first place is hardly surprising, the results are interesting. Given the fairly large early drop in a number of requests it seems that one might be able to catch a significant portion of interesting “bad” bot behavior with just a single-page (or four or five-page) honeypot… In other words, if you’ve ever wondered where to place a “honeypage” on your server in order for it to be effective, the top paths mentioned in the table above might probably be a good start.
(c) SANS Internet Storm Center. https://isc.sans.edu Creative Commons Attribution-Noncommercial 3.0 United States License.