ZiNgA BuRgA Wrote:The idea is this, a lot of websites have dynamic pages. Take these forums for example, the "Forum" link at the top leads to a certain page, with the same "Forum" link at the top - so if a spider kept going through the link, it would be stuck in an infinite loop.
Kinda, you have a good heart, I can tell... To understand this, you need to, "think evil".
This page/link/script/whateveryouwannacallit targets "email harvesters". Most often its simple code on a dedicated box in a second world country targeting a, (or multiple), user defined site(s) with the sole intention of extracting potentially active email addresses from page source. Upon finding a potential email address, it is logged while the bot continues to scan. These email's are sent sPa/\/\. Upon reply, they are sent more sPa/\/\. Upon finding a link, it is followed and the process is repeated, (In the case of this page; filling the DB with garbage email's), unless...
ZiNgA BuRgA Wrote:...it uses a priority system...
In other words, if the bot won't log email's outside the base domain and has a connection limit, it won't work because it won't visit the five links at the top.
ZiNgA BuRgA Wrote:Thus I guess the idea is that the spider won't continually re-index pages it has indexed before.
The thing about it is, the page is randomly generated and is never re-indexededified... Try visiting the link, the top five are links and the rest are bogus email's. Now refresh the page, or visit one of the links... Now they're all different. Cool concept no?
ZiNgA BuRgA Wrote:Other argument is that I'm sure people have tried it with Google and Yahoo, and I doubt those multi-billion dollar companies would be brought down by such a simple trick.
Hmmm... never thought about that... I wonder what happens when G00gleb0t visits that page... Can't be good... He doesn't have a robots.txt either, (not that it would matter but)... How would it avoid it?
ZiNgA BuRgA Wrote:Might work on lesser intelligent spiders I guess.
That's the idea, yup.
edit: You could code/embed your own in every page, that would be ultimate, but the bandwidth if one, (or 1000) got stuck... basically refreshing over and over... augghh... consider it a proof of concept, rendered useless in due time... Think DOS, (not the Microsoft one), unless everyone had one.
Otherwise you could stash the link at the top of every page. Or even: "To contact me:
admin@website.com", for AI harvesters that ignore source and use an, ummm, (in laymans terms), automated people technic. Or combine it with an SMTP server honeypot... nevermind... I could go on forever... Either way, it helps fight sPa/\/\.
Hows that for milk? ;P