How to let crawlers have to run javascript in pages? -
i want implement anti-crawler mechanism protect data in site. after reading many related topics in so, going focus on "enforce running javascript".
my plan is:
implement special function f (eg. md5sum) in javascript file c
input: cookie string of current user (the cookie changes in each response)
output: verification string v
send v along other parameters sensitive backend interface request valuable data
- backend server has validation function t check whether v correct
the difficult part how obfuscate f. if crawlers can understand f, v without c , bypass javascript.
indeed, there many js obfuscators, going achieve goal implement generator function g not appear in c.
g(k) generates f, k large integer. f should complicate enough, crawler writers have take many hours understand f. given k', g(k') = f', f' should new function in extent, , again, crawler writers have take hours crack.
a possible implementation of g might mapping integer digital circuit of many connected logic gates (like maze). using javascript grammar represent f. since f must run in javascript, crawlers have run phantomjs. furthermore, can insert sleeps in f slow down crawlers while normal users hardly aware 50-100ms delay.
i know there group of methods detect crawlers. applied. let's discuss "enforce running javascript" topic. give me advice? there better solution?
using login prevent whole world see data 1 option.
if not want logged in users fetch data make available them, limit number of requests per minute user, adding delay page load if has been reached. since user logged, track requests server-side if manage change cookies/localstorage/ip/browser , whatnot.
you can use images texts, force them use resource-heavy mechanics translate usable information.
you add hidden texts, prevent users' copy/paste (you use spans filled 3-4 random letters on every 3-4 real letter , make them font-size 0). way aren't seen, still copied, , taken crawler.
refuse connection known crawler http header signatures, although crawler mock those. , greasemonkey or scripting extension turn regular browser crawler has little incidence.
now, force using javascript
the problem cannot force javascript execution. javascript seen has access page, if it's kind of md5 hash you'd accomplish, can implemented in language.
that's unfeasible because crawler has access client's javascript has access to.
forcing use javascript enabled crawler can circumvented, , if not, computing power available nowaday, easy launch phantomjs instance... , said above, slight javascript knownledge can automate clicks on website using browser, make undetectable.
what should done
the bulletproof way prevent crawlers leech data, , prevent automation ask human do. captcha comes mind.
think real users
first thing should keep in mind is website starts annoying use normal users, not come back. having type 8 character captcha on each page request because there might wants pump data become tedious anyone. also, blocking unknown browser agents might prevent legit users accessing website because of x or y reason using weird browser.
the impact on legit users, , time you'd take working hard on fighting crawlers might high accept crawling happen. best bet rewrite tos explicitly forbid crawling of sort, log every http access of every user, , take action when needed.
disclaimer: i'm scrapping on hundred websites monthly, following external links totalise 3000 domains. @ time of posting, none of them resisting, while employ 1 or more techniques of above. when scrapping error detected, not take long fix it...
the thing crawl respectfully, not on crawl or make many requests in small time frame. doing circumvent popular anti crawlers.
Comments
Post a Comment