i need robots.txt disallow rule prevents crawlers following template tags in <script type="text/template"> tags.
when crawled, url errors looks like:
404 /foo/bar/<%=%20 getpublicurl %20% e.g.
<script type="text/template"> <a href="<%= my_var %>" target="_blank">test</a> </script> blocked like:
disallow: <%*%> any ideas?
i did notice seems happen on anchors target="_blank". not sure why is.
this bit tricky.
many crawlers, including google, silently url-encode unsafe characters in url before check against robots.txt. means have block encoded version.
for example, if url is:
http://example.com/foo/bar/<% my_var %> the url google checks against robots.txt be:
http://example.com/foo/bar/%3c%%20my_var%20%%3e the spaces , angle brackets silently url-encoded. need block this:
user-agent: * disallow: */%3c%*%%3e if try block this:
# not work: user-agent: * disallow: */<%*%> then nothing blocked, because it's comparing "<" , ">" "%3c" , "%3e".
i have verified above works google, ymmv other crawlers. note crawlers don't support wildcards @ all.
Comments
Post a Comment