Preventing crawlers from reading underscore templates -


i need robots.txt disallow rule prevents crawlers following template tags in <script type="text/template"> tags.

when crawled, url errors looks like:

404 /foo/bar/<%=%20 getpublicurl %20% 

e.g.

<script type="text/template">   <a href="<%= my_var %>" target="_blank">test</a> </script> 

blocked like:

disallow: <%*%> 

any ideas?

i did notice seems happen on anchors target="_blank". not sure why is.

this bit tricky.

many crawlers, including google, silently url-encode unsafe characters in url before check against robots.txt. means have block encoded version.

for example, if url is:

http://example.com/foo/bar/<% my_var %> 

the url google checks against robots.txt be:

http://example.com/foo/bar/%3c%%20my_var%20%%3e 

the spaces , angle brackets silently url-encoded. need block this:

user-agent: * disallow: */%3c%*%%3e 

if try block this:

# not work: user-agent: * disallow: */<%*%> 

then nothing blocked, because it's comparing "<" , ">" "%3c" , "%3e".

i have verified above works google, ymmv other crawlers. note crawlers don't support wildcards @ all.


Comments