python - scrapy: avoid circular re-crawling -


i building scraper attractions nearby hotels in tripadvisor, scraper parse urls : http://www.tripadvisor.com/attractionsnear-g55711-d1218038-oa30-dallas_addison_marriott_quorum_by_the_galleria-dallas_texas.html

i wrote 2 rules these urls, second 1 next attractions page destination urls:

rule(sgmllinkextractor(allow=(".*attractionsnear-g.*",),                            restrict_xpaths=('.//div[@class="nearby_links wrap"]/a',), unique=true),          callback='parse_item', follow=true),     rule(sgmllinkextractor(allow=(".*attractionsnear-g.*",),                            restrict_xpaths=('.//div[@class="pglinks"]/a[contains(@class, "pagenext")]',), unique=true),          callback='parse_item', follow=true), 

but in destination url first rule valid, , scraper re-crawl parsed urls , begin process start.

i tried avoid circular crawling downloadermiddleware

class locationsdownloadermiddleware(object): def process_request(self, request, spider):     if(request.url.encode('ascii', errors='ignore') in deny_domains):         return ignorerequest()     else: return none 

and managing deny_domains list- in response parsing

 def parse_item(self, response):     deny_domains.append(response.url.encode('ascii', errors='ignore')) 

but middleware blocking every url want parse.

how can manage it? thanks

sgmllinkextractor discontinued, should use scrapy.linkextractors.linkextractor instead.

now rules should this:

rules = (     rule(         linkextractor(             restrict_xpaths=['xpath_to_category'],             allow=('regex_for_links')         ),         follow=true,     ),     rule(         linkextractor(             restrict_xpaths=['xpath_to_items'],             allow=('regex_to_links')         ),         callback='some_parse_method',     ), ) 

when specify follow=true means not using callback, instead specifying links should "followed" , rules still apply. can check docs here.

also won't make duplicate requests because scrapy filtering that.


Comments