i building scraper attractions nearby hotels in tripadvisor, scraper parse urls : http://www.tripadvisor.com/attractionsnear-g55711-d1218038-oa30-dallas_addison_marriott_quorum_by_the_galleria-dallas_texas.html
i wrote 2 rules these urls, second 1 next attractions page destination urls:
rule(sgmllinkextractor(allow=(".*attractionsnear-g.*",), restrict_xpaths=('.//div[@class="nearby_links wrap"]/a',), unique=true), callback='parse_item', follow=true), rule(sgmllinkextractor(allow=(".*attractionsnear-g.*",), restrict_xpaths=('.//div[@class="pglinks"]/a[contains(@class, "pagenext")]',), unique=true), callback='parse_item', follow=true), but in destination url first rule valid, , scraper re-crawl parsed urls , begin process start.
i tried avoid circular crawling downloadermiddleware
class locationsdownloadermiddleware(object): def process_request(self, request, spider): if(request.url.encode('ascii', errors='ignore') in deny_domains): return ignorerequest() else: return none and managing deny_domains list- in response parsing
def parse_item(self, response): deny_domains.append(response.url.encode('ascii', errors='ignore')) but middleware blocking every url want parse.
how can manage it? thanks
sgmllinkextractor discontinued, should use scrapy.linkextractors.linkextractor instead.
now rules should this:
rules = ( rule( linkextractor( restrict_xpaths=['xpath_to_category'], allow=('regex_for_links') ), follow=true, ), rule( linkextractor( restrict_xpaths=['xpath_to_items'], allow=('regex_to_links') ), callback='some_parse_method', ), ) when specify follow=true means not using callback, instead specifying links should "followed" , rules still apply. can check docs here.
also won't make duplicate requests because scrapy filtering that.
Comments
Post a Comment