my spider scrapes no pages,finished in less second ,but throws no errors.
i've checked code , compared similar project ran few weeks ago, still can't figure out problem might .
i'm using scrapy 1.0.1 , scrapy-redis 0.6.
this logs:
2015-07-21 11:33:20 [scrapy] info: scrapy 1.0.1 started (bot: demo) 2015-07-21 11:33:20 [scrapy] info: optional features available: ssl, http11 2015-07-21 11:33:20 [scrapy] info: overridden settings: {'newspider_module': 'demo.spiders', 'log_level': 'info', 'spider_modules': ['demo.spiders'], 'retry_http_codes': [500, 502, 503, 504, 400, 408, 404, 302, 403], 'bot_name': 'demo', 'scheduler': 'scrapy_redis.scheduler.scheduler', 'default_item_class': 'demo.items.demoitem', 'redirect_enabled': false} 2015-07-21 11:33:20 [scrapy] info: enabled extensions: closespider, telnetconsole, logstats, corestats, spiderstate 2015-07-21 11:33:20 [scrapy] info: enabled downloader middlewares: customuseragentmiddleware, customhttpproxymiddleware, httpauthmiddleware, downloadtimeoutmiddleware, retrymiddleware, defaultheadersmiddleware, metarefreshmiddleware, httpcompressionmiddleware, cookiesmiddleware, chunkedtransfermiddleware, downloaderstats 2015-07-21 11:33:20 [scrapy] info: enabled spider middlewares: httperrormiddleware, offsitemiddleware, referermiddleware, urllengthmiddleware, depthmiddleware 2015-07-21 11:33:20 [scrapy] info: enabled item pipelines: redispipeline, demopipeline 2015-07-21 11:33:20 [scrapy] info: spider opened 2015-07-21 11:33:20 [scrapy] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2015-07-21 11:33:20 [scrapy] info: closing spider (finished) 2015-07-21 11:33:20 [scrapy] info: dumping scrapy stats: {'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 7, 21, 3, 33, 20, 301371), 'log_count/info': 7, 'start_time': datetime.datetime(2015, 7, 21, 3, 33, 20, 296941)} 2015-07-21 11:33:20 [scrapy] info: spider closed (finished) this spider.py
# -*- coding: utf-8 -*- import scrapy demo.items import demoitem scrapy.spiders import crawlspider, rule scrapy.linkextractors import linkextractor scrapy_redis.spiders import redismixin pip._vendor.requests.models import request class democrawler(redismixin, crawlspider): name = "demo" redis_key = "democrawler:start_urls" rules = ( rule(linkextractor(allow='/shop/\d+?/$',restrict_xpaths=u"//ul/li/div[@class='txt']/div[@class='tit']/a"),callback = 'parse_demo'), rule(linkextractor(restrict_xpaths = u"//div[@class='shop-wrap']/div[@class='page']/a[@class='next']"),follow = true) ) def parse_demo(self,response): item = demoitem() temp = response.xpath(u"//div[@id='basic-info']/div[@class='action']/a/@href").re("\d.+\d") item['id'] = temp[0] if temp else '' temp = response.xpath(u"//div[@class='page-header']/div[@class='container']/a[@class='city j-city']/text()").extract() item['city'] = temp[0] if temp else '' temp = response.xpath(u"//div[@class='breadcrumb']/span/text()").extract() item['name'] = temp[0] if temp else '' temp = response.xpath(u"//div[@class='main']/div[@id='sales']/text()").extract() item['deals'] = temp[0] if temp else '' temp = response.xpath(u"//div[@class='main-nav']/div[@class='container']/a[1]/text()").extract() item['category'] = temp[0] if temp else '' temp = response.xpath(u"//div[@class='main']/div[@id='basic-info']/div[@class='expand-info address']/a/span/text()").extract() item['region'] = temp[0] if temp else '' temp = response.xpath(u"//div[@class='main']/div[@id='basic-info']/div[@class='expand-info address']/span/text()").extract() item['address'] = temp[0] if temp else '' yield item to start spider,i should type 2 commands in shell:
redis-cli lpush democrawler:start_urls url
scrapy crawl demo
url specific url i'm scraping,for exmaple http://google.com
Comments
Post a Comment