i want extract product's information amazon web site scrapy , selenium.
following source code connects amazon website , performs search letter "a". recovers links of search result sets. when made loop go each 1 of search results nothing happens (it connects first result). thank helping me correct code.
source code "spider"
scrapy.contrib.spiders import crawlspider selenium import webdriver selenium.webdriver.support.select import select time import sleep import selenium.webdriver.support.ui ui scrapy.xlib.pydispatch import dispatcher #from runner.items import runneritem extraction.items import produititem class runnerspider(crawlspider): name = 'products' allowed_domains = ['amazon.com'] start_urls = ['http://www.amazon.com'] def __init__(self): self.driver = webdriver.firefox() def parse(self, response): items = [] self.driver.get(response.url) recherche = self.driver.find_element_by_xpath('//*[@id="twotabsearchtextbox"]') recherche.send_keys("a") recherche.submit() #time.sleep(2.5) # search results links resultas = self.driver.find_elements_by_xpath('//ul[@id="s-results-list-atf"]/li/div/div/div/div[2]/div[1]/a') result in resultas: item = produititem() lien = result lien.click() # exemple of data extracted item['nom'] = self.driver.find_element_by_xpath('//h1[@id="aiv-content-title"]').text() item['image'] = self.driver.find_element_by_xpath('//*[@id="dv-dp-left-content"]/div[1]/div/div/img/@src').text() items.append(item) self.driver.close() yield items source code "item"
# -*- coding: utf-8 -*- import scrapy class produititem(scrapy.item): nom = scrapy.field() image = scrapy.field() source code "piplines"
scrapy.exceptions import dropitem class duplicatespipeline(object): def __init__(self): self.ids_seen=set() def process_item(self, item, spider): if item['id'] in self.ids_seen: raise dropitem("duplicate item found: %s"%item) else: self.ids_seen.add(item['id']) return item
if @ source code of result site in browser using developer tools (for example in chrome) can see code
resultas = self.driver.find_elements_by_xpath('//ul[@id="s-results-list-atf"]/li/div/div/div/div[2]/div[1]/a') returns 1 element. because results in same ul block , first li element.
you should ul[@id="s-results-list-atf"] element iterate on every list item element.xpath('//li') , url of detail site. alternatively can skip strolling through divs , find url inside li blocks class matching.
and data without selenium -- if want search.
update
the code above plain old scrapy apply xpath call on response. selenium works bit different because selenium elements in return -- on these can apply the find_elements_by_xpath on elements in list.
Comments
Post a Comment