How to extract product's information from Amazon site web with Scrapy and Selenium? -


i want extract product's information amazon web site scrapy , selenium.
following source code connects amazon website , performs search letter "a". recovers links of search result sets. when made loop go each 1 of search results nothing happens (it connects first result). thank helping me correct code.

source code "spider"

    scrapy.contrib.spiders import crawlspider     selenium import webdriver     selenium.webdriver.support.select import select     time import sleep     import selenium.webdriver.support.ui ui     scrapy.xlib.pydispatch import dispatcher     #from runner.items import runneritem     extraction.items import produititem     class runnerspider(crawlspider):       name = 'products'       allowed_domains = ['amazon.com']       start_urls = ['http://www.amazon.com']        def __init__(self):            self.driver = webdriver.firefox()        def parse(self, response):          items = []                self.driver.get(response.url)         recherche = self.driver.find_element_by_xpath('//*[@id="twotabsearchtextbox"]')         recherche.send_keys("a")         recherche.submit()         #time.sleep(2.5)          # search results links         resultas = self.driver.find_elements_by_xpath('//ul[@id="s-results-list-atf"]/li/div/div/div/div[2]/div[1]/a')          result in resultas:           item = produititem()           lien = result           lien.click()           # exemple of data extracted            item['nom'] = self.driver.find_element_by_xpath('//h1[@id="aiv-content-title"]').text()           item['image'] = self.driver.find_element_by_xpath('//*[@id="dv-dp-left-content"]/div[1]/div/div/img/@src').text()           items.append(item)        self.driver.close()       yield items 

source code "item"

    # -*- coding: utf-8 -*-     import scrapy     class produititem(scrapy.item):        nom = scrapy.field()        image = scrapy.field() 

source code "piplines"

    scrapy.exceptions import dropitem     class duplicatespipeline(object):       def __init__(self):         self.ids_seen=set()       def process_item(self, item, spider):         if item['id'] in self.ids_seen:            raise dropitem("duplicate item found: %s"%item)         else:            self.ids_seen.add(item['id'])         return item 

if @ source code of result site in browser using developer tools (for example in chrome) can see code

resultas = self.driver.find_elements_by_xpath('//ul[@id="s-results-list-atf"]/li/div/div/div/div[2]/div[1]/a') 

returns 1 element. because results in same ul block , first li element.

you should ul[@id="s-results-list-atf"] element iterate on every list item element.xpath('//li') , url of detail site. alternatively can skip strolling through divs , find url inside li blocks class matching.

and data without selenium -- if want search.

update

the code above plain old scrapy apply xpath call on response. selenium works bit different because selenium elements in return -- on these can apply the find_elements_by_xpath on elements in list.


Comments