Update:目前scrapy-selenium已经支持Remote WebDriver,参考#55。
scrapy-selenium是一个Scrapy的中间件,用于让Scrapy支持通过selenium driver访问网页。但默认情况下只支持本地的WebDriver。
scrapy-selenium在和selenium的会话管理上相对完善和稳定,所以,派生一个SeleniumMiddleware的子类,重写即可。
class RemoteSeleniumMiddleware(SeleniumMiddleware): def __init__(self, command_executor, desired_capabilities): self.driver = driver = webdriver.Remote( command_executor=command_executor, desired_capabilities=desired_capabilities) @classmethod def from_crawler(cls, crawler): command_executor = crawler.settings.get('SELENIUM_COMMAND_EXECUTOR') desired_capabilities = crawler.settings.get('SELENIUM_DESIRED_CAPABILITIES') middleware = cls( command_executor=command_executor, desired_capabilities=desired_capabilities, ) crawler.signals.connect(middleware.spider_closed, signals.spider_closed) return middleware
对应的settings.py中,添加连接参数即可:
SELENIUM_COMMAND_EXECUTOR = "http://docker-host:4444/wd/hub" SELENIUM_DESIRED_CAPABILITIES = DesiredCapabilities.CHROME