kuma35/README-JP.md

## README-JP.md

      
    Raw
  

              README-JP.md
            
          
    Scrapy with selenium

   
Seleniumを使用してjavascriptページを処理するScrapyミドルウェア。
インストール

$ pip install scrapy-selenium

python>=3.6 を使用する必要があります。
また、Selenium 互換ブラウザ のいずれかが必要です。
構成(Configuration)


使用するブラウザー、ドライバー実行ファイルへのパス、および実行ファイルに渡す引数をScrapy設定に追加します。
from shutil import which

SELENIUM_DRIVER_NAME = 'firefox'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('geckodriver')
SELENIUM_DRIVER_ARGUMENTS=['-headless']  # Firefoxの代わりにChromeを使用している場合は '--headless' が必要
(訳注: firefox(69.0.2)(geckodriver 0.25.0 (bdb64cf16b68 2019-09-10))では上記コメントにも関わらず、SELENIUM_DRIVER_ARGUMENTSを設定する必要があった。)

オプションで、ブラウザの実行ファイルへのパスを設定します:
python SELENIUM_BROWSER_EXECUTABLE_PATH = which('firefox') 

SeleniumMiddleware ダウンローダーミドルウェアに追加します:
DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}


使い方

以下のように、Scrapy組み込みの Request の代わりに scrapy_selenium.SeleniumRequest を使用します:
from scrapy_selenium import SeleniumRequest

yield SeleniumRequest(url=url, callback=self.parse_result)
リクエストはseleniumによって処理され、リクエストオブジェクトには、リクエストを処理したseleniumドライバーを含む driver という名前の追加の meta キーが含まれます。
def parse_result(self, response):
    print(response.request.meta['driver'].title)
使用可能なドライバーメソッドと属性の詳細については、 selenium python documentation (http://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.remote.webdriver) を参照してください。
selector レスポンス属性は通常どおり動作します(ただし、seleniumドライバーによって処理されるhtmlが含まれます)。
def parse_result(self, response):
    print(response.selector.xpath('//title/@text'))
追加の引数

scrapy_selenium.SeleniumRequest は4つの追加引数を受け入れます:
wait_time / wait_until

これらを使用すると、seleniumはスパイダーにレスポンスを返す前に 明示的なウエイト (http://selenium-python.readthedocs.io/waits.html#explicit-waits) を実行します。
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

yield SeleniumRequest(
    url=url,
    callback=self.parse_result,
    wait_time=10,
    wait_until=EC.element_to_be_clickable((By.ID, 'someid'))
)
screenshot

これを使用すると、seleniumはページのスクリーンショットを撮り、キャプチャしたPNG画像のバイナリデータが response meta に追加されます。
yield SeleniumRequest(
    url=url,
    callback=self.parse_result,
    screenshot=True
)

def parse_result(self, response):
    with open('image.png', 'wb') as image_file:
        image_file.write(response.meta['screenshot'])
script

これを使用すると、seleniumはカスタムJavaScriptコードを実行します。
yield SeleniumRequest(
    url=url,
    callback=self.parse_result,
    script='window.scrollTo(0, document.body.scrollHeight);',
)