有三种方式爬取Ajax页面

抓Json包:简单、快捷 能找到url的情况下首选使用

采用Splash插件:爬取速度快,需要Docker,部署麻烦

采用Selenium插件:爬取速度慢,需要PhantomJs

Splash插件

Splash是一个Javascript渲染服务。它是一个实现了HTTP API的轻量级浏览器

首先安装scrapy-splash

1
2
3
4
# Python2
pip install scrapy-splash
# Python3
pip3 install scrapy-splash

安装Docker:Docker官网下载页

拉取Docker镜像

1
docker pull scrapinghub/splash

运行这个镜像

1
docker run -p 8050:8050 scrapinghub/splash

配置Scrapy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# settings.py

SPLASH_URL = 'http://localhost:8050'

DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

爬虫文件

1
2
3
4
5
6
7
8
9
10
11
12
import scrapy
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):
start_urls = ["http://example.com", "http://example.com/foo"]

def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, args={'wait': 0.5})

def parse(self, response):
...


其他两篇链接

Scrapy爬取Ajax页面(一)- 抓Json包
Scrapy爬取Ajax页面(三)- Selenium插件