有三种方式爬取Ajax页面

抓Json包:简单、快捷 能找到url的情况下首选使用

采用Splash插件:爬取速度快,需要Docker,部署麻烦

采用Selenium插件:爬取速度慢,需要PhantomJs

抓Json包

本文爬取网站为工信部的一个文件发布网站,爬取列表和里面的文件内容

http://xxgk.miit.gov.cn/gdnps/wjfbindex.jsp

用谷歌浏览器开发者工具里的Network找到json包的请求地址

1
2
3
4
# 第一页
http://xxgk.miit.gov.cn/gdnps/searchIndex.jsp?params=%257B%2522goPage%2522%253A1%252C%2522orderBy%2522%253A%255B%257B%2522orderBy%2522%253A%2522publishTime%2522%252C%2522reverse%2522%253Atrue%257D%252C%257B%2522orderBy%2522%253A%2522orderTime%2522%252C%2522reverse%2522%253Atrue%257D%255D%252C%2522pageSize%2522%253A10%252C%2522queryParam%2522%253A%255B%257B%257D%252C%257B%257D%252C%257B%2522shortName%2522%253A%2522fbjg%2522%252C%2522value%2522%253A%2522%252F1%252F29%252F1146295%252F1652858%252F1652930%2522%257D%255D%257D&callback=jQuery111108763108125362828_1521373431680&_=1521373431681
# 第二页
http://xxgk.miit.gov.cn/gdnps/searchIndex.jsp?params=%257B%2522goPage%2522%253A2%252C%2522orderBy%2522%253A%255B%257B%2522orderBy%2522%253A%2522publishTime%2522%252C%2522reverse%2522%253Atrue%257D%252C%257B%2522orderBy%2522%253A%2522orderTime%2522%252C%2522reverse%2522%253Atrue%257D%255D%252C%2522pageSize%2522%253A10%252C%2522queryParam%2522%253A%255B%257B%257D%252C%257B%257D%252C%257B%2522shortName%2522%253A%2522fbjg%2522%252C%2522value%2522%253A%2522%252F1%252F29%252F1146295%252F1652858%252F1652930%2522%257D%255D%257D&callback=jQuery111106871472981573403_1521373463476&_=1521373463478

发现第一页和第二页的请求网址中间有以下不同

1
2
%253A1%252C
%253A2%252C
所以start_urls这样写就能请求到1-9页的所有表格里的所有内容
1
start_urls = ['http://xxgk.miit.gov.cn/gdnps/searchIndex.jsp?params=%257B%2522goPage%2522%253A' + str(d) +'%252...省略...' for d in range(1, 10)]

由于请求的源代码并不是单独的一个json包,所有json.loads网页的内容会报错

所有得先使用正则表达式把json提取出来,去掉其他内容:

1
2
3
4
5
6
def parse(self, response):
jsonbody = re.findall("96955\((.*)\);", str(response.body))[0].decode(encoding="utf-8")
body = json.loads(jsonbody)
for r in body['resultMap']:
item['title'] = r['tile']
...

另外推荐Chrome的一个json插件:JSON-Handle能很直观的看出json数据:

JSON-Handle下载地址:

屏幕快照 2018-03-18 下午7.52.56

其他两篇链接

Scrapy爬取Ajax页面(二)- Splash插件
Scrapy爬取Ajax页面(三)- Selenium插件