你可以使用Custom Retry Middleware
来做到这一点,你只需要覆盖process_response
当前Retry Middleware
的方法即可:
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
class CustomRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
if request.Meta.get('dont_retry', False):
return response
if response.status in self.retry_http_codes:
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
# this is your check
if response.status == 200 and response.xpath(spider.retry_xpath):
return self._retry(request, 'response got xpath "{}"'.format(spider.retry_xpath), spider) or response
return response
然后启用它,而不是默认RetryMiddleware
的settings.py
:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'myproject.middlewarefilepath.CustomRetryMiddleware': 550,
}
现在,你有了一个中间件,你可以在其中配置,xpath
以使用属性在Spider
内部重试retry_xpath
:
class MySpider(Spider):
name = "myspidername"
retry_xpath = '//h2[@class="tadasdop-cat"]'
...
当“项目”的字段为空时,这不一定会重试,但是你可以在此retry_xpath
属性中指定该字段的相同路径以使其起作用。