您好, 欢迎来到 !    登录 | 注册 | | 设为首页 | 收藏本站

python – Scrapy从脚本运行不起作用

5b51 2022/1/14 8:20:36 python 字数 5052 阅读 453 来源 www.jb51.cc/python

我正在尝试使用scrapy crall single运行完美运行的scrapy蜘蛛,但我无法在python脚本中运行它.我知道文档告诉如何:https://scrapy.readthedocs.org/en/0.18/topics/practices.html我也读过这个已经回答的问题(How to run Scrapy from within a Pyth

概述

我正在尝试使用scrapy crall single运行完美运行的scrapy蜘蛛,但我无法在python脚本中运行它.

我知道文档告诉如何:https://scrapy.readthedocs.org/en/0.18/topics/practices.html我也读过这个已经回答的问题(How to run Scrapy from within a Python script),但我无法做到这一点.

主要问题是从不执行SingleBlogSpider.parse方法,而执行start_requests

这是运行该脚本的代码输出.我还试图将执行移动到一个单独的文件,但同样的情况发生.

from urlparse import urlparse
from scrapy.http import Request
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor


class SingleBlogSpider(BaseSpider):
    name = 'single'

    def __init__(self,**kwargs):
        super(SingleBlogSpider,self).__init__(**kwargs)

        url = kwargs.get('url') or kwargs.get('domain') or 'seaofshoes.com'
        if not url.startswith('http://') and not url.startswith('https://'):
            url = 'http://%s/' % url

        self.url = url
        self.allowed_domains = [urlparse(url).hostname.lstrip('www.')]
        self.link_extractor = SgmlLinkExtractor()
        self.cookies_seen = set()

        print 0,self.url

    def start_requests(self):
        print '1',self.url
        return [Request(self.url,callback=self.parse)]

    def parse(self,response):
        print '2'
        # Actual scraper code,that is never executed

if __name__ == '__main__':
    from twisted.internet import reactor
    from scrapy.crawler import Crawler
    from scrapy.settings import Settings
    from scrapy import log,signals

    spider = SingleBlogSpider(domain='scrapinghub.com')

    crawler = Crawler(Settings())
    crawler.signals.connect(reactor.stop,signal=signals.spider_closed)
    crawler.configure()
    crawler.crawl(spider)
    crawler.start()

    log.start()
    reactor.run()

输出

 0 http://scrapinghub.com/
 1 http://scrapinghub.com/
 2013-09-13 14:21:46-0500 [single] INFO: Closing spider (finished)
 2013-09-13 14:21:46-0500 [single] INFO: Dumping Scrapy stats:
     {'downloader/request_bytes': 221,'downloader/request_count': 1,'downloader/request_method_count/GET': 1,'downloader/response_bytes': 9403,'downloader/response_count': 1,'downloader/response_status_count/200': 1,'finish_reason': 'finished','finish_time': datetime.datetime(2013,9,13,19,21,46,563184),'response_received_count': 1,'scheduler/dequeued': 1,'scheduler/dequeued/memory': 1,'scheduler/enqueued': 1,'scheduler/enqueued/memory': 1,'start_time': datetime.datetime(2013,328961)}
 2013-09-13 14:21:46-0500 [single] INFO: Spider closed (finished)

该程序永远不会到达SingleBlogSpider.parse并打印’2′,因此它不会抓取任何内容.但是你可以在输出上看到它确实发出了请求,所以不确定是什么.

Scrapy版本== 0.18.2

我真的无法发现错误,真的很感激帮助.

谢谢!

只是为了测试,把a = b放在parse()中:

def parse(self,response):
    a = b

并且,您将看到exceptions.NameError:未定义全局名称“b”.

总结

以上是编程之家为你收集整理的python – Scrapy从脚本运行不起作用全部内容,希望文章能够帮你解决python – Scrapy从脚本运行不起作用所遇到的程序开发问题。


如果您也喜欢它,动动您的小指点个赞吧

除非注明,文章均由 laddyq.com 整理发布,欢迎转载。

转载请注明:
链接:http://laddyq.com
来源:laddyq.com
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。


联系我
置顶