概述
htmlfile = urllib2.urlopen(line) htmltext = htmlfile.read() regexName = '"></a>(.+?)</dd><dt>' patternName = re.compile(regexName) name = re.findall(patternName,htmltext) if name: text = name[0] else: text = 'unkNown' nf.write(text)
哪个有效,但非常非常慢.刮掉所有160.000页需要四天多的时间.有什么建议可以加快速度吗?
编译正则表达式模式时,请确保还使用编译对象.并避免在每个处理循环中编译正则表达式.
pattern = re.compile('"></a>(.+?)</dd><dt>') # ... links = pattern.findall(html)
如果你想避免使用其他框架,那么最好的解决方案就是加速它,所以使用标准的线程库来获得并行的多个HTTP连接.
像这样的东西:
from Queue import Queue from threading import Thread import urllib2 import re # Work queue where you push the URLs onto - size 100 url_queue = Queue(10) pattern = re.compile('"></a>(.+?)</dd><dt>') def worker(): '''Gets the next url from the queue and processes it''' while True: url = url_queue.get() print url html = urllib2.urlopen(url).read() print html[:10] links = pattern.findall(html) if len(links) > 0: print links url_queue.task_done() # Start a pool of 20 workers for i in xrange(20): t = Thread(target=worker) t.daemon = True t.start() # Change this to read your links and queue them for processing for url in xrange(100): url_queue.put("http://www.ravn.co.uk") # Block until everything is finished. url_queue.join()
总结
以上是编程之家为你收集整理的python – 刮削160.000页 – 太慢了全部内容,希望文章能够帮你解决python – 刮削160.000页 – 太慢了所遇到的程序开发问题。
如果您也喜欢它,动动您的小指点个赞吧