您好, 欢迎来到 !    登录 | 注册 | | 设为首页 | 收藏本站

提取Google搜索结果

提取Google搜索结果

正则表达式对于解析HTML是个坏主意。读取并依赖格式正确的HTML是很神秘的。

尝试使用BeautifulSoup forPython。这是一个示例脚本,该脚本从site:domain.com Google查询的前10个页面返回URL。

import sys # Used to add the BeautifulSoup folder the import path
import urllib2 # Used to read the html document

if __name__ == "__main__":
    ### Import Beautiful Soup
    ### Here, I have the BeautifulSoup folder in the level of this Python script
    ### So I need to tell Python where to look.
    sys.path.append("./BeautifulSoup")
    from BeautifulSoup import BeautifulSoup

    ### Create opener with Google-friendly user agent
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]

    ### Open page & generate soup
    ### the "start" variable will be used to iterate through 10 pages.
    for start in range(0,10):
        url = "http://www.google.com/search?q=site:stackoverflow.com&start=" + str(start*10)
        page = opener.open(url)
        soup = BeautifulSoup(page)

        ### Parse and find
        ### Looks like google contains URLs in <cite> tags.
        ### So for each cite tag on each page (10), print its contents (url)
        for cite in soup.findAll('cite'):
            print cite.text

输出

stackoverflow.com/
stackoverflow.com/questions
stackoverflow.com/unanswered
stackoverflow.com/users
Meta.stackoverflow.com/
blog.stackoverflow.com/
chat.Meta.stackoverflow.com/
...

当然,您可以将每个结果附加到列表中,以便可以将其解析为子域。我几天前刚接触Python并抓取内容,但这应该可以帮助您入门。

Go 2022/1/1 18:27:16 有315人围观

撰写回答


你尚未登录,登录后可以

和开发者交流问题的细节

关注并接收问题和回答的更新提醒

参与内容的编辑和改进,让解决方法与时俱进

请先登录

推荐问题


联系我
置顶