Python-在http响应流中查找

我不确定C＃实现的工作方式，但是由于通常无法找到Internet流，因此我猜测是它将所有数据下载到本地文件或内存对象中并在其中进行查找。Python的等效方法是按照Abafei的建议进行，然后将数据写入文件或StringIO并从中查找。

但是，如果按照您对Abafei答案的评论所暗示的，您只想检索文件的特定部分（而不是在返回的数据中来回查找），则还有另一种可能性。urllib2如果服务器支持此行为，则可用于检索网页的特定部分（或HTTP措辞中的“范围”）。

当您将请求发送到服务器时，请求的参数在各种标头中给出。其中之一是Range标头，该标头在RFC2616（定义HTTP / 1.1的规范）的14.35节中定义。通过此标头，您可以执行以下操作，例如检索从第10,000个字节开始的所有数据或字节1,000至1,500之间的数据。

不需要服务器支持范围检索。一些服务器将返回Accept- Ranges标头（RFC2616的14.5节）以及响应，以报告是否支持范围。可以使用HEAD请求进行检查。但是，并不需要这样做。如果服务器不支持范围，它将返回整个页面，然后我们可以像以前一样在Python中提取所需的数据部分。

如果服务器返回范围，则它必须将Content- Range标头（RFC2616的14.16节）与响应一起发送。如果响应的标题中存在此内容，则说明返回了范围；如果不存在，则返回整个页面。

urllib2允许我们向请求添加标头，从而允许我们向服务器询问范围而不是整个页面。以下脚本采用URL，起始位置和（可选）命令行上的长度，并尝试检索页面的给定部分。

import sys
import urllib2

# Check command line arguments.
if len(sys.argv) < 3:
    sys.stderr.write("Usage: %s url start [length]\n" % sys.argv[0])
    sys.exit(1)

# Create a request for the given URL.
request = urllib2.Request(sys.argv[1])

# Add the header to specify the range to download.
if len(sys.argv) > 3:
    start, length = map(int, sys.argv[2:])
    request.add_header("range", "bytes=%d-%d" % (start, start + length - 1))
else:
    request.add_header("range", "bytes=%s-" % sys.argv[2])

# Try to get the response. This will raise a urllib2.URLError if there is a
# problem (e.g., invalid URL).
response = urllib2.urlopen(request)

# If a content-range header is present, partial retrieval worked.
if "content-range" in response.headers:
    print "Partial retrieval successful."

    # The header contains the string 'bytes', followed by a space, then the
    # range in the format 'start-end', followed by a slash and then the total
    # size of the page (or an asterix if the total size is unkNown). Lets get
    # the range and total size from this.
    range, total = response.headers['content-range'].split(' ')[-1].split('/')

    # Print a message giving the range information.
    if total == '*':
        print "Bytes %s of an unkNown total were retrieved." % range
    else:
        print "Bytes %s of a total of %s were retrieved." % (range, total)

# No header, so partial retrieval was unsuccessful.
else:
    print "Unable to use partial retrieval."

# And for good measure, lets check how much data we downloaded.
data = response.read()
print "Retrieved data size: %d bytes" % len(data)

使用此方法，我可以检索Python主页的最后2,000个字节：

blair@blair-eeepc:~$ python retrieverange.py http://www.python.org/ 17387
Partial retrieval successful.
Bytes 17387-19386 of a total of 19387 were retrieved.
Retrieved data size: 2000 bytes

或距首页中间400个字节：

blair@blair-eeepc:~$ python retrieverange.py http://www.python.org/ 6000 400
Partial retrieval successful.
Bytes 6000-6399 of a total of 19387 were retrieved.
Retrieved data size: 400 bytes

但是，Google主页不支持范围：

blair@blair-eeepc:~$ python retrieverange.py http://www.google.com/ 1000 500
Unable to use partial retrieval.
Retrieved data size: 9621 bytes

在这种情况下，有必要在进行进一步处理之前提取Python中感兴趣的数据。

python 2022/1/1 18:36:26 有234人围观

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节

关注并接收问题和回答的更新提醒

参与内容的编辑和改进，让解决方法与时俱进

请先登录

Python-在http响应流中查找

撰写回答

推荐问题

尝试使用selenium和python登录网页时出错

从Python访问errno？

gcloud compute copy-files：复制文件时拒绝权限

在服务器上运行selenium浏览器（Flask / Python / Heroku）

ImportError：没有使用Python2的名为mysql.connector的模块

Python：无法在网页中使用selenium下载

带有Selenium的Python“元素未附加到页面文档中”

在Jenkins中设置特定的Python

Python：从文件中选择随机行，然后删除该行

从Python字符串中删除不在允许列表中的HTML标签

从python读取json文件

通过Python3使用Selenium和WebDriver切换选项卡时，“ NoSuchWindowException：没有这样的窗口：窗口已经关闭”

pythonselenium多个测试用例

连接所有PostgreSQL表并创建一个Python字典

带有selenium的Python：无法找到真正存在的元素

列出用户和组的Python脚本

在capybara中选择具有多个类的元素

如何以正确的顺序导入Scrapy项目密钥？

如何确定是否为Selenium + Python加载了某些HTML元素？

使用Java与Python的Selenium Webdriver

分类汇总

您的鼓励是对我最大的支持