在Python 2.7中打印UTF-8字符

将Unicode代码点（字符）编码为UTF-8时，某些代码点将转换为单个字节，但是许多代码点会超过一个字节。标准7位ASCII范围内的字符将被编码为单个字节，但是更多的外来字符通常将需要更多的字节进行编码。

因此，您将获得那些奇怪的字符，因为您将这些多字节UTF-8序列分解为单个字节。有时这些字节将与正常的可打印字符相对应，但通常不会，因此您可以打印出来。

这是一个使用©，®和?字符的简短演示，它们在UTF-8中分别编码为2、2和3个字节。我的终端设置为使用UTF-8。

utfbytes = "\xc2\xa9 \xc2\xae \xe2\x84\xa2"
print utfbytes, len(utfbytes)
for b in utfbytes:
    print b, repr(b)

uni = utfbytes.decode('utf-8')
print uni, len(uni)

Stack Overflow的联合创始人Joel Spolsky在Unicode上写了一篇很好的文章：每个软件开发人员绝对绝对要完全了解Unicode和字符集（没有任何借口！）

您还应该看一下Python文档中的Unicode HOWTO 文章，以及Ned Batchelder的实用Unicode 文章，也称为“ Unipain”。

这是从UTF-8编码的字节字符串中提取单个字符的简短示例。正如我在评论中提到的那样，要正确执行此操作，您需要知道每个字符被编码为多少个字节。

utfbytes = "\xc2\xa9 \xc2\xae \xe2\x84\xa2"
widths = (2, 1, 2, 1, 3)
start = 0
for w in widths:
    print "%d %d [%s]" % (start, w, utfbytes[start:start+w])
    start += w

0 2 [©]
2 1 [ ]
3 2 [®]
5 1 [ ]
6 3 [?]

FWIW，这是该代码的Python 3版本：

utfbytes = b"\xc2\xa9 \xc2\xae \xe2\x84\xa2"
widths = (2, 1, 2, 1, 3)
start = 0
for w in widths:
    s = utfbytes[start:start+w]
    print("%d %d [%s]" % (start, w, s.decode()))
    start += w

如果我们不知道UTF-8字符串中字符的字节宽度，那么我们需要做更多的工作。每个UTF-8序列都会在第一个字节中编码该序列的宽度，如Wikipedia文章中有关UTF-8所述。

以下Python 2演示演示了如何提取该宽度信息。它产生与前两个片段相同的输出。

# UTF-8 code widths
#width starting byte
#1 0xxxxxxx
#2 110xxxxx
#3 1110xxxx
#4 11110xxx
#C 10xxxxxx

def get_width(b):
    if b <= '\x7f':
        return 1
    elif '\x80' <= b <= '\xbf':
        #Continuation byte
        raise ValueError('Bad alignment: %r is a continuation byte' % b)
    elif '\xc0' <= b <= '\xdf':
        return 2
    elif '\xe0' <= b <= '\xef':
        return 3
    elif '\xf0' <= b <= '\xf7':
        return 4
    else:
        raise ValueError('%r is not a single byte' % b)


utfbytes = b"\xc2\xa9 \xc2\xae \xe2\x84\xa2"
start = 0
while start < len(utfbytes):
    b = utfbytes[start]
    w = get_width(b)
    s = utfbytes[start:start+w]
    print "%d %d [%s]" % (start, w, s)
    start += w

一般来说，它应该不会有必要做这样的事情：只使用所提供的解码方法。

出于好奇，这里是的Python 3版本get_width，以及一个手动解码UTF-8字节串的函数。

def get_width(b):
    if b <= 0x7f:
        return 1
    elif 0x80 <= b <= 0xbf:
        #Continuation byte
        raise ValueError('Bad alignment: %r is a continuation byte' % b)
    elif 0xc0 <= b <= 0xdf:
        return 2
    elif 0xe0 <= b <= 0xef:
        return 3
    elif 0xf0 <= b <= 0xf7:
        return 4
    else:
        raise ValueError('%r is not a single byte' % b)

def decode_utf8(utfbytes):
    start = 0
    uni = []
    while start < len(utfbytes):
        b = utfbytes[start]
        w = get_width(b)
        if w == 1:
            n = b
        else:
            n = b & (0x7f >> w)
            for b in utfbytes[start+1:start+w]:
                if not 0x80 <= b <= 0xbf:
                    raise ValueError('Not a continuation byte: %r' % b)
                n <<= 6
                n |= b & 0x3f
        uni.append(chr(n))
        start += w
    return ''.join(uni)


utfbytes = b'\xc2\xa9 \xc2\xae \xe2\x84\xa2'
print(utfbytes.decode('utf8'))
print(decode_utf8(utfbytes))

©®? ©®?

python 2022/1/1 18:39:54 有253人围观

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节

关注并接收问题和回答的更新提醒

参与内容的编辑和改进，让解决方法与时俱进

请先登录

在Python 2.7中打印UTF-8字符

撰写回答

推荐问题

如何在PHP变量中去除空格？

我可以在php中的SESSION数组上使用array_push吗？

如何使用bcrypt在PHP中对密码进行哈希处理？

如何在PHP中使用XMLReader？

PDOException“找不到驱动程序”在PHP

为什么在pom.xml的第1行中出现Unknown错误？

__construct（）与SameAsClassName（）在PHP中的构造函数

使用Retrofit2在POST请求中发送JSON

用单引号在PHP中打印换行符

可以嵌套在P元素内的HTML5元素列表？

为什么在PHP中通过标头（'Location ..'）重定向后必须调用'exit'？

如何在PHP中发出异步GET请求？

如何在php中为其他所有函数调用自动调用函数

当软键盘出现在phonegap中时，输入字段隐藏

在PHP中连接n个数组的值

在PHP中“ =>”是什么意思？

在PHP中写入新行到文件（换行）

文件上传可以在PHP中超时吗？

如何在Python中使用Selenium滚动到页面的末尾？

在PHP中对关联数组进行排序

分类汇总

您的鼓励是对我最大的支持