有关python中字符的unicode表信息

标准模块unicodedata定义了很多属性，但不是全部。快速浏览一下其来源即可确认这一点。

幸运的是unicodedata.txt，不难解析它来自的数据文件。每行仅包含15个;单独的元素，因此非常适合解析。使用 ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html上元素的描述，可以创建一些类来封装数据。我从该列表中选取了类元素的名称。同一页面上解释了每个元素的含义。

确保先下载 ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt和 ftp://ftp.unicode.org/Public/UNIDATA/Blocks.txt，并将它们放在与此相同的文件夹中程序。

代码（已通过Python 2.7和3.6测试）：

# -*- coding: utf-8 -*-

class UnicodeCharacter:
    def __init__(self):
        self.code = 0
        self.name = 'unnamed'
        self.category = ''
        self.combining = ''
        self.bidirectional = ''
        self.decomposition = ''
        self.asDecimal = None
        self.asDigit = None
        self.asNumeric = None
        self.mirrored = False
        self.uc1Name = None
        self.comment = ''
        self.uppercase = None
        self.lowercase = None
        self.titlecase = None
        self.block = None

    def __getitem__(self, item):
        return getattr(self, item)

    def __repr__(self):
        return '{'+self.name+'}'

class UnicodeBlock:
    def __init__(self):
        self.first = 0
        self.last = 0
        self.name = 'unnamed'

    def __repr__(self):
        return '{'+self.name+'}'

class BlockList:
    def __init__(self):
        self.blocklist = []
        with open('Blocks.txt','r') as uc_f:
            for line in uc_f:
                line = line.strip(' \r\n')
                if '#' in line:
                    line = line.split('#')[0].strip()
                if line != '':
                    rawdata = line.split(';')
                    block = UnicodeBlock()
                    block.name = rawdata[1].strip()
                    rawdata = rawdata[0].split('..')
                    block.first = int(rawdata[0],16)
                    block.last = int(rawdata[1],16)
                    self.blocklist.append(block)
            # make 100% sure it's sorted, for quicker look-up later
            # (it is usually sorted in the file, but better make sure)
            self.blocklist.sort (key=lambda x: block.first)

    def lookup(self,code):
        for item in self.blocklist:
            if code >= item.first and code <= item.last:
                return item.name
        return None

class UnicodeList:
    """UnicodeList loads Unicode data from the external files
    'UnicodeData.txt' and 'Blocks.txt', both available at unicode.org

    These files must appear in the same directory as this program.

    UnicodeList is a new interpretation of the standard library
    'unicodedata'; you may first want to check if its functionality
    suffices.

    As UnicodeList loads its data from an external file, it does not depend
    on the local build from Python (in which the Unicode data gets frozen
    to the then 'current' version).

    Initialize with

        uclist = UnicodeList()
    """
    def __init__(self):

        # we need this first
        blocklist = BlockList()
        bpos = 0

        self.codelist = []
        with open('UnicodeData.txt','r') as uc_f:
            for line in uc_f:
                line = line.strip(' \r\n')
                if '#' in line:
                    line = line.split('#')[0].strip()
                if line != '':
                    rawdata = line.strip().split(';')
                    parsed = UnicodeCharacter()
                    parsed.code = int(rawdata[0],16)
                    parsed.characterName = rawdata[1]
                    parsed.category = rawdata[2]
                    parsed.combining = rawdata[3]
                    parsed.bidirectional = rawdata[4]
                    parsed.decomposition = rawdata[5]
                    parsed.asDecimal = int(rawdata[6]) if rawdata[6] else None
                    parsed.asDigit = int(rawdata[7]) if rawdata[7] else None
                    # the following value may contain a slash:
                    #  ONE QUARTER ... 1/4
                    # let's make it Python 2.7 compatible :)
                    if '/' in rawdata[8]:
                        rawdata[8] = rawdata[8].replace('/','./')
                        parsed.asNumeric = eval(rawdata[8])
                    else:
                        parsed.asNumeric = int(rawdata[8]) if rawdata[8] else None
                    parsed.mirrored = rawdata[9] == 'Y'
                    parsed.uc1Name = rawdata[10]
                    parsed.comment = rawdata[11]
                    parsed.uppercase = int(rawdata[12],16) if rawdata[12] else None
                    parsed.lowercase = int(rawdata[13],16) if rawdata[13] else None
                    parsed.titlecase = int(rawdata[14],16) if rawdata[14] else None
                    while bpos < len(blocklist.blocklist) and parsed.code > blocklist.blocklist[bpos].last:
                        bpos += 1
                    parsed.block = blocklist.blocklist[bpos].name if bpos < len(blocklist.blocklist) and parsed.code >= blocklist.blocklist[bpos].first else None
                    self.codelist.append(parsed)

    def find_code(self,codepoint):
        """Find the Unicode information for a codepoint (as int).

        Returns:
            a UnicodeCharacter class object or None.
        """
        # the list is unlikely to contain duplicates but I have seen Unicode.org
        # doing that in similar situations. Again, better make sure.
        val = [x for x in self.codelist if codepoint == x.code]
        return val[0] if val else None

    def find_char(self,str):
        """Find the Unicode information for a codepoint (as character).

        Returns:
            for a single character: a UnicodeCharacter class object or
            None.
            for a multicharacter string: a list of the above, one element
            per character.
        """
        if len(str) > 1:
            result = [self.find_code(ord(x)) for x in str]
            return result
        else:
            return self.find_code(ord(str))

加载后，您现在可以使用

>>> ul = UnicodeList()     # ONLY NEEDED ONCE!
>>> print (ul.find_code(0x204))
{LATIN CAPITAL LETTER E WITH DOUBLE GRAVE}

默认情况下显示为字符名称（Unicode将其称为“代码点”），但您也可以检索其他属性：

>>> print ('%04X' % uc.find_code(0x204).lowercase)
0205
>>> print (ul.lookup(0x204).block)
Latin Extended-B

并且（只要您没有获得None）甚至将它们链接起来：

>>> print (ul.find_code(ul.find_code(0x204).lowercase))
{LATIN SMALL LETTER E WITH DOUBLE GRAVE}

它不依赖于您的特定Python版本；您可以随时从unicode.org下载更新列表，并确保获得最新信息：

import unicodedata
>>> print (unicodedata.name('\U0001F903'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name
>>> print (uclist.find_code(0x1f903))
{LEFT HALF CIRCLE WITH FOUR DOTS}

（经Python 3.5.3。测试）。

当前定义了两个查找功能：

之后import unicodelist（假设您将其另存为unicodelist.py），则可以使用

>>> ul = UnicodeList()
>>> hex(ul.find_char(u'è').code)
'0xe8'

查找任何字符的十六进制代码，以及列表理解，例如

>>> l = [hex(ul.find_char(x).code) for x in 'Hello']
>>> l
['0x48', '0x65', '0x6c', '0x6c', '0x6f']

对于更长的字符串。 请注意， 如果您想要的只是一个字符串的 十六进制 表示，那么 实际上并不需要所有这些 ！这足以：

 l = [hex(ord(x)) for x in 'Hello']

该模块的目的是使您可以轻松访问其他 Unicode属性。一个更长的例子：

str = 'Héllo...'
dest = ''
for i in str:
    dest += chr(ul.find_char(i).uppercase) if ul.find_char(i).uppercase is not None else i
print (dest)

HÉLLO...

并根据您的示例显示字符的属性列表：

letter = u'?'
print ('Name > '+ul.find_char(letter).name)
print ('Unicode number > U+%04x' % ul.find_char(letter).code)
print ('Bloc > '+ul.find_char(letter).block)
print ('Lowercase > %s' % chr(ul.find_char(letter).lowercase))

（我省略了HTML；这些名称未在Unicode标准中定义。）

python 2022/1/1 18:32:00 有206人围观

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节

关注并接收问题和回答的更新提醒

参与内容的编辑和改进，让解决方法与时俱进

请先登录

有关python中字符的unicode表信息

撰写回答

推荐问题

DAO和JDBC有关系吗？

有关EAR与WAR的EJB解释

Spring MVC REST不兼容JAX-RS。有关系吗？

ElasticSearch给出有关队列大小的错误

如何删除所有关系？

使用JavaScript / jQuery发送有关重定向的POST数据？

使用Tomcat在Spring中上传文件与允许的最大大小有关

在include（）中使用名称空间时有关app_name的ImpropyConfiguredError

求解线性丢番图方程（有关示例，请参见说明）

如何通过history.pushState获得有关历史更改的通知？

Java 如何获得与平台有关的换行符？

ElasticSearch-有关字符串字段长度的统计信息

与时间属性有关的设计数据库

我可以使用Java反射获取有关局部变量的信息吗？

SpringBoot ComponentScan与多模块项目有关的问题

有关Python / Django和消息队列的建议

需要有关Boost.Python入门的帮助

URLConnection不允许我访问有关Http错误的数据（404,500等）

有关python GIL的问题

Apache Camel：收到有关不相关的ID的回复

分类汇总

您的鼓励是对我最大的支持