N-Gram，tf-idf和余弦相似度在Python中的简单实现

查看NLTK软件包：http：//www.nltk.org，它具有您需要的一切

对于cosine_similarity：

def cosine_distance(u, v):
    """
    Returns the cosine of the angle between vectors v and u. This is equal to
    u.v / |u||v|.
    """
    return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v)))

对于ngrams：

def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None):
    """
    A utility that produces a sequence of ngrams from a sequence of items.
    For example:

    >>> ngrams([1,2,3,4,5], 3)
    [(1, 2, 3), (2, 3, 4), (3, 4, 5)]

    Use ingram for an iterator version of this function.  Set pad_left
    or pad_right to true in order to get additional ngrams:

    >>> ngrams([1,2,3,4,5], 2, pad_right=True)
    [(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]

    @param sequence: the source data to be converted into ngrams
    @type sequence: C{sequence} or C{iterator}
    @param n: the degree of the ngrams
    @type n: C{int}
    @param pad_left: whether the ngrams should be left-padded
    @type pad_left: C{boolean}
    @param pad_right: whether the ngrams should be right-padded
    @type pad_right: C{boolean}
    @param pad_symbol: the symbol to use for padding (default is None)
    @type pad_symbol: C{any}
    @return: The ngrams
    @rtype: C{list} of C{tuple}s
    """

    if pad_left:
        sequence = chain((pad_symbol,) * (n-1), sequence)
    if pad_right:
        sequence = chain(sequence, (pad_symbol,) * (n-1))
    sequence = list(sequence)

    count = max(0, len(sequence) - n + 1)
    return [tuple(sequence[i:i+n]) for i in range(count)]

对于tf-idf，您将必须首先计算分布，我正在使用Lucene来做到这一点，但您可能会对NLTK做类似的事情，请使用FreqDist：

http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html#frequency_distribution_index_term

如果您喜欢pylucene，这将告诉您如何上下班tf.idf

    # reader = lucene.IndexReader(FSDirectory.open(index_loc))
    docs = reader.numDocs()
    for i in xrange(docs):
        tfv = reader.getTermFreqVector(i, fieldname)
        if tfv:
            rec = {}
            terms = tfv.getTerms()
            frequencies = tfv.getTermFrequencies()
            for (t,f,x) in zip(terms,frequencies,xrange(maxtokensperdoc)):
                    df= searcher.docFreq(Term(fieldname, t)) # number of docs with the given term
                        tmap.setdefault(t, len(tmap))
                        rec[t] = sim.tf(f) * sim.idf(df, max_doc)  #compute TF.IDF
            # and normalize the values using cosine normalization
            if cosine_normalization:
                denom = sum([x**2 for x in rec.values()])**0.5
                for k,v in rec.items():
                    rec[k] = v / denom

python 2022/1/1 18:36:46 有443人围观

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节

关注并接收问题和回答的更新提醒

参与内容的编辑和改进，让解决方法与时俱进

请先登录

N-Gram，tf-idf和余弦相似度在Python中的简单实现

撰写回答

推荐问题

如何使用JSON-LD标记面包屑列表中的最后一个非链接项目

Spring / Java错误：JDK 1.5及更高版本上的名称空间元素'annotation-config'…

Hibernate-@ElementCollection-奇怪的删除/插入行为

JaCoCo与Maven-缺少执行数据文件

在jetty-maven-plugin中未调用Spring映射的servlet

使用JSX react / react-in-jsx-scope时，“反应”必须在范围内？

找不到元素'context：annotation-config'的声明

Margin-Top不能用于span元素吗？

TransientObjectException-对象引用了一个未保存的瞬态实例-在刷新之前保存该瞬态实例

JSON-CPP-如何从字符串初始化并获取字符串值？

Python-如何在列表中找到重复项并使用它们创建另一个列表？

Gson-从Json转换为类型化ArrayList

Python-哪个更适合报废：selenium还是含selenium的beautifulsoup？

Python-使用strftime将python datetime转换为纪元

Python-在私有模式下使用Selenium启动firefox

从WSDL使用jaxb2-maven-plugin生成类

生成[0..n-1]范围内的m个不同的随机数

Python-开箱，扩展开箱和嵌套扩展开箱

Spring Boot Core依赖项被maven-dependency-plugin视为未使用

将tty / std-in-out附加到docker或lxc是什么意思？

分类汇总

您的鼓励是对我最大的支持