您好, 欢迎来到 !    登录 | 注册 | | 设为首页 | 收藏本站

如何在scikit-learn中查看tfidf之后的术语文档矩阵的前n个条目

如何在scikit-learn中查看tfidf之后的术语文档矩阵的前n个条目

从0.15版开始,TfidfVectorizer可以通过属性访问由a获知的特征的全局项加权,该属性idf_将返回一个长度等于特征维的数组。按此权重对要素进行排序,以获得权重最高的要素:

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

lectures = ["this is some food", "this is some drink"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(lectures)
indices = np.argsort(vectorizer.idf_)[::-1]
features = vectorizer.get_feature_names()
top_n = 2
top_features = [features[i] for i in indices[:top_n]]
print top_features

输出

[u'food', u'drink']

使用ngram获取主要功能的第二个问题可以使用相同的想法来完成,还有一些额外的步骤将功能分为不同的组:

from sklearn.feature_extraction.text import TfidfVectorizer
from collections import defaultdict

lectures = ["this is some food", "this is some drink"]
vectorizer = TfidfVectorizer(ngram_range=(1,2))
X = vectorizer.fit_transform(lectures)
features_by_gram = defaultdict(list)
for f, w in zip(vectorizer.get_feature_names(), vectorizer.idf_):
    features_by_gram[len(f.split(' '))].append((f, w))
top_n = 2
for gram, features in features_by_gram.iteritems():
    top_features = sorted(features, key=lambda x: x[1], reverse=True)[:top_n]
    top_features = [f[0] for f in top_features]
    print '{}-gram top:'.format(gram), top_features

输出

1-gram top: [u'drink', u'food']
2-gram top: [u'some drink', u'some food']
其他 2022/1/1 18:39:07 有410人围观

撰写回答


你尚未登录,登录后可以

和开发者交流问题的细节

关注并接收问题和回答的更新提醒

参与内容的编辑和改进,让解决方法与时俱进

请先登录

推荐问题


联系我
置顶