毫不奇怪,NLTK的运行速度很慢:
>>> tfidf = stemmedTfidfVectorizer(min_df=1, stop_words='english', analyzer='word', ngram_range=(1,1))
>>> %timeit tfidf.fit_transform(X_train)
1 loops, best of 3: 4.89 s per loop
>>> tfidf = TfidfVectorizer(min_df=1, stop_words='english', analyzer='word', ngram_range=(1,1))
>>> %timeit tfidf.fit_transform(X_train)
1 loops, best of 3: 415 ms per loop
您可以使用更智能的snowball提取器实现,例如PyStemmer来加快速度:
>>> import stemmer
>>> english_stemmer = stemmer.stemmer('en')
>>> class stemmedTfidfVectorizer(TfidfVectorizer):
... def build_analyzer(self):
... analyzer = super(TfidfVectorizer, self).build_analyzer()
... return lambda doc: english_stemmer.stemWords(analyzer(doc))
...
>>> tfidf = stemmedTfidfVectorizer(min_df=1, stop_words='english', analyzer='word', ngram_range=(1,1))
>>> %timeit tfidf.fit_transform(X_train)
1 loops, best of 3: 650 ms per loop
NLTK是一个教学工具包。它的设计速度很慢,因为它针对可读性进行了优化。