您好, 欢迎来到 !    登录 | 注册 | | 设为首页 | 收藏本站

Python-如何下载NLTK数据?

Python-如何下载NLTK数据?

要下载特定的数据集/模型,请使用nltk.download()函数,例如,如果你要下载punkt句子标记器,请使用:

$ python3
>>> import nltk
>>> nltk.download('punkt')

如果不确定所需的数据/模型,则可以使用以下数据和模型的基本列表开始:

>>> import nltk
>>> nltk.download('popular')

它将下载“流行”资源的列表,其中包括

<collection id="popular" name="Popular packages">
      <item ref="cmudict" />
      <item ref="gazetteers" />
      <item ref="genesis" />
      <item ref="gutenberg" />
      <item ref="inaugural" />
      <item ref="movie_reviews" />
      <item ref="names" />
      <item ref="shakespeare" />
      <item ref="stopwords" />
      <item ref="treebank" />
      <item ref="twitter_samples" />
      <item ref="omw" />
      <item ref="wordnet" />
      <item ref="wordnet_ic" />
      <item ref="words" />
      <item ref="maxent_ne_chunker" />
      <item ref="punkt" />
      <item ref="snowball_data" />
      <item ref="averaged_perceptron_tagger" />
    </collection>

已编辑如果有人避免nltk从https://stackoverflow.com/a/38135306/610569上从下载较大的数据集而避免错误

$ rm /Users/<your_username>/nltk_data/corpora/panlex_lite.zip
$ rm -r /Users/<your_username>/nltk_data/corpora/panlex_lite
$ python

>>> import nltk
>>> dler = nltk.downloader.Downloader()
>>> dler._update_index()
>>> dler._status_cache['panlex_lite'] = 'installed' # Trick the index to treat panlex_lite as it's already installed.
>>> dler.download('popular')

更新

从v3.2.5起,当nltk_data找不到资源时,NLTK会提供更多信息,例如:

>>> from nltk import word_tokenize
>>> word_tokenize('x')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/l/alvas/git/nltk/nltk/tokenize/__init__.py", line 128, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "/Users//alvas/git/nltk/nltk/tokenize/__init__.py", line 94, in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
  File "/Users/alvas/git/nltk/nltk/data.py", line 820, in load
    opened_resource = _open(resource_url)
  File "/Users/alvas/git/nltk/nltk/data.py", line 938, in _open
    return find(path_, path + ['']).open()
  File "/Users/alvas/git/nltk/nltk/data.py", line 659, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')

  Searched in:
    - '/Users/alvas/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
python 2022/1/1 18:24:44 有179人围观

撰写回答


你尚未登录,登录后可以

和开发者交流问题的细节

关注并接收问题和回答的更新提醒

参与内容的编辑和改进,让解决方法与时俱进

请先登录

推荐问题


联系我
置顶