您好, 欢迎来到 !    登录 | 注册 | | 设为首页 | 收藏本站

python – 如何训练大型数据集进行分类

5b51 2022/1/14 8:22:35 python 字数 2579 阅读 560 来源 www.jb51.cc/python

我有一个1600000推文的训练数据集.我该如何训练这类巨大的数据. 我尝试过使用nltk.NaiveBayesClassifier.如果我跑步,训练需要5天以上. def extract_features(tweet): tweet_words = set(tweet) features = {} for word in featureList: fea

概述

我尝试过使用nltk.NaiveBayesClassifier.如果我跑步,训练需要5天以上.

def extract_features(tweet):

    tweet_words = set(tweet)
    features = {}
    for word in featureList:
        features['contains(%s)' % word] = (word in tweet_words)

    return features


training_set = nltk.classify.util.apply_features(extract_features,tweets)

NBClassifier = nltk.NaiveBayesClassifier.train(training_set)  # This takes lots of time

我该怎么办?

我需要使用SVM和朴素的bayes对我的数据集进行分类.

我想使用的数据集:Link

样本(培训数据集):

Label     Tweet
0         url aww bummer you shoulda got david carr third day
4         thankyou for your reply are you coming england again anytime soon

示例(测试数据集):

Label     Tweet
4         love lebron url
0         lebron beast but still cheering the til the end
^
I have to predict Label 0/4 only

如何有效地训练这个庞大的数据集?

train_df_raw = pd.read_csv('train.csv',header=None,names=['label','tweet'])
test_df_raw = pd.read_csv('test.csv','tweet'])
train_df_raw =  train_df_raw[train_df_raw['tweet'].notnull()]
test_df_raw =  test_df_raw[test_df_raw['tweet'].notnull()]
test_df_raw =  test_df_raw[test_df_raw['label']!=2]

y_train = [x if x==0 else 1 for x in train_df_raw['label'].tolist()]
y_test = [x if x==0 else 1 for x in test_df_raw['label'].tolist()]
X_train = train_df_raw['tweet'].tolist()
X_test = test_df_raw['tweet'].tolist()

print('At vectorizer')
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
print('At vectorizer for test data')
X_test = vectorizer.transform(X_test)

print('at Classifier')
classifier = LogisticRegression()
classifier.fit(X_train,y_train)

predictions = classifier.predict(X_test)
print 'Accuracy:',accuracy_score(y_test,predictions)

confusion_matrix = confusion_matrix(y_test,predictions)
print(confusion_matrix)

Accuracy: 0.8
[[135  42]
 [ 30 153]]

总结

以上是编程之家为你收集整理的python – 如何训练大型数据集进行分类全部内容,希望文章能够帮你解决python – 如何训练大型数据集进行分类所遇到的程序开发问题。


如果您也喜欢它,动动您的小指点个赞吧

除非注明,文章均由 laddyq.com 整理发布,欢迎转载。

转载请注明:
链接:http://laddyq.com
来源:laddyq.com
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。


联系我
置顶