词云是个很风趣的东西。

汉语词云代码调节和测试,中文调节和测试

词云是个很有趣的东西。

用jieba断词,小说文本存入”mori.txt”,停用词列表在”stopword.txt”中,断词结果好坏,停用词很重点,供给持续调节补充。

from wordcloud import WordCloud
import jieba

f = open(u'mori.txt','r').read()
##cuttext=" ".join(jieba.cut(f))
cuttext= jieba.cut(f) 
final= [] 
stopwords=open(u'stopword.txt','r').read() 

for seg in cuttext:
    ##seg = seg.encode('utf-8')
    if seg[0] not  in ['0','1','2','3','4','5','6','7','8','9']:##忽略数字
        if seg not in stopwords:
            final.append( seg) ## 列表添加   

font=r"c:/Windows/Fonts/simsun.ttc"##中文显示必须加
wordcloud = WordCloud(font_path=font,background_color="white",width=1000, height=860, margin=2).generate(" ".join(final))

import matplotlib.pyplot as plt
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

  wordcloud.to_file(‘test.png’)

效果图:

金沙注册送58 1

上面是词频总结排序,词长排序的代码。

##统计词频
freqD2 = {}
for word2 in final:
  freqD2[word2] = freqD2.get(word2, 0) + 1 

##按词频排序输出
counter_list = sorted(freqD2.items(), key=lambda x: x[1], reverse=True) 
_2000=counter_list[0][1] + 1
print(_2000)##用于词长词频排序用
fp = open('sort.txt',"w+",encoding='utf-8')
for d in counter_list:
  fp.write(d[0]+':'+str(d[1]))
  fp.write('\n')
fp.close()

##按词长词频排序输出
counter_list = sorted(freqD2.items(), key=lambda x: len(x[0])*_2000+x[1], reverse=True) 
fp = open('sortlen.txt',"w+",encoding='utf-8')
for d in counter_list:
  fp.write(d[0]+':'+str(d[1]))
  fp.write('\n')
fp.close()

排序代码很便利,也值得借鉴,Python是个好东西,强大,易重用。

 

词云是个很逸事物。
用jieba断词,随笔文本存入”mori.txt”,停用词列表在”stopword.txt”中,断词结果好坏,…

《机器学习实战》中贝叶斯分类中导入帕杰罗SS源例子,机器学习实战

跟着书中代码往下写在此处卡住了,只怕还会有其余同学也碰着了这么的主题材料,记下来分享。

 

随即书中代码往下写在那边卡住了,思索到恐怕还会有别的同学也境遇了这么的标题,记下来分享。

用jieba断词,小说文本存入”mori.txt”,停用词列表在”stopword.txt”中,断词结果好坏,停用词很首要,须求持续调整补充。

怎么设置feedparser?

按书中提供的网站直接设置feedparser会提醒错误说并未有setuptools,然后去找setuptools,官方的传教是windows最佳用ez_setup.py安装,小编实在下载不下来官方网站的不得了ez_etup.py,那么些帖子给出了缓解方案:

ez_setup.py

将那些文件一向拷贝到C:\\python贰7文书夹中,输入命令行:python
ez_setup.py install

下一场转到放feedparser安装文件的文本夹中,命令行输入:python setup.py
install

 

先嘲弄一下,相信大多数网民在此地卡住的主要原因是远大的GFW,所以不管软件FQ依然肉身FQ的小伙伴们推断是无论如何也看不到那篇博文的,不想往下看的请自觉运用FQ手艺。

from wordcloud import WordCloud
import jieba

f = open(u'mori.txt','r').read()
##cuttext=" ".join(jieba.cut(f))
cuttext= jieba.cut(f) 
final= [] 
stopwords=open(u'stopword.txt','r').read() 

for seg in cuttext:
    ##seg = seg.encode('utf-8')
    if seg[0] not  in ['0','1','2','3','4','5','6','7','8','9']:##忽略数字
        if seg not in stopwords:
            final.append( seg) ## 列表添加   

font=r"c:/Windows/Fonts/simsun.ttc"##中文显示必须加
wordcloud = WordCloud(font_path=font,background_color="white",width=1000, height=860, margin=2).generate(" ".join(final))

import matplotlib.pyplot as plt
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

小编提供的奇骏SS源链接“

书中小编的意趣是的话自源
中的文章作为分类为一的稿子,以来自源
中的文章作为分类为0的小说

为了可以跑通示例代码,可以找两可用的科雷傲SS源作为代表。

小编用的是那多少个源:

NASA Image of the
Day:

Yahoo Sports – NBA – Houston Rockets
News:

也正是说,假使算法运转正确的话,全部来自于 nasa
的稿子将会被归类为一,全数来自于yahoo sports的休斯顿休斯敦火箭队音信将会分类为0

 

 

  wordcloud.to_file(‘test.png’)

动用自身定义的EnclaveSS源,当程序运维到trainNB0(array(trainMat),array(trainClasses))时会报错,怎么做?

从书中笔者的事例来看,笔者运用的源中文章数量较多,len(ny[‘entries’])
为 100,笔者找的几个 瑞虎SS 源只有拾-二十三个左右。

>>> import feedparser
>>>ny=feedparser.parse(”)
>>> ny[‘entries’]
>>> len(ny[‘entries’])
100

因为作者的3个途乐SS源有拾0篇小说,所以他得以在代码中剔除了216个“停用词”,随机选取20篇作品作为测试集,不过当大家选择代替奥迪Q5SS源时我们只有10篇小说却要收取20篇作品作为测试集,那样分明是会出错的。只要自身调节下测试集的数额就能够让代码跑通;若是小说中的词太少,减弱剔除的“停用词”数量可以增加算法的准确度。

 

怎么设置feedparser?

按书中提供的网站直接设置feedparser会提醒错误说并未有setuptools,然后去找setuptools,官方的布道是windows最佳用ez_setup.py安装,笔者确实下载不下去官方网站的可怜ez_etup.py,这几个帖子给出了解决方案:

ez_setup.py

将以此文件一直拷贝到C:\\python27文本夹中,输入命令行:python
ez_setup.py install

然后转到放feedparser安装文件的公文夹中,命令行输入:python setup.py
install

 

效果图:

壹经不想将应运而生频率排序最高的3十一个单词移除,该怎么去掉“停用词”?

能够把要去掉的停用词存放到txt文件中,使用时读取(替代移除高频词的代码)。具体须求停用哪些词能够参考那里

以下代码想通常运作必要将停用词保存至stopword.txt中。

本身的txt中保存了以下单词,效果还不易:

a
about
above
after
again
against
all
am
an
and
any
are
aren’t
as
at
be
because
been
before
being
below
between
both
but
by
can’t
cannot
could
couldn’t
did
didn’t
do
does
doesn’t
doing
don’t
down
during
each
few
for
from
further
had
hadn’t
has
hasn’t
have
华语词云代码调节和测试,机器学习实战。haven’t
having
he
he’d
he’ll
he’s
her
here
here’s
hers
herself
him
himself
his
how
how’s
i
i’d
i’ll
i’m
i’ve
if
in
into
is
isn’t
it
it’s
its
itself
let’s
me
more
most
mustn’t
my
myself
no
nor
not
of
off
on
once
only
or
other
ought
our
ours
ourselves
out
over
own
same
shan’t
she
she’d
she’ll
she’s
should
shouldn’t
so
some
such
than
that
that’s
the
their
theirs
them
themselves
then
there
there’s
these
they
they’d
they’ll
they’re
they’ve
this
those
through
to
too
under
until
up
very
was
wasn’t
we
we’d
we’ll
we’re
we’ve
were
weren’t
what
what’s
when
when’s
where
where’s
which
while
who
who’s
whom
why
why’s
with
won’t
would
wouldn’t
you
you’d
you’ll
you’re
you’ve
your
yours
yourself
yourselves

 

'''
Created on Oct 19, 2010

@author: Peter
'''
from numpy import *

def loadDataSet():
    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help','my','dog', 'please'],
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not
    return postingList,classVec

def createVocabList(dataSet):
    vocabSet = set([])  #create empty set
    for document in dataSet:
        vocabSet = vocabSet | set(document) #union of the two sets
    return list(vocabSet)

def bagOfWords2Vec(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
        else: print "the word: %s is not in my Vocabulary!" % word
    return returnVec

def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory)/float(numTrainDocs)
    p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones() 
    p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = log(p1Num/p1Denom)          #change to log()
    p0Vect = log(p0Num/p0Denom)          #change to log()
    return p0Vect,p1Vect,pAbusive

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else: 
        return 0

def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec

def testingNB():
    print '*** load dataset for training ***'
    listOPosts,listClasses = loadDataSet()
    print 'listOPost:\n',listOPosts
    print 'listClasses:\n',listClasses
    print '\n*** create Vocab List ***'
    myVocabList = createVocabList(listOPosts)
    print 'myVocabList:\n',myVocabList
    print '\n*** Vocab show in post Vector Matrix ***'
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(bagOfWords2Vec(myVocabList, postinDoc))
    print 'train matrix:',trainMat
    print '\n*** train P0V p1V pAb ***'
    p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
    print 'p0V:\n',p0V
    print 'p1V:\n',p1V
    print 'pAb:\n',pAb
    print '\n*** classify ***'
    testEntry = ['love', 'my', 'dalmation']
    thisDoc = array(bagOfWords2Vec(myVocabList, testEntry))
    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
    testEntry = ['stupid', 'garbage']
    thisDoc = array(bagOfWords2Vec(myVocabList, testEntry))
    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)

def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 

def spamTest():
    docList=[]; classList = []; fullText =[]
    for i in range(1,26):
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    trainingSet = range(50); testSet=[]           #create test set
    for i in range(10):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
            print "classification error",docList[docIndex]
    print 'the error rate is: ',float(errorCount)/len(testSet)
    #return vocabList,fullText

def calcMostFreq(vocabList,fullText):
    import operator
    freqDict = {}
    for token in vocabList:
        freqDict[token]=fullText.count(token)
    sortedFreq = sorted(freqDict.iteritems(), key=operator.itemgetter(1), reverse=True) 
    return sortedFreq[:30]       

def stopWords():
    import re
    wordList =  open('stopword.txt').read() # see http://www.ranks.nl/stopwords
    listOfTokens = re.split(r'\W*', wordList)
    return [tok.lower() for tok in listOfTokens] 
    print 'read stop word from \'stopword.txt\':',listOfTokens
    return listOfTokens

def localWords(feed1,feed0):
    import feedparser
    docList=[]; classList = []; fullText =[]
    print 'feed1 entries length: ', len(feed1['entries']), '\nfeed0 entries length: ', len(feed0['entries'])
    minLen = min(len(feed1['entries']),len(feed0['entries']))
    print '\nmin Length: ', minLen
    for i in range(minLen):
        wordList = textParse(feed1['entries'][i]['summary'])
        print '\nfeed1\'s entries[',i,']\'s summary - ','parse text:\n',wordList
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1) #NY is class 1
        wordList = textParse(feed0['entries'][i]['summary'])
        print '\nfeed0\'s entries[',i,']\'s summary - ','parse text:\n',wordList
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    print '\nVocabList is ',vocabList
    print '\nRemove Stop Word:'
    stopWordList = stopWords()
    for stopWord in stopWordList:
        if stopWord in vocabList:
            vocabList.remove(stopWord)
            print 'Removed: ',stopWord
##    top30Words = calcMostFreq(vocabList,fullText)   #remove top 30 words
##    print '\nTop 30 words: ', top30Words
##    for pairW in top30Words:
##        if pairW[0] in vocabList:
##            vocabList.remove(pairW[0])
##            print '\nRemoved: ',pairW[0]
    trainingSet = range(2*minLen); testSet=[]           #create test set
    print '\n\nBegin to create a test set: \ntrainingSet:',trainingSet,'\ntestSet',testSet
    for i in range(5):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])
    print 'random select 5 sets as the testSet:\ntrainingSet:',trainingSet,'\ntestSet',testSet
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    print '\ntrainMat length:',len(trainMat)
    print '\ntrainClasses',trainClasses
    print '\n\ntrainNB0:'
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    #print '\np0V:',p0V,'\np1V',p1V,'\npSpam',pSpam
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        classifiedClass = classifyNB(array(wordVector),p0V,p1V,pSpam)
        originalClass = classList[docIndex]
        result =  classifiedClass != originalClass
        if result:
            errorCount += 1
        print '\n',docList[docIndex],'\nis classified as: ',classifiedClass,', while the original class is: ',originalClass,'. --',not result
    print '\nthe error rate is: ',float(errorCount)/len(testSet)
    return vocabList,p0V,p1V

def testRSS():
    import feedparser
    ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')
    sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml')
    vocabList,pSF,pNY = localWords(ny,sf)

def testTopWords():
    import feedparser
    ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')
    sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml')
    getTopWords(ny,sf)

def getTopWords(ny,sf):
    import operator
    vocabList,p0V,p1V=localWords(ny,sf)
    topNY=[]; topSF=[]
    for i in range(len(p0V)):
        if p0V[i] > -6.0 : topSF.append((vocabList[i],p0V[i]))
        if p1V[i] > -6.0 : topNY.append((vocabList[i],p1V[i]))
    sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)
    print "SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**"
    for item in sortedSF:
        print item[0]
    sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)
    print "NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**"
    for item in sortedNY:
        print item[0]

def test42():
    print '\n*** Load DataSet ***'
    listOPosts,listClasses = loadDataSet()
    print 'List of posts:\n', listOPosts
    print 'List of Classes:\n', listClasses

    print '\n*** Create Vocab List ***'
    myVocabList = createVocabList(listOPosts)
    print 'Vocab List from posts:\n', myVocabList

    print '\n*** Vocab show in post Vector Matrix ***'
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(bagOfWords2Vec(myVocabList,postinDoc))
    print 'Train Matrix:\n', trainMat

    print '\n*** Train ***'
    p0V,p1V,pAb = trainNB0(trainMat,listClasses)
    print 'p0V:\n',p0V
    print 'p1V:\n',p1V
    print 'pAb:\n',pAb

作者提供的PAJEROSS源链接“

书中小编的意思是的话自源
中的小说作为分类为一的稿子,以来自源
中的小说作为分类为0的稿子

为了能够跑通示例代码,能够找两可用的汉兰达SS源作为代表。

本人用的是那七个源:

NASA Image of the
Day:

Yahoo Sports – NBA – Houston Rockets
News:

约等于说,假诺算法运转正确的话,全数来自于 nasa
的小说将会被分类为一,全体来自于yahoo sports的休斯顿火箭队(休斯敦 罗克ets)资源信息将会分类为0

 

金沙注册送58 2

接纳本身定义的凯雷德SS源,当程序运营到trainNB0(array(trainMat),array(trainClasses))时会报错,怎么做?

从书中小编的事例来看,小编利用的源普通话章数量较多,len(ny[‘entries’])
为 100,小编找的多少个 奥迪Q三SS 源只有拾-二十二个左右。

>>> import feedparser
>>>ny=feedparser.parse(”)
>>> ny[‘entries’]
>>> len(ny[‘entries’])
100

因为笔者的二个LX570SS源有100篇小说,所以他得以在代码中剔除了三1九个“停用词”,随机挑选20篇小说作为测试集,可是当大家运用替代EvoqueSS源时大家只有十篇小说却要收取20篇文章作为测试集,那样显著是会出错的。只要本人调节下测试集的多寡就足以让代码跑通;纵然小说中的词太少,减弱剔除的“停用词”数量能够增长算法的准确度。

 

上边是词频总结排序,词长排序的代码。

跟着书中代码往下写在此间卡住了,也许还会有别的同学也遇上了如此的问…

假若不想将现出频率排序最高的2十几个单词移除,该怎么样去掉“停用词”?

能够把要去掉的停用词存放到txt文件中,使用时读取(替代移除高频词的代码)。具体必要停用哪些词可以参见那里

以下代码想正常运行要求将停用词保存至stopword.txt中。

自个儿的txt中保留了以下单词,效果还能够:

a
about
above
after
again
against
all
am
an
and
any
are
aren’t
as
at
be
because
been
before
being
below
between
both
but
by
can’t
cannot
could
couldn’t
did
didn’t
do
does
doesn’t
doing
don’t
down
during
each
few
for
from
further
had
hadn’t
has
hasn’t
have
haven’t
having
he
he’d
he’ll
he’s
her
here
here’s
hers
herself
him
himself
his
how
how’s
i
i’d
i’ll
i’m
i’ve
if
in
into
is
isn’t
it
it’s
its
itself
let’s
me
more
most
mustn’t
my
myself
no
nor
not
of
off
on
once
only
or
other
ought
our
ours
ourselves
out
over
own
same
shan’t
she
she’d
she’ll
she’s
should
shouldn’t
so
some
such
than
that
that’s
the
their
theirs
them
themselves
then
there
there’s
these
they
they’d
they’ll
they’re
they’ve
this
those
through
to
too
under
until
up
very
was
wasn’t
we
we’d
we’ll
金沙注册送58,we’re
we’ve
were
weren’t
what
what’s
when
when’s
where
where’s
which
while
who
who’s
whom
why
why’s
with
won’t
would
wouldn’t
you
you’d
you’ll
you’re
you’ve
your
yours
yourself
yourselves

 

'''
Created on Oct 19, 2010

@author: Peter
'''
from numpy import *

def loadDataSet():
    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help','my','dog', 'please'],
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not
    return postingList,classVec

def createVocabList(dataSet):
    vocabSet = set([])  #create empty set
    for document in dataSet:
        vocabSet = vocabSet | set(document) #union of the two sets
    return list(vocabSet)

def bagOfWords2Vec(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
        else: print "the word: %s is not in my Vocabulary!" % word
    return returnVec

def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory)/float(numTrainDocs)
    p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones() 
    p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = log(p1Num/p1Denom)          #change to log()
    p0Vect = log(p0Num/p0Denom)          #change to log()
    return p0Vect,p1Vect,pAbusive

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else: 
        return 0

def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec

def testingNB():
    print '*** load dataset for training ***'
    listOPosts,listClasses = loadDataSet()
    print 'listOPost:\n',listOPosts
    print 'listClasses:\n',listClasses
    print '\n*** create Vocab List ***'
    myVocabList = createVocabList(listOPosts)
    print 'myVocabList:\n',myVocabList
    print '\n*** Vocab show in post Vector Matrix ***'
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(bagOfWords2Vec(myVocabList, postinDoc))
    print 'train matrix:',trainMat
    print '\n*** train P0V p1V pAb ***'
    p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
    print 'p0V:\n',p0V
    print 'p1V:\n',p1V
    print 'pAb:\n',pAb
    print '\n*** classify ***'
    testEntry = ['love', 'my', 'dalmation']
    thisDoc = array(bagOfWords2Vec(myVocabList, testEntry))
    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
    testEntry = ['stupid', 'garbage']
    thisDoc = array(bagOfWords2Vec(myVocabList, testEntry))
    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)

def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 

def spamTest():
    docList=[]; classList = []; fullText =[]
    for i in range(1,26):
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    trainingSet = range(50); testSet=[]           #create test set
    for i in range(10):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
            print "classification error",docList[docIndex]
    print 'the error rate is: ',float(errorCount)/len(testSet)
    #return vocabList,fullText

def calcMostFreq(vocabList,fullText):
    import operator
    freqDict = {}
    for token in vocabList:
        freqDict[token]=fullText.count(token)
    sortedFreq = sorted(freqDict.iteritems(), key=operator.itemgetter(1), reverse=True) 
    return sortedFreq[:30]       

def stopWords():
    import re
    wordList =  open('stopword.txt').read() # see http://www.ranks.nl/stopwords
    listOfTokens = re.split(r'\W*', wordList)
    return [tok.lower() for tok in listOfTokens] 
    print 'read stop word from \'stopword.txt\':',listOfTokens
    return listOfTokens

def localWords(feed1,feed0):
    import feedparser
    docList=[]; classList = []; fullText =[]
    print 'feed1 entries length: ', len(feed1['entries']), '\nfeed0 entries length: ', len(feed0['entries'])
    minLen = min(len(feed1['entries']),len(feed0['entries']))
    print '\nmin Length: ', minLen
    for i in range(minLen):
        wordList = textParse(feed1['entries'][i]['summary'])
        print '\nfeed1\'s entries[',i,']\'s summary - ','parse text:\n',wordList
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1) #NY is class 1
        wordList = textParse(feed0['entries'][i]['summary'])
        print '\nfeed0\'s entries[',i,']\'s summary - ','parse text:\n',wordList
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    print '\nVocabList is ',vocabList
    print '\nRemove Stop Word:'
    stopWordList = stopWords()
    for stopWord in stopWordList:
        if stopWord in vocabList:
            vocabList.remove(stopWord)
            print 'Removed: ',stopWord
##    top30Words = calcMostFreq(vocabList,fullText)   #remove top 30 words
##    print '\nTop 30 words: ', top30Words
##    for pairW in top30Words:
##        if pairW[0] in vocabList:
##            vocabList.remove(pairW[0])
##            print '\nRemoved: ',pairW[0]
    trainingSet = range(2*minLen); testSet=[]           #create test set
    print '\n\nBegin to create a test set: \ntrainingSet:',trainingSet,'\ntestSet',testSet
    for i in range(5):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])
    print 'random select 5 sets as the testSet:\ntrainingSet:',trainingSet,'\ntestSet',testSet
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    print '\ntrainMat length:',len(trainMat)
    print '\ntrainClasses',trainClasses
    print '\n\ntrainNB0:'
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    #print '\np0V:',p0V,'\np1V',p1V,'\npSpam',pSpam
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        classifiedClass = classifyNB(array(wordVector),p0V,p1V,pSpam)
        originalClass = classList[docIndex]
        result =  classifiedClass != originalClass
        if result:
            errorCount += 1
        print '\n',docList[docIndex],'\nis classified as: ',classifiedClass,', while the original class is: ',originalClass,'. --',not result
    print '\nthe error rate is: ',float(errorCount)/len(testSet)
    return vocabList,p0V,p1V

def testRSS():
    import feedparser
    ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')
    sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml')
    vocabList,pSF,pNY = localWords(ny,sf)

def testTopWords():
    import feedparser
    ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')
    sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml')
    getTopWords(ny,sf)

def getTopWords(ny,sf):
    import operator
    vocabList,p0V,p1V=localWords(ny,sf)
    topNY=[]; topSF=[]
    for i in range(len(p0V)):
        if p0V[i] > -6.0 : topSF.append((vocabList[i],p0V[i]))
        if p1V[i] > -6.0 : topNY.append((vocabList[i],p1V[i]))
    sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)
    print "SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**"
    for item in sortedSF:
        print item[0]
    sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)
    print "NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**"
    for item in sortedNY:
        print item[0]

def test42():
    print '\n*** Load DataSet ***'
    listOPosts,listClasses = loadDataSet()
    print 'List of posts:\n', listOPosts
    print 'List of Classes:\n', listClasses

    print '\n*** Create Vocab List ***'
    myVocabList = createVocabList(listOPosts)
    print 'Vocab List from posts:\n', myVocabList

    print '\n*** Vocab show in post Vector Matrix ***'
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(bagOfWords2Vec(myVocabList,postinDoc))
    print 'Train Matrix:\n', trainMat

    print '\n*** Train ***'
    p0V,p1V,pAb = trainNB0(trainMat,listClasses)
    print 'p0V:\n',p0V
    print 'p1V:\n',p1V
    print 'pAb:\n',pAb
##统计词频
freqD2 = {}
for word2 in final:
  freqD2[word2] = freqD2.get(word2, 0) + 1 

##按词频排序输出
counter_list = sorted(freqD2.items(), key=lambda x: x[1], reverse=True) 
_2000=counter_list[0][1] + 1
print(_2000)##用于词长词频排序用
fp = open('sort.txt',"w+",encoding='utf-8')
for d in counter_list:
  fp.write(d[0]+':'+str(d[1]))
  fp.write('\n')
fp.close()

##按词长词频排序输出
counter_list = sorted(freqD2.items(), key=lambda x: len(x[0])*_2000+x[1], reverse=True) 
fp = open('sortlen.txt',"w+",encoding='utf-8')
for d in counter_list:
  fp.write(d[0]+':'+str(d[1]))
  fp.write('\n')
fp.close()

排序代码很便利,也值得借鉴,Python是个好东西,庞大,易重用。

 

相关文章

网站地图xml地图