TF-IDF 算法原理以及源码实现

TF-IDF（Term Frequency-Inverse Document Frequency），是用来衡量一个词在文档中的重要性，下面看一下TDF-IDF的公式：

首先是TF，也就是词频，用来衡量一个词在文档中出现频率的指标。假设某词在文档中出现了( n )次，而文档总共包含( N )个词，则该词的TF定义为：

注意：（t，d）中的t表示的是文档中的词汇，d表示的是文档的词汇集合，通过计算TF也就是进行词频率的统计，好的，那么看一下代码的实现。

defcompute_tf(word_dict, doc_words):""":param word_dict: 字符的统计个数

    :param doc_words: 文档中的字符集合

    :return:"""tf_dict={}

    words_len=len(doc_words)for word_i, count_i inword_dict.items():

        tf_dict[word_i]= count_i /words_lenreturntf_dict#示例文档
doc1 = "this is a sample"doc2= "this is another example example example"doc3= "this is a different example example"

#分割单词
doc1_words =doc1.split()

doc2_words=doc2.split()

doc3_words=doc3.split()#计算每个文档的词频
word_dict1 =Counter(doc1_words)

word_dict2=Counter(doc2_words)

word_dict3=Counter(doc3_words)#计算TF
tf1 =compute_tf(word_dict1, doc1_words)

tf2=compute_tf(word_dict2, doc2_words)

tf3=compute_tf(word_dict3, doc3_words)print(f'tf1:{tf1}')print(f'tf2:{tf2}')print(f'tf3:{tf3}')#tf1:{'this': 0.25, 'is': 0.25, 'a': 0.25, 'sample': 0.25}#tf2:{'this': 0.16666666666666666, 'is': 0.16666666666666666, 'another': 0.16666666666666666, 'example': 0.5}#tf3:{'this': 0.16666666666666666, 'is': 0.16666666666666666, 'a': 0.16666666666666666, 'different': 0.16666666666666666, 'example': 0.3333333333333333}

看完TF的计算之后，我们看一下IDF的定义，公式和对应的实现吧，IDF的定义是：即逆文档频率，反映了词的稀有程度，IDF越高，说明词越稀有。这个逆文档频率也就是说一个词的文档集合中出现的次数越少，他就越具有表征型，因为在文中有很多“的”，“了”这种词，这些词重要性不大，反而出现少的词重要性大一点，来看一下IDF的公式：

其中，( D )是文档总数，( df_t )是包含词( t )的文档数量。通过取对数，可以避免数值过大的问题，同时保证了IDF的单调递减特性，下面看一下代码的现实：

defcompute_idf(doc_list):""":param doc_list: 文档的集合

    :return:"""sum_list= list(set([word_i for doc_i in doc_list for word_i indoc_i]))



    idf_dict= {word_i: 0 for word_i insum_list}for word_j insum_list:for doc_j indoc_list:if word_j indoc_j:

                idf_dict[word_j]+= 1
    return {k: math.log(len(doc_list) / (v + 1)) for k, v inidf_dict.items()}#示例文档
doc1 = "this is a sample"doc2= "this is another example example example"doc3= "this is a different example example"

#分割单词
doc1_words =doc1.split()

doc2_words=doc2.split()

doc3_words=doc3.split()#计算每个文档的词频
word_dict1 =Counter(doc1_words)

word_dict2=Counter(doc2_words)

word_dict3=Counter(doc3_words)#计算整个文档集合的IDF
idf =compute_idf([doc1_words, doc2_words, doc3_words])#idf:{'different': 0.4054651081081644, 'another': 0.4054651081081644, 'a': 0.0, 'example': 0.0, 'this': -0.2876820724517809, 'sample': 0.4054651081081644, 'is': -0.2876820724517809}

通过结果可以发现，different、another和sample都比is、a等词汇的IDF值要高，代表越重要。

好的，最后看一下TF-IDF的公式吧，

$$TF-IDF=TF*IDF $$

TF-IDF 就是TF*IDF，来综合的评价一个词在文档中的重要性。

最后看一下完整的代码，

importmathfrom collections importCounterimportmathdefcompute_tfidf(tf_dict, idf_dict):

    tfidf={}for word, tf_value intf_dict.items():

        tfidf[word]= tf_value *idf_dict[word]returntfidfdefcompute_tf(word_dict, doc_words):""":param word_dict: 字符的统计个数

    :param doc_words: 文档中的字符集合

    :return:"""tf_dict={}

    words_len=len(doc_words)for word_i, count_i inword_dict.items():

        tf_dict[word_i]= count_i /words_lenreturntf_dictdefcompute_idf(doc_list):""":param doc_list: 文档的集合

    :return:"""sum_list= list(set([word_i for doc_i in doc_list for word_i indoc_i]))



    idf_dict= {word_i: 0 for word_i insum_list}for word_j insum_list:for doc_j indoc_list:if word_j indoc_j:

                idf_dict[word_j]+= 1
    return {k: math.log(len(doc_list) / (v + 1)) for k, v inidf_dict.items()}#示例文档
doc1 = "this is a sample"doc2= "this is another example example example"doc3= "this is a different example example"

#分割单词
doc1_words =doc1.split()

doc2_words=doc2.split()

doc3_words=doc3.split()#计算每个文档的词频
word_dict1 =Counter(doc1_words)

word_dict2=Counter(doc2_words)

word_dict3=Counter(doc3_words)#计算TF
tf1 =compute_tf(word_dict1, doc1_words)

tf2=compute_tf(word_dict2, doc2_words)

tf3=compute_tf(word_dict3, doc3_words)print(f'tf1:{tf1}')print(f'tf2:{tf2}')print(f'tf3:{tf3}')#计算整个文档集合的IDF
idf =compute_idf([doc1_words, doc2_words, doc3_words])print(f'idf:{idf}')#计算每个文档的TF-IDF
tfidf1 =compute_tfidf(tf1, idf)

tfidf2=compute_tfidf(tf2, idf)

tfidf3=compute_tfidf(tf3, idf)print("TF-IDF for Document 1:", tfidf1)print("TF-IDF for Document 2:", tfidf2)print("TF-IDF for Document 3:", tfidf3)"""tf1:{'this': 0.25, 'is': 0.25, 'a': 0.25, 'sample': 0.25}

tf2:{'this': 0.16666666666666666, 'is': 0.16666666666666666, 'another': 0.16666666666666666, 'example': 0.5}

tf3:{'this': 0.16666666666666666, 'is': 0.16666666666666666, 'a': 0.16666666666666666, 'different': 0.16666666666666666, 'example': 0.3333333333333333}

idf:{'example': 0.0, 'different': 0.4054651081081644, 'this': -0.2876820724517809, 'another': 0.4054651081081644, 'is': -0.2876820724517809, 'a': 0.0, 'sample': 0.4054651081081644}

TF-IDF for Document 1: {'this': -0.07192051811294523, 'is': -0.07192051811294523, 'a': 0.0, 'sample': 0.1013662770270411}

TF-IDF for Document 2: {'this': -0.047947012075296815, 'is': -0.047947012075296815, 'another': 0.06757751801802739, 'example': 0.0}

TF-IDF for Document 3: {'this': -0.047947012075296815, 'is': -0.047947012075296815, 'a': 0.0, 'different': 0.06757751801802739, 'example': 0.0}"""

TF-IDF 算法原理以及源码实现

添加新评论

最新文章

最近回复

分类

归档

其它