Time flies, it has been over four weeks
since I took the class. I have absorbed something new since I wrote the last
blog. Hence, it is the time to make a summary for these two weeks’ knowledge.
I extract four key words in the last two
class. They are “document comparison”, “text classification”, “text clustering”
and “sentiment analysis and opinion mining”. Let’s talk about these topics in
order. When it comes to the similarity between several file documents, in my
perspective, we should only pay attention to several keywords. If two pieces of
paper have several words which appear frequently in common, there is a high
possibility that these two documents focus on the same research realm. This is
an empirical method after all. There are several quantitative solutions with
higher accuracy.
TF*IDF rule is one of these solutions,
actually, it is simple and effective. In the TF*IDF rule, TF means the keyword
frequency, it is simply the number of times a given keyword appears within a
specific document. Meanwhile, the IDF (Inverse Document Frequency) is obtained
by dividing the number of documents in the whole collection by the number of
documents containing the given keyword. The bigger the IDF index is, there are
fewer documents with specific keywords in the whole collection. These given
keyword are of great specificity. Namely, they are unique. If two documents
have several unique keywords in common, obviously and undoubtedly, they are
similar and they can be classified into the same group.
Actually, the first three topics can be
mentioned at the same time. In this information-explosion era, documents can be
generated in an amazing speed, under this circumstance, people have realized
the importance of comparing documents and classifying similar files into the
same group. It will increase the
efficiency and make works easier.
Except from text classification, analyzing
people’s moods is also a useful tool in social network analytics. People are
likely to sharing their feelings with their friends on the platforms such as
Facebook and Twitter. Some people express themselves explicitly while some
people prefer implicit expressions. In the most cases, people’s moods can be
divided into three types. Positive, negative and objective (neutral). Three
types of moods represent three classes. People’s expressions are text files.
Analyzer’s task is text classification, as what we have discussed above. We may
create dictionaries for each class respectively. Several typical given keywords
contain in the dictionary. Comparing people’s expressions with keywords in each
class and it is easy to determine people’s current feelings.
All
in all, contents mentioned in the last two classes are interesting. Although it
is simple for us to understand, it also needs to be measured by rigorous
mathematical methods.
Obviously, you have understand the key points about the content we have learned in class and it is a good summary as well.
回复删除Hi, Weibin, thanks for your praise, I just record something important after the class. We may share with each other at some time.
删除Anyhow ,text cluster remains unmature for the time being.Hoping to be updated.
回复删除