Research Summaries

Back Tools for Topic Analysis, Phase II

Fiscal Year 2010
Division Graduate School of Operational & Information Sciences
Department Computer Science
Investigator(s) Martell, Craig H.
Sponsor National Reconnaissance Office (DoD)
Summary Over the past four decades or so, the analysis of documents and newswire has progressed from infancy to a robust field with massive amounts of linguistic data and algorithms capable of performing a number of tasks important to the analyst. However, these gains have come from decades of work; a great deal of which was spent in hand-labeling the data that is used in these algorithms with the "truth" of the matter. That is, for each of the many tasks, a gold-standard data set was created (mostly) by hand and used for building the respective algorithms. On the other hand, those who wish to analyze newer forms of communication do not have analogous resources available to them; chat, blogs, SMS, etc. are sufficiently different forms of communication that the pre-existing data (built for document or newsfeed analysis) doesn't work. As an example, Forsyth and Martell (2007) show that a system trained to do part-of-speech tagging on chat achieved only 57.4% accuracy. In that same work, however, it was shown that adding just a relatively small amount of data from the chat domain-10,000 posts to over a million words of Wall Street Journal data-allowed for algorithms that produced 87.1% accuracy. This was extended to over 90% accuracy in Forsyth's 2007 NPS Masters Thesis. That is, although we do not have mass amounts of data, as does the document and newsfeed analysis communities, we believe we can leverage the data created in these other communities so that we can build systems that allow for robust analysis of these newer forms of communication. In short, we wish to build good models of these domains from data obtained from other domains (or the wrong data).
Keywords
Publications Publications, theses (not shown) and data repositories will be added to the portal record when information is available in FAIRS and brought back to the portal
Data Publications, theses (not shown) and data repositories will be added to the portal record when information is available in FAIRS and brought back to the portal