Intro to Topic Modeling

As Underwood mentions, topic modeling is something we already do as readers: “human beings are already pretty good at inferring the latent structure in (say) a single writer’s oeuvre.” However, when assessing a great mass of ouvres, we end up having to employ more computing power and algorithims, and that’s where topic modeling done by computers comes in. I was suprised to read that “topics” had a very vague definition: “a ‘topic’ can be understood as a collection of words that have different probabilities of appearance in passages discussing the topic.” I assumed “topics” were static things, like genres but a bit more specific–e.g., a topic of surviving in the outdoors, under the genre of instructional manual. Question–Can you always group these collections of words into these more specific topics, or is that presumptuous? For instance, Topic 1 in the Underwood reading includes the words “organize,” “lead,” “committee,” and “direct.” Instead of “Topic 1,” could we just say, “Leadership and Management?” Perhaps vagueness is better, to make things open-ended.

Some good reminders from these readings:

  • Close reading is still important; these tools just provide more computing power and perhaps also a different perspective, allowing us to ask different questions
  • There are still many subjective choices in data, and with visualization it’s no different: “Topic modeling is not an exact science by any means,” (Brett) and:

    “[probabalistic techniques] require you to make a series of judgment calls that deeply shape the results you get (from choosing stopwords, to the number of topics produced, to the scope of the collection). The resulting model ends up being tailored in difficult-to-explain ways by a researcher’s preferences.” (Underwood)

Has anyone designed a UI for interacting with MALLET yet, or is it all in-console? Because it looks like it can do a lot of things with natural language processing, but you really have to know how to talk to it.


01 March 2017