Our goal is to use statistical topic modeling to find topics or themes that occur in the corpus for three purposes. First, this provides meta data on each document, since they will be tagged with topics. Second, documents can be clustered together based on what topics they cover and this information can be summarized in a visualization, which can be valuable in a mostly qualitative analysis. Finally, this high-level analysis shows which topics tend to occur together in documents and find any latent topics that might be more pervasive in the corpus than expected.
Topic modeling is a type of statistical model in natural language processing that aims to find topics in a corpus, group topics together by looking for similarity and co-occurence, and categorize documents in the corpus based on the topic probabilities assigned. We are specifically using a statistical method called the latent Dirichlet allocation (LDA). LDA attempts to build topics using the words and documents in the corpus and assumes that each document in the corpus covers one or more of these topics.
Before running the code below, a script converted the PDFs into text files, this preprocessing code can be examined here. Research papers stored as images were not included in this analysis. The list of the papers included in the model can be found in the repo.
LDA assumes that each document in the corpus is a mixture of topics, where topics are a structure built from words and documents. LDA requires us to define the number of topics we expect to see across the corpus, however we ran this model multiple times to evaluate the results for 3-20 topics to pick the topic count that best captures the content in this corpus.
After some basic preprocessing like removing numbers, removing common English words (like “and”), and stemming/lemmatizing words (truncating words to their roots, such as ‘responding’ to ‘respond’), we built a document term matrix that lists all words in the corpus and counts those words by occurence in each document.
Here are the first few rows and columns of the document-term matrix, to give a sense of what data we are using to do our LDA:
## <<DocumentTermMatrix (documents: 5, terms: 5)>> ## Non-/sparse entries: 4/21 ## Sparsity : 84% ## Maximal term length: 7 ## Weighting: term frequency (tf) ## ##Terms ## Docsaaron abbrevi abil ## A Needs Assessment Study of LBT Women.txt 0 01 ## berg and lien 2006.txt0 04 ## Cahill et al. Do ask.journal.pone.0107104 (1).txt 0 01 ## Clark et al, 2005.txt 0 00 ## Coffman_Underreporting_NBERreport_2013.txt0 00 ##Terms ## Docsabnorm about ## A Needs Assessment Study of LBT Women.txt0 0 ## berg and lien 2006.txt 0 0 ## Cahill et al. Do ask.journal.pone.0107104 (1).txt0 0 ## Clark et al, 2005.txt0 1 ## Coffman_Underreporting_NBERreport_2013.txt 0 0
And the twenty most frequent words:
##sexualgender sexsurveyhealthquestion ident ##3300282228002350217220702003 ## transgend respondreportorientdatameasurresearch ##1743155314911399132512981117 ## studi responsmale womenidentifidiffer ##1082105510191011 986 942
After running the latent Dirichlet allocation on our data, our output includes:
- A mapping of documents to six topics
- A definition of each topic (i.e. what words make up that topic)
- The probabilities with which each topic is assigned to a document (so if a document covers two topics, we should see a high probability for both topics in that document)
First, let’s take a look at the terms that were assigned to each topic to get a better sense of what each topic is capturing. We will also name the topics so we can refer to them later.
Terms are assigned to a topic with probabilities, so every term in the corpus is given a probability per topic. So, as an example, these are the top 10 terms for the first topic along with their probabilities:
##Term Topic Probability ## 3790sex Topic 10.05826209 ## 917 coupl Topic 10.03889637 ## 3518 report Topic 10.02841547 ## 3489 relationship Topic 10.02719181 ## 2524marri Topic 10.01867941 ## 1930household Topic 10.01857300 ## 579census Topic 10.01437000 ## 2975oppositesex Topic 10.01288033 ## 2540match Topic 10.01266752 ## 3112percent Topic 10.01229510
At some point the term probabilities drop rapidly:
However, we can use the top terms to get a sense for what each topic covers.
To summarize, our topics are:
## TopicDescription ## 1 Topic 1 Couples, Relationships, Households ## 2 Topic 2Gender, Transgender ## 3 Topic 3 Gender ## 4 Topic 4Health, Sexual, Orientation ## 5 Topic 5 Sexual Orientation ## 6 Topic 6 Question, Survey, Response
Similar to how terms are assigned to topics with a probability, every topic is assigned to documents with a probability as well. The topic assigned the highest probability for a document is the primary topic. However, it is possible for a document to cover multiple topics. A good way to assess how well a topic describes a document is by looking at the ratio of the highest topic probability to the second highest probability, and so on. For example, if a research paper has a primary topic probability of 0.49 and a seconary topic probability of 0.16, the ratio of 3.06 indicates the first topic is three times more likely. However if the first two topics have probabilities of 0.33 and 0.22, respectively, then the ratio is 1.5, indicating both topics might apply to this document.
Primary Topic Frequency
Here is a look at the primary topics by frequency, i.e. number of documents per primary topic assignment.
Relationship Between Topics and Papers
We ran this same analysis after removing words that represent the dominant topics across the corpus. This makes it easier to see any latent topic signal rather than the topics we expect to see (like survey). The words removed include: sexual, gender, sex, survey, health, question, ident, transgend, respond, report, orient, measur, research, studi, respons, male, women, femal, men.
To see the full analysis, check out the sogi repo.
Here are the slides and Python code from the hands-on workshop at Open Data Day DC 2016 where we did k-means clustering after scraping text from NIST's newsfeed:
Slides and code for the Data Analysis with R class I taught at the Data Academy at the US Department of Commerce.
It focuses on using analysis to clean, wrangle, summarize/aggregate/visualize, and select features/models: