Introduction

When discussing political issues, we consider each argument as a sum of ideologies and assumptions captured in carefully-chosen words. These latent qualities exist as broad frames or ideological perspectives, communicated using a particular lexical register or by the repetition of certain phrases to influence an audience. Within parliament, political parties form along frames, debating particular topics based on their ideologies. By attempting to classify text according to frames, we uncover evidence of these implicit political stances. In our research, we chose immigration as the main topic and compared the effectiveness of text classification methods in exploring these frames, employing numerous parameters to magnify our results. The chief parameter under investigation was the role played by different syntactic units in forming a topical model of frames.

Topical models are a computational linguistics approach that allow us to identify these frames. This method treats each document as a mixture of latent topics, and tries to recover those topics by identifying covariant terms within the document. In doing so, we can classify unseen documents into topics based on the new word data, and identify the party of unseen documents based on word emphasis and choice. We explored three approaches in depth: Structural Topic Models (STM) [1], Latent Dirichlet Allocation (LDA) [2] and Multinomial Naïve Bayes Classification (MNB) [3]. In reverse order, Multinomial Naïve Bayes allowed for a basic classification by tf-idf of the texts; Latent Dirichlet Allocation classified topics by covariant terms exclusively; and Structural Topic Models expanded on LDA by considering party and date metadata during classification.

The corpus used was the Canadian House of Commons debates since the 9$^{th}$ Canadian Parliament (1901-1904), with particular emphasis on the 36$^{th}$ and 39$^{th}$ Canadian Parliaments, for which our metadata information – debate date, speaker party and recorded debate topic – was more complete. The corpus was stored in XML, which allowed for easy integration of metadata with data in-line. Some modifications were further made to the scope of the data considered. As the Canadian parliament has historically been fought over by two parties – the Liberal Party of Canada and the conservative party of the time – simplification was performed in order to correct for the long historical presence of the Liberal Party, and hence much larger number of speeches than any one other party. All conservative parties were relabelled as “conservative” (see Appendix A for a list) to achieve parity in speeches with the Liberals. As other parties beyond these two groups traditionally have not been very successful in Canadian parliament, we did not consider them during classification.

Approach

The general approach across classification methods was as follows: extract a filtered subset of the Canadian parliamentary debates from an XML corpus; preprocess this subset to exclude all non-immigration debates and any irrelevant data (explained hereafter); and perform classification on the remaining data. These metadata elements allowed us to examine how political parties differed in terms of language use. The preprocessing phase began with the removal of all speeches which did not fall under the topic of immigration, refugees or asylum seekers, before proceeding in the removal of stop words and common parliamentary words and abbreviations – such as “hon”, “member” and “party” – removal of numbers and punctuation, stemming and removal of very rare or very frequent words. In all cases, the extracted data was represented as a “bag of words” with word frequency calculated among the remaining subset.

In terms of what was particular to each classification, STM was performed on all available parliaments, and hence words which occurred in 57% or more of the documents were removed, while words which occurred fewer than 50 times – 20 when considering only nouns – were also removed. These thresholds removed most extraneous content, although some general parliamentary language remained. LDA focussed on the 36$^{th}$ and 39$^{th}$ parliaments, removing the 100 most common words and any words which occurred five times or less. Lastly, MNB excluded all words occurring fewer than 5 times and across more than 80% of the documents, examining – individually – the 33$^{th}$, 36$^{th}$ and 39$^{th}$ parliaments.

Syntactic parsing was performed using two different methods: initially, the NLTK Stanford POSTagger, and later with Google’s SyntaxNet, using the English parser Parsey McParseface. For the NLTK parse, XML data was extracted in Python and the text was tagged with the Stanford parser before writing out the desired words to a CSV file for classification. Using SyntaxNet, the XML data was extracted in Python, piped to the SyntaxNet process and then that output was written back into the XML. Using this system, part-of-speech and syntactic role information was determined for each word in the proceedings.

Analysis

The LDA and MNB results demonstrated a relatively high rate of success in distinguishing the political parties by language: LDA results successfully generated topics in line with the STM ones using a party-as-document approach. The MNB results, created using an approximately equal number of liberal and conservative speeches, reported an average classification result of 80% across all tests and parliaments.

STM results obtained for the all parts-of-speech model and the nouns-only model emphasised the difficulties inherent in measuring the role of syntactic units on frame identification. While similar topics occurred across the models when dealing with regional immigration issues such as Japanese-Canadian internment or British Commonwealth immigration, in many cases linguistically removing non-nouns did not meaningfully alter the results, likely due to the common roots of nouns and non-nouns – vote, deportation versus deport, employment versus employ – which were all typically stemmed.

Conclusion

Our analysis demonstrated the effectiveness of classification in identifying party membership using two categories, a useful result for identifying frames. Classification performed between 20 and 25% better than random choice in all cases. Nonetheless, no evidence was found that a nouns-only corpus provides a meaningful difference in topic results when compared to an all-parts-of-speech corpus.

Many potential developments exist following this study. Proper elimination of noise remains an area of improvement, while more tests need to be undertaken to better understand the role of certain parts-of-speech in determining topic. LDA and MNB methods can also be explored further: additional preprocessing can divide political documents into narrower groups – such as government orders or oral questions – to better investigate the differences in topic formation. As well, models can be compared against each other to examine for possible overfitting. These many future directions to the work emphasise the continued value in exploring text-as-data and the wide variety of available text classification methods.

References

  1. Lucas, Nielsen, Roberts, Stewart, Storer, and Tingley. “Computer assisted text analysis for comparative politics.” Political Analysis. 2015.

  2. Phan, Xuan-Hieu, and Cam-Tu Nguyen. “GibbsLDA++: A C/C++ Implementation of Latent Dirichlet Allocation (LDA) Using Gibbs Sampling for Parameter Estimation and Inference.” GibbsLDA++. Sourceforge, 2007. Web.

  3. Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

Appendix A

Parties labelled as “conservative”: