What is tm.plugin.koRpus?

While the koRpus package focusses mostly on analysis steps of individual texts, tm.plugin.koRpus adds several new object classes and respective methods, which can be used to analyse complete text corpora in a single step. These classes are also a first step to combine object classes of both, the koRpus and tm packages.

There are three basic classes, which are hierarchically nested:

  1. class kRp.topicCorpus holds a list (named by topics) of objects of

  2. class kRp.sourcesCorpus, which in its sources slot holds a list of objects of

  3. class kRp.corpus, which in turn contains objects of both koRpus and tm classes.

The idea behind this is to be able to categorize corpora on at least two levels. The default assumes that these levels are different sources and different topics, but apart from this naming (which is coded into the classes) you can actually use this for whatever levels you like.

If you don’t need these hierarchical levels, you can just use the function simpleCorpus() to create objects of class kRp.corpus. It simply represents a collection of texts. To distinguish texts which came from different sources, use the function sourcesCorpus(), which will generate sub-corpora for each source given. And one level higher up, use the function topicCorpus(), to sort kRp.sourcesCorpus objects by different topics. Objects of this class will only be valid if there are texts of each topic from each source.

Tokenizing corpora

As with koRpus, the first step for text analysis is tokenizing and possibly POS tagging. This step is performed by the functions mentioned above, simpleCorpus(), sourcesCorpus(), or topicCorpus(), respectively. The package includes four sample texts taken from Wikipedia1 in its tests directory which we can use for a demonstration:

library(tm.plugin.koRpus)
library(koRpus.lang.de)
# set the root path to the sample files
sampleRoot <- file.path(path.package("tm.plugin.koRpus"), "tests", "testthat", "samples")
# now we can define the topics (names of the vector elements)
# and their main path; i.e., these are subdirectories below sampleRoot
samplePaths <- c(
  C3S=file.path(sampleRoot, "C3S"),
  GEMA=file.path(sampleRoot, "GEMA")
)
# we also define the sources; again, these are subdirectories, this time
# below the topic directories, and they contain all texts to analyze
sampleSources <- c(
  wpa="Wikipedia_alt",
  wpn="Wikipedia_neu"
)
# and finally, we can tokenize all texts
sampleTexts <- topicCorpus(paths=samplePaths, sources=sampleSources, tagger="tokenize", lang="de")
processing topic "C3S", source "Wikipedia_alt", 1 text...
processing topic "C3S", source "Wikipedia_neu", 1 text...
processing topic "GEMA", source "Wikipedia_alt", 1 text...
processing topic "GEMA", source "Wikipedia_neu", 1 text...

Should you need to get hold of the nested objects inside kRp.sourcesCorpus or kRp.topicCorpus class objects, or replace them with updated ones, you can do so by using the methods corpusTagged(), corpusSources(), or corpusTopics():

allC3SSources <- corpusSources(corpusTopics(sampleTexts, "C3S"))
names(allC3SSources)
[1] "wpa" "wpn"

Since we’re using the koRpus package for all actual analysis, you can also setup your environment with set.kRp.env() and POS-tag all texts with TreeTagger2.

Analysing corpora

After the initial tokenizing, we can analyse the corpus by calling the provided methods, for instance lexical diversity:

sampleTexts <- lex.div(sampleTexts)
corpusSummary(sampleTexts)
             doc_id topic source stopwords    a    C CTTR   HDD     K lgV0
wpaC3S01   wpaC3S01   C3S    wpa        NA 0.16 0.95 6.13 38.14 49.92 6.21
wpnC3S01   wpnC3S01   C3S    wpn        NA 0.17 0.94 6.82 38.05 54.88 6.10
wpaGEMA01 wpaGEMA01  GEMA    wpa        NA 0.17 0.94 7.07 37.61 65.08 6.11
wpnGEMA01 wpnGEMA01  GEMA    wpn        NA 0.16 0.94 7.13 37.87 60.14 6.24
          MATTR MSTTR   MTLD MTLDMA     R    S  TTR     U
wpaC3S01   0.81  0.79 100.16     NA  8.68 0.93 0.78 39.92
wpnC3S01   0.82  0.76 123.01     NA  9.65 0.92 0.73 36.46
wpaGEMA01  0.80  0.78 106.94    192 10.00 0.92 0.71 35.96
wpnGEMA01  0.81  0.79 111.64     NA 10.08 0.92 0.73 37.47

As you can see, corpusSummary() returns a data.frame object with the summarised results of all texts below the given object level. That is, if you are only interested in the results for texts of the first topic, simply apply corpusSummary() on the result of corpusTopics():

corpusSummary(corpusTopics(sampleTexts, "C3S"))
           doc_id topic source stopwords    a    C CTTR   HDD     K lgV0 MATTR
wpaC3S01 wpaC3S01   C3S    wpa        NA 0.16 0.95 6.13 38.14 49.92 6.21  0.81
wpnC3S01 wpnC3S01   C3S    wpn        NA 0.17 0.94 6.82 38.05 54.88 6.10  0.82
         MSTTR   MTLD MTLDMA    R    S  TTR     U
wpaC3S01  0.79 100.16     NA 8.68 0.93 0.78 39.92
wpnC3S01  0.76 123.01     NA 9.65 0.92 0.73 36.46

There are quite a number of corpus*() getter/setter methods for slots of these objects, e.g., corpusReadability() to get the readability() results from objects of class kRp.corpus.

As we explained earlier, the nested S4 object classes used by tm.plugin.koRpus are rather complex. Two methods can be especially helpful for further analysis. The first one is tif_as_tokens_df() and returns a data.frame including all texts of the tokenized corpus in a format that is compatible with Text Interchange Formats standards.

The second one is a family of [, [<-, [[ and [[<- shorcuts to directly interact with the data.frame object you get via corpusSummary(). Here’s an example how to use this to plot interactions:

library(sciplot)
lineplot.CI(
  x.factor=sampleTexts[["source"]],
  response=sampleTexts[["MTLD"]],
  group=sampleTexts[["topic"]],
  type="l",
  main="MTLD",
  xlab="Media source",
  ylab="Lexical diversity score",
  col=c("grey", "black"),
  lwd=2
)