Using the koRpus Package for Text Analysis

m.eik michalke

2018-07-29

Abstract

The R package koRpus aims to be a versatile tool for text analysis, with an emphasis on scientific research on that topic. It implements dozens of formulae to measure readability and lexical diversity. On a more basic level koRpus can be used as an R wrapper for third party products, like the tokenizer and POS tagger TreeTagger or language corpora of the Leipzig Corpora Collection. This vignette takes a brief tour around its core components, shows how they can be used and gives some insight on design decisions.

What is koRpus?

Work on koRpus started in February 2011, primarily with the goal in mind to examine how similar different texts are. Since then, it quickly grew into an R package which implements dozens of formulae for readability and lexical diversity, and wrappers for language corpus databases and a tokenizer/POS tagger.

Recommendations

TreeTagger

At the very beginning of almost every analysis with this package, the text you want to examine has to be sliced into its components, and the components must be identified and named. That is, it has to be split into its semantic parts (tokens), words, numbers, punctuation marks. After that, each token will be tagged regarding its part-of-speech (POS). For both of these steps, koRpus can use the third party software TreeTagger (Schmid, 1994).

Especially for Windows users installation of TreeTagger might be a little more complex – e.g., it depends on Perl1, and you need a tool to extract .tar.gz archives.2 Detailed installations instructions are beyond the scope of this vignette.

If you don’t want to use TreeTagger, koRpus provides a simple tokenizer of its own called tokenize(). While the tokenizing itself works quite well, tokenize() is not as elaborate as is TreeTagger when it comes to POS tagging, as it can merely tell words from numbers, punctuation and abbreviations. Although this is sufficient for most readability formulae, you can’t evaluate word classes in detail. If that’s what you want, a TreeTagger installation is needed.

Word lists

Some of the readability formulae depend on special word lists (like Bormuth, 1968; Dale & Chall, 1948; Spache, 1953). For copyright reasons these lists are not included as of now. This means, as long as you don’t have copies of these lists, you can’t calculate these particular measures, but of course all others. The expected format to use a list with this package is a simple text file with one word per line, preferably in UTF-8 encoding.

Language corpora

The frequency analysis functions in this package can look up how often each word in a text is used in its language, given that a corpus database is provided. Databases in Celex format are support, as is the Leipzig Corpora Collection (Quasthoff, Richter, & Biemann, 2006) file format. To use such a database with this package, you simply need to download one of the .zip/.tar files.

Translated Human Rights Declaration

If you want to estimate the language of a text, reference texts in known languages are needed. In koRpus, the Universal Declaration of Human Rights with its more than 350 translations is used.

A sample session

From now on it is assumed that the above requirements are correctly installed and working. If an optional component is used it will be noted. Further, we’ll need a sample text to analyze. We’ll use the section on defense mechanisms of Phasmatodea from Wikipedia for this purpose.

Loading a language package

In order to do some analysis, you need to load a language support package for each language you would like to work with. For instance, in this vignette we’re analyzing an English sample text. Language support packages for koRpus are named koRpus.lang.**, where ** is a two-character ID for the respective language, like en for English.3

# install the language support package
install.koRpus.lang("en")
# load the package
library(koRpus.lang.en)

When koRpus itself is loaded, it will list you all language packages found on your system. To get a list of all installable packages, call available.koRpus.lang().

Tokenizing and POS tagging

As explained earlier, splitting the text up into its basic components can be done by TreeTagger. To achieve this and have the results available in R, the function treetag() is used.

treetag()

At the very least you must provide it with the text, of course, and name the language it is written in. In addition to that you must specify where you installed TreeTagger. If you look at the package documentation you’ll see that treetag() understands a number of options to configure TreeTagger, but in most cases using one of the built-in presets should suffice. TreeTagger comes with batch/shell scripts for installed languages, and the presets of treetag() are basically just R implementations of these scripts.

tagged.text <- treetag(
  "sample_text.txt",
  treetagger="manual",
  lang="en",
  TT.options=list(
    path="~/bin/treetagger/",
    preset="en"
  ),
  doc_id="sample"
)

The first argument (file name) and lang should explain themselves. The treetagger option can either take the full path to one of the original TreeTagger scripts mentioned above, or the keyword “manual”, which will cause the interpretation of what is defined by TT.options. To use a preset, just put the path to your local TreeTagger installation and a valid preset name here.4 The document ID is optional and can be omitted.

The resulting S4 object is of a class called kRp.tagged. If you call the object directly you get a shortened view of it’s main content:

tagged.text
##     doc_id       token  tag     lemma lttr      wclass desc stop stem idx sntc
## 1   sample     Defense   NN   defense    7        noun <NA> <NA> <NA>   1    1
## 2   sample  mechanisms  NNS mechanism   10        noun <NA> <NA> <NA>   2    1
## 3   sample Phasmatodea   NP <unknown>   11        name <NA> <NA> <NA>   3    1
## 4   sample     species   NN   species    7        noun <NA> <NA> <NA>   4    1
## 5   sample     exhibit   NN   exhibit    7        noun <NA> <NA> <NA>   5    1
## 6   sample  mechanisms  NNS mechanism   10        noun <NA> <NA> <NA>   6    1
##                                                  [...]                        
## 612 sample  considered  VBN  consider   10        verb <NA> <NA> <NA> 612   18
## 613 sample    inedible   JJ  inedible    8   adjective <NA> <NA> <NA> 613   18
## 614 sample          by   IN        by    2 preposition <NA> <NA> <NA> 614   18
## 615 sample        some   DT      some    4  determiner <NA> <NA> <NA> 615   18
## 616 sample   predators  NNS  predator    9        noun <NA> <NA> <NA> 616   18
## 617 sample           . SENT         .    1    fullstop <NA> <NA> <NA> 617   18

Once you’ve come this far, i.e., having a valid object of class kRp.tagged, all following analyses should run smoothly.

Troubleshooting

If treetag() should fail, you should first re-run it with the extra option debug=TRUE. Most interestingly, that will print the contents of sys.tt.call, which is the TreeTagger command given to your operating system for execution. With that it should be possible to examine where exactly the erroneous behavior starts.

Alternative: tokenize()

If you don’t need detailed word class analysis, you should be fine using koRpus’ own function tokenize(). As you can see, tokenize() comes to the same results regarding the tokens, but is rather limited in recognizing word classes:

(tokenized.text <- tokenize(
    "sample_text.txt",
    lang="en",
    doc_id="sample"
))
##     doc_id       token      tag lemma lttr   wclass desc stop stem idx sntc
## 1   sample     Defense word.kRp          7     word <NA> <NA> <NA>   1    1
## 2   sample  mechanisms word.kRp         10     word <NA> <NA> <NA>   2    1
## 3   sample Phasmatodea word.kRp         11     word <NA> <NA> <NA>   3    1
## 4   sample     species word.kRp          7     word <NA> <NA> <NA>   4    1
## 5   sample     exhibit word.kRp          7     word <NA> <NA> <NA>   5    1
## 6   sample  mechanisms word.kRp         10     word <NA> <NA> <NA>   6    1
##                                               [...]                        
## 620 sample  considered word.kRp         10     word <NA> <NA> <NA> 620   20
## 621 sample    inedible word.kRp          8     word <NA> <NA> <NA> 621   20
## 622 sample          by word.kRp          2     word <NA> <NA> <NA> 622   20
## 623 sample        some word.kRp          4     word <NA> <NA> <NA> 623   20
## 624 sample   predators word.kRp          9     word <NA> <NA> <NA> 624   20
## 625 sample           .     .kRp          1 fullstop <NA> <NA> <NA> 625   20

Accessing data from koRpus objects

For this class of objects, koRpus provides some comfortable methods to extract the portions you’re interested in. For example, the main results are to be found in the slot TT.res. In addition to TreeTagger’s original output (token, tag and lemma) treetag() also automatically counts letters and assigns tokens to global word classes. To get these results as a data.frame, use the getter method taggedText():

taggedText(tagged.text)[26:34,]
##    doc_id     token tag    lemma lttr      wclass desc stop stem idx sntc
## 26 sample       and  CC      and    3 conjunction   NA   NA   NA  26    1
## 27 sample       are VBP       be    3        verb   NA   NA   NA  27    1
## 28 sample  deployed VBN   deploy    8        verb   NA   NA   NA  28    1
## 29 sample     after  IN    after    5 preposition   NA   NA   NA  29    1
## 30 sample        an  DT       an    2  determiner   NA   NA   NA  30    1
## 31 sample    attack  NN   attack    6        noun   NA   NA   NA  31    1
## 32 sample       has VBZ     have    3        verb   NA   NA   NA  32    1
## 33 sample      been VBN       be    4        verb   NA   NA   NA  33    1
## 34 sample initiated VBN initiate    9        verb   NA   NA   NA  34    1

In case you want to access a subset of the data in the resulting object, e.g., only the column with number of letters or the first five rows of TT.res, you’ll be happy to know there’s special [ and [[ methods for these kinds of objects:

head(tagged.text[["lttr"]], n=50)
##  [1]  7 10 11  7  7 10  3  7  4  9  4  4  7  2  6  4  9  2  3  5  5  1  7  7  1  3  3
## [28]  8  5  2  6  3  4  9  1  9  8  1  3  7  9  4  7 12  4 11  2 10  1  4
tagged.text[1:5,]
##   doc_id       token tag     lemma lttr wclass desc stop stem idx sntc
## 1 sample     Defense  NN   defense    7   noun   NA   NA   NA   1    1
## 2 sample  mechanisms NNS mechanism   10   noun   NA   NA   NA   2    1
## 3 sample Phasmatodea  NP <unknown>   11   name   NA   NA   NA   3    1
## 4 sample     species  NN   species    7   noun   NA   NA   NA   4    1
## 5 sample     exhibit  NN   exhibit    7   noun   NA   NA   NA   5    1

The [ and [[ methods are basically a useful shortcut replacements for taggedText().

Descriptive statistics

All results of both treetag() and tokenize() also provide various descriptive statistics calculated from the analyzed text. You can get them by calling describe() on the object:

describe(tagged.text)
## $all.chars
## [1] 3554
## 
## $lines
## [1] 10
## 
## $normalized.space
## [1] 3549
## 
## $chars.no.space
## [1] 2996
## 
## $punct
## [1] 78
## 
## $digits
## [1] 4
## 
## $letters
##  all   l1   l2   l3   l4   l5   l6   l7   l8   l9  l10  l11  l12  l13  l14  l15  l16 
## 2918   19   92   74   80   51   49   65   43   35   22   15    6    3    0    1    1 
## 
## $letters.only
## [1] 2914
## 
## $char.distrib
##                 1         2         3         4          5          6         7
## num      80.00000  92.00000  74.00000  80.00000  51.000000  49.000000  65.00000
## cum.sum  80.00000 172.00000 246.00000 326.00000 377.000000 426.000000 491.00000
## cum.inv 537.00000 445.00000 371.00000 291.00000 240.000000 191.000000 126.00000
## pct      12.96596  14.91086  11.99352  12.96596   8.265802   7.941653  10.53485
## cum.pct  12.96596  27.87682  39.87034  52.83630  61.102107  69.043760  79.57861
## pct.inv  87.03404  72.12318  60.12966  47.16370  38.897893  30.956240  20.42139
##                  8          9         10         11          12          13
## num      43.000000  35.000000  22.000000  15.000000   6.0000000   3.0000000
## cum.sum 534.000000 569.000000 591.000000 606.000000 612.0000000 615.0000000
## cum.inv  83.000000  48.000000  26.000000  11.000000   5.0000000   2.0000000
## pct       6.969206   5.672609   3.565640   2.431118   0.9724473   0.4862237
## cum.pct  86.547812  92.220421  95.786062  98.217180  99.1896272  99.6758509
## pct.inv  13.452188   7.779579   4.213938   1.782820   0.8103728   0.3241491
##                  14          15          16
## num       0.0000000   1.0000000   1.0000000
## cum.sum 615.0000000 616.0000000 617.0000000
## cum.inv   2.0000000   1.0000000   0.0000000
## pct       0.0000000   0.1620746   0.1620746
## cum.pct  99.6758509  99.8379254 100.0000000
## pct.inv   0.3241491   0.1620746   0.0000000
## 
## $lttr.distrib
##                  1         2         3         4          5         6         7
## num      19.000000  92.00000  74.00000  80.00000  51.000000  49.00000  65.00000
## cum.sum  19.000000 111.00000 185.00000 265.00000 316.000000 365.00000 430.00000
## cum.inv 537.000000 445.00000 371.00000 291.00000 240.000000 191.00000 126.00000
## pct       3.417266  16.54676  13.30935  14.38849   9.172662   8.81295  11.69065
## cum.pct   3.417266  19.96403  33.27338  47.66187  56.834532  65.64748  77.33813
## pct.inv  96.582734  80.03597  66.72662  52.33813  43.165468  34.35252  22.66187
##                  8          9         10         11          12          13
## num      43.000000  35.000000  22.000000  15.000000   6.0000000   3.0000000
## cum.sum 473.000000 508.000000 530.000000 545.000000 551.0000000 554.0000000
## cum.inv  83.000000  48.000000  26.000000  11.000000   5.0000000   2.0000000
## pct       7.733813   6.294964   3.956835   2.697842   1.0791367   0.5395683
## cum.pct  85.071942  91.366906  95.323741  98.021583  99.1007194  99.6402878
## pct.inv  14.928058   8.633094   4.676259   1.978417   0.8992806   0.3597122
##                  14          15          16
## num       0.0000000   1.0000000   1.0000000
## cum.sum 554.0000000 555.0000000 556.0000000
## cum.inv   2.0000000   1.0000000   0.0000000
## pct       0.0000000   0.1798561   0.1798561
## cum.pct  99.6402878  99.8201439 100.0000000
## pct.inv   0.3597122   0.1798561   0.0000000
## 
## $words
## [1] 556
## 
## $sentences
## [1] 18
## 
## $avg.sentc.length
## [1] 30.88889
## 
## $avg.word.length
## [1] 5.248201
## 
## $doc_id
## [1] "sample"

Amongst others, you will find several indices describing the number of characters:

  • all.chars: Counts each character, including all space characters
  • normalized.space: Like all.chars, but clusters of space characters (incl. line breaks) are counted only as one character
  • chars.no.space: Counts all characters except any space characters
  • letters.only: Counts only letters, excluding(!) digits (which are counted seperately as digits)

You’ll also find the number of words and sentences, as well as average word and sentence lengths, and tables describing how the word length is distributed throughout the text (lttr.distrib). For instance, we see that the text has 74 words with three letters, 185 with three or less, and 371 with more than three. The last three lines show the percentages, respectively.

Lexical diversity (type token ratios)

To analyze the lexical diversity of our text we can now simply hand over the tagged text object to the lex.div() method. You can call it on the object with no further arguments (like lex.div(tagged.text)), but in this example we’ll limit the analysis to a few measures:5

lex.div(
  tagged.text,
  measure=c("TTR", "MSTTR", "MATTR","HD-D", "MTLD", "MTLD-MA"),
  char=c("TTR", "MATTR","HD-D", "MTLD", "MTLD-MA")
)
## 
## Total number of tokens: 556
## Total number of types:  294
## Total number of lemmas: 283
## 
## Type-Token Ratio
##    TTR: 0.53 
## 
## TTR characteristics:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5297  0.5466  0.5930  0.6188  0.6491  1.0000 
##    SD
##  0.0907
## 
## 
## Mean Segmental Type-Token Ratio
##                MSTTR: 0.72
##           SD of TTRs: 0.03
##         Segment size: 100
##       Tokens dropped: 56 
## 
## Hint: A segment size of 92 would reduce the drop rate to 4.
##       Maybe try ?segment.optimizer()
## 
## 
## Moving-Average Type-Token Ratio
##                MATTR: 0.74
##           SD of TTRs: 0.03
##          Window size: 100 
## 
## MATTR characteristics:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.7138  0.7239  0.7308  0.7290  0.7341  0.7368 
##    SD
##  0.0066
## 
## 
## HD-D
##           HD-D: 35.54
##           ATTR: 0.85
##    Sample size: 42 
## 
## HD-D characteristics:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   35.46   35.62   34.31   35.82   36.16 
##    SD
##  5.0648
## 
## 
## Measure of Textual Lexical Diversity
##                 MTLD: 97.38
##    Number of factors: 5.71
##          Factor size: 0.72
##     SD tokens/factor: 36.08 (all factors)
##                       26.06 (complete factors only)
## 
## MTLD characteristics:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   14.00   79.07   88.37   83.21   92.92  104.40       1 
##    SD
##  15.9015
## 
## 
## Moving-Average Measure of Textual Lexical Diversity
##              MTLD-MA: 102.73
##     SD tokens/factor: 26.74
##            Step size: 1
##          Factor size: 0.72
##          Min. tokens: 9 
## 
## MTLD-MA characteristics:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   63.00   92.99   96.57   95.89  102.76  107.99      12 
##    SD
##  9.6766
## 
## Note: Analysis was conducted case insensitive.

Let’s look at some particular parts: At first we are informed of the total number of types, tokens and lemmas (if available). After that the actual results are being printed, using the package’s show() method for this particular kind of object. As you can see, it prints the actual value of each measure before a summary of the characteristics.6

Some measures return more information than just their actual index value. For instance, when the Mean Segmental Type-Token Ratio is calculated, you’ll be informed how much of your text was dropped and hence not examined. A small feature tool of koRpus, segment.optimizer(), automatically recommends you with a different segment size if this could decrease the number of lost tokens.

By default, lex.div() calculates every measure of lexical diversity that was implemented. Of course this is fully configurable, e.g. to completely skip the calculation of characteristics just add the option char=NULL. If you’re only interested in one particular measure, it might be more convenient to call the according wrapper function instead of lex.div(). For example, to calculate only the measures proposed by Maas (1972):

maas(tagged.text)
## Language: "en"
## 
## Total number of tokens: 556
## Total number of types:  294
## Total number of lemmas: 283
## 
## Maas' Indices
##        a: 0.19 
##     lgV0: 5.64 
##    lgeV0: 12.99 
## 
## Relative vocabulary growth (first half to full text)
##        a: 0.81 
##     lgV0: 6.75 
##       V': 0.43 (43 new types every 100 tokens)
## 
## Note: Analysis was conducted case insensitive.

All wrapper functions have characteristics turned off by default. The following example demonstrates how to calculate and plot the classic type-token ratio with characteristics. The resulting plot shows the typical degredation of TTR values with increasing text length:

ttr.res <- TTR(tagged.text, char=TRUE)
plot(ttr.res@TTR.char, type="l", main="TTR degredation over text length")

Since this package is intended for research, it is possible to directly influence all relevant values of each measure and examine the effects. For example, as mentioned before segment.optimizer() recommended a change of segment size for MSTTR to drop less words, which is easily done:

MSTTR(tagged.text, segment=92)
## Language: "en"
## 
## Total number of tokens: 556
## Total number of types:  294
## Total number of lemmas: 283
## 
## Mean Segmental Type-Token Ratio
##                MSTTR: 0.75
##           SD of TTRs: 0.04
##         Segment size: 92
##       Tokens dropped: 4
## 
## Note: Analysis was conducted case insensitive.

Please see to the documentation for more detailed information on the available measures and their references.

Frequency analysis

Importing language corpora data

This package has rudimentary support to import corpus databases.7 That is, it can read frequency data for words into an R object and use this object for further analysis. Next to the Celex database format (read.corp.celex()), it can read the LCC flatfile format8 (read.corp.LCC()). The latter might be of special interest, because the needed database archives can be freely downloaded. Once you’ve downloaded one of these archives, it can be comfortably imported:

LCC.en <- read.corp.LCC("~/downloads/corpora/eng_news_2010_1M-text.tar")

read.corp.LCC() will automatically extract the files it needs from the archive. Alernatively, you can specify the path to the unpacked archive as well. To work with the imported data directly, the tool query() was added to the package. It helps you to comfortably look up certain words, or ranges of interesting values:

query(LCC.en, "word", "what")
##     num word  freq         pct pmio    log10 rank.avg rank.min rank.rel.avg
## 160 210 what 16396 0.000780145  780 2.892095   260759   260759     99.95362
##     rank.rel.min
## 160     99.95362
query(LCC.en, "pmio", c(780, 790))
##     num  word  freq          pct pmio    log10 rank.avg rank.min rank.rel.avg
## 156 206  many 16588 0.0007892806  789 2.897077   260763   260763     99.95515
## 157 207   per 16492 0.0007847128  784 2.894316   260762   260762     99.95477
## 158 208  down 16468 0.0007835708  783 2.893762   260761   260761     99.95439
## 159 209 since 16431 0.0007818103  781 2.892651   260760   260760     99.95400
## 160 210  what 16396 0.0007801450  780 2.892095   260759   260759     99.95362
##     rank.rel.min
## 156     99.95515
## 157     99.95477
## 158     99.95439
## 159     99.95400
## 160     99.95362

Conduct a frequency analysis

We can now conduct a full frequency analysis of our text:

freq.analysis.res <- freq.analysis(tagged.text, corp.freq=LCC.en)

The resulting object holds a lot of information, even if no corpus data was used (i.e., corp.freq=NULL). To begin with, it contains the two slots TT.res and lang, which are copied from the analyzed tagged text object. In this way analysis results can always be converted back into kRp.tagged objects.9 However, if corpus data was provided, the tagging results gained three new columns:

taggedText(freq.analysis.res)
##        token tag     lemma lttr  [...] pmio rank.avg rank.min
[...]
## 30        an  DT        an    2        3817 99.98735 99.98735
## 31    attack  NN    attack    6         163 99.70370 99.70370
## 32       has VBZ      have    3        4318 99.98888 99.98888
## 33      been VBN        be    4        2488 99.98313 99.98313
## 34 initiated VBN  initiate    9          11 97.32617 97.32137
## 35         (   (         (    1         854 99.96013 99.96013
## 36 secondary  JJ secondary    9          21 98.23846 98.23674
## 37   defense  NN   defense    7         210 99.77499 99.77499
## 38         )   )         )    1         856 99.96052 99.96052
[...]

Perhaps most informatively, pmio shows how often the respective token appears in a million tokens, according to the corpus data. Adding to this, the previously introduced slot desc now contains some more descriptive statistics on our text, and if we provided a corpus database, the slot freq.analysis lists summaries of various frequency information that was calculated.

If the corpus object also provided inverse document frequency (i.e., values in column idf) data, freq.analysis() will automatically compute tf-idf statistics and put them in a column called tfidf.

New to the desc slot

Amongst others, the descriptives now also give easy access to character vectors with all words ($all.words) and all lemmata ($all.lemmata), all tokens sorted10 into word classes (e.g., all verbs in $classes$verb), or the number of words in each sentece:

describe(freq.analysis.res)[["sentc.length"]]
##  [1] 34 10 37 16 44 31 14 31 34 23 17 43 40 47 22 19 65 29

As a practical example, the list $classes has proven to be very helpful to debug the results of TreeTagger, which is remarkably accurate, but of course not free from making a mistake now and then. By looking through $classes, where all tokens are grouped regarding to the global word class TreeTagger attributed to it, at least obvious errors (like names mistakenly taken for a pronoun) are easily found:11

describe(freq.analysis.res)$classes
## $conjunction
## [1] "both" "and"  "and"  "and"  "and"  "or"   "or"   "and"  "and"  "or"  
## [11] "and"  "or"   "and"  "or"   "and"  "and"  "and"  "and" 
## 
## $number
## [1] "20"  "one"
## 
## $determiner
##  [1] "an"      "the"     "an"      "The"     "the"     "the"     "some"   
##  [8] "that"    "Some"    "the"     "a"       "a"       "a"       "the"    
## [15] "that"    "the"     "the"     "Another" "which"   "the"     "a"      
## [22] "that"    "a"       "The"     "a"       "the"     "that"    "a"      
[...]

Readability

The package comes with implementations of several readability formulae. Some of them depend on the number of syllables in the text.12 To achieve this, the method hyphen() takes objects of class kRp.tagged and applies an hyphenation algorithm (Liang, 1983) to each word. This algorithm was originally developed for automatic word hyphenation in \(\LaTeX\), and is gracefully misused here to fulfill a slightly different service.13

(hyph.txt.en <- hyphen(tagged.text))
hyph.txt.en
##     syll           word
## 1      2       De-fense
## 2      3   mech-a-nisms
## 3      4 Phasm-a-to-dea
## 4      2       spe-cies
## 5      3      ex-hib-it
## 6      3   mech-a-nisms
##       NA      [...]    
## 551    1             is
## 552    3   con-sid-ered
## 553    4    in-ed-i-ble
## 554    1             by
## 555    1           some
## 556    3    pred-a-tors

This seperate hyphenation step can actually be skipped, as readability() will do it automatically if needed. But similar to TreeTagger, hyphen() will most likely not produce perfect results. As a rule of thumb, if in doubt it seems to behave rather conservative, that is, is underestimates the real number of syllables in a text. This, however, would of course affect the results of several readability formulae.

So, the more accurate the end results should be, the less you should rely on the automatic hyphenation alone. But it sure is a good starting point, for there is a method called correct.hyph() to help you clean these results of errors later on. The most straight forward way to do this is to call hyphenText(hyph.txt.en), which will get you a data frame with two colums, word (the hyphenated words) and syll (the number of syllables), in a spread sheet editor:14

head(hyphenText(hyph.txt.en))
##   syll           word
## 1    2       De-fense
## 2    3   mech-a-nisms
## 3    4 Phasm-a-to-dea
## 4    2       spe-cies
## 5    3      ex-hib-it
## 6    3   mech-a-nisms

You can then manually correct wrong hyphenations by removing or inserting “-” as hyphenation indicators, and call correct.hyph() without further arguments, which will cause it to recount all syllables:

hyph.txt.en <- correct.hyph(hyph.txt.en)

But the method can also be used to alter entries directly, which might be simpler and cleaner than manual changes:

hyph.txt.en <- correct.hyph(hyph.txt.en, word="mech-a-nisms", hyphen="mech-a-ni-sms")
## Changed
## 
##   syll         word
## 2    3 mech-a-nisms
## 6    3 mech-a-nisms
## 
##   into
## 
##   syll          word
## 2    4 mech-a-ni-sms
## 6    4 mech-a-ni-sms

The hyphenated text object can now be given to readability(), to calculate the measures of interest:15

readbl.txt <- readability(tagged.text, hyphen=hyph.txt.en)

Similar to lex.div(), by default readability() calculates almost16 all available measures:

readbl.txt
## 
## Automated Readability Index (ARI)
##   Parameters: default 
##        Grade: 18.73 
## 
## 
## Coleman Formulas
##   Parameters: default 
##     Pronouns: 1.62 (per 100 words)
##      Prepos.: 13.49 (per 100 words)
##    Formula 1: 39% cloze completions
##    Formula 2: 37% cloze completions
##    Formula 3: 35% cloze completions
##    Formula 4: 36% cloze completions
## 
## 
## Coleman-Liau
##   Parameters: default 
##          ECP: 33% (estimted cloze percentage)
##        Grade: 14.1 
##        Grade: 14.1 (short formula)
## 
## 
## Danielson-Bryan
##   Parameters: default 
##          DB1: 9.86 
##          DB2: 26.39 
##        Grade: >= 13 (college) 
## 
## 
## Dickes-Steiwer's Handformel
##   Parameters: default 
##          TTR: 0.53 
##        Score: 32.21 
## 
## 
## Easy Listening Formula
##   Parameters: default 
##       Exsyls: 222 
##        Score: 12.33 
## 
## 
## Farr-Jenkins-Paterson
##   Parameters: default 
##           RE: 33.19 
##        Grade: >= 13 (college) 
## 
## 
## Flesch Reading Ease
##   Parameters: en (Flesch) 
##           RE: 33.98 
##        Grade: >= 13 (college) 
## 
## 
## Flesch-Kincaid Grade Level
##   Parameters: default 
##        Grade: 16.19 
##          Age: 21.19 
## 
## 
## Gunning Frequency of Gobbledygook (FOG)
##   Parameters: default 
##        Grade: 18.69 
## 
## 
## FORCAST
##   Parameters: default 
##        Grade: 10.99 
##          Age: 15.99 
## 
## 
## Fucks' Stilcharakteristik
##        Score: 162.11 
##        Grade: 12.73 
## 
## 
## Linsear Write
##   Parameters: default 
##   Easy words: 80.4 
##   Hard words: 19.6 
##        Grade: 21.5 
## 
## 
## Läsbarhetsindex (LIX)
##   Parameters: default 
##        Index: 65.24 
##       Rating: very difficult 
##        Grade: > 11 
## 
## 
## Neue Wiener Sachtextformeln
##   Parameters: default 
##        nWS 1: 10.57 
##        nWS 2: 11.07 
##        nWS 3: 10.58 
##        nWS 4: 11.89 
## 
## 
## Readability Index (RIX)
##   Parameters: default 
##        Index: 10.61 
##        Grade: > 12 (college) 
## 
## 
## Simple Measure of Gobbledygook (SMOG)
##   Parameters: default 
##        Grade: 17.19 
##          Age: 22.19 
## 
## 
## Strain Index
##   Parameters: default 
##        Index: 15.5 
## 
## 
## Tränkle-Bailer Formulas
##    Parameters: default 
##  Prepositions: 13%
##  Conjunctions: 3%
##          TB 1: 18.59 
##          TB 2: 27.15 
## 
## 
## Kuntzsch's Text-Redundanz-Index
##   Parameters: default 
##  Short words: 334 
##  Punctuation: 78 
##      Foreign: 0 
##        Score: -56.88 
## 
## 
## Tuldava's Text Difficulty Formula
##   Parameters: default 
##        Index: 5.74 
## 
## 
## Wheeler-Smith
##   Parameters: default 
##        Score: 123.33 
##        Grade: > 4 
## 
## Text language: en

To get a more condensed overview of the results try the summary() method:

summary(readbl.txt)
## Text language: en
##                    index     flavour    raw           grade  age
## 1                    ARI                              18.73     
## 2             Coleman C1                 39                     
## 3             Coleman C2                 37                     
## 4             Coleman C3                 35                     
## 5             Coleman C4                 36                     
## 6           Coleman-Liau                 33            14.1     
## 7    Danielson-Bryan DB1               9.86                     
## 8    Danielson-Bryan DB2              26.39 >= 13 (college)     
## 9         Dickes-Steiwer              32.21                     
## 10                   ELF              12.33                     
## 11 Farr-Jenkins-Paterson              33.19 >= 13 (college)     
## 12                Flesch en (Flesch)  33.98 >= 13 (college)     
## 13        Flesch-Kincaid                              16.19 21.2
## 14                   FOG                              18.69     
## 15               FORCAST                              10.99   16
## 16                 Fucks             162.11           12.73     
## 17         Linsear-Write                               21.5     
## 18                   LIX              65.24            > 11     
## 19                  nWS1                              10.57     
## 20                  nWS2                              11.07     
## 21                  nWS3                              10.58     
## 22                  nWS4                              11.89     
## 23                   RIX              10.61  > 12 (college)     
## 24                  SMOG                              17.19 22.2
## 25                Strain               15.5                     
## 26   Traenkle-Bailer TB1              18.59                     
## 27   Traenkle-Bailer TB2              27.15                     
## 28                   TRI             -56.88                     
## 29               Tuldava               5.74                     
## 30         Wheeler-Smith             123.33             > 4

The summary() method supports an additional flat format, which basically turns the table into a named numeric vector, using the raw values (because all indices have raw values, but only a few more than that). This format comes very handy when you want to use the output in further calculations:

summary(readbl.txt, flat=TRUE)
##                   ARI            Coleman.C1            Coleman.C2 
##                 18.73                 39.00                 37.00 
##            Coleman.C3            Coleman.C4          Coleman.Liau 
##                 35.00                 36.00                 33.00 
##   Danielson.Bryan.DB1   Danielson.Bryan.DB2        Dickes.Steiwer 
##                  9.86                 26.39                 32.21 
##                   ELF Farr.Jenkins.Paterson                Flesch 
##                 12.33                 33.19                 33.98 
##        Flesch.Kincaid                   FOG               FORCAST 
##                 16.19                 18.69                 10.99 
##                 Fucks         Linsear.Write                   LIX 
##                162.11                 21.50                 65.24 
##                  nWS1                  nWS2                  nWS3 
##                 10.57                 11.07                 10.58 
##                  nWS4                   RIX                  SMOG 
##                 11.89                 10.61                 17.19 
##                Strain   Traenkle.Bailer.TB1   Traenkle.Bailer.TB2 
##                 15.50                 18.59                 27.15 
##                   TRI               Tuldava         Wheeler.Smith 
##                -56.88                  5.74                123.33

If you’re interested in a particular formula, again a wrapper function might be more convenient:

flesch.res <- flesch(tagged.text, hyphen=hyph.txt.en)
lix.res <- LIX(tagged.text)   # LIX doesn't need syllable count
lix.res
## 
## Läsbarhetsindex (LIX)
##   Parameters: default 
##        Index: 65.24 
##       Rating: very difficult 
##        Grade: > 11 
## 
## Text language: en

Readability from numeric data

It is possible to calculate the readability measures from the relevant key values directly, rather than analyze an actual text, by using readability.num() instead of readability(). If you need to reanalyze a particular text, this can be considerably faster. Therefore, all objects returned by readability() can directly be fed to readability.num(), since all relevant data is present in the desc slot.

Language detection

Another feature of this package is the detection of the language a text was (most probably) written in. This is done by gzipping reference texts in known languages, gzipping them again with addition of a small sample of the text in unknown language, and determining the case where the additional sample causes the smallest increase in file size (as described in Benedetto, Caglioti, & Loreto, 2002). By default, the compressed objects will be created in memory only.

To use the function guess.lang(), you first need to download the reference material. In this implementation, the Universal Declaration of Human Rights in unicode formatting is used, because the document holds the world record of beeing the text translated into the most languages, and is publicly available. Please get the zipped archive with all translations in .txt format. You can, but don’t have to unzip the archive. The text to find the language of must also be in a unicode .txt file:

guessed <- guess.lang(
  file.path(find.package("koRpus"),"tests","testthat","sample_text.txt"),
  udhr.path="~/downloads/udhr_txt.zip"
)
summary(guessed)
##   Estimated language: English
##           Identifier: eng
##               Region: Europe
## 
## 435 different languages were checked.
## 
## Distribution of compression differences:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   136.0   168.0   176.0   190.7   184.0   280.0 
## 
##   SD: 38.21 
## 
## Top 5 guesses:
##                         name iso639-3 bcp47 region diff  diff.std
## 1                    English      eng    en Europe  136 -1.430827
## 2                      Scots      sco   sco Europe  136 -1.430827
## 3           Pidgin, Nigerian      pcm   pcm Africa  144 -1.221473
## 4   Catalan-Valencian-Balear      cat    ca Europe  152 -1.012119
## 5                     French      fra    fr Europe  152 -1.012119
## 
## Last 5 guesses:
##                         name iso639-3   bcp47 region diff diff.std
## 431                  Burmese      mya      my   Asia  280 2.337547
## 432                     Shan      shn     shn   Asia  280 2.337547
## 433                    Tamil      tam      ta   Asia  280 2.337547
## 434     Vietnamese (Han nom)      vie vi-Hani   Asia  280 2.337547
## 435             Chinese, Yue      yue     yue   Asia  280 2.337547

Extending koRpus

The language support of this package has a modular design. There are some pre-built language packages in the l10n repository, and with a little effort you should be able to add new languages yourself. You need the package sources for this, then basically you will have to add a new file to it and rebuild/reinstall the package. More details on this topic can be found in inst/README.languages. Once you got a new language to work with koRpus, I’d be happy to include your module in the official distribution.

Analyzing full corpora

Despite its name, the scope of koRpus is single texts. If you would like to do analysis an a full corpus of texts, have a look at the plugin package tm.plugin.koRpus.

Acknowledgements

The APA style used in this vignette was kindly provided by the CSL project, licensed under Creative Commons Attribution-ShareAlike 3.0 Unported license.

References

Benedetto, D., Caglioti, E., & Loreto, V. (2002). Language trees and zipping. Physical Review Letters, 88(4), 048702.

Bormuth, J. R. (1968). Cloze Test Readability: Criterion Reference Scores. Journal of Educational Measurement, 5(3), 189–196.

Dale, E., & Chall, J. S. (1948). A formula for predicting readability. Educational Research Bulletin, 11–28.

Liang, F. M. (1983). Word Hy-phen-a-tion by Com-put-er (Dissertation). Stanford University, Dept. Computer Science, Stanford.

Maas, H. D. (1972). Über den Zusammenhang zwischen Wortschatzumfang und Länge eines Textes. Zeitschrift Für Literaturwissenschaft Und Linguistik, 2(8), 73–79.

McCarthy, P. M., & Jarvis, S. (2007). vocd – A theoretical and empirical evaluation. Language Testing, 24(4), 459–488.

McCarthy, P. M., & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42(2), 381–392.

Quasthoff, U., Richter, M., & Biemann, C. (2006). Corpus Portal for Search in Monolingual Corpora. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (pp. 1799–1802). Genoa.

Schmid, H. (1994). Probabilistic part-of-speec tagging using decision trees. In International Conference on New Methods in Language Processing (pp. 44–49). Manchester, UK.

Spache, G. (1953). A new readability formula for primary-grade reading materials. The Elementary School Journal, 53(7), 410–413.

Tweedie, F. J., & Baayen, R. H. (1998). How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities, 32(5), 323–352.


  1. For a free implementation try http://strawberryperl.com

  2. Like http://7-zip.org

  3. Unfortunately, these language packages did not get the approval of the CRAN maintainers and are officially hosted at (https://undocumeantit.github.io/repos/l10n)[https://undocumeantit.github.io/repos/l10n]. For your convenience the function install.koRpus.lang() can be used to easily install them anyway.

  4. Presets are defined in the language support packages, usually named like their respective two-character language identifier. Refer to their documentation.

  5. For informtaion on the measures shown see Tweedie & Baayen (1998), McCarthy & Jarvis (2007), McCarthy & Jarvis (2010).

  6. Characteristics can be looked at to examine each measure’s dependency on text length. They are calculated by computing each measure repeatedly, beginning with only the first token, then adding the next, progressing until the full text was analyzed.

  7. The package also has a function called read.corp.custom() which can be used to process language corpora yourself, and store the results in an object of class kRp.corp.freq, which is the class returned by read.corp.LCC() and read.corp.celex() as well. That is, if you can’t get any already analyzed corpus database but have a huge language corpus at hand, you can create your own frequency database. But be warned that depending on corpus size and your hardware, this might take ages. On the other hand, read.corp.custom() will provide inverse document frequency (idf) values for all types, which is necessary to compute tf-idf with freq.analysis()

  8. Actually, it unterstands two different LCC formats, both the older .zip and the newer .tar archive format.

  9. This can easily be done by calling as(freq.analysis.res, "kRp.tagged").

  10. This sorting depends on proper POS-tagging, so this will only contain useful data if you used treetag() instead of tokenize().

  11. And can then be corrected by using the function correct.tag()

  12. Whether this is the case can be looked up in the documentation.

  13. The hyphen() method was originally implemented as part of the koRpus package, but was later split off into its own package called sylly.

  14. For example, this can be comfortably done with RKWard: https://rkward.kde.org

  15. Please note that as of version 0.04-18, the correctness of some of these calculations has not been extensively validated yet. The package was released nonetheless, also to find outstanding bugs in the implemented measures. Any information on the validity of its results is very welcome!

  16. Measures which rely on word lists will be skipped if no list is provided.