hacking
koRpus: an R packge for text analysis
koRpus is an R package i originally wrote to measure similarities/differences between texts. over time it grew into what it is now, a hopefully versatile tool to analyze text material in various ways, with an emphasis on scientific research, including readability and lexical diversity features.
web application
to demonstrate some of the core features of koRpus, there is a public web application hosted by the heinrich heine university of düsseldorf. it was realised using the shiny package. the source files for the app come with the koRpus package, so you can also run it locally and change it to your needs.
getting koRpus
the most recent stable release should be available via CRAN. the most recent development release of koRpus can be installed from my own package repository http://R.reaktanz.de, e.g. directly from an R session:
#stable:
install.packages("koRpus")
#development:
install.packages("koRpus", repo="http://R.reaktanz.de")
there's also a debian/ubuntu package (needs recent R packages from CRAN as a dependecy).
the package has its own tokenizer, which should suffice for a lot of use cases, but to use all available features an additional installation of TreeTagger is strongly recommended! this means, koRpus can be used as an R wrapper for TreeTagger.
- download: koRpus_0.05-1.tar.gz (05.05.2013, 967,2 kb)
- NEWS/ChangeLog
- license: GPL >= 3
- documentation: koRpus.pdf
- vignette: koRpus_vignette.pdf
- example analysis: psycholinguistics (text in russian from wikipedia, CC-BY-SA)
features
this is a probably incomplete list of implemented features:
- extensive documentation (including a vignette for an overview)
- R wrapper for the tokenizer & POS tagger TreeTagger
- automatic detection of the language a text is written in (testing >350 languages)
- various text transformations (e.g., jumbled words, cloze deletion or C-test format)
- stopword detection
- supported languages (full analysis)
- english
- french (contributed by alexandre brulet)
- italian (contributed by alberto mirisola)
- german
- russian
- spanish (contributed by earl brown)
- language support is modularized, can be extended easily
- readability
- ARI
- bormuth mean cloze
- coleman (1--4)
- coleman-liau
- dale-chall
- danielson-bryan (1--2)
- degrees of reading power
- dickes-steiwer
- easy listening formula
- farr-jenkins-paterson
- flesch reading ease (incl. de, es, fr & nl)
- flesch-kincaid grade level
- FORCAST
- fucks
- gunning FOG
- harris-jacobson (1--5)
- linsear-write
- LIX
- neue wiener sachtextformeln (1--4)
- RIX
- SMOG
- spache
- strain index
- text-redundanz-index
- tränkle-bailer
- wheeler-smith
- lexical diversity
- carroll's corrected TTR
- dugast's uber index
- guiraud's root TTR
- HD-D (vocd-D)
- herdan's C
- maas' indices (a, lg(V0), lge(V0) & V')
- mean segmental TTR (MSTTR)
- measure of textual lexical diversity (MTLD)
- moving-average measure of textual lexical diversity (MTLD-MA)
- moving-average TTR (MATTR)
- summer's index
- type-token ratio (TTR)
- yule's K
- frequency analysis
- access to existing corpus databases (celex, leipziger corpora collection (LCC))
- creation of your own corpus databases (
read.corp.custom()) - access to valence databases (
read.BAWL()to import BAWL-R) - various statistics (e.g., mean sentence and word length, distribution of word classes)
- supports the use of stemming functions from other packages
citation information
in case you need to cite koRpus for reference, consider the CITATION file or use the following:
Michalke, M. (2012, April). koRpus -- ein R-paket zur textanalyse. Paper presented at the Tagung experimentell arbeitender Psychologen (TeaP), Mannheim.
work in progress
i'm still working on koRpus (see the ChangeLog). that is, as a price for progress it is possible that sometimes things won't work at all, return faulty results or will behave differently in future releases. however, in general i consider the package to be useful and usable, and i recieved several reports from variuos places where was successfully used. any feedback is most welcome!
still some work needs to be done to fully validate the implementations of various measures for readability and lexical diversity. until then, those functions will trigger a warning to interpret the results with caution. help would be appreciated!
RKWard plugin: graphical user interface for koRpus
to make working with koRpus as comfortable as possible, i'm also working on a plugin for RKWard:
the plugin gets installed/updated automatically with the R package, and recent versions of RKWard wil automatically add the plugin to their configuration.




