reaktanz.de :: m.eik michalke

hacking

koRpus: an R packge for text analysis

koRpus is an R package i originally wrote to measure similarities/differences between texts. over time it grew into what it is now, a hopefully versatile tool to analyze text material in various ways, with an emphasis on scientific research, including readability and lexical diversity features.

web application

to demonstrate some of the core features of koRpus, there is a public web application hosted by the heinrich heine university of düsseldorf. it was realised using the shiny package. the source files for the app come with the koRpus package, so you can also run it locally and change it to your needs.

mailing list

to ask for help, report bugs, suggest feature improvements, or discuss the global development of the package, please use the issue tracker or subscribe to the koRpus-dev mailing list.

getting koRpus

the most recent stable release should be available via CRAN. the most recent development release of koRpus can be installed from my own package repository https://reaktanz.de/R, e.g. directly from an R session:

#stable: install.packages("koRpus") #development: install.packages("koRpus", repo=c(getOption("repos"), reaktanz="https://reaktanz.de/R"))

the package has its own tokenizer, which should suffice for a lot of use cases, but to use all available features an additional installation of TreeTagger is strongly recommended! this means, koRpus can be used as an R wrapper for TreeTagger.

download: koRpus_0.13-8.tar.gz (18.05.2021, 458,9 kb)
source code: https://github.com/unDocUMeantIt/koRpus
debian/ubuntu package (depends on recent R packages from CRAN)
NEWS/ChangeLog
license: GPL >= 3
documentation: koRpus.pdf
vignette: using the koRpus package for text analysis
example analysis: psycholinguistics (text in russian from wikipedia, CC-BY-SA)

features

this is a probably incomplete list of implemented features:

extensive documentation (including a vignette for an overview)
R wrapper for the tokenizer & POS tagger TreeTagger
compliant with the text interchange formats proposed by rOpenSci
automatic detection of the language a text is written in (testing >350 languages)
various text transformations (e.g., jumbled words, cloze deletion or C-test format)
stopword detection
fully modularized language support (see ?install.koRpus.lang function in the package)
- dutch
- english
- french (contributed by alexandre brulet)
- german
- italian (contributed by alberto mirisola)
- portuguese
- russian
- spanish (contributed by earl brown)
readability
- ARI
- bormuth mean cloze
- coleman (1--4)
- coleman-liau
- dale-chall
- danielson-bryan (1--2)
- degrees of reading power
- dickes-steiwer
- easy listening formula
- farr-jenkins-paterson
- flesch reading ease (incl. de, es, fr & nl)
- flesch-kincaid grade level
- FORCAST
- fucks
- gunning FOG
- harris-jacobson (1--5)
- linsear-write
- LIX
- neue wiener sachtextformeln (1--4)
- RIX
- SMOG
- spache
- strain index
- text-redundanz-index
- tränkle-bailer
- tuldava
- wheeler-smith
lexical diversity
- carroll's corrected TTR
- dugast's uber index
- guiraud's root TTR
- HD-D (vocd-D)
- herdan's C
- maas' indices (a, lg(V0), lge(V0) & V')
- mean segmental TTR (MSTTR)
- measure of textual lexical diversity (MTLD)
- moving-average measure of textual lexical diversity (MTLD-MA)
- moving-average TTR (MATTR)
- summer's index
- type-token ratio (TTR)
- yule's K
frequency analysis
- access to existing corpus databases (celex, leipziger corpora collection (LCC))
- creation of your own corpus databases (read.corp.custom())
- access to valence databases (read.BAWL() to import BAWL-R)
- tf-idf
- various statistics (e.g., mean sentence and word length, distribution of word classes)
supports the use of stemming functions from other packages

the koRpus-dev mailing list

you are invited to subscribe to our mailing list to discuss the development of the R package 'koRpus'.

in order to subscribe, send an e-mail to majordomo /at/ r.reaktanz.de with the only line "subscribe koRpus-dev-r-reaktanz-de@r.reaktanz.de YOUR@MAIL.ADDRESS", replacing the latter with the address you would like to subscribe.

to unsubscribe, do the same, but replace "subscribe" with "unsubsribe".

citation information

in case you need to cite koRpus for reference, consider the CITATION file or use the following:

Michalke, M. (2018, March). "Entschuldigen Sie, dass ich Ihnen einen komplizierten Artikel schreibe, für einen lesbaren habe ich keine Zeit" -- Textanalyse mit den R-Paketen koRpus & tm.plugin.koRpus Paper presented at the Tagung experimentell arbeitender Psychologen (TeaP), Marburg.

Michalke, M. (2012, April). koRpus -- ein R-paket zur textanalyse. Paper presented at the Tagung experimentell arbeitender Psychologen (TeaP), Mannheim.

full text corpus support and 'tm' integration

in case you would like to analyze full corpora instead of single texts, or use both koRpus and the tm package for text analysis, check out this add-on package:

#development: install.packages("tm.plugin.koRpus", repo=c(getOption("repos"), reaktanz="https://reaktanz.de/R"))

download: tm.plugin.koRpus_0.4-2.tar.gz (18.05.2021, 564 kb)
source code: https://github.com/unDocUMeantIt/tm.plugin.koRpus
debian/ubuntu package (depends on recent R packages from CRAN)
NEWS/ChangeLog
license: GPL >= 3
vignette: using the tm.plugin.koRpus package for corpus analysis
documentation: tm.plugin.koRpus.pdf

a token of gratitude

if you appreciate my work an want to say "thanks", please check my wantlist on discogs (just have records sent to the address you find in the imprint/impressum). you're awesome!

work in progress

i'm still working on koRpus (see the ChangeLog). that is, as a price for progress it is possible that sometimes things won't work at all, return faulty results or will behave differently in future releases. however, in general i consider the package to be useful and usable, and i recieved several reports from variuos places where was successfully used. any feedback is most welcome!

still some work needs to be done to fully validate the implementations of various measures for readability and lexical diversity. until then, those functions will trigger a warning to interpret the results with caution. help would be appreciated!

RKWard plugin: graphical user interface for koRpus

to make working with koRpus as comfortable as possible, i've also written a plugin for RKWard:

the plugin gets installed/updated automatically with the R package, and recent versions of RKWard will automatically add the plugin to their configuration.