hacking
koRpus: an R packge for text analysis
koRpus is an R package i originally wrote to measure similarities/differences between texts. over time it grew into what it is now, a hopefully versatile tool to analyze text material in various ways, with an emphasis on scientific research, including readability and lexical diversity features.
web application
to demonstrate some of the core features of koRpus, there is a public web application hosted by the heinrich heine university of düsseldorf. it was realised using the shiny package. the source files for the app come with the koRpus package, so you can also run it locally and change it to your needs.
mailing list
to ask for help, report bugs, suggest feature improvements, or discuss the global development of the package, please use the issue tracker or subscribe to the koRpus-dev mailing list.
getting koRpus
the most recent stable release should be available via CRAN. the most recent development release of koRpus can be installed from my own package repository https://reaktanz.de/R
, e.g. directly from an R session:
#stable:
install.packages("koRpus")
#development:
install.packages("koRpus", repo=c(getOption("repos"), reaktanz="https://reaktanz.de/R"))
the package has its own tokenizer, which should suffice for a lot of use cases, but to use all available features an additional installation of TreeTagger is strongly recommended! this means, koRpus can be used as an R wrapper for TreeTagger.
- download: koRpus_0.13-8.tar.gz (18.05.2021, 458,9 kb)
- source code: https://github.com/unDocUMeantIt/koRpus
- debian/ubuntu package (depends on recent R packages from CRAN)
- NEWS/ChangeLog
- license: GPL >= 3
- documentation: koRpus.pdf
- vignette: using the koRpus package for text analysis
- example analysis: psycholinguistics (text in russian from wikipedia, CC-BY-SA)
features
this is a probably incomplete list of implemented features:
- extensive documentation (including a vignette for an overview)
- R wrapper for the tokenizer & POS tagger TreeTagger
- compliant with the text interchange formats proposed by rOpenSci
- automatic detection of the language a text is written in (testing >350 languages)
- various text transformations (e.g., jumbled words, cloze deletion or C-test format)
- stopword detection
- fully modularized language support (see ?install.koRpus.lang function in the package)
- readability
- ARI
- bormuth mean cloze
- coleman (1--4)
- coleman-liau
- dale-chall
- danielson-bryan (1--2)
- degrees of reading power
- dickes-steiwer
- easy listening formula
- farr-jenkins-paterson
- flesch reading ease (incl. de, es, fr & nl)
- flesch-kincaid grade level
- FORCAST
- fucks
- gunning FOG
- harris-jacobson (1--5)
- linsear-write
- LIX
- neue wiener sachtextformeln (1--4)
- RIX
- SMOG
- spache
- strain index
- text-redundanz-index
- tränkle-bailer
- tuldava
- wheeler-smith
- lexical diversity
- carroll's corrected TTR
- dugast's uber index
- guiraud's root TTR
- HD-D (vocd-D)
- herdan's C
- maas' indices (a, lg(V0), lge(V0) & V')
- mean segmental TTR (MSTTR)
- measure of textual lexical diversity (MTLD)
- moving-average measure of textual lexical diversity (MTLD-MA)
- moving-average TTR (MATTR)
- summer's index
- type-token ratio (TTR)
- yule's K
- frequency analysis
- access to existing corpus databases (celex, leipziger corpora collection (LCC))
- creation of your own corpus databases (
read.corp.custom()
) - access to valence databases (
read.BAWL()
to import BAWL-R) - tf-idf
- various statistics (e.g., mean sentence and word length, distribution of word classes)
- supports the use of stemming functions from other packages
the koRpus-dev mailing list
you are invited to subscribe to our mailing list to discuss the development of the R package 'koRpus'.
in order to subscribe, send an e-mail to majordomo /at/ r.reaktanz.de
with the only line "subscribe koRpus-dev-r-reaktanz-de@r.reaktanz.de YOUR@MAIL.ADDRESS
", replacing the latter with the address you would like to subscribe.
to unsubscribe, do the same, but replace "subscribe
" with "unsubsribe
".
citation information
in case you need to cite koRpus for reference, consider the CITATION file or use the following:
Michalke, M. (2018, March). "Entschuldigen Sie, dass ich Ihnen einen komplizierten Artikel schreibe, für einen lesbaren habe ich keine Zeit" -- Textanalyse mit den R-Paketen koRpus & tm.plugin.koRpus Paper presented at the Tagung experimentell arbeitender Psychologen (TeaP), Marburg.
Michalke, M. (2012, April). koRpus -- ein R-paket zur textanalyse. Paper presented at the Tagung experimentell arbeitender Psychologen (TeaP), Mannheim.
full text corpus support and 'tm' integration
in case you would like to analyze full corpora instead of single texts, or use both koRpus and the tm package for text analysis, check out this add-on package:
#development:
install.packages("tm.plugin.koRpus", repo=c(getOption("repos"), reaktanz="https://reaktanz.de/R"))
- download: tm.plugin.koRpus_0.4-2.tar.gz (18.05.2021, 564 kb)
- source code: https://github.com/unDocUMeantIt/tm.plugin.koRpus
- debian/ubuntu package (depends on recent R packages from CRAN)
- NEWS/ChangeLog
- license: GPL >= 3
- vignette: using the tm.plugin.koRpus package for corpus analysis
- documentation: tm.plugin.koRpus.pdf
a token of gratitude
if you appreciate my work an want to say "thanks", please check my wantlist on discogs (just have records sent to the address you find in the imprint/impressum). you're awesome!
work in progress
i'm still working on koRpus (see the ChangeLog). that is, as a price for progress it is possible that sometimes things won't work at all, return faulty results or will behave differently in future releases. however, in general i consider the package to be useful and usable, and i recieved several reports from variuos places where was successfully used. any feedback is most welcome!
still some work needs to be done to fully validate the implementations of various measures for readability and lexical diversity. until then, those functions will trigger a warning to interpret the results with caution. help would be appreciated!
RKWard plugin: graphical user interface for koRpus
to make working with koRpus as comfortable as possible, i've also written a plugin for RKWard:
the plugin gets installed/updated automatically with the R package, and recent versions of RKWard will automatically add the plugin to their configuration.