tm.plugin.koRpus

Changes in tm.plugin.koRpus version 0.4-2

Mon, 17 May 2021 00:00:00 +0000

fixed

updated test standards after changes to koRpus' internal calculations of numer of lines in texts imported from TIF data frames

changed

kRp.corpus: replaced prototype() in class definition with initialize method

Changes in tm.plugin.koRpus version 0.4-1

Thu, 17 Dec 2020 00:00:00 +0000

fixed

docTermMatrix() : results were wrong because numbers were assigned to wrong columns; now fixed in koRpus
unit tests failed on windows due to an UTF-8 issue

changed

the nested object class kRp.hierarchy was replaced by kRp.corpus; instead of reproducing the file hierarchy in the object structure, kRp.corpus has a flat structure with all texts in one single data frame; this data frame was also renamed from "TT.res" into "tokens" the class name kRp.corpus was used in tm.plugin.koRpus before and is just being recycled ;) kRp.corpus inherits from class kRp.text as defined in the koRpus package
status messages are currently only shown when only one CPU is used
corpusTagged() : now called taggedText() as in koRpus
corpusDesc() : now called describe() as in koRpus
[, [<-, [[ and [[<- methods no longer apply to the summary data frame but tokens slot as in koRpus (where it applies to the TT.res slot)
show() : kRp.corpus objects now list all available features
read.corp.custom() : removed unused mc.cores argument
docTermMatrix() : by default behaves like most other methods and adds its result to the input object rather than returning just the matrix; also, the generic is now defined by the koRpus package and was removed, including all of the actual function code
adjusted unit tests and vignette
updated all examples to use a new sample corpus (see added), to the benefit that many "\dontrun{}" cases could be removed

added

readCorpus() : the hierarchy levels of a text corpus can now be assumed directly from the directory structure by setting "hierarchy=TRUE"
corpusHasFeatures() , corpusHasFeatures() <-, corpusFeatures() , corpusFeatures() <-, corpusHierarchy() , corpusHierarchy() <-, corpusCorpFreq() , corpusCorpFreq() <-, diffText() , diffText() <-, originalText() : new getter/setter methods for kRp.corpus objects
split_by_doc_id() : new method transforms a kRp.corpus object into a list of kRp.text objects
corpusDocTermMatrix() : new method to get/set the sparse document term matrix in kRp.corpus objects
[[/[[<-: gained new argument "doc_id" to limit the scope to particular documents
describe() /describe()<-: now support filtering by doc_id
new sample corpus for use in examples

removed

removed all classes and methods dealing with kRp.hierarchy
removed deprecated methods of the pre-kRp.hierarchy era
removed generic of tif_as_tokens_df() as it was moved to the koRpus package

Changes in tm.plugin.koRpus version 0.3-1

Tue, 14 May 2019 00:00:00 +0000

fixed

readCorpus() : solved a cryptic warning when more than one text was tokenized

added

docTermMatrix() : new method to generate document-term matrices, either with absolute frequencies or tf-idf values
query() : new method, extending the generic of koRpus >= 0.12-1
filterByClass() : new method, extending the generic of koRpus >= 0.12-1
jumbleWords() : new method, extending the generic of koRpus >= 0.12-1
clozeDelete() : new method, extending the generic of koRpus >= 0.12-1
cTest() : new method, extending the generic of koRpus >= 0.12-1
textTransform() : new method, extending the generic of koRpus >= 0.12-1
show() : new method for objects of class kRp.hierarchy

changed

depends on koRpus >= 0.12-1 now
depends on the Matrix package now (for docTermMatrix() )
adjusted test standards to include the additional POS tags from koRpus >= 0.12-1

Changes in tm.plugin.koRpus version 0.02-2

Fri, 18 Jan 2019 00:00:00 +0000

fixed

readCorpus() , kRpSource() : added missing imports from packages tm, NLP and parallel
readCorpus() : fixed status message formatting
corpusTm() : removed useless "level" argument and corrected the output
readCorpus() : removed unused "level" argument
corpusFiles() : now also works with flat hierarchy objects

added

readCorpus() : can now also import data frames in TIF format, including support for hierarchal categories
tif_as_corpus_df() : new S4 method to transform a kRp.hierarchy object into a TIF compliant data frame

changed

readCorpus() : the tm corpora now include full hierarchy metadata
removed pre-hierarchy portions from internal function whatIsAvailable()

Changes in tm.plugin.koRpus version 0.02-1

Sun, 29 Jul 2018 00:00:00 +0000

changed

vignette: also includes info on readCorpus()
tests: adjusted test standards to new object class

added

kRp.hierarchy: new S4 class to replace kRp.sourcesCorpus and kRp.topicCorpus to allow more generic nesting of hierarchical levels
readCorpus() : new function to generate kRp.hierarchy objects recursively
many corpus*() getter functions can now filter by hierarchy level or category ID
removed all code regarding simpleCorpus() , sourcesCorpus() and topicCorpus() , their object classes and methods; this is all handled much more flexible by kRp.hierarchy and readCorpus() now

Changes in tm.plugin.koRpus version 0.01-4

Wed, 07 Mar 2018 00:00:00 +0000

fixed

sourcesCorpus() : speak of "text" instead of "texts" if it's only one

changed

adjusted package to support koRpus >= 0.11 and sylly, especially with regards to summary() , hyphen() , and new class contructors
summary() : for more coherence with the koRpus package the "text" column in the summary slot was renamed into "doc_id"
reaktanz.de supports HTTPS now, updated references
vignette is now in RMarkdown/HTML format; the SWeave/PDF version was dropped
hyphen() /lex. div() /readability(): 'quiet' is now TRUE by default
lex.div() : 'char' is now an emtpy string by default; computing all characteristics was not a useful default for large text corpora

added

README.md
new [, [<-, [[ and [[<- methods added for corpus object classes
new methods tif_as_tokens_df() to export corpus objects as a single data.frame in fully TIF compliant format
summary() : now also includes the total number of stopwords (if available)
new class object contructors kRp_corpus() , kRp_sourcesCorpus() , and kRp_topicCorpus() can be used instead of new( "kRp.corpus" , ...) etc.

Changes in tm.plugin.koRpus version 0.01-3

Tue, 12 Jul 2016 00:00:00 +0000

fixed

the arguments that simpleCorpus() was supposed to pipe to DirSource() weren't used

changed

the "paths" argument of topicCorpus() now expects a list, not a vector
using the parallel package to be able to use more CPU cores

added

new argument "format" for simpleCorpus() , sourceCorpus() , and topicCorpus() , to be able to work with text objects directly, instead of files

Changes in tm.plugin.koRpus version 0.01-2

Wed, 08 Jul 2015 00:00:00 +0000

changed

using the S4 methods of koRpus 0.06-1 now, therefore renamed all methods removing the *.corpus suffix (e.g., lex.div.corpus() is now lex.div() )
renamed classes into kRp.corpus, kRp.sourcesCorpus and kRp.topicCorpus, and their generator functions accordingly

added

new methods read.corp.custom() , freq.analysis() and summary()
new getter/setter methods: corpusSources() , corpusTopics() , corpusFreq() , corpusSummary()
first basic unit tests, using the testthat package
new option "summary" for lex.div() and readability() , to automatically update the summary data.frames
first notes in a vignette

Changes in tm.plugin.koRpus version 0.01-1

Mon, 29 Jun 2015 00:00:00 +0000

added

initial release