NEWS {tm.plugin.koRpus} | R Documentation |
updated test standards after changes to koRpus' internal calculations of numer of lines in texts imported from TIF data frames
kRp.corpus: replaced prototype()
in class definition with initialize
method
docTermMatrix()
: results were wrong because numbers were assigned to
wrong columns; now fixed in koRpus
unit tests failed on windows due to an UTF-8 issue
the nested object class kRp.hierarchy was replaced by kRp.corpus; instead
of reproducing the file hierarchy in the object structure, kRp.corpus has
a flat structure with all texts in one single data frame; this data frame
was also renamed from "TT.res"
into "tokens"
the class name kRp.corpus
was used in tm.plugin.koRpus before and is just being recycled ;) kRp.corpus
inherits from class kRp.text as defined in the koRpus package
status messages are currently only shown when only one CPU is used
corpusTagged()
: now called taggedText()
as in koRpus
corpusDesc()
: now called describe()
as in koRpus
[, [<-, [[ and [[<- methods no longer apply to the summary data frame but tokens slot as in koRpus (where it applies to the TT.res slot)
show()
: kRp.corpus objects now list all available features
read.corp.custom()
: removed unused mc.cores argument
docTermMatrix()
: by default behaves like most other methods and adds its
result to the input object rather than returning just the matrix; also,
the generic is now defined by the koRpus package and was removed, including
all of the actual function code
adjusted unit tests and vignette
updated all examples to use a new sample corpus (see added), to the benefit that many "\dontrun{}" cases could be removed
readCorpus()
: the hierarchy levels of a text corpus can now be assumed
directly from the directory structure by setting "hierarchy=TRUE"
corpusHasFeatures()
, corpusHasFeatures()
<-, corpusFeatures()
,
corpusFeatures()
<-, corpusHierarchy()
, corpusHierarchy()
<-, corpusCorpFreq()
,
corpusCorpFreq()
<-, diffText()
, diffText()
<-, originalText()
: new getter/setter
methods for kRp.corpus objects
split_by_doc_id()
: new method transforms a kRp.corpus object into a list
of kRp.text objects
corpusDocTermMatrix()
: new method to get/set the sparse document term
matrix in kRp.corpus objects
[[/[[<-: gained new argument "doc_id"
to limit the scope to particular
documents
describe()
/describe()<-: now support filtering by doc_id
new sample corpus for use in examples
removed all classes and methods dealing with kRp.hierarchy
removed deprecated methods of the pre-kRp.hierarchy era
removed generic of tif_as_tokens_df()
as it was moved to the koRpus
package
readCorpus()
: solved a cryptic warning when more than one text was
tokenized
docTermMatrix()
: new method to generate document-term matrices, either
with absolute frequencies or tf-idf values
query()
: new method, extending the generic of koRpus >= 0.12-1
filterByClass()
: new method, extending the generic of koRpus >= 0.12-1
jumbleWords()
: new method, extending the generic of koRpus >= 0.12-1
clozeDelete()
: new method, extending the generic of koRpus >= 0.12-1
cTest()
: new method, extending the generic of koRpus >= 0.12-1
textTransform()
: new method, extending the generic of koRpus >= 0.12-1
show()
: new method for objects of class kRp.hierarchy
depends on koRpus >= 0.12-1 now
depends on the Matrix package now (for docTermMatrix()
)
adjusted test standards to include the additional POS tags from koRpus >= 0.12-1
readCorpus()
, kRpSource()
: added missing imports from packages tm, NLP
and parallel
readCorpus()
: fixed status message formatting
corpusTm()
: removed useless "level"
argument and corrected the output
readCorpus()
: removed unused "level"
argument
corpusFiles()
: now also works with flat hierarchy objects
readCorpus()
: can now also import data frames in TIF format, including
support for hierarchal categories
tif_as_corpus_df()
: new S4 method to transform a kRp.hierarchy object
into a TIF compliant data frame
readCorpus()
: the tm corpora now include full hierarchy metadata
removed pre-hierarchy portions from internal function whatIsAvailable()
vignette: also includes info on readCorpus()
tests: adjusted test standards to new object class
kRp.hierarchy: new S4 class to replace kRp.sourcesCorpus and kRp.topicCorpus to allow more generic nesting of hierarchical levels
readCorpus()
: new function to generate kRp.hierarchy objects recursively
many corpus*() getter functions can now filter by hierarchy level or category ID
removed all code regarding simpleCorpus()
, sourcesCorpus()
and
topicCorpus()
, their object classes and methods; this is all handled much more
flexible by kRp.hierarchy and readCorpus()
now
sourcesCorpus()
: speak of "text"
instead of "texts"
if it's only one
adjusted package to support koRpus >= 0.11 and sylly, especially with
regards to summary()
, hyphen()
, and new class contructors
summary()
: for more coherence with the koRpus package the "text"
column
in the summary slot was renamed into "doc_id"
reaktanz.de supports HTTPS now, updated references
vignette is now in RMarkdown/HTML format; the SWeave/PDF version was dropped
hyphen()
/lex.div()
/readability(): 'quiet' is now TRUE by default
lex.div()
: 'char' is now an emtpy string by default; computing all
characteristics was not a useful default for large text corpora
README.md
new [, [<-, [[ and [[<- methods added for corpus object classes
new methods tif_as_tokens_df()
to export corpus objects as a single
data.frame in fully TIF compliant format
summary()
: now also includes the total number of stopwords (if available)
new class object contructors kRp_corpus()
, kRp_sourcesCorpus()
, and
kRp_topicCorpus()
can be used instead of new("kRp.corpus"
, ...) etc.
the arguments that simpleCorpus()
was supposed to pipe to DirSource()
weren't used
the "paths"
argument of topicCorpus()
now expects a list, not a vector
using the parallel package to be able to use more CPU cores
new argument "format"
for simpleCorpus()
, sourceCorpus()
, and
topicCorpus()
, to be able to work with text objects directly, instead of files
using the S4 methods of koRpus 0.06-1 now, therefore renamed all methods
removing the *.corpus suffix (e.g., lex.div.corpus()
is now lex.div()
)
renamed classes into kRp.corpus, kRp.sourcesCorpus and kRp.topicCorpus, and their generator functions accordingly
new methods read.corp.custom()
, freq.analysis()
and summary()
new getter/setter methods: corpusSources()
, corpusTopics()
, corpusFreq()
,
corpusSummary()
first basic unit tests, using the testthat package
new option "summary"
for lex.div()
and readability()
, to automatically
update the summary data.frames
first notes in a vignette
initial release