Change Log

1.2.1 (2014-11-19)

Changed license to Apache License version 2. (Issue 17)
Note that TreeTagger itself is subject different license terms available from the TreeTagger TreeTagger website.

Support for getting multiple tag/lemmas and their probabilities. This feature requires a TreeTagger binary newer than 2012-04-25. When used with previous versions, it will just hang. At the time of writing, the TreeTagger versions for OS X (Intel), Windows and Linux support this feature. It is possible that the versions for Solaris and OS X (PPC) may not be updated to support this feature. TT4J continues to work with other/older TreeTagger versions as long as this feature is not used. (Issue 13)
Improved parsing of TreeTagger output.

Changed default flush sequence to work with the TreeTagger model for chinese (Issue 6 - thanks Jérôme)

Added detection if communication with TreeTagger starts running out-of-sync due to some odd characters appearing in tokens. This can be disabled, but per default the strict-mode is on.
Added setting for the maximal token length (default 90000 bytes) - TreeTagger seems to have a limit of 99998 bytes per token and crashes when this is exceeded
Improved handling of crashed TreeTagger process

Fixed bug: Resource not properly destroyed when an exception is thrown in reader/writer thread.
Improvement: Try harder to get to end-of-text mark.
Improvement: Added tracing of start and end marks.

Improvement: Massively improved throughput when processing a large number of documents.
Improvement: Try to gracefully handle cases where TT does not produce a “token tag lemma” line. Return null for tag and lemma in these cases.

Improvement: Added tracing.
Improvement: Improved robustness ignoring illegal tokens (e.g. containing tabs or line breaks).
Improvement: Added performance mode which does not check for illegal tokens.

Improvement: Allow setting the parameters -eps and -hyphen-heuristics needed to use TT4J with chunker models. Now a chunker can be build on top of TT4J.