Simple tokenizer using the Java BreakIterator

Sometimes users ask for TT4J to include a tokenizer. I will not include a ready-to-use tokenizer with TT4J, since there are other libraries that do a much better job here. A good tokenizer for English for example is included with the Stanford Parser.

If you do not wish to look for a good tokenizer for your task, you may find this method useful. It uses a simple tokenizer called !BreakIterator which ships with Java.

	List<String> tokenize(
			final String aString)
		List<String> tokens = new ArrayList<String>();
		BreakIterator bi = BreakIterator.getWordInstance();
		int begin = bi.first();
		int end;
		for (end =; end != BreakIterator.DONE; end = {
			String t = aString.substring(begin, end);
			if (t.trim().length() > 0) {
				tokens.add(aString.substring(begin, end));
			begin = end;
		if (end != -1) {
		return tokens;