Sometimes users ask for TT4J to include a tokenizer. I will not include a ready-to-use tokenizer with TT4J, since there are other libraries that do a much better job here. A good tokenizer for English for example is included with the Stanford Parser.
If you do not wish to look for a good tokenizer for your task, you may find this method useful. It uses a simple tokenizer called !BreakIterator which ships with Java.
public
List<String> tokenize(
final String aString)
{
List<String> tokens = new ArrayList<String>();
BreakIterator bi = BreakIterator.getWordInstance();
bi.setText(aString);
int begin = bi.first();
int end;
for (end = bi.next(); end != BreakIterator.DONE; end = bi.next()) {
String t = aString.substring(begin, end);
if (t.trim().length() > 0) {
tokens.add(aString.substring(begin, end));
}
begin = end;
}
if (end != -1) {
tokens.add(aString.substring(end));
}
return tokens;
}