Is it possible that CG is deliberately hiding some details of their requirements? I mean how to count the words "world" exactly. CG says the words are separated by punctuation marks, but I still see some unclear circumstances. E.g. if it is included in the file: "world'abcd(world!worldly,worlds;" // it means 2 hits or 4? "world:WORLD" // it means 1 or 2? It is also unclear what exactly is meant by punctuation mark. Of the 65536 Unicode characters, which ones exactly should I consider to be the punctuation marks that separate words? According to one description on net, these are what are considered punctuation marks in ASCII: ! " # $ % & ' ( ) * + , - . / : ; ? @ [ \ ] ^ _ ` { | } ~ However, if I use the string.replaceAll("\p{Punct}","") or string.split("\p{Punct}") methods, which are supposedly to remove punctuation, then multiple characters will be removed, not just the preceding sequence of characters. It also removes characters that are, for example, MATH_SYMBOL (and not "anyType"_PUNCTUATION) according to Character.getType() method (see ). In summary, in the example below, I have not used typical punctuation to separate the 11 words "world", but various other characters from the ASCII code table (but not letters). (And then we were only talking about 256 characters, where are the other 65280....). So how do I handle this string below? Is this 11 hits of "world" strings, or is it just a single string with a lot of meaningless characters that counts as 0 hits?... "worldworldworldworldworldˆworld‰worldŠworld«world»world¿world"