Language or not

repo

What statistical characteristics can be used to identify language?

Suppose I hand you a text in a language that you don't know. Perhaps it's in a Latin-based orthography, perhaps not. Let's assume for the sake of simplicity that it enough spaces to apparently be broken into words.

Can you tell me whether or not that text is actually language?

After all, I can generate random strings of letters (and spaces) that will more or less look like language. Let's assume for the time being that this "generator" works on principles that we don't know. It's a black box: you push the button, and out comes "some language."

Now, let's make it more interesting. Suppose that sometimes, our black box outputs real text from real language, but uses some simple encryption algorithm to disguise it. In other words, it might eat some Spanish text, and then go about substituting "a" for "o", "m" for "n", whatever, until out the other end comes something that looks nothing like Spanish.

Yes, this Faux Spanish will have exactly the same statistical properties as the original Spanish text. (Except that it can no longer be compared directly to other Spanish texts--we wouldn't know to compare it to Spanish in the first place).

Here's the question: given such a black box, is it possible to determine whether a given output is an encoded "real" language or not?

The question is partly rhetorical, after all, if the investigator succeeds in finding some statistical metric to identify the fake (not faux) language, that metric could be incorporated into the generation algorithm.

Nonetheless, it seems like a fun game. Let's try building the black box, and then devising identifiers that try to identify fake vs. faux.