Conversation with #blogamundo at 2007-03-23 14:50:24 on patbam@irc.freenode.net (irc)Conversation with #blogamundo at 2007-03-23 14:50:24 on patbam@irc.freenode.net (irc) (14:50:24) The topic for #blogamundo is: in which patbam attempts to woo people into a wikipedia project; http://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Endangered_languages#Perhaps_we_should_all_pick_a_single_language_to_work_on__for_a_while.3F (14:50:26) patbam: meep (14:50:38) patbam: i pingeth sbp chrys_desk (14:54:25) chrys_desk: plon (14:55:26) patbam: hey chrys (14:55:32) patbam: i did sometin neato :P (14:55:45) patbam: it is not that complex but it's interesting (14:55:48) patbam: some python code (14:56:25) chrys_desk: ah? (15:02:59) patbam: http://ruphus.com/stash/out.txt (15:03:01) patbam: yeah that's the output (15:03:10) patbam: the input looks like this: (15:03:36) patbam: http://ruphus.com/stash/samplein.txt (15:03:55) patbam: it's a suuuuper dumb attempt at finding a transltieration correspondence (15:04:45) patbam: not bad for 25 lines hehe (15:14:38) chrys_desk: sorry had to chat about a client problem (15:17:21) patbam: no worries (15:19:05) chrys_desk: was is das letzte in jeder zeile? (15:20:00) chrys_desk: for your amusement -- how my clients talk to me (this channel isn't publicly logged is it?) (15:20:16) chrys_desk: "Thanks for this explanation but it is not acceptable for me. I can understand that you “locked” the number when the message is archived and they might be a difference when you get the data but we are talking of 30 000 emails less than the initial targeting with “your” technical approach !!!   Please Find a solution or a quick fix to make things happened !" (15:21:28) patbam: what is the last in every headline? (15:21:36) chrys_desk: oops (15:21:44) chrys_desk: that was german wasn't it :) (15:21:53) chrys_desk: DEVANAGARI LETTER RA र d (15:21:57) chrys_desk: the last in every line (15:21:58) chrys_desk: the d (15:22:08) patbam: yeah that one is wrong :) (15:22:18) patbam: RA and VIRAMA like to screw things up (15:22:19) chrys_desk: DEVANAGARI LETTER RA र n (15:22:26) chrys_desk: DEVANAGARI VOWEL SIGN AA ा l (15:22:32) chrys_desk: DEVANAGARI DIGIT ZERO ० 0 (15:22:39) patbam: heh, tha'ts right! (15:22:40) patbam: yay :) (15:22:42) chrys_desk: the last is a transliteration ... that's what you're heading at? (15:22:49) patbam: so like (15:23:06) patbam: the assumptions go like this: (15:23:32) patbam: IF two words are in fact transltierations of each other, then they are sort of of the same length right (15:23:33) patbam: i mean (15:23:44) chrys_desk: hm (15:23:46) chrys_desk: depends (15:23:47) chrys_desk: like (15:23:48) patbam: they won't be like, a factor of forty bazillion different in legnth (15:24:06) chrys_desk: there's loads of stuff that gets transliterated in different numbers of chars (15:24:09) chrys_desk: like greek (15:24:11) chrys_desk: X = ch (15:24:15) patbam: well, let's put it this way: they're more likely to be of similar length than randomly chosen english and hindi terms (15:24:19) patbam: yep, that's right (15:24:22) chrys_desk: ok yeah (15:24:30) patbam: and in the case of devanagari, it's kind of a syllabary (15:24:33) chrys_desk: but that's a bit too .. statistical innit :) (15:24:33) patbam: so that comes into play too (15:24:43) patbam: oh i'm all about things statistical :) (15:24:49) patbam: so then (15:25:01) patbam: let's pretend instead of hindi, we're going from "lowercase" to "UPPERCASE" (15:25:10) chrys_desk: ok (15:25:15) patbam: so the "transliteration" might be, say, house -> HOUSE (15:25:23) patbam: in that case, you can just do: (15:25:32) patbam: zip(list('house'), list('HOUSE')) (15:25:45) patbam: >>> zip(list('house'), list('HOUSE')) (15:25:45) patbam: [('h', 'H'), ('o', 'O'), ('u', 'U'), ('s', 'S'), ('e', 'E')] (15:25:47) patbam: obviously (15:25:52) patbam: but (15:25:54) patbam: what about this: (15:26:04) patbam: >>> zip(list('house'), list('hause')) (15:26:04) patbam: [('h', 'h'), ('o', 'a'), ('u', 'u'), ('s', 's'), ('e', 'e')] (15:26:05) patbam: er (15:26:09) patbam: is that a german form for house? heh (15:26:11) patbam: ooops (15:26:15) chrys_desk: not QUITE (15:26:18) chrys_desk: it's Haus (15:26:19) patbam: >>> zip(list('house'), list('Haus')) (15:26:19) patbam: [('h', 'H'), ('o', 'a'), ('u', 'u'), ('s', 's')] (15:26:32) patbam: pretty g ood match there, as it happens (15:26:33) chrys_desk: that'll be kinda english to kinda german then (15:26:39) patbam: yah, kinda opointless (15:26:48) chrys_desk: giggle ... no! (15:26:49) patbam: i mean, it's much more interesting ot do across alhpabets (15:26:56) chrys_desk: strange thing .. like ... (15:27:05) chrys_desk: english != what's in the books (15:27:25) chrys_desk: but english ~= what a lliterate native speaker can decipher (15:28:15) patbam: true dat (15:28:28) patbam: wait wait let me annoy you more with my thingie (15:28:28) patbam: haha (15:28:30) patbam: preeeeze (15:28:33) patbam: i yearn for approval (15:28:34) patbam: haha (15:28:38) patbam: APPROVE ME (15:28:39) patbam: heh (15:29:09) patbam: so here's the interesting bit (15:29:30) patbam: beacuse the words are of different lengths (as you mention, c might correspond to 'ch' or something) (15:29:43) patbam: they might align at the beginning, but then get thrown off by the end (15:29:47) patbam: behold: (15:30:15) patbam: >>> zip(reversed(list('house')), reversed(list('Haus'))) (15:30:15) patbam: [('e', 's'), ('s', 'u'), ('u', 'a'), ('o', 'H')] (15:30:34) patbam: crummy results in that case, just because of hte way house and Haus happen to be (15:30:44) patbam: but sometimes you catch more real correspondences that way (15:30:46) ***chrys_desk approves patbam (15:31:04) chrys_desk: huh (15:31:15) patbam: so, what i do is, take the first style of correspondences, then the reversed style, push them in a huge list, count, sort, output (15:31:16) patbam: that's it (15:32:29) chrys_desk: uh ... (15:32:33) ***chrys_desk has a fried brain (15:35:58) patbam: http://ruphus.com/stash/hinditranslit_py.txt (15:43:31) chrys_desk: http://itre.cis.upenn.edu/~myl/languagelog/archives/004331.html this is cool (15:44:14) chrys_desk: sys.stdin = codecs.getreader('utf-8')(sys.stdin) that's kinda cool too (15:44:14) patbam: wow, that rules (15:44:20) patbam: yeah sbp taught me that (15:45:42) chrys_desk: that's kinda ... you redefine an object there, innit? (15:45:59) patbam: yeah seems to (15:46:10) patbam: but if sbp says it's ok i don't doubt that it is , heh (15:47:21) chrys_desk: my poor team member who's a native speaker of seychellois creole had part of his newfound assurance in french destroyed (15:47:50) chrys_desk: when our french account management team trashed his grammatical spelling in a client communication. poor boy. (15:48:11) chrys_desk: yeah he needs to improve his written french, but ... those ppl are just so insensitive. (15:49:43) patbam: he shoudl write everything in seychellois heh (15:50:05) patbam: http://ruphus.com/stash/out3.txt english to french, heh (15:51:45) patbam: wow, even japanese catches some (15:51:49) patbam: i don't really know what the point of this is (15:51:52) patbam: but it's addictive (15:51:52) patbam: heh (15:52:36) patbam: pat@gwin:~/Desktop$ cat out-enja.txt |grep 'LETTER RA' (15:52:36) patbam: KATAKANA LETTER RA ラ (15:52:36) patbam: KATAKANA LETTER RA ラ l (15:52:36) patbam: KATAKANA LETTER RA ラ r (15:52:36) patbam: KATAKANA LETTER RA ラ a (15:52:49) patbam: sorry, i'll stop spitting random data at you hehe (16:04:19) patbam has changed the topic to: http://christianflury.com/blog/2007/03/quite_some_characters_a_unicod.html (18:58:35) chrys_desk left the room (quit: "Leaving.").