From Coding Horror
While not as entertaining as the Nietzschean Family Circus, it's not without its charm.
At the letter-level you can play fun games with Markov chains - for example, coming up with similar-sounding but not identical names. I started by throwing in the top American cities. I got a few that were believable and coherent but not identical to the inputs (Bernard, Costa Maria, and Pennectico) but unless you have a large corpus, that boundary of coherent but not identical is pretty sharp. Even then you can still get some clever portmanteaus out of the deal: Thorntonio, Mobilene, and Charleston-Salem. Even better, there was an obvious name for a gay men's magazine about the scene in Alaska - Manchorage (I'm sure someone's ahead of me on that) and my personal favorite Markovian city name, Allen West. Going from 300ish city names down to 50 states gave only one novel one in a number of tries, Monsaskana, but a few clever portmanteaus, like Coloridaho, Florado, and my favorite, Tennsylabamaska.
From now on when I need something like this I'm going to automate it this way. For president names, no way, only forty-four of those. I did the same thing with another Markov generator which operates at the sentence level (far more interesting) and tried it on one of my own blog posts, where I discuss the zoo hypothesis response to the Fermi paradox. At a mere 367 words, using a 2-element state, it has the same problem of either spitting out the same sentences verbatim or giving sentence portmanteaus: That we've been staring them in the face the whole universe is A kind of frightening abjectly humbling realization is in fact the best case scenario I expect because it means they will seem incomprehensible if we even recognize them. (Compare to the original post.) It's more interesting but less coherent with a 1-element state: To spoil the whole universe is exactly what we haven't noticed aliens Ever try to do This wanting as alive even if they're trying to get our own ignorance. The Markov process is producing grammatical sentences even at this level, notwithstanding run-ons or semantic incoherencies, whether green, colorless or otherwise.
In case you wanted input text from a more middlebrow writer than myself, I used Hemingway's Old Man and the Sea for a bigger corpus and with 2-element states got a few nice ones: Now there was no one that they rose and they had razor-sharp cutting edges on both sides. Knowing it was a little later to save the blood in the sky. Borrow two dollars and a skiff in the bow he could not fail myself and die on A Monday morning. Strong enough Now for the fish made He was letting the current made against the line and he is too wise to jump. And then 3 elements: Shoulders and braced his left hand and arm he took the bait just now. The skiff shake as he jerked and pulled on the fish and he had found a way of leaning forward against the bow he could not remember the prayer and then He would say them fast so that they made a half-garland on the projecting steel.
J.G. Ballard had characters who were poets who spent their days programming, and they were celebrated for the brilliance of the verse that their algorithms produced. (To read this as dystopian is to oversimplify Ballard with assumptions he exploded - it was a challenge to artists. I.e., how is this not what you're doing, except you're using the hardware inside your skull, and you're not sure how it works?) While the text above isn't about to fool a publisher that you're in possession of a lost Hemingway manuscript, it gets you partway there. That is to say, if I were an undergraduate in some bullshit post-modern theory class I needed for elective credit, for my term papers I would get hold of a bunch of secondary literature, stuff it in the Markov meat machine, and then curate the sentences and use spell check to make sure it's coherent. That takes the concentration and understanding out of it; at that point you're just editing.
It's worth pointing out that humans differ in how easily they create new words like this, mostly because we're organized semantically rather than phonetically. It's not easy, at least for healthy people. If I ask you to name words that start with p, you'll start slowing down after 5-10 words, but if I ask you to name things that have to do with palace, you'll have a much easier time. In some pathological states, you can't help but make up words, like Wernicke's encephalopathy, but the words people make up still follow the rules of their native language. Interestingly, when people "speak in tongues" during religious ceremonies, the tongues in which they speak to their gods have the same phonological rules as their native one.
Of any statistical technique, Markov chains have most made me wish I was a programmer. For instance, it doesn't seem that it would be any harder to reverse this process to recognize affixes, rather than predict following character. That is to say: in English we use "-s" (or "-es") and -ing as suffices on nouns and verbs. In any corpus, these will appear more often than other endings, and they will be less predicted by the letters preceding them than other clusters. By doing this, you could feed a corpus in any language and it would be able to pick out the likely affixes and particles, based on these properties. In fact, it's not implausible that this is how children are decoding language when learning their first one.
No comments:
Post a Comment