Posted 11 March 2013 by admin | Comments Off
The discipline of linguistics calls a collected sampling of ‘real world’ language a ‘corpus’ — a body of data. A corpus is thought of as yielding information about how language operates. As samples of language as language irresistibly happens in the world as opposed to how language dreams of itself — as actually mangled and not abstractly perfect — corpora can be annotated and analysed by linguists in order to probe certain linguistic features, tricks, or properties. Actually, it’s almost as if a corpus is less a body, than a digestive system: corpora metabolize the soup of words sloshing through the world on a daily basis. An individual corpus can tell us, for instance, quite a lot about ‘collocation’, words that co-occur for no good reason — sequences of words that get put together and then get stuck that way, as if the wind changed. A sense of collocation can be the making and breaking of someone attempting to pass as a native speaker. To English ears, a ‘quick shower’ is ‘natural’ but a ‘fast shower’ is odd, while a ‘fast train’ sounds right but a ‘quick train’ sounds slightly weird. You ‘make a noise’ — you don’t ‘do a noise’ — but some kinds of noises you do ‘do’. A corpus, a sample of real world language, can tell us much about the frequency and so on of these sorts of linguistic happenings. But corpora don’t only give us the habits of language — they also give us the hiccups.
According to the ‘User’s Manual’ — like a machine, it comes with one — ‘Switchboard’ is a corpus of ‘spontaneous conversations’ addressing ‘the growing need for large multispeaker databases of telephone bandwith speech’. In other words, it’s a sampling of American telephone conversations, recorded and then transcribed. In total, it contains ‘about 2430 conversations averaging 6 minutes in length’, or ‘240 hours of recorded speech, and about 3 million words of text, spoken by over 500 speakers of both sexes from every major dialect of American English’. The calls took place under relatively normal conditions — over the public telephone network, but with some recorded instructions for the callers. The transcripts were, we are told by the manual, produced by court reporters. The prepared resource contains some demographic information about the participants, including the area code of each call’s origin. Transcribers were also asked to rate the calls according to the amount of echo, crosstalk, static — noise caused by the collection system itself — and background, or noise emerging out of the environments of the two callers. This background noise might be additional voices not belonging to those bodies with their mouths at the telephone piece, or it might be the extra sounds those bodies make simultaneous to talking (washing dishes, shutting doors). In the transcribed conversations, one caller is called A, the other B. Also provided is information as to the duration of all the words spoken: the Switchboard Corpus tells us how long, exactly, each spoken word lasts.
The discussions of the participants are often stuttery. In the segment of the corpus that’s available to download online under a Creative Commons license, there are 36 conversations, and within these the uhs and uh-huhs are already innumerable. One .txt file isolates the moments when the speakers lapse into ‘disfluency’ — the title of the .txt file – by annotating them with brackets. But the actual look of the resulting text makes it seem as if disfluency is the rule and not the exception — as if it’s via the stutters that language works itself into the world rather than by any crystal clear flow of intention. The soup has bits in. The soup is soupy. The marks of disfluency become punctuation marks, structure: the repetitions, the fillers, the irregularities, the stray phonemes that aren’t supposed to mean anything — by being marked out and eliminated, they even become the substance. There’s so much nonsense in the sentences that it starts to feel as if the nonsense might mean something.
Language is scrapped, becomes a scrap landscape. Language is held together by language falling apart. Words don’t flow, disfluency does. Or — or — is this only what language looks like when judged by court reporters? Is a transcription always necessarily a judgement? How do you decide how to transcribe something? And anyway, leaving aside those hiccups linguists call non-lexical vocables, the bits and pieces we are always saying that, again, are not supposed to have any semantic meaning — leaving aside the hiccups, who decided where to put the spaces between our words in the first place?
Honor Gavin is a writer, musician, and academician born in Birmingham and now based in Berlin. She is currently developing a new work, 0121 Stimmtausch, for EVP. It will be premiered on 18th May at Rich Mix, London.
Sorry. No data so far.