¡@ A collection of texts, spoken and/or written, which has been designed and compiled based on a set of clearly defined criteria. CORPUS [13c: from Latin corpus body. The plural is usually corpora]. A collection of texts, especially if complete and self-contained: the corpus of Anglo-Saxon verse. Plural also corpuses. In linguistics and lexicography, a body of texts, utterances, or other specimens considered more or less representative of a language, and usually stored as an electronic database. Currently, computer corpora may store many millions of running words, whose features can be analysed by means of tagging (the addition of identifying and classifying tags to words and other formations) and the use of concordancing programs. Corpus linguistics studies data in any such corpus. (The Oxford Companion to the English Language, ed. McArthur & McArthur, 1992) ¡@ A collection of linguistic data, either written texts or a transcription of recorded speech, which can be used as a starting-point of linguistic description or as a means of verifying hypotheses about a language. (David Crystal, A Dictionary of Linguistics and Phonetics, Blackwell, 3rd Edition, 1991) ¡@ A collection of naturally occurring language text, chosen to characterize a state or variety of a language. (John Sinclair, Corpus Concordance, Collocation, OUP, 1991) ¡@ Monitor corpus ¡V attempts to be a representative cross-section of the spoken and/or written language to be studied (e.g. The Bank of English (COBUILD) and the British National Corpus) an d by its very nature it has to be very large (The Bank of English is about 400 million words of written and spoken texts and continues to grow). Monitor corpora have to be continually updated with 'new' texts and 'old' texts must be discarded if they are to be truly representative. ¡@ Sample corpus ¡V does not pretend to be representative of the whole spoken and/or written forms of the language to be investigated. Sample corpora are much more common and are the norm in most corpus-based studies (e.g. International Corpus of English and the Hong Kong Corpus of Spoken English at PolyU). Back to topDefinition of corpus linguistics Corpus linguistics is simply the study of language through corpus-based research, but
it differs from traditional linguistics in its insistence on the systematic study of
authentic examples of language in use. (i.e. corpus = evidence) Language cannot be invented; it can only be captured. (Sinclair, 1997: 31) Back to topExamples of English language corpora The Bank of English ¡V written and spoken English (used extensively by researchers and for the COBUILD series of English language books) The BNC ¡V written and spoken British English (used extensively by researchers and for the Oxford University Press, Chambers and Longman publishing houses) CANCODE (Cambridge Nottingham Corpus of the Discourse of English)¡V spoken British English (used extensively by researchers and Cambridge University Press) ICE (International Corpus of English¡V international varieties of spoken and written English (most of the corpus is not yet available) Brown University Corpus & LOB (Lancaster-Oslo-Bergen) Corpus ¡V parallel corpora of written texts (but now rather outdated) London-Lund Corpus (Survey of English Usage)¡V spoken British English (used very extensively by researchers, but it is now quite old) Santa Barbara Corpus ¡V spoken American English (most of the corpus is not yet available) Hong Kong Corpus of Spoken English (still being compiled, 1 million of the target 1,5 million words have been collected so far) ICAME (International Computer Archive of Modern English) - a centre which aims to coordinate and facilitate the sharing of computer-based corpora. Back to topExamples of corpus linguistic studies Three main types of corpus based linguistic study: 1. Lexical: e.g. word use, idioms, irregular plurals 2. Syntactic: sentence level features, e.g. use of prepositions, verb forms, pronouns, agreement 3. Discourse: the structure of text e.g. cohesion above the sentence level Test the hypothesis of co-selection Study the verbal environment of a word or phrase and examine the nature and extent of multi-word choices in the clause. Inspect co-text
Sinclair (in Wichmann et al, 1997) claims that meaning has an important effect on structure. If a word has two meanings it is possible to predict that it also has at least two structures (Sinclair, 1997: 35-36) and this is only made possible by studying examples of language in use.
¡@ Corpus-based language studies enable the researcher to identify and describe various realisations of the productivity of language ¡V what Sinclair (1997: 37-38) terms 'permissble variety'. For example, a search for the first noun X in the structure a(n) X of Y allows the researcher to uncover the productive opportunities in language. Back to topSchematic knowledge (see Aston, in Wichmann, 1997) To enable the researcher to become aware of the many kinds of regularities in discourse and the extent to which you can operate with fixed or semi-fixed associations Based on a distinction between syntagmatic associations at the levels of meanings (informational/rhetorical structure) and of forms (collocation/colligation/semantic/pragmatic/prosodic), and paradigmatic associations between situation and meaning (genre and register conventions), between meaning and form (conventional speech act and referring procedures), and between situation and form (routine formulae, technical terminology). ¡@ Multiple texts, multiple contexts: Recurrent patternings among multiple texts (e.g. newspaper articles):
¡@ Look for regularities (patterning): collocation, colligation, connotation, discourse structuring. e.g any will and would irregular verbs etc. (Sinclair, in Wichmann, 1997: Sinclair, Collins Cobuild English Grammar, Chapter 10) Back to topExploring texts through the concordancer An example of this kind of study for discourse features is detailed below: Study a list of words (occurring at least four times) taken from an article and make hypotheses about the text-topic.
(Choose 15 words from your text that occur only once and create a cloze test.) Back to topSome
fundamental precepts in corpus-based research (Sinclair, in Wichmann et al., 1997) Some implications of corpus-based study The operations of text and context retrieval rarely provide simply what the user was expecting (unexpected or unthought of usage): three spin-offs:
¡@ |
||
| Why use a corpus The computer ?gives us the ability to comprehend, and to account for, the contents of such corpora in a way which was not dreamed of in the pre-computational era of corpus? Leech 1992 in Svartvik (ed), p.106: Linguistics: to study linguistic competence or performance as revealed in naturally occurring data. Most applications will require or lead to the creation of annotated text. Language teaching/learning: language for specific purposes (e.g. use newspaper corpora, corpora of scientific texts); to prepare vocabulary lists based on high-frequency lexical items; to prepare CLOZE tests; to answer ad hoc learner questions ('What's the difference between few and a few?'); to discover facts about language ... ¡@
Constructing a corpus ¡@ Finding and cleaning up
text Scanning can be another method of collecting electronic texts, but you need to have
access to a scanner and learn how to use the OCR. It is still much faster and more
reliable than typing. Useful sites and homepages
on corpora and concordancers
|