COMPUTER CORPS LINGUISTICS RESEARCH ON THE BASE OF THE
INTERNATIONAL CORPUS OF ENGLISH
N.V., Tolstoy V.V.
metallurgical academy of Ukraine
Abstract: in the focus of the article there are
structures of the different modern corpus dictionaries. The reduction of
published dictionaries is regarded.
Key words: corpus design and compilation, linguistic
descriptors, corpus-based dictionaries.
As the number of words in the English language continues to grow,
dictionaries are getting fatter, and publishing costs become soaring, lexicographers
of the world offered a revolutionary solution of the problem. In 2000 members
of the Association of Creative lexicographers unanimously voted for a 15
percent and then a more significant reduction of all published dictionaries.
Reduction resulted in proportion to all the letters and all levels of
vocabulary. Thus, by 2014 every new vocabulary coming from the school to the academic
was cut by 25% .This solution has been defined as "ecologically
correct": smaller dictionaries saving forests, solution has been defined
as "ecologically correct": smaller dictionaries save forests.
Since English is currently
a lingua franca in many spheres of the modern world, including technology and
the Internet, one may proceed from the assumption that this will still be the
case, even when other languages such as Chinese Mandarin, Spanish, Arabic and
Hindi will gain in importance . Therefore, the question 'What kind of
English do we expect in the world tomorrow?' will have to be considered. We can
see how quickly current electronic sources such as Microsoft Encarta World English
Dictionary Online and The Literary Encyclopedia (released in 2001) will become
outdated, and what possibilities the future might hold. Especially in the field
of technology, the growth of vocabulary is accelerating all the time, and
printed reference works cannot keep up with the pace of change.
So, the contingency for
users to have direct access to a corpus, or to update a personalized dictionary
automatically, together with the rapid globalization and the need to
communicate easily and efficiently across linguistic and cultural borders,
ensures a bright future for digital lexicography.
Although corpus-based linguistics since the 1960s has been
associated with texts stored and analyzed on computers, the use of texts as the
basis for linguistic description goes back well before electronic computers
became available; today they are often called Pre-electronic Corpora [2,3]. In
particular we may note the use of texts in lexicography, the study of the
meaning and use of words. Thus, the first edition of the Oxford English
Dictionary, completed in 1928 after 70 years' work, was based mainly on the
analysis of a collection of citations, including about 50million words from
texts related to canon of English literature, representing the use of written
English over a period of 800 years.
Linguists and educationalists in the USA in the first half
of the twentieth century also used corpora of up to 19 million words to
discover the most frequently used words in English. That had to help developing
of better curricula for improving literacy education. This work  was also
very influential in the development of language teaching methodology and had a big
impact on the methodology used for the teaching of English to speakers of other
languages in substantial parts of Eastern Europe, Africa and Asia.
In addition to the central focus on linguistic description,
the study of the phonology, morphology, syntax, and discourse structure of
languages, contemporary work in corpus linguistics has focused on four other
main areas of activity:
corpus design and compilation;
the development of automatic
grammatical annotation of corpora by means of word‑class tagging and
parsing to assign constituent structures;
methodology for linguistic analysis
applications of corpus-based linguistic
The next diagrams representing structure of the
International Corpus of English (ICE) are based on 500 samples, each of 2,000
Diagram 1. Spoken, dialogue and private language
In total 300 texts were examined.
Diagram 2. Written, non-professional and non-printed language
In total 200 texts were examined.
ICE project, when completed, will include 20
one-million-word corpora of English (60 percent from spoken sources) compiled
in different parts of the world from texts produced in the 1990s. In addition
to the whole corpus forming a representative sample of English as an international
language, the 20 individual subcorpora can be the basis of comparative studies
of regional varieties of English. The British section of ICE, the first section
publicly available, comes equipped with sophisticated search and retrieval
software for automatic lexical and grammatical analysis,
1. Henrik Gottlieb and Jens Erik Mogensen (Editors).
Dictionary Visions, Research and Practice. Selected Papers from the 12th
International Symposium on Lexicography, Copenhagen 2007.
2. Kennedy, Graeme An Introduction to Corpus Linguistics.
3. Thomdike E.L. The psychology of the school dictionary /
Bulletin of the School of Education, Indiana University. V. 4.– Bloomington,
4. Oii V. Computer Corpus Lexicography. Edinburgh, 1998.
5. David Crystal, The Cambridge Encyclopedia of the English
Language (Second Edition) ‑ Cambridge University Press, 2003.