Mathematical and computer linguistics. Computational linguistics: methods, resources, applications

The term "computational linguistics" usually refers to a wide area of ​​​​use of computer tools - programs, computer technology organization and processing of data - to model the functioning of language in certain conditions, situations, problem areas, as well as the scope of application of computer models of language not only in linguistics, but also in related disciplines. Actually, only in the latter case are we talking about applied linguistics in the strict sense, since computer modeling of language can also be considered as a field of application of computer science theory in the field of linguistics. Nevertheless, the general practice is that the field of computational linguistics covers almost everything related to the use of computers in linguistics: “The term “computational linguistics” defines general orientation to use computers to solve a variety of scientific and practical problems related to language, without in any way limiting the ways of solving these problems."

Institutional aspect of computational linguistics. Computational linguistics took shape as a special scientific field in the 60s. The flow of publications in this area is very large. In addition to thematic collections, the magazine " Computational linguistics". Greater organizational and scientific work is conducted by the Association for Computational Linguistics, which has regional structures around the world (in particular, a European branch). Every two years there are international conferences on computational linguistics - COLING. Relevant issues are also widely represented at international conferences on artificial intelligence at various levels.

Cognitive tools for computational linguistics

Computational linguistics as a special applied discipline is distinguished primarily by its instrument - that is, by the use of computer tools for processing language data. Because the computer programs, which model certain aspects of the functioning of a language can use a variety of programming tools, it seems that there is no need to talk about a general metalanguage. However, it is not. There are general principles of computer modeling of thinking that are somehow implemented in any computer model. This language is based on the theory of knowledge developed in artificial intelligence and forming an important branch of cognitive science.

The main thesis of the theory of knowledge states that thinking is the process of processing and generating knowledge. “Knowledge” or “knowledge” is considered an undefinable category. The human cognitive system acts as a “processor” that processes knowledge. In epistemology and cognitive science, two main types of knowledge are distinguished - declarative (“knowing what”) and procedural (“knowing how”2)). Declarative knowledge is usually presented in the form of a set of propositions, statements about something. A typical example of declarative knowledge can be considered the interpretation of words in ordinary explanatory dictionaries. For example, cup] - "a small vessel for drinking round shape, usually with a handle, made of porcelain, earthenware, etc. ". Declarative knowledge is amenable to a verification procedure in terms of “true-false”. Procedural knowledge is presented as a sequence (list) of operations, actions that should be performed. This is some general instructions about actions in some situation. A typical example of procedural knowledge is instructions for using household appliances.

Unlike declarative knowledge, procedural knowledge cannot be verified as true or false. They can be assessed only by the success or failure of the algorithm.

Most of the concepts of the cognitive tools of computational linguistics are homonymous: they simultaneously designate some real entities of the human cognitive system and ways of representing these entities in certain metalanguages. In other words, elements of metalanguage have an ontological and instrumental aspect. Ontologically, the division of declarative and procedural knowledge corresponds various types knowledge of the human cognitive system. So, knowledge about specific subjects, objects of reality are predominantly declarative, and a person’s functional abilities to walk, run, and drive a car are realized in the cognitive system as procedural knowledge. Instrumentally, knowledge (both ontologically procedural and declarative) can be represented as a set of descriptions, descriptions and as an algorithm or instruction. In other words, ontologically declarative knowledge about the object of reality “table” can be represented procedurally as a set of instructions, algorithms for its creation, assembly (= creative aspect of procedural knowledge) or as an algorithm for its typical use (= functional aspect of procedural knowledge). In the first case, this could be a guide for a novice carpenter, and in the second, a description of the possibilities office desk. The reverse is also true: ontologically procedural knowledge can be represented declaratively.

It requires a separate discussion whether any ontologically declarative knowledge can be represented as procedural, and any ontologically procedural knowledge can be represented as declarative. Researchers agree that any declarative knowledge can, in principle, be represented procedurally, although this may turn out to be very wasteful for the cognitive system. The reverse is unlikely to be true. The fact is that declarative knowledge is much more explicit, it is easier for a person to understand than procedural knowledge. In contrast to declarative knowledge, procedural knowledge is predominantly implicit. Thus, the language ability, being procedural knowledge, is hidden from a person and is not realized by him. An attempt to explicate the mechanisms of language functioning leads to dysfunction. Specialists in the field of lexical semantics know, for example, that long-term semantic introspection, necessary to study the content of a word, leads to the researcher partially losing the ability to distinguish between correct and incorrect uses of the word being analyzed. Other examples can be given. It is known that from the point of view of mechanics, the human body is a complex system of two interacting pendulums.

In the theory of knowledge, various knowledge structures are used to study and represent knowledge - frames, scenarios, plans. According to M. Minsky, “a frame is a data structure designed to represent a stereotypical situation” [Minsky 1978, p.254]. In more detail, we can say that a frame is a conceptual structure for the declarative representation of knowledge about a typified thematically unified situation containing slots interconnected by certain semantic relationships. For clarity purposes, a frame is often represented as a table, the rows of which form slots. Each slot has its own name and content (see Table 1).

Table 1

Fragment of the "table" frame in a table view

Depending on the specific task, frame structuring can be significantly more complex; a frame may contain nested subframes and references to other frames.

Instead of a table, a predicate form of representation is often used. In this case, the frame is in the form of a predicate or a function with arguments. There are other ways to represent a frame. For example, it can be represented as a tuple of the following form: ( (frame name) (slot name)) (slot value,),..., (slot name n) (slot value l)).

Typically, frames in knowledge representation languages ​​have this type.

Like other cognitive categories of computational linguistics, the concept of frame is homonymous. Ontologically, it is part of the human cognitive system, and in this sense, the frame can be compared with such concepts as gestalt, prototype, stereotype, scheme. In cognitive psychology, these categories are considered from an ontological point of view. Thus, D. Norman distinguishes two main ways of existence and organization of knowledge in the human cognitive system - semantic networks and schemas. “Schemas,” he writes, “are organized packages of knowledge assembled to represent individual, independent units of knowledge. My schema for Sam might contain information describing his physical features, his activities, and personality traits. This schema relates to other schemas that describe its other sides" [Norman 1998, p. 359]. If we take the instrumental side of the frame category, then this is a structure for the declarative representation of knowledge. In existing AI systems, frames can form complex knowledge structures; Frame systems allow hierarchy - one frame can be part of another frame.

In content, the concept of a frame is very close to the category of interpretation. Indeed, a slot is an analogue of valence, filling a slot is an analogue of an actant. The main difference between them is that the interpretation contains only linguistically relevant information about the content of the word, and the frame, firstly, is not necessarily tied to the word, and, secondly, includes all information relevant to a given problem situation, including including extralinguistic (knowledge about the world) 3).

A script is a conceptual structure for the procedural representation of knowledge about a stereotypical situation or stereotypical behavior. The elements of a script are the steps of an algorithm or instructions. They usually talk about a “restaurant visit scenario”, “purchase scenario”, etc.

Initially, the frame was also used for procedural representation (cf. the term "procedural frame"), but now the term "script" is more often used in this sense. A scenario can be represented not only as an algorithm, but also as a network, the vertices of which correspond to certain situations, and the arcs correspond to connections between situations. Along with the concept of a script, some researchers use the category of script for computer modeling of intelligence. According to R. Schenk, a script is some generally accepted, well-known sequence of causal relationships. For example, understanding dialogue

It's pouring like buckets outside.

You will still have to go to the store: there is nothing to eat in the house - yesterday the guests swept everything away.

is based on non-explicit semantic connections like “if it's raining, it is undesirable to go outside because you can get sick." These connections form a script, which is used by native speakers to understand each other's speech and non-speech behavior.

As a result of applying the scenario to a specific problem situation, a plan). A plan is used to procedurally represent knowledge about possible actions leading to the achievement of a specific goal. A plan relates a goal to a sequence of actions.

In general, a plan includes a sequence of procedures that transform the initial state of the system into the final state and lead to the achievement of a certain subgoal and goal. In AI systems, a plan arises as a result of the planning or planning activity of the corresponding module - the planning module. The planning process may be based on adapting data from one or more scenarios, activated by testing procedures, to resolve a problem situation. The plan is executed by the executive module, which controls the cognitive procedures and physical actions of the system. In the elementary case, the plan in intelligent system represents a simple sequence of operations; in more complex versions, the plan is associated with a specific subject, its resources, capabilities, goals, detailed information about the problem situation, etc. The emergence of a plan occurs in the process of communication between the world model, part of which is formed by scenarios, the planning module and the executive module.

Unlike a script, a plan is related to specific situation, a specific performer and pursues the achievement of a specific goal. The choice of plan is governed by the contractor's resources. Feasibility of the plan - required condition its generation in the cognitive system, and the feasibility characteristic is not applicable to the scenario.

Another important concept is the model of the world. A world model is usually understood as a set of knowledge about the world organized in a certain way, characteristic of a cognitive system or its computer model. In a somewhat more general form, a world model is spoken of as part of a cognitive system that stores knowledge about the structure of the world, its patterns, etc. In another understanding, a world model is associated with the results of understanding a text or, more broadly, discourse. In the process of understanding discourse, its mental model is built, which is the result of the interaction of the text’s content plan and knowledge about the world characteristic of a given subject [Johnson-Laird 1988, p. 237 ff]. The first and second understandings are often combined. This is typical for linguistic researchers working within cognitive linguistics and cognitive science.

Closely related to the category of frame is the concept of scene. The scene category is mainly used in the literature as a designation of a conceptual structure for the declarative representation of situations and their parts actualized in a speech act and highlighted by linguistic means (lexemes, syntactic constructions, grammatical categories, etc.) Being associated with linguistic forms, a scene is often actualized by a certain word or expression. In plot grammars (see below), a scene appears as part of an episode or narrative. Typical examples of scenes are a set of cubes that the AI ​​system works with, the location of the action in the story and the participants in the action, etc. In artificial intelligence, scenes are used in image recognition systems, as well as in programs focused on the study (analysis, description) of problem situations. The concept of a scene has become widespread in theoretical linguistics, as well as logic, in particular in situational semantics, in which the meaning of a lexical unit is directly associated with the scene.

Modern computational linguistics is very much focused on the use of mathematical models. There is even a common belief that linguists are not particularly needed for automatic natural language modeling. There is a well-known catchphrase from Frederick Jelinek, head of the Johns Hopkins University Speech Recognition Center: “ Anytime a linguist leaves the group, the recognition rate goes up"- every time a linguist leaves the working group, the quality of recognition increases.

However, the more complex and multi-level tasks of linguistic modeling are set for developers automatic systems, the more obvious it becomes that their solution is impossible without taking into account linguistic theory, understanding of how language functions, and linguistic expert competence. At the same time, it became obvious that automatic methods for analyzing and modeling language data can significantly enrich theoretical linguistic research, being both a means for collecting language data and a tool for testing the validity of a particular linguistic hypothesis.

Forum on Evaluation of Automatic Text Processing Systems

S.Yu.Toldova, O.N. Lyashevskaya, A.A. Bonch-Osmolovskaya

How to formalize lexical meaning and make it “machine readable”? The answer to this is given by distributional models of language, in which the meaning of a word is the sum of its contexts in a sufficiently large corpus. Artificial neural networks make it possible to quickly and efficiently train such models.

Denis Kiryanov, Tanya Panova (scientific supervisor B.V. Orekhov)

This program has two functions: a) normalization of Yiddish text, b) transliteration from square script to Latin. These problems are very relevant: until now, not a single normalizer existed, unless you count spell checkers as such. Meanwhile, almost every publishing house that published books in Yiddish followed its own spelling practice. A normalizer is needed to work on the Yiddish language corpus: to reduce all texts to a single spelling that is recognized by the parser. Transliteration will allow typologists to work with Yiddish material.

VIDEO of the School of Linguistics staff:

Optionally; 3rd year, 2, 3 module

Required; 1st year, 2nd module

Optionally; 3rd year, 3rd module

Required; 4th year, 1-3 module

Required; 4th year, 2 module

Required; 2nd year, 1, 2, 4 module

Linguistic informatics is part of the theory of information services. The theory of information services arose in connection with the computerization of speech, that is, in connection with the use of computers as a means of recording, accounting and storing linguistic information. Thanks to technology, it was possible to combine the functions of a library, archive and office.

Large classes of texts are processed through automatic summarization. The continuously growing volume of scientific and technical information, the search for which is becoming more and more labor-intensive, gave rise to the idea of ​​searching through so-called secondary texts, which are condensed information from a primary document: bibliographic description, abstract, abstract, scientific translation.

Collapse of the primary text is carried out by compressing it. Special methods for collapsing the primary text have been developed:

a) statistical-distributive methods consist in identifying the most informative sentences, in which the most significant linguistic signs for a given text are concentrated;

b) methods of using semantic indicators, when the most meaningful “points” of the text are noted - the subject of research, purpose, methods, relevance, scope, conclusions, results); c) the method of textual connections, which consists in the fact that taking into account interphrase connections makes the abstract holistic.

3. Practical terminology.
Practical terminology includes sections:

a) lexicographic terminology, which deals with the theory and practice of creating special dictionaries, unifying term systems, translating terms, creating terminological data banks, automating their storage and processing.

b) lexicography itself became the subject of applied linguistics as one of the most labor-intensive types of practical linguistics. Dictionaries have been created for decades. Therefore, the desire of scientists to automate lexicographic activities is understandable. Automatic dictionaries appeared. Their purpose is to increase labor productivity when working with texts, collecting, storing and processing various language units. Dictionaries of this type are used in automatic text processing systems.

Automatic translation.

Automatic or machine translation is based on the assumption that it is possible to harmonize typologically different language structures(vocabulary, word order, inflection, syntactic structures). The linguistic principle of translation is the comparison of linguistic units of two or more languages ​​that are equivalent in meaning.

There are two stages in the development of automatic translation systems. At the first stage, such fundamental problems of machine translation were solved as the creation of automatic dictionaries, the development of an intermediary language, the formalization of grammar, overcoming homonymy, and processing idiomatic formations. At the second stage, set-theoretic models of grammars, models of dependency grammars, direct components, and models of generative grammar continue to be quite fruitfully developed and implemented in practice. During this period, semantics according to the “meaning - text” model is increasingly involved in applied linguistics. Centers for applied linguistics that have emerged in domestic and foreign universities are developing machine translation strategies. These include the laboratory of mathematical linguistics at St. Petersburg University, at the Institute of Applied Mathematics of the Russian Academy of Sciences; All-Union Translation Center; group “Speech Statistics” at the Leningrad Pedagogical Institute under the leadership of Raymond Genrikhovich Piotrovsky; group for the study of syntactic modeling “meaning - text” under the leadership of Igor Aleksandrovich Melchuk.

A new stage in improving machine translation is associated with the use of an intermediary language - a knowledge representation language. It is based on the analysis of the meaning of a sentence, obtained by understanding the input sentence, supplemented and marked with the help of information from the knowledge base and in its terms. The translation process is the transformation of the input sentence of language X into the output structure of language Y. In other words, the result of machine translation is not the translation itself, but a retelling of the source text (X). The quality of translation depends on the effectiveness of the knowledge representation language. High quality machine translation can only be achieved by creating reliable linguistic foundations and software tools for building powerful semantic networks based on automated lexicons.

IV. Ethnolinguistics.

Ethnolinguistics (ethnosemantics, anthropolinguistics) is a field of linguistics that studies language in its relationship with the culture of a particular ethnic group. The foundations of ethnolinguistics were laid in the work of Franz Boas and Edward Sapir in the first quarter of the 20th century. In the second half of the 20th century. Ethnolinguistics has become an independent branch of linguistics. Ethnolinguistic studies of the second half of the 20th century. characterized by such features as: involving methods experimental psychology; comparison of semantic models of different languages; studying problems of folk taxonomy; paralinguistic research; reconstruction of spiritual ethnic culture based on language data; reviving attention to folklore.

Central to ethnolinguistics are two closely interrelated problems, which can be called “cognitive” and “communicative”:

1. How, by what means and in what form are the cultural (everyday, religious, social, etc.) ideas of the people speaking this language about the world around them and about the place of man in this world reflected in the language?

2. What forms and means of communication - primarily linguistic communication - are specific to a given ethnic or social group?

In accordance with these problems, two directions have emerged in ethnolinguistics: cognitively oriented ethnolinguistics and communicative oriented linguistics.

a) Cognitively oriented ethnolinguistics.

Cognitively oriented ethnolinguistics is characteristic of American linguistics. It is called anthropological linguistics. Initially, anthropological linguistics was focused on the study of the culture of peoples who differed sharply from European ones, primarily the American Indians. Establishing family ties between these languages ​​and describing their current state were subordinated to the task of a comprehensive description of the culture of these peoples and reconstruction of their history, including migration routes. The recording and interpretation of everyday and folklore texts was an integral component of anthropological description.

Following Franz Boas, anthropological linguistics believes that more fractional fragments of the classification of reality in a language correspond to more important aspects of a given culture. As American linguist and anthropologist Harry Heuer notes, “Hunter-gatherer peoples, such as the Apache tribes of the American Southwest, have an extensive vocabulary of names for animals, plants, and environmental phenomena. The peoples whose main source of existence is fishing(in particular, the Indians of the northern Pacific coast) have in their vocabulary a detailed set of fish names, as well as fishing tools and techniques.”

The greatest attention of ethnolinguists was attracted by such taxonomic systems as designations of body parts, terms of kinship, the so-called ethno-biological classifications, that is, the names of plants and animals (English scientist B. Berlin, Anna Vezhbitskaya), - and especially color designations (B. Berlin and P .Kay, A.Vezhbitskaya).

In modern anthropological ethnolinguistics, we can conditionally distinguish “relativistic” and “universalist” directions: for the first, the priority is the study of cultural and linguistic specifics in the speaker’s picture of the world, for the second, the search for universal properties of the vocabulary and grammar of natural languages.

An example of research into the relativistic direction in ethnolinguistics is the work of Yuri Derenikovich Apresyan, Nina Davidovna Arutyunova, Anna Vezhbitskaya, Tatyana Vyacheslavovna Bulygina, Alexey Dmitrievich Shmelev, E.S. Yakovleva, devoted to the peculiarities of the Russian linguistic picture of the world. These authors analyze the meaning and use of words that either denote unique concepts that are not typical for the conceptualization of the world in other languages ​​(melancholy and daring, maybe and I suppose), or correspond to concepts that exist in other cultures, but are especially significant for Russian culture or receiving a special interpretation (truth and justice, freedom and will, fate and fate). Let us give as an example a fragment of the description of the word “maybe” from the book by T.V. Bulygina and A.D. Shmelev “Linguistic Conceptualization of the World”:

«<...>Perhaps means not at all the same thing as simply “possibly” or “maybe.”<...>most often, maybe is used as a kind of justification for carelessness, when it comes to the hope not so much that some favorable event will happen, but that some extremely undesirable consequence will be avoided. A person who buys a lottery ticket will not be said to be acting at chance. So, rather, one can say about a person who<...>saves money by not buying health insurance and hopes nothing bad happens<...>Therefore, hope for chance is not just hope for luck. If the symbol of fortune is roulette, then hope for chance can be symbolized by “Russian roulette.”

An example of research in the universalist direction in ethnolinguistics is the work of the Polish scientist Anna Wierzbicka, devoted to the principles of describing linguistic meanings. The goal of many years of research by A. Vezhbitskaya and her followers is to establish a set of so-called “semantic primitives,” universal elementary concepts, by combining which each language can create an infinite number of configurations specific to a given language and culture. Semantic primitives are lexical universals, in other words, these are elementary concepts for which in any language there is a word denoting them. These concepts are intuitively clear to a speaker of any language, and on their basis one can build interpretations of any, no matter how complex, linguistic units. Studying material from genetically and culturally diverse languages ​​of the world, including the languages ​​of Papua New Guinea, Austronesian languages, African languages ​​and Australian aborigines, A. Vezhbitskaya constantly refines the list of semantic primitives. Her work “The Interpretation of Emotional Concepts” gives the following list of them:

“substantives” – I, you, someone, something, people;
“determiners and quantifiers” – this, the same, the same, the other, one, two, many, all/entire;
“mental predicates” – think (about), speak, know, feel, want;
“actions and events” - do, occur/happen;
“ratings” – good, bad;
“descriptors” – big, small;
“time and place” – when, where, after/before, under/above;
“metapredicates” – not/not/negation, because/because of, if, can;
“intensifier” – very;
“taxonomy and partonomy” – species/variety, part;
“laxness/prototype” – similar/as.

From semantic primitives, like “building blocks”, A. Vezhbitskaya puts together interpretations of even such subtle concepts as emotions. For example, she manages to demonstrate the subtle difference between the concept of American culture denoted by the word “happy” and the concept denoted by the Russian word “happy” (and similar Polish, French and German adjectives). The word “happy,” as A. Vezhbitskaya writes, although it is usually considered a dictionary equivalent English word“happy”, in Russian culture, has a narrower meaning, “usually it is used to denote rare states of complete bliss or complete satisfaction received from such serious things as love, family, the meaning of life, etc.” This is how this difference is formulated in the language of semantic primitives (components of interpretation B that are absent in interpretation A are highlighted in capital letters).

Interpretation A: X feels happy
X feels something
something good happened to me
I wanted this
I don't want anything else
X feels something similar

Interpretation B: X is happy
X feels something
sometimes a person thinks something like this:
something VERY good happened to me
I wanted this
EVERYTHING IS FINE
I CAN'T WANT anything else
so this person feels something good
X feels something similar

For A. Vezhbitskaya’s research program, it is fundamental that the search for universal semantic primitives is carried out empirically, using field linguistics methods - working with an informant: firstly, in each individual language the role that a given concept plays in the interpretation of other concepts is clarified, and, in -secondly, for each concept, a set of languages ​​is identified in which this concept is lexicalized, that is, there is a special word that expresses this concept.

B) Communicatively oriented ethnolinguistics.

The most significant results in communicatively oriented ethnolinguistics are associated with the direction called “ethnography of speech” or “ethnography of communication.” Ethnography of speech as a theory and method for analyzing language use in a sociocultural context was proposed in the early 60s. in the works of D. Himes and John J. Gumperz and developed in the works of the American scientist Aron Cicurel, J. Bauman, A.W. Corsaro. An utterance is studied only in connection with any speech or communicative event within the framework of which it is generated. The cultural conditionality of any speech events (sermon, court hearing, telephone conversation, etc.) is emphasized. The rules of language use are established through present observation (participation in a speech event), analysis of spontaneous data, and interviewing native speakers of a given language.

Within the framework of this direction, models of speech behavior accepted in a particular culture, in a particular ethnic or social group. So, for example, in the culture of the “Central European standard”, an informal conversation between several people assumes, according to the rules of good manners accepted in this community, that the participants will not interrupt each other, everyone is given the opportunity to speak in turn, and the person who wants to speak usually signals this with the words “let me note” , “let me ask,” etc. Anyone who wants to drop out of the conversation announces his intention with the words “unfortunately, I have to go,” “I have to leave for a while,” and so on. Completely different norms of public speech behavior are accepted, for example, in a number of Australian Aboriginal cultures. Compliance individual rights individual participant conversation in these communities is not a mandatory rule: several interlocutors can speak at the same time, it is not necessary to react to the statement of another, the speaker speaks out without specifically addressing anyone, the interlocutors may not look at each other, etc. This model of speech behavior is built on the initial premise that all utterances are somehow accumulated in the surrounding world, and therefore the “reception” of a message does not necessarily have to directly follow its “transmission.”

A relevant topic in the ethnography of communication is also the study of the linguistic expression of the relative social status of the interlocutors: the rules of addressing the interlocutor, including the use of titles, addresses by name, surname, first name and patronymic, professional addresses (for example, “doctor”, “comrade major”, “ professor"), the appropriateness of addresses "on you" and "on you", etc. Particular attention is paid to languages ​​in which the relationship between the social position of the speaker and the listener is fixed not only in vocabulary, but also in grammar. An example is the Japanese language, where the choice of grammatical form of a verb depends on whether the listener is higher or lower than the speaker in the social hierarchy, and also on whether the speaker and listener are part of the same social unit or not. In addition, the relationship between the speaker and the person being spoken about is also taken into account. we're talking about. As a result of the complex effect of these restrictions, the same person uses different forms of the verb when addressing a subordinate and when addressing a superior, when addressing a co-worker and when addressing a stranger, when addressing his wife and his neighbor’s wife.

The grammar also reflects such a feature of Japanese speech etiquette as the desire to avoid intrusion into the sphere of thoughts and feelings of the interlocutor. In Japanese, there is a special grammatical form of the verb - the so-called “optional mood”. Using the desirable mood suffix –tai, the speaker expresses a desire to perform the action indicated by the original verb: “read” + tai = “want to read”, “leave” + tai = “want to leave”. However, forms of the desirable mood are only possible if the speaker describes his own desire. The desire of the interlocutor or a third party is expressed using a special construction, approximately meaning “by external signs we can conclude that person X wants to perform action Y." Thus, subject to the requirements of grammar, a speaker of Japanese can make judgments only about his own intentions. But the language simply does not allow making direct statements about the internal state of another person, for example about his desires You can say “I want...”, but you cannot say “You want...” or “He wants...”, but only “It seems to me (I have the impression) that you want...” or “It seems to me (I have the impression) that he wants...”

In addition to the norms of speech etiquette, the ethnography of communication also studies speech situations ritualized in certain cultures, such as a court hearing, the defense of a dissertation, a trade transaction, and the like; rules for choosing a language in interlingual communication; linguistic conventions and clichés that signal that a text belongs to a certain genre (“once upon a time” - in fairy tales, “they listened and decided” - in the minutes of a meeting).

Modern ethnolinguistics is closely related to sociology, psychology, and semiotics. In Russian ethnolinguistics, a special place is occupied by research at the intersection of ethnolinguistics, folkloristics and comparative historical linguistics. First of all, this is a research program dedicated to the ethnolinguistic and ethnocultural history of the Slavic peoples (Nikita Ilyich Tolstoy, Svetlana Mikhailovna Tolstaya, Vladimir Nikolaevich Toporov). Within the framework of this program, ethnolinguistic atlases are compiled, rituals, beliefs, and folklore are mapped; the structure of codified Slavic texts of certain genres is studied, including spell texts, riddles, funeral and construction rituals, etc., in correlation with data from comparative historical and archaeological research.

  • Systematization in linguistics and linguistic classification of the peoples of the world
  • Sociolinguistic (or functional) classification of languages ​​and forms of speech

  • Computational linguistics(Also: mathematical or computational linguistics, English computational linguistics) - a scientific direction in the field of mathematical and computer modeling of intellectual processes in humans and animals when creating artificial intelligence systems, which aims to use mathematical models to describe natural languages.

    Computational linguistics overlaps with natural language processing. However, in the latter the emphasis is not on abstract models, but on applied methods of describing and processing language for computer systems.

    The field of activity of computer linguists is the development of algorithms and application programs for processing linguistic information.

    Origins

    Mathematical linguistics is a branch of the science of artificial intelligence. Its history began in the United States of America in the 1950s. With the invention of the transistor and the advent of a new generation of computers, as well as the first programming languages, experiments began with machine translation, especially of Russian scientific journals. In the 1960s, similar studies were carried out in the USSR (for example, an article on translation from Russian into Armenian in the collection “Problems of Cybernetics” for 1964). However, the quality of machine translation is still much inferior to the quality of human translation.

    From May 15 to May 21, 1958, the first All-Union Conference on Machine Translation was held at the I Moscow State Pedagogical Institute of Foreign Languages. The Organizing Committee was headed by V. Yu. Rosenzweig and the executive secretary of the Organizing Committee G. V. Chernov. The full conference program is published in the collection “Machine Translation and Applied Linguistics,” vol. 1, 1959 (aka “Machine Translation Association Bulletin No. 8”). As V. Yu. Rosenzweig recalls, the published collection of conference abstracts came to the USA and made a great impression there.

    In April 1959, the First All-Union Meeting on Mathematical Linguistics, convened by Leningrad University and the Committee of Applied Linguistics, took place in Leningrad. The main organizer of the Meeting was N.D. Andreev. A number of prominent mathematicians took part in the Meeting, in particular, S. L. Sobolev, L. V. Kantorovich (later - Nobel laureate) and A. A. Markov (the last two spoke in the debate). V. Yu. Rosenzweig delivered a keynote speech on the opening day of the Meeting, “General linguistic theory of translation and mathematical linguistics.”

    Areas of Computational Linguistics

    • Natural language processing natural language processing; syntactic, morphological, semantic text analysis). This also includes:
    1. Corpus linguistics, the creation and use of electronic corpora of texts
    2. Creation of electronic dictionaries, thesauri, ontologies. For example, Lingvo. Dictionaries are used, for example, for automatic translation and spell checking.
    3. Automatic translation of texts. Promt is popular among Russian translators. Among the free ones is Google Translate.
    4. Automatic extraction of facts from text (information extraction) fact extraction, text mining)
    5. Auto-referencing automatic text summarization). This feature is included, for example, in Microsoft Word.
    6. Building knowledge management systems. See Expert Systems
    7. Creation of question and answer systems question answering systems).
    • Optical character recognition OCR). For example, the FineReader program
    • Automatic speech recognition ASR). There are paid and free software
    • Automatic speech synthesis

    Major associations and conferences

    Study programs in Russia

    see also

    Write a review about the article "Computational Linguistics"

    Notes

    Links

    • (abstract)
    • - knowledge base of linguistic resources for the Russian language
    • - open source codes of some computational linguistics utilities
    • - online access to computational linguistics programs

    An excerpt characterizing Computational Linguistics

    “Take, take the child,” said Pierre, handing over the girl and addressing the woman imperiously and hastily. - Give it to them, give it to them! - he shouted almost at the woman, putting the screaming girl on the ground, and again looked back at the French and the Armenian family. The old man was already sitting barefoot. The little Frenchman took off his last boot and clapped the boots one against the other. The old man, sobbing, said something, but Pierre only caught a glimpse of it; all his attention was turned to the Frenchman in the hood, who at that time, slowly swaying, moved towards the young woman and, taking his hands out of his pockets, grabbed her neck.
    The beautiful Armenian woman continued to sit in the same motionless position, with her long eyelashes lowered, and as if she did not see or feel what the soldier was doing to her.
    While Pierre ran the few steps that separated him from the French, a long marauder in a hood was already tearing the necklace she was wearing from the Armenian woman’s neck, and the young woman, clutching her neck with her hands, screamed in a shrill voice.
    – Laissez cette femme! [Leave this woman!] - Pierre croaked in a frantic voice, grabbing the long, hunched soldier by the shoulders and throwing him away. The soldier fell, got up and ran away. But his comrade, throwing away his boots, took out a cleaver and menacingly advanced on Pierre.
    - Voyons, pas de betises! [Oh well! Don’t be stupid!] – he shouted.
    Pierre was in that rapture of rage in which he remembered nothing and in which his strength increased tenfold. He rushed at the barefoot Frenchman and, before he could take out his cleaver, he had already knocked him down and was hammering at him with his fists. An approving cry from the surrounding crowd was heard, and at the same time a mounted patrol of French lancers appeared around the corner. The lancers trotted up to Pierre and the Frenchman and surrounded them. Pierre did not remember anything of what happened next. He remembered that he had beaten someone, he had been beaten, and that in the end he felt that his hands were tied, that a crowd of French soldiers was standing around him and searching his dress.
    “Il a un poignard, lieutenant, [Lieutenant, he has a dagger,”] were the first words that Pierre understood.
    - Ah, une arme! [Ah, weapons!] - said the officer and turned to the barefoot soldier who was taken with Pierre.
    “C"est bon, vous direz tout cela au conseil de guerre, [Okay, okay, you’ll tell everything at the trial," said the officer. And after that he turned to Pierre: “Parlez vous francais vous?” [Do you speak French? ]
    Pierre looked around him with bloodshot eyes and did not answer. His face probably seemed very scary, because the officer said something in a whisper, and four more lancers separated from the team and stood on both sides of Pierre.
    – Parlez vous francais? – the officer repeated the question to him, staying away from him. - Faites venir l "interprete. [Call an interpreter.] - From behind the ranks he left little man in civilian Russian dress. Pierre, by his attire and speech, immediately recognized him as a Frenchman from one of the Moscow shops.
    “Il n"a pas l"air d"un homme du peuple, [He doesn’t look like a commoner," said the translator, looking at Pierre.
    – Oh, oh! ca m"a bien l"air d"un des incendiaires," the officer blurred. "Demandez lui ce qu"il est? [Oh, oh! he looks a lot like an arsonist. Ask him who he is?] he added.
    - Who are you? – asked the translator. “The authorities must answer,” he said.
    – Je ne vous dirai pas qui je suis. Je suis votre prisonnier. Emmenez moi, [I won't tell you who I am. I am your prisoner. Take me away,” Pierre suddenly said in French.
    - Ah, Ah! – the officer said, frowning. - Marchons!
    A crowd gathered around the lancers. Closest to Pierre stood a pockmarked woman with a girl; When the detour started moving, she moved forward.
    -Where are they taking you, my darling? - she said. - This girl, what am I going to do with this girl, if she’s not theirs! - the woman said.
    – Qu"est ce qu"elle veut cette femme? [What does she want?] - asked the officer.
    Pierre looked like he was drunk. His ecstatic state intensified even more at the sight of the girl he had saved.
    “Ce qu"elle dit?” he said. “Elle m”apporte ma fille que je viens de sauver des flammes,” he said. - Adieu! [What does she want? She is carrying my daughter, whom I saved from the fire. Farewell!] - and he, not knowing how this aimless lie escaped him, walked with a decisive, solemn step among the French.
    The French patrol was one of those that were sent by order of Duronel to various streets of Moscow to suppress looting and especially to capture the arsonists, who, according to the general opinion that emerged that day among the French of the highest ranks, were the cause of the fires. Having traveled around several streets, the patrol picked up five more suspicious Russians, one shopkeeper, two seminarians, a peasant and a servant, and several looters. But of all the suspicious people, Pierre seemed the most suspicious of all. When they were all brought to spend the night in a large house on Zubovsky Val, in which a guardhouse was established, Pierre was placed separately under strict guard.

    In St. Petersburg at this time, in the highest circles, with greater fervor than ever, there was a complex struggle between the parties of Rumyantsev, the French, Maria Feodorovna, the Tsarevich and others, drowned out, as always, by the trumpeting of the court drones. But calm, luxurious, concerned only with ghosts, reflections of life, St. Petersburg life went on as before; and because of the course of this life, it was necessary to make great efforts to recognize the danger and the difficult situation in which the Russian people found themselves. There were the same exits, balls, the same French theater, the same interests of the courts, the same interests of service and intrigue. Only in the highest circles were efforts made to recall the difficulty of the present situation. It was told in whispers how the two empresses acted opposite to each other in such difficult circumstances. Empress Maria Feodorovna, concerned about the welfare of the charitable and educational institutions under her jurisdiction, made an order to send all institutions to Kazan, and the things of these institutions were already packed. Empress Elizaveta Alekseevna, when asked what orders she wanted to make, with her characteristic Russian patriotism, deigned to answer that she could not make orders about state institutions, since this concerned the sovereign; about the same thing that personally depends on her, she deigned to say that she will be the last to leave St. Petersburg.

    Plan:

    1. What is computational linguistics?

    2. Object and subject of computational linguistics

    4. Problems of computational linguistics

    5. Research methods for computational linguistics

    6. History and reasons for the emergence of computational linguistics

    7. Basic terms of computational linguistics

    8. Scientists working on the problem of computational linguistics

    9. Associations and conferences on computational linguistics

    10. Literature used.


    Computational linguistics – an independent direction in applied linguistics, focused on the use of computers to solve problems related to the use of natural language. (Shchilikhina K.M.)


    Computational linguistics– being one of the areas of applied linguistics, she studies the linguistic foundations of computer science and all aspects of the connection between language and thinking, modeling language and thinking in a computer environment using computer programs, and her interests lie in the areas of: 1) optimization of communication based on linguistic knowledge 2) creation natural language interface and typologies of language understanding for human-machine communication 3) creation and modeling of information computer systems (Sosnina E.P.)


    Object of Computational Linguistics– analysis of language in its natural state as it is used by people in different situations communication, and how the features of language can be formulated.


    Tasks of computational linguistics:


    Computational linguistics research methods:

    1. modeling method- a special object of study that is not available through direct observation. According to the definition of mathematician K. Shannon, a model is a representation of an object in some form that is different from the form of its real existence.

    2. knowledge representation theory method implies methods of representing knowledge that are oriented towards automatic processing by modern computers.

    3. programming language theory method(programming language theory) is a field of computer science associated with the design, analysis, determination of characteristics and classification of programming languages ​​and the study of their individual characteristics.


    Reasons for the emergence of computational linguistics

    1. The emergence of computers

    2. The problem of communicating with computers of untrained users


    1.Dictionary search system developed at Birkbeck College in London in 1948.

    2. Warren Weaver Memorandum

    3. The beginning of the introduction of the first computers in the field of machine translation

    4. Georgetown Project in 1954


    1. ALPAC (Automatic Language Processing Advisory Committee) 2. a new stage in the development of computer technologies and their active use in linguistic tasks 3. the creation of a new generation of computers and programming languages ​​4. increasing interest in machine translation 60

    -70s of the twentieth century


    Late 80s – early 90s of the twentieth century

      The emergence and active development of the Internet

    • Rapid growth in the volume of text information in electronic form

    • The need for automatic processing of texts in natural language


    1. Products of PROMT and ABBY (Lingvo) 2. Machine translation technologies 3. Translation Memory technologies

    Modern commercial systems

    • Reviving texts

    • Communication models

    • Computer lexicography

    • Machine translate

    • Corpus of texts


    Natural language text analysis

    3 levels of text structure:
    • Surface syntactic structure

    • Deep syntactic structure

    • Semantic level


    The problem of synthesis is the reverse of that in analysis

    Bringing text to life

    1. Exchange of texts through visual images on the display screen

    2. 2 modalities of human thinking: symbolic and visual.


    1. Imitation of the communication process 2. Creation of an effective dialogue model Communication models


    Hypertext- a special way of organizing and presenting text, in which several texts or fragments of text can be interconnected using various types of connections.


    Differences between hypertext and traditional text

    Hypertext

      1. processing oral speech

    • 2. processing of written text


    Spoken speech processing

    1. automatic speech synthesis

    A) the development of text-to-speech synthesizers. Includes 2 blocks: linguistic text processing block And acoustic synthesis block.

    2. automatic speech recognition


    1) text recognition

    2) text analysis

    3) text synthesis


    IRS (information retrieval system)- This software systems for storing, searching and issuing information of interest.

    Zakharov V.P. believes that IPS is an ordered set of documents and information technologies designed for storing and retrieving information - texts or data.


    3 types of IPS

    3 types of IPS

      Manual- This is a search in the library.

    • Mechanized IPS are technical means that ensure the selection of the necessary documents

    • Automatic- searching for information using computers


    Computer lexicography

    Computer lexicography– one of the important areas of applied linguistics, deals with the theory and practice of compiling dictionaries.

    There are 2 directions in lexicography:
    • Traditional lexicography compiles traditional dictionaries

    • Machine lexicography deals with automation of dictionary preparation and solves problems of developing electronic dictionaries


    Tasks of computer lexicography

    • Automatically obtaining various dictionaries from text

    • Creation of dictionaries that are electronic versions of traditional dictionaries or complex electronic linguistic dictionaries for traditional dictionary work, for example LINGVO

    • Development of theoretical and practical aspects of compiling special computer dictionaries, for example for information retrieval, machine translation


    Machine translate

    Machine translate– converting text in one natural language into another natural language using a computer.

    Types of machine translation
    • FAMT(Fully Automated Machine Translation) – fully automatic translation

    • HAMT(Human Aided machine Translation) - machine translation with human participation

    • MAHT(Machine Aided Human Translation) – translation carried out by a person using auxiliary software and linguistic tools.


    • 2) professional MP– higher quality translation followed by human editing

    • 3) interactive MP– is considered a translation in special support systems; it takes place in dialogue mode with a computer system. The quality of MP depends on customization options, resources, and type of texts.

    Corpus of texts

    Corpus of texts- this is a certain collection of texts, which is based on a logical concept, a logical idea that unites these texts.

    Language corpus- a large, electronically presented, unified, structured, labeled, philologically competent array of language data designed to solve specific linguistic problems.


    Representativeness is the most important property of a corpus


    The purpose of a language corpus is to show the functioning of linguistic units in their natural contextual environment



    Based on the corpus, you can obtain the following data:

    1. about the frequency of grammatical categories

    2. about frequency changes

    3. about changes in contexts in different periods of time

    5. about the co-occurrence of lexical units

    6. about the features of their compatibility


    Brown Corps


    Corpus of texts - this is a certain collection of texts, which is based on a logical concept, a logical idea that unites these texts. The embodiment of this logical idea: rules for organizing texts into a corpus; algorithms and programs for analyzing a corpus of texts; associated ideology and methodology. National Corps represents a given language at a certain stage (or stages) of its existence and in all the diversity of genres, styles, territorial and social variants, etc. Basic terms of computational linguistics

      Programming languages (LP) is a class of artificial languages ​​designed for processing information using a computer. Any programming language is a strict (formal) sign system with the help of which computer programs are written. According to various estimates, there are currently between a thousand and ten thousand different programming languages.

    • Computer science(Computer Science) - the science of the patterns of recording, storing, processing, transmitting and using information using technical means.



    Search for information (Information Retrieval) is the process of finding such documents (texts, records and

    etc.) that correspond to the received request.

    « Information retrieval system (IPS) is an ordered set of documents (arrays of documents) and information technologies designed for storing and retrieving information - texts (documents) or data (facts).

    Machine lexicography(Computational Lexicography) deals with the automation of the preparation of dictionaries and solves the problems of developing electronic

    dictionaries.

    Machine translate is the computer's transformation of text on one

    natural language into content-equivalent text in another

    natural language.

    Hypertext is a technology for organizing information and specially structured text, divided into separate blocks, having a non-linear presentation, for the effective presentation of information in computer environments.


      Frame is a structure for representing declarative knowledge about a typified thematically unified situation, i.e. structure of data about a stereotypical situation.

    • Scenario - this is a sequence of several episodes in time, this is also a representation of a stereotypical situation or stereotypical behavior, only the elements of the scenario are steps of an algorithm or instructions.
    • Plan – representation of knowledge about possible actions that are necessary to achieve a certain goal.



    Scientists in the field of computational linguistics:

    • Soviet and Russian scientists: Alexey Lyapunov, Igor Melchuk, Olga Kulagina, Yu.D. Apresyan, N.N. Leontyeva, Yu.S. Martemyanov, Z.M. Shalyapina, Igor Boguslavsky, A.S. Narignani, A.E. Kibrik, Baranov A.N.

    • Western scientists Stars: Yorick Wilks, Gregory Grefenstette, Gravil Corbett, John Carroll, Diana McCarthy, Luis Marquez, Dan Moldovan, Joakim Nivre, Victor Raskin, Eduard Hovey.


    Associations and conferences in computational linguistics:
    • "Dialogue"- the main Russian conference on computational linguistics with international participation.

    The priority of the Dialogue is computer modelling Russian language. The working languages ​​of the conference are Russian and English. To attract foreign reviewers, the bulk of applied work is submitted in English.

    Main directions of the conference:
    • Linguistic semantics and semantic analysis

    • Formal language models and their applications

    • Theoretical and computer lexicography

    • Methods for evaluation of text analysis and machine translation systems

    • Corpus linguistics. Creation, application, evaluation of corpora

    • Internet as a linguistic resource. Linguistic technologies on the Internet

    • Ontologies. Knowledge extraction from texts

    • Computer analysis of documents: abstracting, classification, search

    • Automatic sentiment analysis of texts

    • Machine translate

    • Models of communication. Communication, dialogue and speech act

    • Analysis and speech synthesis



    2. Association for Computational Linguistics (ACL) is an international scientific and professional society people working on problems involving natural language and computing. The annual meeting is held every summer in locations where significant computational linguistics research is being carried out. Founded in 1962, originally named Association for Machine Translation and Computational Linguistics (AMTCL). In 1968 it became ACL.
    • UACL has a European one (EACL) and North American (NAACL) branches.

    • ACL Journal, Computational linguistics, is the premier forum for research in computational linguistics and natural language processing. Since 1988 the magazine has been published for ACL MIT Press.
    • ACL Book Series, Research in Natural Language Processing, published Cambridge University Press.

    • Every year ACL and its chapters organize international conferences in different countries.

    ACL 2014 was held in Baltimore, USA.

    • References:

    • 1. Marchuk Yu.N. Computer linguistics: textbook/Yu.N. Marchuk.- M.:AST: East-West, 2007-317 p.

    • 2. Shilikhina K.M. Fundamentals of applied linguistics: textbook for specialty 021800 (031301) - Theoretical and applied linguistics, Voronezh, 2006.

    • 3. Boyarsky K.K. Introduction to computational linguistics. Textbook. - St. Petersburg: NRU ITMO, 2013. - 72 p.

    • 4. Shchipitsina L.Yu. Information technologies in linguistics: textbook / L.Yu. Shchipitsina.- M.: FLINTA: science, 2013.- 128 p.

    • 5. Sosnina E.P. Introduction to applied linguistics: textbook / E.P. Sosnina. - 2nd ed., revised. and additional – Ulyanovsk: Ulyanovsk State Technical University, 2012. -110 p.

    • 6. Baranov A.N. Introduction to applied linguistics: Textbook. - M.: Editorial URSS, 2001. - 360 p.

    • 7. Applied linguistics: Textbook / L.V. Bondarko, L.A. Verbitskaya, G.Ya. Martynenko and others; Rep. Editor A.S. Gerd. St. Petersburg: publishing house St. Petersburg. Univ., 1996.- 528 p.

    • 8. Shemyakin Yu.I. Beginnings of computer linguistics: Textbook. M.: Publishing house MGOU, JSC "Rosvuznauka", 1992.