Types of professions. ABC of professions. Rare female professions

* This work is not a scientific work, is not a final qualifying work and is the result of processing, structuring and formatting the collected information intended for use as a source of material for self-preparation of educational work.

    Introduction.

    References.

Methods of mathematical statistics

    Introduction.

    Basic concepts of mathematical statistics.

    Statistical processing of the results of psychological and pedagogical research.

    References.

Methods of mathematical statistics

    Introduction.

    Basic concepts of mathematical statistics.

    Statistical processing of the results of psychological and pedagogical research.

    References.

      Introduction.

The application of mathematics to other sciences makes sense only in conjunction with a deep theory of a specific phenomenon. It is important to remember this in order not to get lost in a simple game of formulas, behind which there is no real content.

Academician Yu.A. Metropolitan

Theoretical research methods in psychology and pedagogy make it possible to reveal the qualitative characteristics of the studied phenomena. These characteristics will be fuller and deeper if the accumulated empirical material is subjected to quantitative processing. However, the problem of quantitative measurements in the framework of psychological and pedagogical research is very complex. This complexity lies primarily in the subjective-causal variety of pedagogical activity and its results, in the very object of measurement, which is in a state of continuous movement and change. At the same time, the introduction of quantitative indicators into the study today is a necessary and obligatory component of obtaining objective data on the results of pedagogical work. As a rule, these data can be obtained both by direct or indirect measurement of various components of the pedagogical process, and by quantitative assessment of the corresponding parameters of its adequately constructed mathematical model. For this purpose, in the study of the problems of psychology and pedagogy, the methods of mathematical statistics are used. With their help, various tasks are solved: processing factual material, obtaining new, additional data, substantiating the scientific organization of the research, and others.

2. Basic concepts of mathematical statistics

An extremely important role in the analysis of many psychological and pedagogical phenomena is played by average values, which are a generalized characteristic of a qualitatively homogeneous population according to a certain quantitative criterion. It is impossible, for example, to calculate the secondary specialty or the average nationality of university students, since these are qualitatively heterogeneous phenomena. But it is possible and necessary to determine, on average, the numerical characteristic of their academic performance (average score), the effectiveness of methodological systems and techniques, etc.

In psychological and pedagogical research, various types of averages are usually used: arithmetic mean, geometric mean, median, fashion, and others. The most common are arithmetic mean, median, and mode.

The arithmetic mean is used in cases where there is a directly proportional relationship between the defining property and the given attribute (for example, with the improvement of the performance indicators of the study group, the performance indicators of each of its members improve).

The arithmetic mean is the quotient of dividing the sum of quantities by their number and is calculated by the formula:

where X is the arithmetic mean; X1, X2, X3 ... Xn - the results of individual observations (techniques, actions),

n is the number of observations (techniques, actions),

The sum of the results of all observations (techniques, actions).

Median (Me) is a measure of the average position that characterizes the value of a feature on an ordered (built on the basis of increasing or decreasing) scale, which corresponds to the middle of the studied population. The median can be determined for ordinal and quantitative characteristics. The location of this value is determined by the formula: Location of the median = (n + 1) / 2

For example. According to the results of the study, it was found that:

- 5 people from participating in the experiment study with excellent marks;

- 18 people study “well”;

- for "satisfactory" - 22 people;

- “unsatisfactory” - 6 people.

Since in total N = 54 people took part in the experiment, the middle of the sample is equal to people. Hence, it is concluded that more than half of the students study below the mark “good”, that is, the median is more “satisfactory”, but less than “good” (see figure).

Mode (Mo) is the most common typical value of a feature among other meanings. It corresponds to the class with the maximum frequency. This class is called modal value.

For example.

If to the question of the questionnaire: “indicate the degree of proficiency in a foreign language”, the answers were distributed:

1 - speak fluently - 25

2 - I know enough to communicate - 54

3 - I know how, but I have difficulty communicating - 253

4 - I hardly understand - 173

5 - don't speak - 28

Obviously, the most typical meaning here is - "I own, but have difficulty communicating", which will be modal. So the mod is - 253.

When using mathematical methods in psychological and pedagogical research, great importance is attached to the calculation of variance and root-mean-square (standard) deviations.

The variance is equal to the mean square of the deviations of the value of the options from the mean. It acts as one of the characteristics of the individual results of the scatter of the values ​​of the studied variable (for example, students' assessments) around the mean. The calculation of variance is carried out by determining: the deviation from the mean; the square of the specified deviation; the sum of the squares of the deviation and the mean value of the square of the deviation (see Table 6.1).

The variance value is used in various statistical calculations, but is not directly observable. The quantity directly related to the content of the observed variable is the standard deviation.

Table 6.1

Variance Calculation Example

Meaning

indicator

Deviation

from average

deviations

2 – 3 = – 1

The standard deviation confirms the typicality and exponentialness of the arithmetic mean, reflects the measure of fluctuations in the numerical values ​​of the signs, from which the average value is derived. It is equal to the square root of the variance and is determined by the formula:

where: - root mean square. With a small number of observations (actions) - less than 100 - in the value of the formula, you should put not “N”, but “N - 1”.

The arithmetic mean and root mean square are the main characteristics of the results obtained during the study. They allow you to generalize data, compare them, establish the advantages of one psychological and pedagogical system (program) over another.

The root mean square (standard) deviation is widely used as a measure of dispersion for various characteristics.

When evaluating the results of the study, it is important to determine the dispersion of a random variable around the mean. This scattering is described using Gauss's law (the law of the normal distribution of the probability of a random variable). The essence of the law is that when measuring a certain feature in a given set of elements, there are always deviations in both directions from the norm due to a variety of uncontrollable reasons, while the larger the deviations, the less often they occur.

Further processing of the data may reveal: coefficient of variation (stability) the investigated phenomenon, which is the percentage of the standard deviation to the arithmetic mean; measure of obliquity, showing in which direction the predominant number of deviations is directed; measure of coolness, which shows the degree of accumulation of values ​​of a random variable around the average, etc. All these statistical data help to more fully identify the signs of the phenomena under study.

Coupling measures between variables. Relationships (dependencies) between two or more variables in statistics are called correlation. It is estimated using the value of the correlation coefficient, which is a measure of the degree and magnitude of this relationship.

There are many correlation coefficients. Let's consider only a part of them, which take into account the presence of a linear relationship between variables. Their choice depends on the scales of measurement of the variables, the relationship between which needs to be assessed. The most often used in psychology and pedagogy are the Pearson and Spearman coefficients.

Let's consider the calculation of the values ​​of the correlation coefficients using specific examples.

Example 1. Let two comparable variables X (marital status) and Y (exclusion from the university) be measured on a dichotomous scale (a special case of the denomination scale). To determine the relationship, we use the Pearson coefficient.

In cases where there is no need to calculate the frequency of occurrence of different values ​​of the variables X and Y, it is convenient to calculate the correlation coefficient using a contingency table (see Tables 6.2, 6.3, 6.4), showing the number of joint occurrences of pairs of values ​​for two variables (features) ... A - the number of cases when the variable X has a value equal to zero, and, at the same time, the variable Y has a value equal to one; B - the number of cases when the variables X and Y have simultaneously values ​​equal to one; С - the number of cases when the variables X and Y have simultaneously values ​​equal to zero; D - the number of cases when the variable X has a value equal to one, and, at the same time, the variable Y has a value equal to zero.

Table 6.2

General contingency table

Feature X

In general terms, the formula for the Pearson correlation coefficient for dichotomous data is

Table 6.3

Dichotomous scale data example

Substitute the data from the contingency table (see Table 6.4), corresponding to the example under consideration, into the formula:

Thus, the Pearson correlation coefficient for the selected example is 0.32, that is, the relationship between the marital status of students and the facts of exclusion from the university is insignificant.

Example 2. If both variables are measured on scales of order, then Spearman's rank correlation coefficient (Rs) is used as a measure of the relationship. It is calculated by the formula

where Rs is Spearman's rank correlation coefficient; Di is the difference in the ranks of the compared objects; N is the number of compared objects.

The value of the Spearman coefficient varies from –1 to + 1. In the first case, there is an unambiguous, but oppositely directed relationship between the analyzed variables (with an increase in the value of one, the value of the other decreases). In the second, with the growth of the values ​​of one variable, the value of the second variable increases proportionally. If the Rs value is equal to zero or has a value close to it, then there is no significant relationship between the variables.

As an example of calculating the Spearman coefficient, we use the data from table 6.5.

Table 6.5

Data and intermediate results of calculating the coefficient value

rank correlation Rs

Qualities

Expert Ranks

Difference of ranks

Rank difference squared

–1
–1
–1

The sum of the squares of the differences of the ranks Di = 22

Let's substitute the example data into the formula for the Smirman coefficient:

The calculation results allow us to assert that there is a sufficiently pronounced relationship between the variables under consideration.

Statistical test of a scientific hypothesis. The proof of the statistical reliability of the experimental influence differs significantly from the proof in mathematics and formal logic, where the conclusions are more universal in nature: statistical proofs are not so strict and final - they always risk making mistakes in conclusions and therefore statistical methods do not finally prove the legitimacy of one or another conclusion, and a measure of the likelihood of accepting a particular hypothesis is shown.

A pedagogical hypothesis (a scientific assumption about the advantage of a particular method, etc.) in the process of statistical analysis is translated into the language of statistical science and is formulated anew, at least in the form of two statistical hypotheses. The first (main) is called null hypothesis(H 0), in which the researcher speaks about his starting position. He (a priori), as it were, declares that the new (assumed by him, his colleagues or opponents) method does not have any advantages, and therefore from the very beginning the researcher is psychologically ready to take an honest scientific position: the differences between the new and the old methods are declared equal to zero. In another, alternative hypothesis(H 1) an assumption is made about the advantage of the new method. Sometimes, several alternative hypotheses are put forward with appropriate designations.

For example, the hypothesis about the advantage of the old method (H 2). Alternative hypotheses are accepted if and only if the null hypothesis is refuted. This happens in cases when the differences, say, in the arithmetic means of the experimental and control groups are so significant (statistically significant) that the risk of error to reject the null hypothesis and accept the alternative does not exceed one of the three accepted ones. levels of significance statistical inference:

- the first level - 5% (in scientific texts they sometimes write p = 5% or a? 0.05, if presented in shares), where the risk of error in the conclusion is allowed in five cases out of a hundred theoretically possible similar experiments with strictly random selection of subjects for each experiment;

- the second level is 1%, i.e., accordingly, the risk of making a mistake is allowed only in one case out of a hundred (a? 0.01, with the same requirements);

- the third level is 0.1%, that is, the risk of making a mistake is allowed only in one case out of a thousand (a? 0.001). The last level of significance makes very high demands on substantiating the reliability of experimental results and therefore is rarely used.

When comparing the arithmetic mean of the experimental and control groups, it is important not only to determine which mean is greater, but also how much greater. The smaller the difference between them, the more acceptable the null hypothesis about the absence of statistically significant (reliable) differences will be. Unlike thinking at the level of everyday consciousness, which is inclined to perceive the difference in means obtained as a result of experience as a fact and a basis for inference, a teacher-researcher familiar with the logic of statistical inference will not rush in such cases. He will most likely make an assumption about the randomness of the differences, put forward a null hypothesis about the absence of significant differences in the results of the experimental and control groups, and only after refuting the null hypothesis will he accept the alternative.

Thus, the issue of differences in the framework of scientific thinking is transferred to another plane. The point is not only in the differences (they almost always exist), but in the magnitude of these differences and hence in the determination of the difference and the boundary after which one can say: yes, the differences are not accidental, they are statistically significant, which means that the subjects of these two groups belong after experiment no longer to one (as before), but to two different general populations, and that the level of preparedness of students potentially belonging to these populations will differ significantly. In order to show the boundaries of these differences, the so-called estimates of general parameters.

Let's look at a specific example (see Table 6.6), how using mathematical statistics, you can refute or confirm the null hypothesis.

For example, it is necessary to determine whether the effectiveness of group activities of students depends on the level of development in the study group of interpersonal relations. As a null hypothesis, it is suggested that such a dependence does not exist, and as an alternative, a dependence exists. For these purposes, the results of the effectiveness of activity in two groups are compared, one of which in this case acts as an experimental one, and the other as a control one. To determine whether the difference between the average values ​​of performance indicators in the first and in the second group is significant (significant), it is necessary to calculate the statistical significance of this difference. To do this, you can use the t - Student's test. It is calculated by the formula:

where X 1 and X 2 - the arithmetic mean of the variables in groups 1 and 2; M 1 and M 2 are the values ​​of the average errors, which are calculated by the formula:

where is the mean square, calculated by the formula (2).

Let us determine the errors for the first row (experimental group) and the second row (control group):

We find the value of t - criterion by the formula:

Having calculated the value of the t - criterion, it is required to determine the level of statistical significance of the differences between the average performance indicators in the experimental and control groups using a special table. The higher the value of the t-criterion, the higher the significance of the differences.

For this, the calculated t is compared with the tabular t. The tabular value is selected taking into account the selected confidence level (p = 0.05 or p = 0.01), as well as depending on the number of degrees of freedom, which is found by the formula:

where U is the number of degrees of freedom; N 1 and N 2 - the number of measurements in the first and second rows. In our example, U = 7 + 7 –2 = 12.

Table 6.6

Data and intermediate results of calculating the significance of statistical

Differences in mean values

Experimental group

Control group

The value of the efficiency of activity

For the table t - criterion, we find that the value of t table. = 3.055 for one percent level (p

However, the teacher-researcher should remember that the existence of the statistical significance of the difference in mean values ​​is an important, but not the only, argument in favor of the presence or absence of a relationship (dependence) between phenomena or variables. Therefore, it is necessary to involve other arguments for a quantitative or substantive substantiation of a possible connection.

Multivariate data analysis methods. The analysis of the relationship between a large number of variables is carried out using multivariate methods of statistical processing. The purpose of using such methods is to make the hidden patterns visible, to highlight the most significant relationships between variables. Examples of such multivariate statistical methods are:

    - factor analysis;

    - cluster analysis;

    - analysis of variance;

    - regression analysis;

    - latent structural analysis;

    - multidimensional scaling and others.

Factor analysis is to identify and interpret factors. A factor is a generalized variable that allows you to collapse a part of information, that is, to present it in a convenient form. For example, the factorial theory of personality identifies a number of generalized characteristics of behavior, which in this case are called personality traits.

Cluster Analysis allows you to highlight the leading feature and the hierarchy of interrelationships of features.

ANOVA- a statistical method used to study one or more simultaneously acting and independent variables for the variability of the observed trait. Its peculiarity lies in the fact that the observed feature can only be quantitative, while the explanatory features can be both quantitative and qualitative.

Regression analysis allows you to identify the quantitative (numerical) dependence of the average value of changes in the effective attribute (explained) from changes in one or more attributes (explanatory variables). As a rule, this type of analysis is used when it is required to find out how much the average value of one characteristic changes when another characteristic changes by one.

Latent Structural Analysis represents a set of analytical and statistical procedures for identifying hidden variables (features), as well as the internal structure of relationships between them. It makes it possible to study the manifestations of complex relationships of directly unobservable characteristics of socio-psychological and pedagogical phenomena. Latent analysis can be the basis for modeling these relationships.

Multidimensional scaling provides a visual assessment of the similarity or difference between some objects described by a large number of different variables. These differences are presented as the distance between the evaluated objects in multidimensional space.

3. Statistical processing of the results of psychological and pedagogical

research

In any study, it is always important to ensure the mass and representativeness (representativeness) of the objects of study. To solve this issue, they usually resort to mathematical methods of calculating the minimum value of objects (groups of respondents) subject to research, so that objective conclusions can be drawn on this basis.

According to the degree of completeness of coverage of primary units, statistics divides studies into continuous, when all units of the phenomenon under study are studied, and selective, if only a part of the population of interest is studied, taken according to some criterion. The researcher does not always have the opportunity to study the entire set of phenomena, although this should always be strived for (there is not enough time, funds, necessary conditions, etc.); on the other hand, often a continuous study is simply not required, since the conclusions will be quite accurate after studying a certain part of the primary units.

The theoretical basis of the selective research method is the theory of probability and the law of large numbers. In order for the study to have a sufficient number of facts, observations, use a table of sufficiently large numbers. In this case, the researcher is required to establish the magnitude of the probability and the magnitude of the permissible error. Let, for example, the admissible error in the conclusions that should be made as a result of observations, in comparison with theoretical assumptions, should not exceed 0.05 in both positive and negative directions (in other words, we can be mistaken in no more than 5 cases out of 100). Then, according to the table of sufficiently large numbers (see Table 6.7), we find that the correct conclusion can be made in 9 cases out of 10 when the number of observations is at least 270, in 99 cases out of 100 with at least 663 observations, etc. This means that with an increase in the accuracy and probability with which we propose to draw conclusions, the number of required observations increases. However, in psychological and pedagogical research, it should not be excessively large. 300-500 observations are often quite enough for solid conclusions.

This method of determining the sample size is the simplest. Mathematical statistics also has more complex methods for calculating the required sample sets, which are covered in detail in the special literature.

However, compliance with the requirements of mass character does not yet ensure the reliability of the conclusions. They will be reliable when the units selected for observation (conversations, experiment, etc.) are sufficiently representative for the studied class of phenomena.

Table 6.7

A short table of large enough numbers

The quantity

probabilities

Permissible

The representativeness of observation units is ensured primarily by their random selection using tables of random numbers. Suppose, it is required to determine 20 training groups for carrying out a mass experiment out of the available 200. For this, a list of all groups is drawn up, which is numbered. Then 20 numbers are written out from the table of random numbers, starting with any number, at a certain interval. These 20 random numbers, according to the observance of the numbers, determine the groups that the researcher needs. A random selection of objects from the general (general) population gives grounds to assert that the results obtained in the study of a sample set of units will not differ sharply from those that would have been available in the case of a study of the entire set of units.

In the practice of psychological and pedagogical research, not only simple random selections are used, but also more complex selection methods: stratified random selection, multi-stage selection, etc.

Mathematical and statistical research methods are also means of obtaining new factual material. For this purpose, templating techniques are used that increase the informative capacity of the questionnaire and scaling, which makes it possible to more accurately assess the actions of both the researcher and the subjects.

The scales arose because of the need to objectively and accurately diagnose and measure the intensity of certain psychological and pedagogical phenomena. Scaling makes it possible to order the phenomena, to quantitatively evaluate each of them, to determine the lower and higher stages of the studied phenomenon.

So, when studying the cognitive interests of listeners, you can set their boundaries: very high interest - very weak interest. Introduce a number of steps between these boundaries that create a scale of cognitive interests: very great interest (1); great interest (2); medium (3); weak (4); very weak (5).

Scales of different types are used in psychological and pedagogical research, for example,

a) Three-dimensional scale

Very active …… .. ………… ..10

Active ………………………… 5

Passive… ... ………………… ... 0

b) Multidimensional scale

Very active ………………… ..8

Intermediate ………………… .6

Not too active ………… ... 4

Passive ……………………… ..2

Completely passive ………… ... 0

c) Two-sided scale.

Very interested in …………… ..10

Interested enough in ……… ... 5

Indifferent ……………………… .0

Not interested in ………………… ..5

No interest at all ……… 10

Numerical rating scales give each item a specific numerical designation. So, when analyzing the attitude of students to learning, their perseverance in work, willingness to cooperate, etc. you can draw up a numerical scale based on the following indicators: 1 - unsatisfactory; 2 - weak; 3 - medium; 4 is above average, 5 is much above average. In this case, the scale takes the following form (see Table 6.8):

Table 6.8

If the numeric scale is bipolar, the bipolar ordering is used with a zero value in the center:

Discipline Indiscipline

Pronounced 5 4 3 2 1 0 1 2 3 4 5 Not pronounced

Evaluation scales can be plotted graphically. In this case, they express categories in a visual form. Moreover, each division (step) of the scale is characterized verbally.

The considered methods play an important role in the analysis and generalization of the data obtained. They allow us to establish various relationships, correlations between facts, to identify trends in the development of psychological and pedagogical phenomena. So, the theory of groupings of mathematical statistics helps to determine which facts from the collected empirical material are comparable, on what basis to group them correctly, what degree of reliability they will be. All this makes it possible to avoid arbitrary manipulations with facts and to define a program for their processing. Depending on the goals and objectives, three types of groupings are usually used: typological, variational and analytical.

Typological grouping it is used when it is necessary to break the obtained factual material into qualitatively homogeneous units (distribution of the number of discipline violations between different categories of students, breakdown of indicators of their physical exercise performance by years of study, etc.).

If necessary, group the material according to the value of any changing (varying) attribute - breakdown of groups of students according to the level of academic performance, percentage of assignments, similar violations of the established order, etc. - applied variation grouping, which makes it possible to consistently judge the structure of the phenomenon under study.

Analytical view of grouping helps to establish the relationship between the studied phenomena (the dependence of the degree of preparation of students on various teaching methods, the quality of the tasks performed on the temperament, abilities, etc.), their interdependence and interdependence in an exact calculation.

The importance of the researcher's work in grouping the collected data is evidenced by the fact that errors in this work devalue the most comprehensive and meaningful information.

Currently, the mathematical foundations of grouping, typology, classification have received the most profound development in sociology. Modern approaches and methods of typology and classification in sociological research can be successfully applied in psychology and pedagogy.

In the course of the study, the techniques of the final generalization of data are used. One of them is the technique of drawing up and studying tables.

When compiling a summary of data with respect to one statistical quantity, a distribution series (variation series) of the value of this quantity is formed. An example of such a series (see Table 6.9) is a summary of data on the chest circumference of 500 persons.

Table 6.9

Summarizing data for two or more statistical quantities simultaneously involves the compilation of a distribution table that reveals the distribution of the values ​​of one static quantity in accordance with the values ​​that other quantities take.

As an illustration, table 6.10 is given, compiled on the basis of statistics on chest circumference and weight of these people.

Table 6.10

Chest circumference in cm

The distribution table gives an idea of ​​the relationship and relationship that exists between the two quantities, namely: with a low weight, the frequencies are located in the upper left quarter of the table, which indicates the predominance of persons with a small chest circumference. As the weight increases to an average value, the frequency distribution moves to the center of the plate. This indicates that people weighing closer to the average have a chest circumference that is also close to the average. With a further increase in weight, frequencies begin to occupy the lower right quarter of the plate. This indicates that a person weighing more than average has a chest circumference that is also above average.

It follows from the table that the established relationship is not strict (functional), but probabilistic, when, with changes in the values ​​of one quantity, the other changes as a trend, without a rigid unambiguous relationship. Similar connections and dependencies are often found in psychology and pedagogy. Currently, they are usually expressed using correlation and regression analysis.

Variational series and tables give an idea of ​​the statics of the phenomenon, while the dynamics can be shown by the series of development, where the first line contains successive stages or time intervals, and the second - the values ​​of the studied statistical quantity obtained at these stages. This is how the increase, decrease or periodic changes of the studied phenomenon are revealed, its tendencies and patterns are revealed.

Tables can be filled with absolute values, or summary figures (average, relative). The results of statistical work - in addition to tables, are often depicted graphically in the form of diagrams, shapes, etc. The main ways of graphically displaying statistical quantities are: the method of points, the method of straight lines and the method of rectangles. They are simple and accessible to every researcher. The technique of their use is to draw coordinate axes, establish a scale, and extract the designation of segments (points) on the horizontal and vertical axes.

Diagrams depicting the series of distributions of values ​​of one statistical quantity allow plotting distribution curves.

The graphical representation of two (or more) statistical quantities makes it possible to form a certain curved surface, called a distribution surface. A series of development with graphical execution form development curves.

The graphic representation of statistical material allows you to penetrate deeper into the meaning of digital values, to grasp their interdependencies and features of the phenomenon under study, which are difficult to notice in the table. The researcher is freed from the work that he would have to do in order to deal with the abundance of numbers.

Tables and graphs are important, but only the first steps in the study of statistical quantities. The main method is analytical, operating with mathematical formulas, with the help of which the so-called “generalizing indicators” are derived, that is, the absolute values ​​given in a comparable form (relative and average values, balances and indices). So, with the help of relative values ​​(percent), the qualitative features of the analyzed aggregates are determined (for example, the ratio of excellent students to the total number of students; the number of errors when working on complex equipment, caused by the mental instability of students, to the total number of errors, etc.). That is, the relations are revealed: part to the whole (specific weight), terms to the sum (structure of the aggregate), one part of the aggregate to its other part; characterizing the dynamics of any changes over time, etc.

As you can see, even the most general understanding of the methods of statistical calculus suggests that these methods have great capabilities in the analysis and processing of empirical material. Of course, the mathematical apparatus can dispassionately process everything that the researcher puts into it, both reliable data and subjective conjectures. That is why perfect mastery of the mathematical apparatus for processing the accumulated empirical material in unity with a thorough knowledge of the qualitative characteristics of the phenomenon under study is necessary for every researcher. Only in this case is it possible to select high-quality, objective factual material, its qualified processing and obtain reliable final data.

This is a brief description of the most frequently used methods of studying the problems of psychology and pedagogy. It should be emphasized that none of the methods considered, taken by itself, can claim universality, a complete guarantee of the objectivity of the data obtained. Thus, the elements of subjectivity in the answers obtained by interviewing respondents are obvious. Observation results, as a rule, are not free from the subjective assessments of the researcher himself. Data taken from various documents require at the same time verification of the accuracy of this documentation (especially personal documents, second-hand documents, etc.).

Therefore, each researcher should strive, on the one hand, to improve the technique of applying any particular method, and on the other, to a complex, mutually controlling use of different methods to study the same problem. Possession of the entire system of methods makes it possible to develop a rational research methodology, clearly organize and conduct it, and obtain significant theoretical and practical results.

    References.

    Shevandrin N.I. Social psychology in education: Textbook. Part 1. Conceptual and applied foundations of social psychology. - M .: VLADOS, 1995.

2. Davydov V.P. Fundamentals of methodology, methodology and technology of pedagogical research: Scientific and methodological manual. - M .: Academy of the FSB, 1997.

Odessa National Medical University Department of Biophysics, Informatics and Medical Equipment Methodical instructions for 1st year students on the topic "Fundamentals of Mathematical Statistics" Odessa 2009

1. Topic: “Fundamentals of Mathematical Statistics”.

2. Relevance of the topic.

Mathematical statistics is a branch of mathematics that studies methods of collecting, systematizing and processing the results of observations of mass random events in order to clarify and practical application of existing patterns. Methods of mathematical statistics are widely used in clinical medicine and healthcare. They are used, in particular, in the development of mathematical methods of medical diagnostics, in the theory of epidemics, in planning and processing the results of a medical experiment, in the organization of health care. Statistical concepts, knowingly or unknowingly, are used in decision-making in matters such as clinical diagnosis, predicting the course of an individual patient's disease, predicting the likely outcomes of a given program in a given population, and choosing the appropriate program in a given setting. Familiarity with the ideas and methods of mathematical statistics is a necessary element of the professional education of every health worker.

3. Whole classes. The general goal of the lesson is to teach students to consciously use mathematical statistics in solving problems of a biomedical profile. Specific whole classes:
  1. to acquaint students with the basic ideas, concepts and methods of mathematical statistics, paying attention mainly to issues related to processing the results of observations of mass random events in order to clarify and practical application of existing patterns;
  2. to teach students to consciously apply the basic concepts of mathematical statistics in solving the simplest problems that arise in the professional activity of a doctor.
The student must know (level 2):
  1. definition of class frequency (absolute and relative)
  2. determination of the general sukupnіstі and vibration, the volume of the vibration
  3. point and interval estimation
  4. reliable interval and reliability
  5. determination of mode, median and sample mean
  6. determination of the range, mіzhquartile range, quartile deviation
  7. determination of the mean absolute deviation
  8. determination of sample covariance and variance
  9. determination of sample standard deviation and coefficient of variation
  10. determination of sample regression coefficients
  11. empirical linear regression equations
  12. determination of the sample correlation coefficient.
The student must master basic calculating habits (level 3):
  1. mode, median and sample mean
  2. range, mіzhquartile range, quartile deviation
  3. mean absolute deviation
  4. selective variance and variance
  5. sample standard deviation and coefficient of variation
  6. reliable interval for mathematical expectation and variance
  7. sample regression coefficients
  8. selective correlation coefficient.
4. Ways to implement the objectives of the lesson: To implement the objectives of the lesson, you need the following initial knowledge:
  1. Determination of distribution, distribution series and multipath distribution of a discrete random variable
  2. Determination of functional deposit between random variables
  3. Determination of the correlation deposition between random variables
You also need to be able to calculate the probabilities of inconsistent and consistent events using appropriate rules. 5. The task for students to test their initial level of knowledge. Control questions
  1. Definition of vipadkovoi events, its relative frequency and probability.
  2. A theorem for composing the probabilities of incompatible events
  3. A theorem for composing the probabilities of joint events
  4. The multiplication theorem for the probabilities of independent events
  5. The multiplication theorem for the probabilities of dependent events
  6. Total probability theorem
  7. Bayes' theorem
  8. Determination of random variables: discrete and continuous
  9. Determination of distribution, distribution series and distribution polygon of a discrete random variable
  10. Definition of the distribution function
  11. Determination of measures of the position of the distribution center
  12. Determination of measures of variability of values ​​of a random variable
  13. Determination of the distribution and distribution curve of a continuous random variable
  14. Determination of the functional relationship between random variables
  15. Determination of the correlation between random variables
  16. Regression definition, equation and regression lines
  17. Determination of covariate and correlation coefficient
  18. Determination of the linear regression equation.
6. Information for strengthening the initial knowledge-skills can be found in the manuals:
  1. Zhumatiy P.G. Lecture "Probability Theory". Odessa, 2009.
  2. Zhumatiy P.G. "Foundations of the theory of probability". Odessa, 2009.
  3. Zhumatiy P.G., Senitska Y.R. Elements of the theory of probability. Methodical instructions for students of a medical institute. Odessa, 1981.
  4. Chaly O.V., Agapov B.T., Tsekhmister Y.V. Medical and biological physics. Kiev, 2004.
7. The content of educational material from this topic, highlighting the main key issues.

Mathematical statistics is a branch of mathematics that studies methods of collecting, organizing, processing, displaying, analyzing and interpreting observation results in order to identify existing patterns.

The application of statistics in health care is necessary both at the community level and at the individual patient level. Medicine deals with individuals who differ from each other in many ways, and the value of the indicators on the basis of which a person can be considered healthy vary from one individual to another. There are no two absolutely identical patients or two groups of patients, therefore, decisions that relate to individual patients or groups of the population must be made based on the experience gained from other patients or population groups with similar biological characteristics. It is necessary to realize that given the existing discrepancies, these decisions cannot be absolutely accurate - they are always associated with some uncertainty. This is precisely the intrinsic nature of medicine.

Some examples of the application of statistical methods in medicine:

the interpretation of variation (the variability of the characteristics of an organism when deciding what value of a particular characteristic will be ideal, normal, average, etc., makes it necessary to use the appropriate statistical methods).

diagnosis of diseases in individual patients and assessment of the health status of a population group.

predicting the end of the disease in individual patients or the possible outcome of a control program for a particular disease in any population group.

selection of a suitable influence on the patient or on a population group.

planning and conducting medical research, analysis and publication of results, their reading and critical assessment.

health planning and management.

Useful medical information is usually hidden in a mass of raw data. It is necessary to concentrate the information contained in them and present the data so that the structure of the variation is clearly visible, and then choose specific methods of analysis.

The image of the data provides for familiarity with such concepts and terms:

variation series (ordered arrangement) - a simple ordering of individual observations of a quantity.

class - one of the intervals into which the entire range of values ​​of a random variable is divided.

class endpoints are values ​​that limit the class, for example 2.5 and 3.0, the lower and upper class boundaries are 2.5 - 3.0.

The (absolute) class frequency is the number of observations in the class.

relative frequency of the class - the absolute frequency of the class, expressed as quotients of the total number of observations.

cumulative (accumulated) frequency of a class - the number of observations, which is equal to the sum of the frequencies of all previous classes and this class.

Stovtz diagram is a graphical representation of data frequencies for nominal classes using bars whose heights are directly proportional to the class frequencies.

pie chart - graphical representation of data frequencies for nominal classes using sectors of a circle, the areas of which are directly proportional to the frequencies of the classes.

histogram is a graphical representation of the frequency distribution of quantitative data by the areas of rectangles directly proportional to the frequencies of the classes.

frequency polygon - graph of the frequency distribution of quantitative data; the point corresponding to the frequency of the class is located above the middle of the interval, each two adjacent points are connected by a straight line segment.

ogive (cumulative curve) is a graph of the distribution of cumulative relative frequencies.

All medical data have inherent variability, that is, the analysis of measurement results based on the study of information about what values ​​the random variable that is being investigated took.

The collection of all possible values ​​of a random variable is called general.

The part of the general population registered as a result of the tests is called viborka.

The number of observations included in the selection is called the volume of the selection (usually n).

The task of the sampling method is to make a correct estimate of the random variable that is being studied using the information received by the voter. Therefore, the main requirement that comes before the selection is the maximum display of all the features of the general population. A selection that satisfies this requirement is called representative. The assessment's grounding depends on the representativeness of the selection, that is, the degree of compliance of the assessment with the parameter that it characterizes.

When assessing the parameters of the general population by the voter (parametric assessment), the following concepts are used:

point estimation - estimation of the parameter of the general population in the form of a single value, which it can take with the highest probability.

interval estimation - estimation of the parameter of the general population in the form of an interval of values, which has a given probability of covering its true value.

For interval estimation, the concept is used:

reliable interval - an interval of values ​​that has a given probability to cover the true value of the parameter of the general population during interval estimation.

reliability (reliable probability) - the probability with which the reliable interval covers the true value of the parameter of the general population.

safe bounds - the lower and upper bounds of the safe interval.

The conclusions that are obtained by the methods of mathematical statistics are always based on a limited, selective number of observations, therefore, it is natural that the results for the second selection may be different. This circumstance determines the relative nature of the conclusions of mathematical statistics and, as a consequence, the widespread use of the theory of probability in the practice of statistical research.

The typical path of statistical research is as follows:

having estimated the values ​​or dependences between them according to observational data, put forward the assumption that the phenomenon being studied can be described by one or another stochastic model

using statistical methods, this assumption can be confirmed or rejected; upon confirmation, the goal is achieved - a model has been found that describes the studied patterns; otherwise, they continue to work, putting forward and testing a new hypothesis.

Determination of sample statistical estimates:

fashion is the meaning that is most often found in the voter,

median - the central (median) value of the variation series

range R - the difference between the largest and smallest values ​​in a series of observations

percentile - the value in the variation series that divide the distribution into 100 equal parts (so the median will be n "10th percentile)

first quartile - 25th percentile

third quartile - 75th percentile

mіzhquartile range - the difference between the first and third quartiles (covers the central 50% of observations)

quartile deviation - half of the quarterly range

sample mean - the arithmetic mean of all sample values ​​(sample estimate of the mathematical expectation)

mean absolute deviation - the sum of deviations from the corresponding beginning (without taking into account the sign), divided by the volume of the sample

the mean absolute deviation from the sample mean is calculated by the formula

sample variance (X) - (sample variance estimate) is determined by the formula

sample covariate - (sample estimate of covariate K (X, Y)) equals

the sample Y-on-X regression coefficient (sample estimate of the Y-on-X regression coefficient) is

the empirical linear regression equation of Y on X has the form

the sample X-on-Y regression coefficient (sample estimate of the X-on-Y regression coefficient) is

the empirical linear regression equation of X on Y has the form

sample standard deviation s (X) - (sample standard deviation estimate) equals the square root of the sample variance

sample correlation coefficient - (sample estimate of the correlation coefficient) is equal to

sample coefficient of variation  - (sample estimate of the coefficient of variation CV) equals

.

8. The task for self-preparation of students. 8.1 The task for self-study of material from the topic.

8.1.1 Practical calculation of sample scores

Practical Calculation of Sample Point Estimates

Example 1.

The duration of the disease (in days) in 20 cases of pneumonia added up:

10, 11, 6, 16, 7, 13, 15, 8, 9, 10, 11, 13, 7, 8, 13, 15, 16, 13, 14, 15

Determine the mode, median, range, m_quartile range, sample mean, mean absolute deviation from the sample mean, sample variance, sample coefficient of variation.

Rozv "zok.

The variation series for the viborka has the form

6, 7, 7, 8, 8, 9, 10, 10, 11, 11, 13, 13, 13, 13, 14, 15, 15, 15, 16, 16

Fashion

The number 13 most often occurs in the voter. Therefore, the value of the mode in the voter will be this number.

Median

When a variation series contains a paired number of observations, the median is the average of the two central members of the series, in this case 11 and 13, so the median is 12.

Swing

The minimum value in the selector is 6 and the maximum value is 16, so R = 10.

Quartile range, quartile deviation

In the variation series, a quarter of all data is less significant, or level 8, so the first quartile is 8, and 75% of all data are less important, or level 12, so the third quartile is 14. So, the mіzhquartile range is 6, and the quartile deviation is 3.

Sample mean

The arithmetic mean of all sample values ​​is

.

Mean absolute deviation from the sample mean

.

Sample variance

Sample standard deviation

.

Vibrating coefficient of variation

.

In the next example, we will consider the simplest means of studying the stochastic dependence between two random variables.

Example 2.

When examining a group of patients, data on the growth of H (cm) and the volume of circulating blood V (l) were obtained:

Find empirical linear regression equations.

Rozv "zok.

The first thing to calculate is:

sample mean

sample mean

.

The second thing to calculate is:

sample variance (N)

sample variance (V)

selective variance

Third, this is the calculation of sample regression coefficients:

sample regression coefficient V on H

sample regression coefficient H on V

.

Fourth, write down the sought equations:

the empirical linear regression equation V on H has the form

the empirical linear regression equation H on V has the form

.

Example 3.

Using the conditions and results of example 2, calculate the correlation coefficient and check the reliability of the existence of a correlation between a person's height and the volume of circulating blood with 95% reliable probability.

Rozv "zok.

The correlation coefficient is related to the regression coefficients and a practically useful formula

.

For a sample estimate of the correlation coefficient, this formula has the form

.

Using virahovani in example 2 the value of the sample regression coefficients and, we get

.

The verification of the reliability of the correlation between random variables (assumes a normal distribution for each of them) is carried out as follows:

  • calculate the value of T

  • find the coefficient in the Student's distribution table

  • the existence of a correlation between random variables is confirmed when the roughness

.

Since 3.5> 2.26, then with a 95% reliable probability of the existence of a correlation between the patient's height and the volume of circulating blood can be considered established.

Interval estimates for mathematical expectation and variance

If the random variable has a normal distribution, then the interval estimates for the mathematical expectation and variance are calculated in the following sequence:

1. find the sample mean;

2.Calculate the sample variance and sample standard deviation s;

3. In the table of the Student's distribution, according to the reliable probability  and the volume of the selection n, the Student's coefficient is found;

4.The reliable interval for the mathematical expectation is written as

5. in the distribution table ">  and the vibork volume n find the coefficients

;

6.The reliable interval for variance is written as

The value of the reliable interval, the reliable probability and the volume of the viborka n depend on each other. In fact, the attitude

decreases with increasing n, so, at a constant value of the reliable interval, u increases with increasing n. With a constant reliable probability, the value of the reliable interval decreases with an increase in the viborking volume. When planning medical research, this connection is used to determine the minimum volume of viborka, which will provide the values ​​of a reliable interval and a reliable probability that are needed according to the conditions of the problem being solved.

Example 5.

Using the conditions and results of Example 1, find the interval estimates for the mean and variance for a 95% reliable probability.

Rozv "zok.

In example 1, point estimates of the mathematical expectation (sample mean = 12), variance (sample variance = 10.7) and standard deviation (sample standard deviation) are violated. The volume of the viborka is equal to n = 20.

From the Student's distribution table, we find the value of the coefficient

further, we calculate the half-width d of the reliable interval

and write down the interval estimate of the mathematical expectation

10,5 < < 13,5 при = 95%

From the Pirson distribution table "chi-square" we find the coefficients

calculate the lower and upper safe bounds

and write down the interval estimate for the variance in the form

6.2 23 at = 95%.

8.1.2. Tasks for independent solution

For an independent solution, problems are proposed 5.4 С 1 - 8 (P.G. Zhumatiy. "Mathematical processing of biomedical data. Problems and examples." Odessa, 2009, pp. 24-25)

8.1.3. Control questions
  1. Class frequency (absolute and relative).
  2. General population and sample, sample size.
  3. Point and interval estimation.
  4. Reliable interval and reliability.
  5. Fashion, median and sample mean.
  6. Span, mіzhkvartіlny range, quarterly deviation.
  7. Average absolute deviation.
  8. Selective variance and variance.
  9. Sample standard deviation and coefficient of variation.
  10. Selected regression coefficients.
  11. Empirical regression equations.
  12. Calculation of the correlation coefficient and the reliability of the correlation.
  13. Construction of interval estimates of normally distributed random variables.
8.2 Main literature
  1. Zhumatiy P.G. “Mathematical processing of biomedical data. Tasks and examples ”. Odessa, 2009.
  2. Zhumatiy P.G. Lecture "Mathematical statistics". Odessa, 2009.
  3. Zhumatiy P.G. "Fundamentals of Mathematical Statistics". Odessa, 2009.
  4. Zhumatiy P.G., Senitska Y.R. Elements of the theory of probability. Methodical instructions for students of a medical institute. Odessa, 1981.
  5. Chaly O.V., Agapov B.T., Tsekhmister Y.V. Medical and biological physics. Kiev, 2004.
8.3 Further reading
  1. Remizov O.M. Medical and biological physics. M., "High School", 1999.
  2. Remizov OM, Isakova N.Kh., Maksina OG .. Collection of problems from medical and biological physics. M.,., "High school", 1987.
Methodical instructions folded P. G. Zhumatiy.

Send your good work in the knowledge base is simple. Use the form below

Students, graduate students, young scientists who use the knowledge base in their studies and work will be very grateful to you.

Posted on http://www.allbest.ru/

Introduction

Mathematical statistics is the science of mathematical methods of systematizing and using statistical data for scientific and practical conclusions. In many of its sections, mathematical statistics is based on the theory of probability, which makes it possible to assess the reliability and accuracy of conclusions drawn on the basis of limited statistical material (for example, to estimate the required sample size to obtain the results of the required accuracy in a sample survey).

Probability theory considers random variables with a given distribution or random experiments whose properties are fully known. The subject of probability theory is the properties and relationships of these quantities (distributions).

But often an experiment is a black box that gives out only some results for which it is required to draw a conclusion about the properties of the experiment itself. The observer has a set of numerical (or they can be made numerical) results obtained by repeating the same random experiment under the same conditions.

In this case, for example, the following questions arise: If we observe one random variable - how, from a set of its values ​​in several experiments, to make the most accurate conclusion about its distribution? math statistics variance histogram

An example of such a series of experiments can be a sociological survey, a set of economic indicators, or, finally, a sequence of coats of arms and tails with a thousand toss of a coin. All of the above factors determine the relevance and significance of the topic of work at the present stage, aimed at a deep and comprehensive study of the basic concepts of mathematical statistics.

1. Subject and method of mathematical statistics

Depending on the mathematical nature of specific observation results, mathematical statistics are divided into statistics of numbers, multivariate statistical analysis, analysis of functions (processes) and time series, statistics of objects of non-numeric nature. An essential part of mathematical statistics is based on probabilistic models. The general tasks of describing estimation data and testing hypotheses are identified. They also consider more specific tasks associated with conducting sample surveys by restoring dependencies by constructing and using classifications (typologies), etc.

To describe the data, chart tables are built and other visual representations, for example, correlation fields. Probabilistic models are usually not used. Several methods for describing data are based on advanced theory and capabilities of modern computers. These include, in particular, cluster analysis aimed at identifying groups of objects that are similar to each other and multidimensional scaling that allows you to visually represent objects on a plane in the least distorting the distance between them.

Estimation and hypothesis testing methods rely on probabilistic data generation models. These models are divided into parametric and non-parametric. In parametric models, it is assumed that the objects under study are described by distribution functions depending on a small number (1-4) of numerical parameters. In nonparametric models, distribution functions are assumed to be arbitrary continuous. In mathematical statistics, the parameters and characteristics of the distribution (the mathematical expectation of the median variance of the quantile, etc.) of the density and the distribution function of the dependence between the variables are estimated (based on linear and nonparametric correlation coefficients as well as parametric or nonparametric estimates of the functions expressing the dependence), etc. Point and interval (giving bounds for true values) estimates.

In mathematical statistics there is a general theory of hypothesis testing and a large number of methods dedicated to testing specific hypotheses. Consider hypotheses about the values ​​of parameters and characteristics about checking homogeneity (that is, about the coincidence of characteristics or distribution functions in two samples) about the agreement of the empirical distribution function with a given distribution function or with a parametric family of such functions about the distribution symmetry, etc.

Of great importance is the section of mathematical statistics associated with conducting sample surveys with the properties of various sampling schemes and building adequate methods for evaluating and testing hypotheses.

Dependency recovery problems have been actively studied for more than 200 years since the development of the least squares method by K. Gauss in 1794. Currently, the most relevant methods for finding an informative subset of variables and nonparametric methods.

The development of methods for approximating data and reducing the dimension of description was started more than 100 years ago when K. Pearson created the method of principal components. Later, factor analysis and numerous non-linear generalizations were developed.

Various methods of constructing (cluster analysis) analysis and using (discriminant analysis) classifications (typologies) are also called pattern recognition methods (with and without a teacher), automatic classification, etc.

Mathematical methods in statistics are based either on the use of sums (based on the Central Limit Theorem of probability theory) or indicators of difference (distances of metrics) as in the statistics of objects of non-numerical nature. Only asymptotic results are usually rigorously substantiated. Nowadays computers play a big role in mathematical statistics. They are used both for calculations and for simulation modeling (in particular, in methods of multiplying samples and when studying the suitability of asymptotic results).

1.1 Basic concepts of mathematical statistics

An extremely important role in the analysis of many psychological and pedagogical phenomena is played by average values, which are a generalized characteristic of a qualitatively homogeneous population according to a certain quantitative criterion. It is impossible, for example, to calculate the secondary specialty or the average nationality of university students, since these are qualitatively heterogeneous phenomena. On the other hand, it is possible and necessary to determine, on average, the numerical characteristics of their academic performance (grade point average), the effectiveness of methodological systems and techniques, etc.

In psychological and pedagogical research, various types of averages are usually used: arithmetic mean, geometric mean, median, fashion, and others. The most common are arithmetic mean, median, and mode.

The arithmetic mean is used in cases where there is a directly proportional relationship between the defining property and the given attribute (for example, with an improvement in the performance of a study group, the performance of each of its members improves).

The arithmetic mean is the quotient of dividing the sum of quantities by their number and is calculated by the formula:

Posted on http://www.allbest.ru/

where X is the arithmetic mean; X1, X2, X3 ... Xn - the results of individual observations (techniques, actions),

n is the number of observations (techniques, actions),

The sum of the results of all observations (techniques, actions).

Median (Me) is a measure of the average position that characterizes the value of a feature on an ordered (built on the basis of increasing or decreasing) scale, which corresponds to the middle of the studied population. The median can be determined for ordinal and quantitative characteristics. The location of this value is determined by the formula:

Median place = (n + 1) / 2

For example. According to the results of the study, it was found that:

5 people from participating in the experiment study with excellent marks;

18 people study well;

For "satisfactory" - 22 people;

On "unsatisfactory" - 6 people.

Since in total N = 54 people took part in the experiment, the middle of the sample is equal to people. Hence, it is concluded that more than half of the students study below the mark “good”, that is, the median is more “satisfactory”, but less “good”.

Mode (Mo) is the most common typical value of a feature among other meanings. It corresponds to the class with the maximum frequency. This class is called modal value.

For example.

If to the question of the questionnaire: “indicate the degree of proficiency in a foreign language”, the answers were distributed:

1 - speak fluently - 25

2 - I know enough to communicate - 54

3 - I know how, but I have difficulty communicating - 253

4 - I hardly understand - 173

5 - don't speak - 28

Obviously, the most typical meaning here is “I own, but have difficulty communicating”, which will be modal. So the mod is - 253.

When using mathematical methods in psychological and pedagogical research, great importance is attached to the calculation of variance and root-mean-square (standard) deviations.

The variance is equal to the mean square of the deviations of the value of the options from the mean. It acts as one of the characteristics of the individual results of the scatter of the values ​​of the studied variable (for example, students' assessments) around the mean. The calculation of variance is carried out by determining: the deviation from the mean; the square of the specified deviation; the sum of the squares of the deviation and the mean of the square of the deviation.

The variance value is used in various statistical calculations, but is not directly observable. The quantity directly related to the content of the observed variable is the standard deviation.

The standard deviation confirms the typicality and exponentialness of the arithmetic mean, reflects the measure of fluctuations in the numerical values ​​of the signs, from which the average value is derived. It is equal to the square root of the variance and is determined by the formula:

(2) Posted on http://www.allbest.ru/

where: - root mean square. With a small number of observations (actions) - less than 100 - in the value of the formula, you should put not “N”, but “N - 1”.

The arithmetic mean and root mean square are the main characteristics of the results obtained during the study. They allow you to generalize data, compare them, establish the advantages of one psychological and pedagogical system (program) over another.

The root mean square (standard) deviation is widely used as a measure of dispersion for various characteristics.

When evaluating the results of the study, it is important to determine the dispersion of a random variable around the mean. This scattering is described using Gauss's law (the law of the normal distribution of the probability of a random variable). The essence of the law is that when measuring a certain feature in a given set of elements, there are always deviations in both directions from the norm due to a variety of uncontrollable reasons, while the larger the deviations, the less often they occur.

Further processing of the data can reveal: the coefficient of variation (stability) of the studied phenomenon, which is the percentage of the standard deviation to the arithmetic mean; a measure of obliquity, showing in which direction the predominant number of deviations is directed; a measure of steepness, which shows the degree of accumulation of values ​​of a random variable around the average, etc. All these statistics help to more fully identify the signs of the phenomena under study.

Coupling measures between variables. Relationships (dependencies) between two or more variables in statistics are called correlation. It is estimated using the value of the correlation coefficient, which is a measure of the degree and magnitude of this relationship.

There are many correlation coefficients. Let's consider only a part of them, which take into account the presence of a linear relationship between variables. Their choice depends on the scales of measurement of the variables, the relationship between which needs to be assessed. The most often used in psychology and pedagogy are the Pearson and Spearman coefficients.

1.2 Basic concepts of sampling

Let be a random variable observed in a random experiment. It is assumed that the probability space is given (and will not interest us).

We will assume that having carried out this experiment once under the same conditions, we received numbers - the values ​​of this random variable in the first second, etc. experiments. A random variable has a certain distribution that is partially or completely unknown to us.

Let's take a closer look at a set called a selection.

In a series of experiments already done, a sample is a collection of numbers. But if this series of experiments is repeated again, then instead of this set we will get a new set of numbers. Instead of a number, another number appears - one of the values ​​of a random variable. That is (and, etc.) is a variable that can take on the same values ​​as a random variable and just as often (with the same probabilities). Therefore, before the experiment - a random variable equally distributed with and after the experiment - the number that we observe in this first experiment, i.e. one of the possible values ​​of the random variable.

A sample of volume is a set of independent and equally distributed random variables ("copies") that have a distribution as well.

What does it mean “to draw a conclusion about the distribution from the sample”? A distribution is characterized by a density distribution function or a table by a set of numerical characteristics, etc. Based on the sample, you need to be able to build approximations for all these characteristics.

1.3 Sample distribution

Consider the implementation of a sample on one elementary outcome - a set of numbers. On a suitable probability space, we introduce a random variable taking values ​​with probabilities in (if some of the values ​​coincide, we add the probabilities the corresponding number of times).

The distribution of the quantity is called the empirical or sample distribution. Let us calculate the mathematical expectation and variance of the quantity and introduce the notation for these quantities:

In the same way, we calculate the moment of order

In the general case, we denote by the quantity

If, when constructing all the characteristics introduced by us, we consider the sample to be a set of random variables, then these characteristics themselves - - will become random quantities. These characteristics of the sample distribution are used to estimate (approximate) the corresponding unknown characteristics of the true distribution.

The reason for using distribution characteristics to evaluate the characteristics of the true distribution (or) is in the proximity of these distributions at large.

Consider, for example, tossing a correct die. Let be the number of points dropped during the th throw. Suppose that the unit in the sample occurs once two - once, etc. Then the random variable will take values ​​1 6 with probabilities, respectively. But these proportions approach with growth according to the law of large numbers. That is, the distribution of the value in a sense approaches the true distribution of the number of points dropped out when the correct dice is tossed.

1.4 Empirical distribution function histogram

Since the unknown distribution can be described, for example, by its distribution function, we construct an "estimate" for this function from the sample.

Definition 1. An empirical distribution function constructed from a sample of volume is a random function for each equal

Reminder: Random function

called an event indicator. For each, it is a random variable with the Bernoulli distribution with the parameter

In other words, for any value equal to the true probability of a random variable to be less, it is estimated by the proportion of sample elements that are smaller.

If the elements of the sample are sorted in ascending order (at each elementary outcome), we get a new set of random variables called the variation series:

The element is called the th member of the variation series or the th ordinal statistics.

The empirical distribution function has jumps at the sample points; the jump value at the point is equal to where is the number of sample elements that coincide with.

You can build an empirical variational distribution function:

Another characteristic of a distribution is a table (for discrete distributions) or density (for absolutely continuous). The empirical or selective analogue of a table or density is the so-called histogram. The histogram is plotted using grouped data. The assumed range of values ​​of the random variable (or the area of ​​sampled data) is divided, regardless of the sample, into a number of intervals (not necessarily the same). Let be intervals on a straight line called grouping intervals. Let us denote for through the number of sample elements that fall into the interval:

At each of the intervals, a rectangle is built, the area of ​​which is proportional. The total area of ​​all rectangles must be equal to one. Let be the length of the interval. The height of the rectangle above is

The resulting figure is called a histogram.

Divide the segment into 4 equal segments. 4 sample elements in - 6 in - 3 got into the segment and 2 sample elements got into the segment. We build a histogram (Fig. 2). In fig. 3 - also a histogram for the same sample but when the area is divided into 5 equal segments.

The Econometrics course states that the best number of grouping intervals ("Sturgess's formula") is

Here is the decimal logarithm, so

those. when the sample is doubled, the number of grouping intervals increases by 1. Note that the more grouping intervals, the better. But if we take the number of intervals, say, of the order of magnitude, then with increasing the histogram will not approach the density.

The following statement is true:

If the distribution density of the elements of the sample is a continuous function, then at so that there is a pointwise convergence in the probability of the histogram to the density.

So the choice of the logarithm is reasonable, but not the only possible one.

Posted on Allbest.ru

...

Similar documents

    Construction of a polygon of relative frequencies, empirical distribution function, cumulants and histograms. Calculation of point estimates of unknown numerical characteristics. Testing the hypothesis about the type of distribution for a simple and grouped distribution series.

    term paper, added 09/28/2011

    Subject, methods and concepts of mathematical statistics, its relationship with the theory of probability. Basic concepts of the sampling method. Characteristics of the empirical distribution function. The concept of a histogram, the principle of its construction. Sample distribution.

    tutorial, added 04/24/2009

    Classification of random events. Distribution function. Numerical characteristics of discrete random variables. The law of uniform distribution of probabilities. Student's distribution. Problems of mathematical statistics. Estimates of the parameters of the population.

    lecture, added 12/12/2011

    Estimates of distribution parameters, the most important distributions used in mathematical statistics: normal distribution, Pearson's, Student's, Fisher's distributions. Factor space, formulation of the goal of the experiment and choice of responses.

    abstract, added 01/01/2011

    Numerical characteristics of the sample. Statistical series and distribution function. Concept and graphical representation of a statistical population. Maximum likelihood method for finding the distribution density. Application of the method of least squares.

    test, added 02/20/2011

    Problems of mathematical statistics. Distribution of a random variable based on experimental data. Empirical distribution function. Statistical estimates of distribution parameters. Normal distribution of a random variable, hypothesis testing.

    term paper, added 10/13/2009

    Statistical processing of time control data (in hours) of the computer class work per day. Polygon of absolute frequencies. Plotting the empirical distribution function and histogram envelope. Theoretical distribution of the general population.

    test, added 08/23/2015

    Processing the results of information on transport and technological machines by the method of mathematical statistics. Determination of the cumulative normal distribution function, Weibull's law function. Determination of the amount of shift to the beginning of the parameter distribution.

    test, added 03/05/2017

    The concept of mathematical statistics as a science of mathematical methods of systematization and use of statistical data for scientific and practical conclusions. Point estimates of the parameters of statistical distributions. Analysis of the calculation of average values.

    term paper added 12/13/2014

    Basic concepts of mathematical statistics, interval estimates. Method of moments and method of maximum likelihood. Testing statistical hypotheses about the form of the distribution law using the Pearson test. Estimation properties, continuous distributions.

RANDOM VALUES AND LAWS OF THEIR DISTRIBUTION.

Random is called such a value that takes on values ​​depending on the coincidence of random circumstances. Distinguish discrete and random continuous magnitudes.

Discrete a quantity is called if it takes a countable set of values. ( Example: the number of patients at the doctor's appointment, the number of letters on the page, the number of molecules in a given volume).

Continuous is a quantity that can take values ​​within a certain interval. ( Example: air temperature, body weight, human height, etc.)

Distribution law A random variable is a set of possible values ​​of this quantity and, corresponding to these values, probabilities (or frequencies of occurrence).

PRI me R:

x x 1 x 2 x 3 x 4 ... x n
p p 1 p 2 p 3 p 4 ... p n
x x 1 x 2 x 3 x 4 ... x n
m m 1 m 2 m 3 m 4 ... m n

NUMERICAL CHARACTERISTICS OF RANDOM VALUES.

In many cases, along with the distribution of a random variable or instead of it, information about these quantities can be provided by numerical parameters, called numerical characteristics of a random variable ... The most common ones:

1 .Expected value - (average value) of a random variable is the sum of the products of all its possible values ​​by the probabilities of these values:

2 .Dispersion random variable:


3 .Mean square deviation :

Rule "THREE SIGMA" - if a random variable is distributed according to the normal law, then the deviation of this value from the mean value in absolute value does not exceed three times the standard deviation

GAUSS LAW - NORMAL DISTRIBUTION LAW

Often there are quantities distributed over normal law (Gauss's law). main feature : it is a limiting law, which is approached by other distribution laws.

A random variable is distributed according to the normal law if its probability density looks like:



M (X)- mathematical expectation of a random variable;

s is the standard deviation.

Probability density(distribution function) shows how the probability changes relative to the interval dx a random variable, depending on the value of the quantity itself:


BASIC CONCEPTS OF MATHEMATICAL STATISTICS

Math statistics- a branch of applied mathematics, directly adjacent to the theory of probability. The main difference between mathematical statistics and probability theory is that in mathematical statistics, it is not actions on the distribution laws and numerical characteristics of random variables that are considered, but approximate methods for finding these laws and numerical characteristics based on the results of experiments.

Basic concepts mathematical statistics are:

1. General population;

2. sample;

3. variation range;

4. fashion;

5. median;

6. percentile,

7. frequency polygon,

8. bar graph.

General population- a large statistical population from which some of the objects are selected for research

(Example: the entire population of the region, students of universities of a given city, etc.)

Sample (sample population)- a set of objects selected from the general population.

Variational series- statistical distribution, consisting of a variant (values ​​of a random variable) and the corresponding frequencies.

Example:

X, kg
m

x- value of a random variable (mass of girls aged 10 years);

m- frequency of occurrence.

Fashion- the value of a random variable, which corresponds to the highest frequency of occurrence. (In the above example, the value of 24 kg corresponds to the mod, it is more common than others: m = 20).

Median- the value of a random variable that divides the distribution in half: half of the values ​​are located to the right of the median, half (no more) - to the left.

Example:

1, 1, 1, 1, 1. 1, 2, 2, 2, 3 , 3, 4, 4, 5, 5, 5, 5, 6, 6, 7 , 7, 7, 7, 7, 7, 8, 8, 8, 8, 8 , 8, 9, 9, 9, 10, 10, 10, 10, 10, 10

In the example, we observe 40 values ​​of a random variable. All values ​​are arranged in ascending order based on their frequency of occurrence. It can be seen that 20 (half) of 40 values ​​are located to the right of the highlighted value 7. Therefore, 7 is the median.

To characterize the scatter, we find the values ​​that did not exceed 25 and 75% of the measurement results. These values ​​are called 25th and 75th percentiles ... If the median halves the distribution, then the 25th and 75th percentiles are cut off by a quarter. (The median itself, by the way, can be considered the 50th percentile.) As you can see from the example, the 25th and 75th percentiles are 3 and 8, respectively.

Use discrete (point) statistical distribution and continuous (interval) statistical distribution.

For clarity, statistical distributions are depicted graphically in the form frequency polygon or - histograms .

Frequency polygon- polyline, segments of which connect points with coordinates ( x 1, m 1), (x 2, m 2), ..., or for polygon of relative frequencies - with coordinates ( x 1, p * 1), (x 2, p * 2), ... (Fig. 1).


m m i / n f (x)

Fig. 1 Fig. 2

Frequency histogram- a set of adjacent rectangles built on one straight line (Fig. 2), the bases of the rectangles are the same and equal dx , and the heights are equal to the ratio of the frequency to dx , or R * To dx (probability density).

Example:

x, kg 2,7 2,8 2,9 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7 3,8 3,9 4,0 4,1 4,2 4,3 4,4
m

Frequency polygon

The ratio of the relative frequency to the width of the interval is called probability density f (x) = m i / n dx = p * i / dx

An example of plotting a histogram .

Let's use the data from the previous example.

1. Calculation of the number of class intervals

where n - the number of observations. In our case n = 100 ... Hence:

2. Calculation of the interval width dx :

,

3. Drawing up an interval series:

dx 2.7-2.9 2.9-3.1 3.1-3.3 3.3-3.5 3.5-3.7 3.7-3.9 3.9-4.1 4.1-4.3 4.3-4.5
m
f (x) 0.3 0.75 1.25 0.85 0.55 0.6 0.4 0.25 0.05

bar graph