January 2024 (Original ≽)



Question: What does the word "stochastic" mean?


Answer: A stochastic phenomenon has a random distribution of probabilities and or a pattern that can be analyzed statistically, but not precisely predicted.

Stochastically, a random process is a collection of random variables indexed by a particular set. Each of the random variables of such a process is uniquely bound by the elements of that index set.

The statistical probability is the number P(ω) = M/N, i.e. of favorable outcomes divided by the number of all attempts to repeat the same event.

For example, I roll a fair die N = 100 times (in simulation) and get M = 15 "sixes". It has statistical probability P = 15/100. Repeating the same series, I got P = 17/100 and again repeating for the third time P = 11/100. The next seven tries of the same gave me statistic probabilities of 26/100, 15/100, 18/100, 17/100, 13/100, 13/100 and 15/100. The theory determines the conditions when Pp, and practice confirms this without exception.

Mathematical probability, or shortly the probability, is a number p ∈ [0, 1] around which the statistical probabilities are grouped. In other words, Pp, when the number of repetitions of (the same) experiment increases indefinitely. By rolling a fair die, the probability of a "six" would be p = 1/6. By throwing an unfair die, each of the six numbers would have some probabilities p1, ..., p6 sum to one. All such outcomes ω1, ω2, ..., of which exactly one occurs, form a discrete set Ω of sample space which we call the probability distribution.

There are also distributions of continuous events, ω ∈ Ω, probability density ρ(ω) ≥ 0, whose integral over the set Ω is one, if exactly one outcome of that set occurs. You will find interesting details about probabilities in my elementary script (Mathematics IV, 6. Probability). There is also the next task with a more detailed solution (Example 6.2.10).

1. Example. The two agree to meet between 1 and 2 o'clock and to wait at the agreed place for 20 minutes. What is the probability of an encounter?

Solution: In the above image on the left, the arrival times of the two are x and y, in min after the start of the arrival times. The total arrival period is 60 × 60 minutes, and the period of the meeting is |x - y| ≤ 20 min, that is x - 20 ≤ yx + 20. These are the two lines that separate the shaded part of the surface 3600 - 40 × 40. The quotient of them, the shaded part and the whole square, is 2000/3600 = 0.56. This is the probability. □

Conditional probability is the probability of outcomes occurring under conditions that limit the options. The next example is a variation of coin toss.

2. Example. The father has two children. One is female. What is the probability that the second child is also female?

Solution: Out of four {FF, FM, MF, MM} equal possibilities, only three have "F", and only one of them is also "F". Therefore, the conditional probability of 1/3 is required. □

The probability that ω1 will happen when ω2 is known to have happened is defined by

P21) = P2ω1)/P1).


P1)P21) = P2)P12).

In the given example, ω1 is the event "at least one is F" and ω1ω2 is "both are F" , probability 3/4 and 1/4 respectively, so P21) = (1/4) /(3/4) = 1/3.

Analogously, with two-coin tosses, we have four equal outcomes {TT, TH, HT, HH}, which is easy to check by repeatedly tossing two coins many times. Namely, if TH (tail-head) and HT were the same outcome, then the given set has only three equals, so for example TT (tail-tail) would be in about a third of all throws. But it happens in about a quarter of such attempts.

Information is like a plot master in combinatorial examples. When we have n = 1, 2, 3, ... the position for the digits of the base b = 2, 3, 4, ... of the number writing system, then we can write N = bn of various values. With two positions (n = 2) in the binary system (b = 2) it is possible to write a total of N = 2² = 4 different numbers. In three positions (n = 2) we place a total of N = 2³ = 8 numbers. For one of the N equal possibilities to occur, the probability is p = 1/N, and the information I = log2 N = -log2 p bits. I wrote about the logarithm separately here.


Question: Explain that "conditional probability" to me?


Answer: The Conditional probability for the event A relative to the event B is the probability that A can happen if event B is met. This is written P(A|B) and is worth

\[ P(A|B) = \frac{P(A\cap B)}{P(B)}. \]

Otherwise, the intersection of sets (AB) can be written as a product (AB), a union (AB) as a sum (A + B), and the probability below is denoted by Pr.

1. In the picture on the right is a drum (Ω) with four types of balls A, B1, B 2 and B3. All 30 balls A are white, as well as all 10 balls B1 and 12 balls B2, while 4 of B2 are black. All 10 balls B3 are also black, or the remaining 34 balls, say type C. Each of the balls has an equal chance of being drawn from the drum. The probability that the drawn ball is white or black is:

Pr(white) = Pr(A) + Pr(B1) + Pr(B2|white) = 0.30 + 0.10 + 0.12 = 0.52
Pr(black) = Pr(B3|black) + Pr(B3) + Pr(C) = 0.04 + 0.10 + 0.34 = 0.48

Being a white or a black ball of this drum are disjoint (have no common) events and form a complete set (sample space Ω), so that

Pr(white + black) = Pr(white) + Pr(black) = 1.

In the same picture we also find the following conditional probabilities:

Pr(A|B1) = 1,   Pr(A|B2) = 0.12/(0.12 + 0.04) = 0.75,   Pr(A|B3) = 0.

The way to understand conditional probability is, in addition to the links provided, to carefully analyze the examples I provide. Notice how from BA follows AB = A and AB = B.

2. Let B1, B2, ..., Bn disjoint sets (without common elements), so we have:

  1. A = B1 + B2 + ... + Bn ,
  2. Pr(A) = Pr(B1)Pr(A|B1) + Pr(B2)Pr(A|B2) + ... + Pr(Bn)Pr(A|Bn),
  3. Pr(Bk|A) = Pr(Bk)Pr(A|Bk)/Pr(A),   k = 1, 2, ..., n.

Then (a) the sets Bk are called decomposition of the set A, the following (b) is the formula of the total probability, the third expression (c) is the Bayes formula. Here are some typical examples with these expressions.

3. Example. n = 2, 3, 4, ... samples are exposed for review. Among which m = 0, 1, ..., n are defective. Two are chosen at random without return. If B1 and B2 are the events that the first and second selected products are defective, find the probabilities, and probability of B2 provided that B1 happened. Calculate those numbers for n = 10 and m = 4.

Solution: Clearly Pr(B1) = Pr(B2) = m/n, because a no-return choice will not change the probability of a defective product. Namely, the first one will be incorrect B1 or correct B'1, so the probability of choosing the second incorrect one is:

\[ \Pr(B_2) = \Pr(B'_1)\Pr(B_2|B'_1) + \Pr(B_1)\Pr(B_2|B_1) = \] \[ = \frac{n-m}{n}\frac{m}{n-1} + \frac{m}{n}\frac{m-1}{n-1} = \frac{m(n-1)}{n(n-1)} = \frac{m}{n}. \]

The outcome numbers of the given events are |B1B2| = m(m - 1) and |B1| = m(n - 1), so it is:

\[ \Pr(B_2 | B_1) = \frac{|B_1 B_2|}{|B_1|} = \frac{m(m - 1)}{m(n - 1)} = \frac{m - 1}{n - 1} \ne \Pr(B_1). \]

For n = 10 and m = 4, it will be Pr(B1) = Pr(B 2) = 4/10 and Pr(B2|B1) = 1/3. □

4. Example. The box with equal balls has two compartments, so it has three chambers A, B and C. A quarter of the balls are each in the first and third of the chambers, and the rest are in the second. It is known that 20% of the balls of the first chamber are white and the rest are black, that 30% of the balls of the second chamber are white and the rest are black, and that 40% of the balls of the third chamber are white and the rest are black. What is the chance that a ball drawn at random will be white (X), be black (Y)?

Solution: According to the total probability formula (2.b) we find:

Pr(X) = Pr(A)Pr(X|A) + Pr(B)Pr(X|B) + Pr(C)Pr(X|C) =
= 0.25⋅0.20 + 0.50⋅0.30 + 0.25⋅0.40 = 0.30

Pr(Y) = Pr(A)Pr(Y|A) + Pr(B)Pr(Y|B) + Pr(C)Pr(Y|C) =
= 0.25⋅0.80 + 0.50⋅0.70 + 0.25⋅0.60 = 0.70

The chance is 3:7 that the drawn ball will be white and not black. □

5. Example. When the question in the previous, 4th example is reversed, a task typical for solving with the Bayesian formula is obtained. Say, when a white (X) ball is randomly drawn from the box, what is the probability that it was in the first (A), the second (B) and the third (C) chamber?

Answer: According to formula (2.c) of Bayes, we find:

Pr(A|X) = Pr(A)Pr(X|A)/Pr(X) = 0.25⋅0.20/0.30 = 1/6

Pr(B|X) = Pr(B)Pr(X|B)/Pr(X) = 0.50⋅0.30/0.30 = 1/2

Pr(C|X) = Pr(C)Pr(X|C)/Pr(X) = 0.25⋅0.40/0.30 = 1/3

The sum of these three is 1, because the pellet must be from one of the three chambers. □

Without falling and repetition there is no progress (without uncertainty there is no vitality), stick to it by reading these examples, and if you understood them — that's great, let's move on. Not all conditional probability tasks are intuitively easy to understand. You can already feel that in the 2nd example, and here is an even stronger one.

6. Example (Monty Hall). Let's imagine that you are in the game, and you are given a choice of three doors: Behind one door is a car; behind others, goats. You choose a door, let's say No. 1, and the host, who knows what's behind the door, opens another door, let's say No. 3; behind which there is a goat. Then he tells you: "Do you want to choose door No. 2?” Is it to your advantage to change the choice?

Explanation: In the first attempt, the contestant chooses the door that reveals the goat, not the car, so that the host opens the door that the contestant did not choose and it, too, reveals the goat. Next, the host offers the contestant the option to choose between the originally selected door and another closed door.

On the first try, the chance of winning A is 1 : 3, on the second it is 1 : 2. The uncertainty of the first try has decreased, partly wasted by the first choice not being a hit. Missing A' in the first and second picks has probabilities 2/3 and 1/2, from which it is clear that the contestant should pick the door again. □

The answer sounds paradoxical, and to a smart person it is instructive, because intuition can be wrong in more complex mathematical problems. That's why mathematicians consider the expression "obviously" one of the most slippery, while the statements of many theorems turn out to be so unintuitive. Be that as it may, during the millennium (close to three thousand years) we have no trace of the inaccuracy of mathematical statements (physics and other natural sciences have existed for less than four centuries) even though we constantly question them.

A similar example to the previous one (1975) was the paradox "Three Prisoners" (1959). An even older one would be formulated for high school math on conditional probabilities, as follows. In the first and second attempts, the gain is A1 and A2 and the loss is A'1 and A'2. So:

\[ \begin{matrix} \Pr(A_1) = \frac13 & \Pr(A_2|A'_1) = \frac12 \\ \Pr(A'_1) = \frac23 & \Pr(A'_2|A'_1) = \frac12 \end{matrix} \]

In the first, one of three options is chosen, then in the second, one of two. In the simulation, when the selection in the second step is turned off, fewer hits will be obtained, than with re-selection in the second step, then one of the two options.

Today, it would be easy for anyone to check it on computers, and it was done half a century ago with the same result. However, this exact solution was at one time disputed by the vast majority of viewers and a worrying majority of "experts" in probability theory. I mention this only as an example of the unreliability of pure intuition in mathematics.


Question: Can you send me the "Assessment" module?


Answer: No way! Those parts of "Terminator" (program package) due to invincibility in the competition against today's known strategies, I do not give to anyone. But, I can explain to you an also successful algorithm.

By the way, a good advice would be to provide your artificial intelligence with various solutions to (almost) the same tasks, with many of them, and with access to large live databases. The first is for the case without which there is no creativity, and the second is for "machine learning". The method I am going to describe is not exactly what you asked me to do, but you will see, with some knowledge of mathematics, you will get more than expected.

So, we imagine that we work in some big hospital, or have its data, with a huge indexed list of symptoms B1, B2, ..., Bm with also edited huge list of diseases A1, A2, ..., An. In addition, we have estimates of conditional probabilities Pr(Ai|Bj) = pij which may (but may not) be constantly updated as an external database to the program.

The idea is to have a list of all possible m ∈ ℕ symptoms and every possible n ∈ ℕ disease with those symptoms, but thanks to the Borel–Cantelli lemma it is not necessary. In the case of large lists of possibilities, only a small part of them will be relevant (depending on the need).

Thus, when we have a patient with a small list of symptoms strung into a vector x = (x1, x 2, ..., xm), all others are filled with zeros. Scale those numbers (xj) to be proportional to the severity of symptoms in an individual patient. We form the matrix equation y = Mx, that is

\[ \begin{pmatrix} y_1 \\ y_2 \\ ... \\ y_n \end{pmatrix} = \begin{pmatrix} p_{11} & p_{12} & ... & p_{1m} \\ p_{21} & p_{22} & ... & p_{2m} \\ ... & ... & ... & ... \\ p_{n1} & p_{n2} & ... & p_{nm} \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ ... \\ x_m \end{pmatrix} \]

a system of linear equations that computers can easily cope with even in the case of very long sequences. The module should be made so that the numbers m and n can be changed in the database, so that the database can be updated without changes to the module itself.

For each j-th column of this matrix:

p1j + p2j + ... + pnj = Pr(A1|Bj) + Pr(A2|Bj) + ... + Pr(An|Bj) = 1,

when we have listed (almost) all diseases (Ai) for all types of symptoms (Bj). Each of the columns of the medical matrix M is (approximately) some probability distribution, if it is (approximately) complete, total.

Output vector, array y = (y1, y2, .. ., yn), then is sorted into a sequence of decreasing values z = (z1, z2, ..., zn) among which the first terms are the best disease assessment that the patient might have based on the input symptoms (x) and medical knowledge (M) of the given institution. Anything further is pure algebra.

Let's note that this program for diagnosing diseases can be easily (by changing the databases) redirected to meteorological forecasts, or to evaluations of the positions of games to win, in general in many things — just as calculus finds its applications everywhere. The power of mathematics is that great accuracy leads to incredibly great usability. And that would be the farthest place I want to reveal to you in response to my above question.


Question: That "medical" matrix was not square?


Answer: No, that matrix (M) is not square, if the number of symptoms (m) is not equal to the number of diseases (n). When its row and column numbers are equal (m = n), and the sum of the coefficients of each column is one, it is "stochastic".

1. Stochastic matrix is a square matrix made up of columns of some probability distribution vectors. The coefficients of these probability vectors are real numbers between 0 and 1 sum to 1, and the stochastic matrix copies them into vectors of the same type, coefficients of real numbers from [0, 1] sum to one.

Indeed, if A = (aij) is a stochastic matrix of order n, then

a1j + a2j + ... + anj = 1,

in each column j = 1, 2, ..., n, so if we have x = (x1 , x2, ..., xn), also a probability vector, then in the vector y = Ax coefficient of the i-th row:

yi = ai1x1 + ai2x2 + ... + ainxn

for each i = 1, 2, ..., n. These are non-negative sum numbers:

\[ \sum_{i=1}^n y_i = \sum_{i=1}^n \sum_{j=1}^n a_{ij}x_j = \sum_{j=1}^n\left(\sum_{i=1}^n a_{ij}\right)x_j = \sum_{j=1}^n x_j = 1, \]

so y is also a vector of some probability distribution.

7. Example. Matrix with equation y = Ax

\[ \begin{pmatrix} 0.62 \\ 0.30 \\ 0.08 \end{pmatrix} = \begin{pmatrix} 0.8 & 0.2 & 0.2 \\ 0.2 & 0.7 & 0.2 \\ 0.0 & 0.1 & 0.6 \end{pmatrix} \begin{pmatrix} 0.70 \\ 0.20 \\ 0.10 \end{pmatrix} \]

translates the distribution vector x into the distribution vector y, which is easy to check by directly multiplying and adding the given numbers. □

2. An important property of square matrices of the same order is that they can be multiplied with each other. That the products of stochastic matrix A and B of the same order (n) will again be stochastic matrix (C = AB) we see from the previous one. Each column of the second matrix (B) is some probability vector, like x, which multiplied by the first (A) becomes also a probability vector, like y, so the columns of multiplication results are some probability vectors, i.e. C is a stochastic matrix.

We see the same in more detail by multiplying general numbers:

\[ \sum_{i=1}^n c_{ik} = \sum_{i=1}^n \sum_{j=1}^n a_{ij}b_{jk} = \sum_{j=1}^n \left(\sum_{i=1}^n a_{ij}\right) b_{jk} = \sum_{j=1}^n b_{jk} = 1, \]

which is true for every k = 1, 2, ..., n, so the matrix C = (cik) stochastic.

8. Example. Matrix multiplication C = AB of stochastic matrices

\[ \begin{pmatrix} 0.50 & 0.32 & 0.26 \\ 0.35 & 0.50 & 0.30 \\ 0.15 & 0.18 & 0.44 \end{pmatrix} = \begin{pmatrix} 0.8 & 0.2 & 0.2 \\ 0.2 & 0.7 & 0.2 \\ 0.0 & 0.1 & 0.6 \end{pmatrix} \begin{pmatrix} 0.5 & 0.2 & 0.1 \\ 0.3 & 0.6 & 0.2 \\ 0.2 & 0.2 & 0.7 \end{pmatrix} \]

gives the stochastic matrix. Check it by direct calculation. □

3. Therefore there is a composition x' = Ax, x'' = Ax', x''' = Ax'', ..., x(r+1) = Ax(r), ... mappings with stochastic matrices. This x(r) = Arx. Thus, the array of matrices A, A2, ..., Ar, ... becomes Markov Chain. That "chain" is a process generated by the stochastic matrix A. It maps the probability distribution vector (x) step by step into the sequence x', x'', x' '', ... vector probability distribution.

9. Example. A series of A, A2, A3, ... stochastic matrices:

\[ \begin{pmatrix} 0.8 & 0.2 & 0.2 \\ 0.2 & 0.7 & 0.2 \\ 0.0 & 0.1 & 0.6 \end{pmatrix}, \begin{pmatrix} 0.68 & 0.32 & 0.32 \\ 0.30 & 0.55 & 0.30 \\ 0.02 & 0.13 & 0.38 \end{pmatrix}, \begin{pmatrix} 0.608 & 0.392 & 0.392 \\ 0.350 & 0.475 & 0.350 \\ 0.042 & 0.133 & 0.258 \end{pmatrix}, ... \]

is the Markov chain generated by matrix A. It forms the sequence of vectors step by step x, x', x'', x''', ...:

\[ \begin{pmatrix} 0.7 \\ 0.2 \\ 0.1 \end{pmatrix}, \ \begin{pmatrix} 0.62 \\ 0.30 \\ 0.08 \end{pmatrix}, \ \begin{pmatrix} 0.572 \\ 0.350 \\ 0.078 \end{pmatrix}, \ \begin{pmatrix} 0.5432 \\ 0.3750 \\ 0.0818 \end{pmatrix}, \ ... \]

that are some probability distributions. □

Notice that the range of coefficients of the mapped vector decreases, and there are simple general rules when this happens.


Question: When does the range of probability vector coefficients decrease and what are the consequences?


Answer: Let them be:

mi = min {ai1, ai2, ..., ain}
Mi = max {ai1, ai2, ..., ain}

minima and maxima of the i-th row in order of all i = 1, 2, ..., n, of the stochastic matrix A = (aij).

Then each miyiMi of the distribution vector y = Ax which is the image of the distribution vector x. Namely:

yi = ai1x1 + ... + ainxn,   however   miai1x1 + ... + ainxnMi,

because the x1 + ... + xn = 1.

In the previous one, 9. example, m1 = 0.2 and M1 = 8.0 so that the leading the coefficient 0.7 of the vector x changes to 0.62, this one to 0.572, then to 0.5432 and so on, it decreases, because it is greater than 0.5 and it is too large. However, x2 = 0.2 is less than 0.5 and is too small, so it grows: 0.2 → 0.30 → 0.350 → 0.375 → ..., and that is the procedure of equalizing the probabilities and increasing the information (Extremes).

As a consequence of this narrowing of the probability distribution range xx' → x'' ... and changes (increase) of its information we see as interference, nuisance, that is channel noise. If it is created by adding information, misinformation increases the uncertainty of the input message.

The measure of "channel noise", which is not specifically determined by the information theory, can be defined here by the intensity of the increase in the aforementioned narrowing, which are the probabilities, for example by the formula (Nu):

Ν = -∑i (Mi - mi)⋅log (Mi - mi) =
= -(M1 - m1)⋅log (M1 - m1) - (M2 - m2)⋅log (M2 - m2) - ... - (Mn - mn)⋅log (Mn - mn).

The formula resembles Shannon's information, but it is not such an average, because the differences (Mi - mi) which are probabilities, do not constitute a distribution probability. That is precisely why it is a well-defined disturbance, because the increasing probability difference Δp of 1/e ≈ 0.37, as well as anything smaller than it, makes smaller the summation item -Δp⋅log Δ p.

For example, the previous matrix (A) has "channel noise"

Ν = -0.6⋅log2 0.6 - 0.5⋅log2 0.5 - 0.6⋅log2 0.6 ≈ 1.38 bit.

Another example, the stochastic matrix (Transformations) defined by

\[ C = \begin{pmatrix} \alpha & \alpha & \alpha \\ \beta & \beta & \beta \\ \gamma & \gamma & \gamma \end{pmatrix} \]

has no "channel noise", because -Δp⋅log Δp → 0, when Δp → 0. This is consistent, because C² = C, that is, this matrix is not changed by exponentiation. But, it will each probability vector x = (x1, x2, x3) transform into a vector x' = (α, β, γ). Therefore, we need to distinguish between "channel noise" and "message noise".

The third example, the unit matrix

\[ I = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix} \]

has this difference probability Δpi = Mi - mi = 1 - 0 = 1, but all sums of "channel noise" are -1⋅log 1 = -1⋅0 = 0, so it is absent. Any matrix obtained by permutations of the unit columns is a stochastic matrix and also has no noise.

There is also a summary of the answer to the question posed to me, the range of probability vector coefficients decreases with the number Ν, here determined "channel noise", or more precisely the "Narrowing" of the channel.

Black Box II

Question: Isn't the mentioned matrix C a "black box"?

Black Box II

Answer: Yes, C is one of them (Black Box). When "narrowing" the channel Ν = 0, then the channel is either a black box (like C), or lossless (like I) while implementing probability distributions.

In all other cases, when Ν > 0, the composition Ar, when the exponent r grows indefinitely, converges to a black box. Namely, when Ν > 0, then there is at least one row A = (aij) of the stochastic matrix, say i-th, whose the difference of the maximum and minimum of the conditional probabilities is not zero.

10. Proposition. Differences Δpi = Mi - mi > 0, of the same row, become a strictly decreasing sequence of the composition Ar, when r increases.

Proof: By mathematical induction. In case r = 2, with the matrix A2 we have coefficients a'ik for which:

\[ a'_{ik} = \sum_{j=1}^n a_{ij}a_{jk} \lt M_i \sum_{j=1}^n a_{jk} = M_i \]

because by assumption at least one member aij is smaller than the largest one, and the sum of all k-th columns is 1. Analogously a' ik > mi. Hence

0 < M'i - m'i < Mi - mi

which means that the difference of the largest and smallest coefficient of the ith row of the matrix decreases.

In the second step of the induction, suppose that a'ik are the coefficients of the ith row of the matrix Ar and that M'i and m'i are distinct largest and smallest elements of its ith line. Then for the matrix ArA is:

\[ a''_{ik} = \sum_{j=1}^n a'_{ij}a_{jk} \lt M'_i \sum_{j=1}^n a_{jk} = M'_i \]

explained in the same way as in the preface, and similarly a'ik > m'i, which draws

0 < M''i - m''i < M'i - m'i

which completes the proof by induction. The difference of the largest and smallest of all the coefficients of the same ith row of the matrix Ar decreases when r = 1, 2, 3, ... grows. ∎

As the series of these differences Δpi = Mi - mi > 0 bounded from the lower side but also strictly decreasing, it converges to some accumulation point. However, it cannot be anything but zero, because again we would have a step like the previous induction, all the way to zero.

Therefore, the limit value of each member of the ith row of the matrix Ar, when r → ∞, let named the constant α (alpha). The row then resembles the row constant arrays of the aforementioned C matrix. However, if 0 < α < 1, then each column has some more non-zero elements (up to the sum of 1), so there are still some line differences Δp > 0 which will converge to zero, or has all other rows of some constant β, γ, ... (like C) sum α + β + γ + ... = 1. With this, we also proved the following statement.

11. Proposition. when Ν > 0, the composition Ar, when the exponent r grows indefinitely, converges to a black box.

By the way, the black box we are talking about, like the matrix C (Transformations) is a stochastic matrix that every vector of probability distribution copy to one and the same vector, so based on the output (copy) we cannot know what the input (original) was.

Message Noise

Question: Do you have anything else to say about "channel noise"?

Message Noise

Answer: Of course, you'd be surprised how much there is. But let's first consider the difference between "message noise" and "channel noise" mentioned recently (Narrowing).

1. We observe the original, then the copy, the vector of the above labels:

x = (x1, ..., xn),     x' = (x'1, ..., x'n),     x' = Ax,

Let's calculate the differences (Δxk = x'k - xk) copies and originals, vector:

Δx = x' - x = (A - I)x,

\[ \Delta x = \begin{pmatrix} a_{11} - 1 & a_{12} & ... & a_{1n} \\ a_{21} & a_{22} - 1 & ... & a_{2n} \\ ... & ... & ... & ... \\ a_{n1} & a_{n2} & ... & a_{nn} - 1 \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ ... \\ x_n \end{pmatrix}. \]

When this zero-vector (Δx = 0), then the transmitter (A) has no noise, at least as far as the given message (x) is concerned. This is certainly achieved, when the determinant det(A - I) = 0. The "message noise" problem set in this way becomes a problem of characteristic values (eigenvalues) of matrices with their characteristic vectors (eigenvectors).

12. Example. Matrix (A) 7. example, has characteristic (eigen) values λ and their corresponding (Ax = λx) characteristic or eigenvectors:

λ1 = 1     x1 = (5, 4, 1),
λ2 = 0,6     x2 = (√2, 0, -√2),
λ3 = 0,5     x3 = ( 0, -√2, √2),

so it is Axk = λkxk. □

The example indicates the main obstacle of stochastic matrices in applications of linear algebra theorems. The vectors to which we apply these matrices are probability distributions, and such are only part of the vector space. The matter is similar to solving other algebraic equations, when we can get solutions that are not realistic and we say that the equation "has no solution".

Second, the eigenvectors are (usually) unit norms ∥x∥ = 1, in the way the norm is defined in the given vector space, that is, how we can define it. Standard norm of vector v = (v1, v2, ... , vn) is the root of the sum of the squares of the vector coefficients

\[ \|v\| = \sqrt{v_1^2 + v_2^2 + ... + v_n^2}. \]

When the vector is divided by this numberi, ts norm, the unit vector x = v/∥v∥ norm 1, if ∥v∥ ≠ 0. In the 12th example n = 3, and the squares are the coefficients (±0.7071...)² = 0.5. Also ∥x1∥ = 1, so we write:

λ1 = 1     x1 = (-0.7715, -0.6172, -0.1543),
λ2 = 0.6     x2 = (0.7071, 0.0000, -0.7071),
λ3 = 0.5     x3 = ( 0.0000, -0.7071, 0.7071).

However, one way or another, such eigenvectors are not probability distribution vectors, with non-negative coefficients summing to one.

2. With matrices in general and the linear operators they represent, as we know, there are characteristic (eigen) equations Ax = λx. Number of characteristic values (λ1, λ2, ..., λn) is equal to the rows of the matrix. This is equal to the number of members of the sequence, the components of the vector, and they are the number of dimensions of the vector space. Each individual λk has an eigenvector xk, so that Ax k = λkxk, for k = 1, 2, ..., n.

When the columns of the matrix A span a vector space of the same dimension (n) we say that the matrix is not degenerated. Then all the eigenvalues of the matrix are distinct and there is an inverse matrix such that AA-1 = A-1A = I. So, when the matrix is not degenerate, its eigenvectors (xk goes with λk) are linearly independent. They are then the columns of one auxiliary matrix P = P[x1, x 2, ..., xn] such that AP = PD, where D diagonal matrix.

13. Example. The same previous matrix A, its auxiliary P and diagonal matrix D = P-1AP are respectively:

\[ \begin{pmatrix} 0.8 & 0.2 & 0.2 \\ 0.2 & 0.7 & 0.2 \\ 0 & 0.1 & 0.6 \end{pmatrix}, \quad \begin{pmatrix} \frac{5}{\sqrt{42}} & \frac{1}{\sqrt{2}} & 0 \\ \frac{4}{\sqrt{42}} & 0 & -\frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{42}} & -\frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \end{pmatrix}, \quad \begin{pmatrix} 1 & 0 & 0 \\ 0 & 0.6 & 0 \\ 0 & 0 & 0.5 \end{pmatrix}. \]

Since A = PDP-1, that is A2 = PD2P-1, ..., Ar = PDrP-1 =

\[ P \begin{pmatrix} 1^r & 0 & 0 \\ 0 & 0,6^r & 0 \\ 0 & 0 & 0,5^r \end{pmatrix} P^{-1} \to P \begin{pmatrix} 1 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{pmatrix} P^{-1} = \begin{pmatrix} 0.5 & 0.5 & 0.5 \\ 0.4 & 0.4 & 0.4 \\ 0.1 & 0.1 & 0.1 \end{pmatrix}. \]

We obtained the matrix C from the beginning of these considerations (Narrowing), which is also stochastic, but is the "black box" of the data transmission channel. It maps each vector to the vector (0.5, 0.4, 0.1). □

3. The column information of matrix A is (approximately):

-0.8⋅log20.8 - 0.2⋅log20.2 - 0⋅log20 = 0.722
-0.2⋅log20.2 - 0.7⋅log20.7 - 0,1⋅log20.1 = 1.157
-0.2⋅log20.2 - 0.2⋅log20.2 - 0.6⋅log20.6 = 1.371

a total of 3.250 bits. It converges to the matrix ArC column information

-0.5⋅log20.5 - 0.4⋅log20.4 - 0.1⋅log20.1 = 1.361

or a total of three times 4.083 bits. As previously noted, as chain length increases, this information grows. However, Cx is

\[ \begin{pmatrix} 0.5 & 0.5 & 0.5 \\ 0.4 & 0.4 & 0.4 \\ 0.1 & 0.1 & 0.1 \end{pmatrix}\begin{pmatrix} x_1 \\ x_2 \\ x_3 \end{pmatrix} = \begin{pmatrix} 0.5 \\ 0.4 \\ 0.1 \end{pmatrix} \]

for each probability vector x = (x1, x2, x3), because x1 + x2 + x3 = 1, so the change of input and output information is variable value. This also falls under the distinction between "message noise" and "channel noise".

4. So, when the Markov chain generator is a non-degenerate channel matrix, like A, its columns are probability distribution vectors and span the vector space of the row dimension of the matrix. Its determinant is not zero and such has an inverse matrix, different eigenvalues and corresponding eigenvectors that span the space of the same dimension. This matrix can be diagonalized. However, as we have seen, by scaling it can degenerate into a "black box".

What is important to notice about such non-degenerate matrices is that by exponentiation, composition of transformations, the determinant of an increasingly long chain tends to zero, if |det(A)| < 1, as in the above case, det(A) = 0.3. Since the determinant of the product is equal to the product of the determinants, it is:

det(Ar) = (det A)r = (0.3)r → 0,   r → ∞.

The limit value of a series of transformations is the zero-determinant matrix. The question of the convergence of the Markov chain to the "black box" thus becomes a matter of the degeneracy of the generator of the chain, with a unique criterion. And degeneration occurs if and only if it is |det(A)| < 1.

5. There is also a general rule. The zero determinant matrix is not regular, it has no inverse matrix. It copies so that it is not possible to determine the original based on the copy, and it is a "black box". Any matrix A whose determinant |det(A)| < 1 will have "channel noise". However, there are messages that will still be transmitted accurately. In the 13th example, it is (0.5, 0.4, 0.1), the vector to which each transition.

The upper triangular matrix below the main diagonal has all zero coefficients, and its eigenvalues are on the diagonal. When it is stochastic, the first element of the diagonal is 1, because the sum of that column must be 1. However, it therefore has an eigenvalue λ1 = 1 and a corresponding eigenvector x1 = (1, 0, ..., 0) which is a probability distribution.

If the upper triangular matrix T has no other units on the diagonal except the first one, by scaling it will have smaller diagonal elements and larger ones above, so that in the limiting case it could be a matrix with units in the first row and all other elements zeros.

14. Example. Upper triangular stochastic matrix

\[ T = \begin{pmatrix} 1 & 0.2 & 0.1 \\ 0 & 0.8 & 0.2 \\ 0 & 0 & 0.7 \end{pmatrix} \]

has eigenvalues λ1 = 1, λ2 = 0.8 and λ3 = 0.7 of the corresponding eigenvectors, respectively:

\[ x_1 = \begin{pmatrix} 1 \\ 0 \\ 0 \end{pmatrix}, \quad x_2 = \begin{pmatrix} 1 \\ -1 \\ 0 \end{pmatrix}, \quad x_3 = \begin{pmatrix} 1 \\ -2 \\ 1 \end{pmatrix} \]

The first of them x1 is also a probability distribution vector, so from Tx = λ1x 1 follows Trx1 = T r-1x1 = ... = x 1. It is the eigenvector of the Markov chain. □

The Determinant of a matrix is the product of all eigenvalues of the matrix, and the trace is their sum. Therefore, the upper triangular stochastic matrix with all positive diagonal numbers has a positive determinant, this det(T) = 0.56. Such a number becomes progressively smaller and tends to zero, and the Markov chain generated by the T matrix degenerates into a "black box". But each fragment of that chain has an unchanged eigenpair λ1 and x1 so that the "message noise" will also be beyond the " canal forest".


Question: Some information does not change during "channel degeneration"?


Answer: That's right, it is visible in the previous answer (14. Example). The triangular stochastic matrix degenerates, while running the links

TT2 → ... → Tk → ... C

steps of the Markov chain that converges to the black box C. In addition, the vector x1 = (1, 0, 0) remains characteristic of each length Tk, with the eigenvalue λ1 = 1.

15. Example. Consider a third order stochastic upper triangular matrix in general form

\[ T = \begin{pmatrix} 1 & λ' & α \\ 0 & λ & β \\ 0 & 0 & γ \end{pmatrix} \]

whose coefficients are all non-negative, and λ' + λ = 1 and α + β + γ = 1. Let's find the general form for Tk for each k = 1, 2, 3, ... and limit TkC, when k → ∞.

Solution: Clearly it is

\[ T^k = \begin{pmatrix} 1 & λ'_k & α_k \\ 0 & λ^k & β_k \\ 0 & 0 & γ^k \end{pmatrix}, \]

where λ'k = 1 - λk and αk + βk + γk = 1. Multiplying the first matrix by the second (TTk), we find:

βk+1 = λβk + βγk = ... = β(λk + λk-1γ + ... + γk),
βk = β(λk - γk)/(λ - γ),   λ - γ ≠ 0,
λk, γk → 0,   γ, λ < 1.

We get the same by multiplying the first matrix by the second (TTk) and the second by the first (TkT), then by equating:

βk+1 = λβk + βγk = λkβ + βkγ,
(λ - γ)βk = (λk - γk)β → 0,   γ, λ < 1.

Multiplying the first by the second and the second by the first also give:

αk+1 = αk + λ'βk + αγk = α + λ'kβ + αkγ,
(1 - γ)αk = α + λ'kβ - λ'βk - αγk.

When λ < 1, then (1 - γ)αkα + β. If λ = 1, then (1 - γ)αkα. □

We check the general solutions from this example by including the values of the constants (TkC, k → ∞). Which are written are not zero:

\[ 1. \quad \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix}^k \to \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix}, \] \[ 2. \quad \begin{pmatrix} 1 & \lambda' & 0 \\ 0 & \lambda & 0 \\ 0 & 0 & 1 \end{pmatrix}^k \to \begin{pmatrix} 1 & 1 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 1 \end{pmatrix}, \] \[ 3. \quad \begin{pmatrix} 1 & 0 & \alpha \\ 0 & 1 & 0 \\ 0 & 0 & \gamma \end{pmatrix}^k \to \begin{pmatrix} 1 & 0 & 1 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{pmatrix}, \] \[ 4. \quad \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & \beta \\ 0 & 0 & \gamma \end{pmatrix}^k \to \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 1 \\ 0 & 0 & 0 \end{pmatrix}, \] \[ 5. \quad \begin{pmatrix} 1 & 0 & \alpha \\ 0 & 1 & \beta \\ 0 & 0 & \gamma \end{pmatrix}^k \to \begin{pmatrix} 1 & 0 & \frac{\alpha}{\alpha + \beta} \\ 0 & 1 & \frac{\beta}{\alpha + \beta} \\ 0 & 0 & 0 \end{pmatrix}, \] \[ 6. \quad \begin{pmatrix} 1 & \lambda' & \alpha \\ 0 & \lambda & \beta \\ 0 & 0 & \gamma \end{pmatrix}^k \to \begin{pmatrix} 1 & 1 & 1 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{pmatrix}. \]

By direct multiplication, TC = C, we confirm the accuracy of these.

Another way to check such formulas is with specific values and calculations, say for k = 16 to three decimal places.

\[ 1. \quad \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix}^{16} = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix}, \] \[ 2. \quad \begin{pmatrix} 1 & 0.2 & 0 \\ 0 & 0.8 & 0 \\ 0 & 0 & 1 \end{pmatrix}^{16} \approx \begin{pmatrix} 1 & 0.832 & 0 \\ 0 & 0.168 & 0 \\ 0 & 0 & 1 \end{pmatrix}, \] \[ 3. \quad \begin{pmatrix} 1 & 0 & \alpha \\ 0 & 1 & 0 \\ 0 & 0 & \gamma \end{pmatrix}^{16} \approx \begin{pmatrix} 1 & 0 & 0.997 \\ 0 & 1 & 0 \\ 0 & 0 & 0.003 \end{pmatrix}, \] \[ 4. \quad \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & \beta \\ 0 & 0 & \gamma \end{pmatrix}^{16} \approx \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0.997 \\ 0 & 0 & 0.003 \end{pmatrix}, \] \[ 5. \quad \begin{pmatrix} 1 & 0 & 0,1 \\ 0 & 1 & 0.2 \\ 0 & 0 & 0.7 \end{pmatrix}^{16} \approx \begin{pmatrix} 1 & 0 & 0.332 \\ 0 & 1 & 0.665 \\ 0 & 0 & 0.003 \end{pmatrix}, \] \[ 6. \quad \begin{pmatrix} 1 & 0.2 & 0.1 \\ 0 & 0.8 & 0.2 \\ 0 & 0 & 0.7 \end{pmatrix}^{16} \approx \begin{pmatrix} 1 & 0.972 & 0.947 \\ 0 & 0.028 & 0.050 \\ 0 & 0 & 0.003 \end{pmatrix}. \]

The meaning of these numbers is usual. For example, the second case (2): the second signal (x2) will with each step of the Markov chain (T) with probability 0.2 be passed to the first signal (x1), with probability 0.8 to the second (the same one sent), and never to the third (x3). After 16 steps (with matrix T16) the second signal will pass to the first with probability 0.832 and with probability 0.168 to the second, and also never to the third. Although the message x = (x1, x2, x3) independent distribution of probabilities, the channel adapts them in steps.

When we observe the characteristic equation Tkx = x, which does not change the vector x, we find that in the first case (1) every vector is eigenvector. The second case (2) has x = (1, 0, 0) and (0, 0, 1) eigenvectors. In the third (3), fourth (4) and fifth (5) case, each vector (p, q, 0) of negative coefficients sums to one. The eigenvector of the sixth (6) case is (1, 0, 0), otherwise common to all six.

As we see, linear algebra allows the same thing that this information theory in its own way predicts (Far Before). Some information does not change during "channel degeneration" and this opens up the possibility of transmission to the present and ancient information from the "time before time". In addition to "contentless" and undoubted uncertainties like (p, q, 0), there are regularity certainties like (1, 0, 0). Of course, the considered matrix T is not the only type of Markov chain generator.


Question: Can a noiseless channel have "message noise"?


Answer: Maybe, let's say circular misinformation. When the first signal is definitely mapped to the second, the second to the third and so on in order, and the last to the first.

The matrix A of such deception has a unit in each column, so its determinant is det(A) = 1. This means that the sequence of n = 1, 2, 3, ... channel steps has the same determinant, because det(An) = (det A)...(det A) = 1. Furthermore, this means that the Markov chain does not degenerate into a "black box". He permutes lies into new lies, but occasionally (in no longer than n steps) he also expresses the correct initial message.

Using the new definition of "channel noise" (Narrowing) we get the same, since each summation item has a form of -Δpk⋅log Δpk = 0, whether the given probability is one or zero, and the mentioned matrix A have only those. The sum of all those "Shannon sums" is zero, so the "channel noise" is Ν = 0. Distance ∥x' - x∥ of the output from the input message (x' = Ax) can be drastic, so the "message noise" of an individual step of the channel can be seen as such. However, every period (number of steps), the message becomes initial and starts its cycle again.

16. Example. The stochastic matrix

\[ Ax = \begin{pmatrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 0 \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ x_3 \end{pmatrix} = \begin{pmatrix} x_2 \\ x_3 \\ x_1 \end{pmatrix} = x' \]

it does the permutation, disinformation just described. However, A²x = x'' would be another cyclic exchange into new misinformation, but the next, third step of the chain, A³x = x, given the initial value. □

This was an example of a matrix of the third order (n = 3) and simple enough to easily see the cyclicity created by its exponentiation. It turns out that the same is true for every stochastic matrix (n ∈ ℕ) which is a permutation of the columns of the unit matrix. Moreover, it turns out that cyclical phenomena are all around us and too wide an area (Rotations) for one short answer.


Question: What is a stochastic and what is a regular matrix?


Answer: A matrix K is called stochastic when each of its elements is non-negative and the sum of each of its columns is unity. The product of such matrices, if it exists, will again be a stochastic matrix of a channel, so the degrees of such matrices will give stochastic matrices Kr. The determinant of a regular matrix is one for which the determinant is not zero. When det(K) = 0, the matrix is singular.

1. A stochastic matrix K is often mistakenly said to be "regular" if all elements are of at least some degree Kr positive. This is taken by its "usual" definition, and therefore it has an inverse matrix K-1 for which KK-1 = K-1K = I , where I is the identity matrix. But with that we are in contradiction with the matrix in the previous picture whose determinant is zero. The statement that matrices with positive elements have non-zero determinants is not true, but for the matrix in that picture on the right.

Therefore, the determinant of the regular matrix K is not zero (det K ≠ 0), it has the inverse matrix K-1 with which the multiplication gives the unit, all its columns are linearly independent vectors, the regular matrix is diagonalized, and so on ..., it is not singular.

2. In the description of channel matrices (Channel, Theorem), for the sake of faster recognition, I stated that a matrix is regular (has a non-zero determinant) when each of its diagonal elements is greater than sum of others of that line. Namely, if the determinant is zero, the equation Kx = 0, viewed as a homogeneous system of linear equations of arbitrary unknowns x = (p1, p2, ..., pn), will have a nontrivial solution (x ≠ 0).

If |pi| the largest of those variables, we look at the i-th line:

ki1p1 + ... + kiipi + ... + kinpn = 0,
|kiipi| = |∑j:ji kijpj| ≤ ∑j:ji |kijpj|,
|kii| = ≤ ∑j:ji |kij pj/pi| ≤ ∑j:ji |kij|,
|kii| = ≤ ∑j:ji |kij|

and this would be a contradiction with the assumption, because the diagonal element is not greater than the sum of the rest of its row. Wider sets of variables are used in the proof, but the statement of the theorem remains true for probability distributions as well.

3. However, we can see from the example of the following matrix that the determinant is -1/16 = -0.0625.

\[ K = \begin{pmatrix} 0.50 & 0.25 & 0.25 \\ 0.25 & 0.50 & 0.25 \\ 0.25 & 0.25 & 0.50 \end{pmatrix} \]

The eigenvalues of this stochastic matrix are λ1 = 1 and twice λ2 = λ3 = 0.25. Only the first of them has an eigenvector x = (1/3, 1/3, 1/3) which is also a probability distribution vector, so Kr converge to matrix whose coefficients are all 1/3. All initial distributions through the Markov chain generated by this matrix thus, during the chain, converge to the specified x.

The same example shows us something else, that a "degenerate" matrix (with repeated eigenvalues) is not always the same as a "singular" matrix (zero determinants).


Question: How do I use Markov matrices?


Answer: Engineering work is about turning theory into practice, and looking a little more broadly, many from scientists to craftsmen are on similar tasks. We will not get into their affairs when we look at a few examples.

1. The example of diagnosis, which I mentioned recently, is instructive and quite universal. With a little knowledge of elementary physics, it easily translates into "the work of a force on the path", ΔE = F⋅Δx. In the matrix equation (E = Fx), the force work takes the form

\[ \begin{pmatrix} E_1 \\ E_2 \end{pmatrix} = \begin{pmatrix} F_{11} & F_{12} \\ F_{21} & F_{22} \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix}, \]

where Eij = Fijxj is the energy consumed ( Ei) for the action of the force (Fij) during its action on the path of length (xj ). These forces are constant, otherwise the paths are small distances while consuming correspondingly smaller portions of work. Individual forces are the matrix of (formally) conditional probabilities, or conditional-consequential exchanges of corresponding energies.

As we can see, in the application of "diagnostics" some knowledge, a little imagination and engineering precision of the concepts used here will be enough to find ourselves in pure physics in no time. The following example testifies that "only the sky is the limit" for those who would use Markov chains.

2. In the picture above left is the "Fredkin gate", a universal computer circuit, or electrical switch for on-off current flow, i.e. correct -false algebra of logic. It is a very small switch with three inputs (a, b, c) and three outputs (a', b', c'). Usually b = ā and b' = ā' and c' = c.

When at the input c = 0 (no current, or false ⊥), and a = 1 and b = 0, it will be at the output conversely a' = 0 and b' = 1, and c = 0. Then a = 0 on input will become a' = 1 on output. If at the input c = 1 (there is current, or true ⊤), and a = 1 and b = 0, it will be at the output a' = 1 and b' = 0, and c = 1. Then a = 0 at the input will remain a' = 0 on output.

We can describe this with a matrix equation (x' = Fx)

\[ \begin{pmatrix} a' \\ b' \end{pmatrix} = \begin{pmatrix} \bar{c} & c \\ c & \bar{c} \end{pmatrix} \begin{pmatrix} a \\ b \end{pmatrix} \]

where multiplication means the conjunction operation of the algebra of logic, and addition means the disjunction. Hence it is just a step to the probability, \( c \in [0, 1] \) and \( \bar{c} = 1 - c \), with the matrix F becoming the generator Chain Markov.

3. We can decompose the unit matrix I into factors, which is otherwise not possible with numbers, let's say into the squares of Pauli matrices:

\[ \sigma_1 = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}, \quad \sigma_2 = \begin{pmatrix} 0 & -i \\ i & 0 \end{pmatrix}, \quad \sigma_3 = \begin{pmatrix} 1 & 0 \\ 0 & -1 \end{pmatrix}. \]

For each of these, σ² = I, so each vector can be written at least in the following three formally equal ways x = Ix = σ²x, and of different interpretations. Each of these cases is a separate story of the application of Markov chains, so extensive that it will have to wait for a better opportunity.

The real question is not whether there are applications of Markov chains, but why there are so many of them. Common reasons are the reason for taking the ubiquity of information in natural phenomena, and the ease of tying interpretations to probabilities is the reason for seeking the essence of information in uncertainty. How to tie it all together will be an interesting and problematic area.


Question: The message adapts to the transmitter, but vice versa?


Answer: That's right. State and process adaptation is at the core of this information theory (A Blur), and the message behavior observed in the Markov chain is a limited type of formal confirmation.

The constancy of the chain generator, the matrix A, is one type of such constraint. Another is that more or less such matrices can be changed during the process. But as a simple model, in the first stage of observing situations and information changes, the Markov chain is instructive and approximately sufficient.

1. Not everything communicates with everyone, so the environment does not take over and transmit all its phenomena equally. Each of the "messages" that communicates with the environment thus becomes a "transmitter", so in reality both adaptations, the information transmitted to the transmission channel and the information channel, are inevitable. The first are faster, easier to see, and adjusting the overall environment to its factors is a more complex process of constant averaging.

Due to the principle of striving for less information, it can be expected that the value of the process determinant (det A) increases on average. That contrary to the degeneration of individual messages, the process of regeneration of the channel itself runs, but considering the greater weight of the channel, that degeneration prevails. Analogous to the alignment of the orbits of the planets of the solar system, which becomes more visible, from the revolution of the massive Sun in a similar way.

2. Like the periodic rotation of the planets around the sun (revolution), various periodic phenomena are possible. Then one "link" of the Markov chain itself becomes a queue transformation A = A1A2...Am, with time (t) and m-membered arrays slowly changing A = A(t). The present is moving at the speed of light c, the fastest of physical speeds, leaving a trace of the past behind so that its information density is getting smaller and its inheritance is getting thicker.

Observing the "movement" of the present in 6-dim space-time, considering the law of conservation (present) + (past) = const, the question of the speed of reaching us by past information also arises. I am of the opinion that there are no stupid questions, except possibly stupid answers, so one such suspect could be the idea that the speed of light, or the high speed of whatever information, the present slows down compared to the past. Faster messages thus arrive further, and light is therefore special because its speed does not depend on the speed of the observer.

3. However, the past is a pseudo-reality and its "movements" are not widely acceptable. Then, a more realistic possibility of transmitting messages from the past to the present would be the longer preservation of certain pieces, more resistant to wear and tear. A reduced degeneration of individual details than the average is possible due to the principal diversity, as well as their conservation for emissions to some future present.

The first "crazy" (hypo)thesis, about a higher speed of light then than now, otherwise agrees with the growing universe, with the increasing (perceived) speed of its expansion and the reduction of the energy density or the mass of the universe. With the geometry of 6-dim space-time that so "gravitationally" pulls forward our 4-dim world of events. Also, with the decreasing uncertainty of the present, its guidance from the past is getting better.

4. Looking at the array of information x = (x1, x2, ..., xn), x' = Ax, x'' = Ax', ..., or in general x(r) = Arx, where r = 0, 1, 2, .. ., we notice that longer and longer transmissions, fragments of compositions, arrive from further and further away from the present and closer and closer to the eventual "black box". From the point of view of the current, then observed transfer step, the further the former, the greater the amount of uncertainty, information. This is in favor of "minimalism".

In this sense, the formulas for measuring the narrowing of the channel, otherwise usable for defining the speed of degeneration (per transformation step), can also be a measure of the "attractive force" towards the future. It is a kind of "probability force", and the repulsive "uncertainty forces", and on a longer stick and the aforementioned slowing down of the speed of light in stronger gravity.


Question: How and why do interactions occur at all, if they are not in the temper of nature?


Answer: The key to everything happening, even the "impossible", is uncertainty. Proceeding from that, and separating parts of the interactions, the certainty has an advantage, but without exclusivity. Then we will notice that there are guidelines inside of the lack of regularity.

1. We have seen how Markov processes degenerate into "black boxes", and especially such that each state (probability vector) produces into the same one. Then we are left without the possibility to find out the sent messages based on the received ones. When one such process (A) acts on the state, then a similar one (B), the following happens:

\[ BA = \begin{pmatrix} b_1 & b_1 & ... & b_1 \\ b_2 & b_2 & ... & b_2 \\ ... & ... & ... & ... \\ b_n & b_n & ... & b_n \end{pmatrix} \begin{pmatrix} a_1 & a_1 & ... & a_1 \\ a_2 & a_2 & ... & a_2 \\ ... & ... & ... & ... \\ a_n & a_n & ... & a_n \end{pmatrix} = B, \]

because a1 + a2 + ... + an = 1. The new process absorbs the first one. It does this by mapping each of its columns, like probability distributions. The download of the vector flows in a cascade BAx = Ba = b without memory.

States are messages, the vectors of probability distributions:

x = (x1, x2, ..., xn),
a = (a1, a2, ..., an)
b = (b1, b2, ..., bn),

of the coefficients xk, ak, or bk non-negative (real) numbers sum to one.

2. Memory is given to the elementary particles by its environment. Many of them do not have it in themselves and acquire it as the directions of their own movement, or in general the trajectory of the least action, relatively in relation to the given observers. Interactions with the environment define the subjects of physical action, so those who do not participate in it do not exist (for the overall system). Only the moment of measurement of the electron will determine its previous path, the founders of quantum mechanics noticed, and we explain further: because by acting, it declares itself, loses some uncertainty and gains certainty.

In this sense, previous (1) processes without memory "exist" only in themselves, not as we might see them. As in the case of a particle (message, state), from the observer's point of view these processes are also directed by the past of the given, its environment. The past and the direction through the present to the future of the mass (multitude) are defined by the laws of large numbers from the theory of probability and eventual, perhaps as yet undiscovered.

3. Continuity of phenomena exists as much as we have realized it with our presence. Hence the principle of the smallest effect (least action) of the physical everything around us, or the tendency of the system towards less information. There is no nature without communication, but it will do everything to keep them and interactions to a minimum. The most serious of the obstacles in this is the law on conservation of information. That is why there is no giving without receiving and there is no escape from the vicious circles of exchanges.

The more substances are combined to make more of the opportunities for giving, the greater the limitations of the individual in relation to the guidelines of the collective. By drowning in larger tributary streams, they lose some of their freedom, so in short, playing around like that, the wider nature seems to be getting nowhere. Except that it manages to adhere to the principle of unrepeatability.

4. Development is necessary because of the laws of conservation and uniqueness, and hence interaction. However, such needs are underlined in different ways under the principle of least action, so connecting smaller units to a larger number becomes a matter of Chebyshev's inequality, as well as general gravity, or Bernoulli's equations (Fluid). Approaching a larger (multitude, body, flux) local reduction of information is achieved.

You can see the explanation by Einstein's theory of relativity of the production of a magnetic field by the movement of an electric charge in the unusual attachment here, along with the repulsive mutual force due to the movement of solitary electrons (Current). When we see such opposites in the examples mentioned, up to the enclosure of the living cell and its tendency to join together, we do not recognize their connection because we still do not know the formalism like the mathematical one that gives them structure.

5. Behind all these phenomena is the principle of parsimony of information, or at the elementary level, the tendency for more likely outcomes to occur more often. As with the Compton effect (Collision) when a photon colliding with an electron loses energy by increasing its wavelength, moving to a more smeared path with a lower probability density, this is how interaction occurs in general due to local aspirations of the subject for a smaller effect (least action).

There are no exceptions to these aspirations in today's known theoretical physics, nor is there any lack of communication in these interpretations of it that I add to it. So, to the question of how and why we have interactions, the answer lies in striving for less and the impossibility of actually achieving it. All the necessary details of that story cannot fit into one short description.

6. More complex systems dominated by approximations of mean (Shannon) information towards probability, in transfers like Markov chains, and in some cases of memorization (0 < |det A|, |det B | < 1), by composition they build processes of smaller determinant values than each of the factors:

|det(AB)| = |det(A)| ⋅ |det(B)| < min{|det(A)|, |det(B)|},

which means inevitable degeneration. In particular, the form AB = C1 will be

\[ \begin{pmatrix} 0.9 & 0.2 \\ 0.1 & 0.8 \end{pmatrix} \begin{pmatrix} 0.6 & 0.3 \\ 0.4 & 0.7 \end{pmatrix} = \begin{pmatrix} 0.62 & 0.41 \\ 0.38 & 0.59 \end{pmatrix}, \]

with determinants 0.7 × 0.3 = 0.21. Multiplying inverse BA = C2 gives

\[ \begin{pmatrix} 0.6 & 0.3 \\ 0.4 & 0.7 \end{pmatrix} \begin{pmatrix} 0.9 & 0.2 \\ 0.1 & 0.8 \end{pmatrix} = \begin{pmatrix} 0.57 & 0.36 \\ 0.43 & 0.64 \end{pmatrix} \]

therefore, a different matrix with the same determinant as the previous one. We see in the first multiplication (C1) a more certain first column of probability distribution than the same column of the second multiplication (C2), and vice versa for other columns. Since vector matrix multiplication is calculated from right to left, this means that the newer multiplication covers the older ones in the first coefficients of the vector (distribution).

7. Recent events are more important, more visible, for leading probabilities. If we have long series of outcomes and highlight the more likely ones, so that an infinite number of irrelevant, more certain final processes will yield more certain current, final vectors. Such a stable development is exactly what is happening to us. He flows into a more certain present while it drags an ever-longer trail of the past behind itself.


Question: What is a double, two times stochastic matrix?


Answer: A left stochastic matrix is square consisting of non-negative real numbers whose sum in each column is 1. A right stochastic matrix is also a square of non-negative real numbers but each row sums to 1. A doubly stochastic matrix is a square of non-negative entries with each row and column summing to one.

By transposing the left (replacing columns and rows), the right stochastic matrix is obtained and vice versa. The product of right will be right, just as the product of left is always a left stochastic matrix (Chain, 2). Therefore, the product of doubly stochastic matrices is always a doubly stochastic matrix.

1. The following are two examples of a double stochastic matrix which is (A) and which is not (B) symmetric:

\[ \begin{pmatrix} 0.7 & 0.2 & 0.1 \\ 0.2 & 0.5 & 0.3 \\ 0.1 & 0.3 & 0.6 \end{pmatrix}, \quad \begin{pmatrix} 0.7 & 0.1 & 0.2 \\ 0.2 & 0.6 & 0.2 \\ 0.1 & 0.3 & 0.6 \end{pmatrix}, \]

because the first is equal to itself transposed (A = A), and the second is not (BB). Those symmetric stochastic matrices are Hermitian, complex and non-essential sum of coefficients, but invariant (unchangeable) after transposition with conjugation, so-called adjoints (H = H).

The eigenvalues of the self-adjoint operator are real, and so are the eigenvalues of the symmetric stochastic matrix. In the first case, det(A) = 0.13 with eigenvalues λ1 = 1, λ2 ≈ 0.57 and λ3 ≈ 0.23. The second case has a determinant det(B) = 0.2 and also positive eigenvalues λ1 = 1, λ2 = 0, 5 and λ3 = 0.4.

Each double stochastic matrix M of order n ∈ ℕ, whether symmetric or not, has an eigenvalue λ = 1 corresponding to the eigenvector of the probability distribution x = (1/n, 1/n, ..., 1/n ), so that Mx = x. The rest of the eigenvectors need not be probability distributions.

2. The standard interpretation of the matrix M = (pij) of the Markov chain generator is that the coefficient pij means the conditional probability that the sent j-th signal will be forwarded as i-th. If the j-th column certainly contains one of the possible signals, in order of indexed rows i = 1, 2, ..., n, then the matrix M is left stochastic, or stochastic for short. Additionally, if every i-th rows can certainly recognize some of the j = 1, 2, ..., n possible sent signals, then the matrix is right stochastic, and it is double stochastic.

Note that the above implies the independence and completeness of the conditional probabilities, the coefficients of the M matrix. Then both its columns and rows are probability distributions, which is the default situation, and such a matrix is doubly stochastic. However, errors can occur, say the jth signal is interpreted multiple times or not at all, when a sum-one overshoot or undershoot can occur. A similar thing can happen with the i-th rows and the same one that recognizes multiple or no signal sent at all. Then the matrix M is approximately double stochastic.

The case of the then world champion Grandmaster Garry Kasparov's chess match against a supercomputer (1997) is well known, when the computer defeated him with a score of 3.5 : 2.5. In one, the computer made an unusual move that the Grandmaster and advisors spent hours and all night trying to figure out. And it was only long after the machine's victory that the engineers found out that this strange move was the result of an error in the electric circuit — which rarely but still happens.

3. Even when the channel matrix works flawlessly, be it of type A or B from the previous example (1), it will certainly transmit the mentioned eigenvector x of the eigenvalue λ = 1, otherwise maximum uncertainty. It thus goes through a long Markov chain, regardless of its degeneracy.

If the matrix M itself makes the described errors (2), then in that slightly different way the uncertainty will again travel through the chain. Transmission of errors and uncertainty to real processes, one way or another, is inevitable.


Question: Is there a name for a pair eigen vector-value?


Answer: We call the characteristic pair (eigenvalue/eigenvector) the most important pair, leading, main, predominant, dominant.

1. Let's look at it on the directed graph, pictured on the left (digraph), which represents a three-team competition (1, 2, and 3 circled).

The first beat the second twice and the third once. The second only once the third. The third team has one win against the first and one against the second. Let's add a matrix to them.

\[ A = \begin{pmatrix} 0 & 2 & 1 \\ 1 & 0 & 0 \\ 1 & 1 & 0 \end{pmatrix}, \]

with the coefficients aij, where the row and column numbers i, j = 1, 2 and 3 are team labels. So a12 = 2 and means two wins of the first team against the second team, until a32 = 1 which means one win for the third team against the first team. The teams' successes ranked from best to worst are: 1, 3, 2. The first had three wins, the third two, and the second only one (Perron-Frobenius Theorem).

2. It is an example with n = 3 teams in competition and vector x = (x1, x2, ..., xn) with three ranked input participants x1 = 3, x2 = 1 and x3 = 2, otherwise the winners where xj represents the strength of the jth participant. We assign a positive score to each team (i = 1, 2, ..., n):

\[ s_i = \frac{1}{n_i} \sum_{j=1}^n a_{ij}x_j \]

with normalization factors by the number of games (here each ni = 2) that the ith team played, because we don't want the team ranking to be higher simply because of more games played. Matrix coefficient otherwise aij ≥ 0, in the example it is the number of wins of team i against team j, but it can be evaluated differently.

We expect the i-th team to have a score si proportional to the power xi of that team and that for all components of the score vector s = (s1, s2, ..., sn). So we assume that s = λx. It is actually a ranking of the power of the vector x as a characteristic (eigen) vector of the characteristic equation Ax = λx.

The given example (1) has the eigenvalue λ ≈ 1.879 and the corresponding vector x ≈ (0.717; 0.381; 0.584), so Ax ≈ λx. An example of a competition and a digraph (directed graph), an adjacency matrix with a score-vector should come up with the name "dominant" for a pair of eigenvalue-vectors (λ, x) of the matrix. See also the paper on it by University of Verona.

3. A matrix is "irreducible" if, via permutations, it is not similar to the upper block triangular matrix, which has more than one block of positive size.

It can be shown that the matrix (A) is irreducible which transforms a positive vector, of all positive coordinates (x > 0), into a positive (Ax > 0). Positive operators (matrices) represent directed processes, which will develop states into positive, therefore irreversible ones. Positive operators are self-adjoint (A = A), and all eigenvalues are non-negative.

An irreducible matrix is also characterized by its associated digraph, which is "strongly connected", which means that every node can be reached by means of links from any node.

4. The Perron-Frobenius theorem states that the square matrix A which has the non-negative entries also has non-negative eigenvector x which corresponding to the positive eigenvalue λ. When the matrix A is irreducible, then the eigenvector x has strictly positive entries, is unique, simple, and has the corresponding eigenvalue of the largest absolute value.

This theorem also applies to stochastic matrices with non-negative entries and all columns that sum to one. Stochastic vector p = (p1, p2, ..., pn) has all entries non-negative, pj ≥ 0, with the sum p1 + p2 + ... + pn = 1. Therefore, they are immediate representations of probability distributions. By transposition, the columns become sum-one types, so:

\[ A^\top 1 = \begin{pmatrix} a_{11} & a_{21} & ... & a_{n1} \\ a_{12} & a_{22} & ... & a_{n2} \\ ... & ... & ... & ... \\ a_{1n} & a_{2n} & ... & a_{nn} \end{pmatrix} \begin{pmatrix} 1 \\ 1 \\ ... \\ 1 \end{pmatrix} = \begin{pmatrix} 1 \\ 1 \\ ... \\ 1 \end{pmatrix} = 1, \]

which means, A 1 = 1. Since A and A have the same determinants

det(A - λI) = det(A - λI) = 0,

these are also their characteristic values λ the same. If we divide this equality by n, we will get an eigenvector which is a probability distribution. As the right stochastic matrix A has an eigenvalue λ = 1, so will the stochastic matrix A. Not necessarily the specified eigenvector.

If some eigenvector p of the stochastic matrix A would have the eigenvalue |λ| > 1, then from Akp = |λ|kp it will be that the series of stochastic matrices Ak has exponentially increasing eigenvalues |λ|k → ∞. This means that starting from some large k these matrices can have coefficients greater than one, aij > 1, which is impossible. Multiplication of stochastic matrices of the same order yields stochastic matrices (Chain, 2).

This is also a proof of the Perron-Frobenius theorem for stochastic matrices, that each A has an eigenvalue λ = 1 to which the eigenvector of probability p corresponds, so that Ap = p. This is an interesting property of the "dominant" of stochastic matrices and a contribution to the answer to the question I was asked.


Question: Explain the "irreducibility" of the operator a little more?


Answer: Ok, in the picture on the right, let's say there is a diagram of the use or flow of electricity between the centers circled with the number 1, 2 and 3. On the directed links are the probabilities of those deliveries. When xj quantity is available to j, from all n = 3 centers, then pij chance of sending that part to the ith center.

I am deliberately speaking in general terms so that you can apply the model more easily. Centers 1, 2, ..., n, it is possible that they are banks with links to money flows, or they are market centers between which services, goods and money circulate, or they are societies with acquaintances. So, in case we are talking about subtraction, for example, if pij is the chance of energy separation from the j-th center to i volume.

1. Whatever, let's take the given numbers, form a matrix M. If we multiply the matrix by itself (we exponentiate it) s = 1, 2, 3, ... times, the exponent tends towards the limit matrix M', approximately:

\[ M^s = \begin{pmatrix} 0.8 & 0 & 0 \\ 0.2 & 0.7 & 0.4 \\ 0 & 0.3 & 0.6 \end{pmatrix}^s \to \begin{pmatrix} 0.028 & 0 & 0 \\ 0.560 & 0.571 & 0.571 \\ 0.412 & 0.429 & 0.429 \end{pmatrix} = M', \]

The stochastic matrix of the third order (n = 3) is the result of the given graph.

The directed graph (digraph) in the picture clearly shows that there is no path from position "1" to positions "2" and "3", but reverse flows are possible, directly as well as indirectly "3" → "2" → "1". That's why it remains impossible to get anything but zero in the first (upper) row by exponentiation in the positions of the second and third columns. Hence, we say that the matrix M is "reducible", because it can be reduced to strictly separated blocks. It is similar to triangular matrices (Degeneration) which cannot be freed from zeros by exponentiation.

In particular, the specified matrix (M) is a non-negative operator, because vectors with non-negative coefficients, of the form x = (0, x2, x3) it transforms into non-negative y = (0, y2, y3), where y = Mx. Each of its degrees does the same. And if some degree s of some square matrix P of non-negative entries would be with all positive coefficients, then the matrix would be "irreducible". A given degree (s) would yield a positive operator, a matrix.

A directed graph is called strongly connected if every vertex is reachable from every other vertex. When we consider P = (pij) as a matrix belonging to a directed graph, where we say that node i directs according to j if and only if pij ≠ 0. Such a definition confirms that the graph itself is strongly connected. Such cannot be broken into smaller strongly connected blocks and is therefore called irreducible.

The following matrix is irreducible (NsN', when s → ∞):

\[ N^s = \begin{pmatrix} 0.8 & 0.1 & 0 \\ 0.1 & 0.7 & 0.4 \\ 0.1 & 0.2 & 0.6 \end{pmatrix}^s \to \begin{pmatrix} 0.235 & 0.235 & 0.235 \\ 0.471 & 0.471 & 0.471 \\ 0.294 & 0.294 & 0.294 \end{pmatrix} = N', \]

because it already has no zero coefficients from the second degree (s ≥ 2). When we form a graph from it, like the one on the top right, we get the following.

Irreducible I

It is a connected graph. With the arrows, connecting nodes 1, 2 and 3, you can reach every node from any node. I repeat, it cannot be broken into smaller strongly connected blocks and we call it irreducible.

For example, if we were to divide society into groups of poor, middle class and rich, and wanted the first to be unable to reach the others, we would use the graph method M with very reliable knowledge of algebra, and at the same time conveniently incomprehensible to many, that we would treat such. But, if we would like to give all participants a chance, we will use graph methods N and appropriate mathematical knowledge.

Computers will cope surprisingly easily with fairly large matrices of the order n ∈ ℕ. What the user needs is an understanding of this method, and in the Python programming language (pip install quantecon) the code is:

import numpy as np
import quantecon as qe
M = np.array([[0.8, 0.0, 0.0],
              [0.2, 0.7, 0.4],
              [0.0, 0.3, 0.6]])
N = np.array([[0.8, 0.1, 0.0],
              [0.1, 0.7, 0.4],
              [0.1, 0.2, 0.6]])
print('\nMatrica M:')
M = M.transpose()
mc = qe.MarkovChain(M, ('1', '2', '3'))
print('M:', mc.is_irreducible)
print('\nMatrica N:')
N = N.transpose()
nc = qe.MarkovChain(N, ('1', '2', '3'))
print('N:', nc.is_irreducible)

Executing this program will:

Matrica M:
[[0.8 0.  0. ]
 [0.2 0.7 0.4]
 [0.  0.3 0.6]]
M: False

Matrica N:
[[0.8 0.1 0. ]
 [0.1 0.7 0.4]
 [0.1 0.2 0.6]]
N: True

This means that the matrix M is not irreducible (False), while the matrix N is irreducible (True). Of course, in practice you will be dealing with a huge number of energy consumer centers, banks, markets, communities and whatnot, but this same code (with matrices of much higher order n) will recognize (ir)reducibility just as effectively.


Question: Is every galaxy the "center" of the universe?


Answer: It can be said that way, from the point of view of each of them. The light is late for each one, traveling from the distant past of those further away, so each galaxy is unique in its own right. She (the galaxy) is a leader in time, she is in the future compared to everyone she sees.

The universe is expanding, and if we imagine it as a 2-dim like the surface of a growing balloon, the galaxies as well as the points on it will each "run away" from the others, and each will see itself as the receding center.

So, in the picture on the right, we are (1) the galaxy from which the other three (2, 3 and 4) are moving away, the only ones drawn here out of billions of them. Further away, they communicate less and less with the central one and have a smaller and smaller link coefficient (0.3, 0.2 and 0.1), at the expense of more dealing with themselves (0.7, 0.8 and 0.9), so that the scheme could be represented by a stochastic matrix:

\[ S = \begin{pmatrix} 0.4 & 0.3 & 0.2 & 0.1 \\ 0.3 & 0.7 & 0 & 0 \\ 0.2 & 0 & 0.8 & 0 \\ 0.1 & 0 & 0 & 0.9 \end{pmatrix}. \]

We easily check that it is irreducible, and already its square (S²) is a positive matrix. Let's also check that the degree of such a matrix will converge to a "black box" (StS', when t → ∞):

\[ S' = \begin{pmatrix} 0.25 & 0.25 & 0.25 & 0.25 \\ 0.25 & 0.25 & 0.25 & 0.25 \\ 0.25 & 0.25 & 0.25 & 0.25 \\ 0.25 & 0.25 & 0.25 & 0.25 \end{pmatrix}. \]

If we had taken a larger number of galaxies, of these n = 4, or other graph link coefficients (a = 0.4, b = 0.3, c = 0.2 and d = 0.1), the resulting "universe" matrix as a chain of Markov irreducible generator matrices would converge to the "black box", then the matrix of all coefficients 1/n, as long as a, b, c, ... > 0 with a sum of 1. As long as there is at least some communication of said galaxies.

If we consider this exponent t as a measure of the increasing period of time from the present to the past, then the matrix S is currently in the process of developing the galaxies of the universe, and S' what we see today filtered through a long series of such transformations and burdened with interference, channel noise. What we can see as the distant past of the universe, considering each galaxy to be the "center" of the universe, is an impersonal uniform state.

Therefore, the description of the past of the universe that fits into the theory of information, and the way I treat it (Far Before), is also consistent with the algebra of processes. The universe described here is relatively selfish, self-centered and compact, highly connected. However, it does allow separate blocks of events, perhaps described in my dimensions of time contributions.


January 2024 (Original ≽)