Thursday 7 June 2018

The "Survivor Deviation" that Internet users must understand

Whether or not you have heard this word, this article will give you a deeper understanding of the “survivor bias”. Let's look at a few paragraphs first:

The school organized outings. The teacher asked: The classmates who didn't come to raise their hands. Well, people are all together. Let's go!
CCTV reporter asked on a high-speed rail: Did you buy tickets? bought! Did you buy tickets? bought!
Why didn't mom pick up food? Because she had picked it when she bought it!
Why are the e-commerce shops for parachutes good? People who have lost their lives due to problems with parachutes can't give bad reviews!
These are all about the survivors' deviations. Everyone laughs after reading because we are too easy to identify it. However, I list the following cases. We may not be able to draw a correct conclusion:

In 1936, the United States presidential election, "Literature Abstracts" magazine through 1.4 million telephone survey shows that Langdon will win the election, this study has much credibility?
Papyrus was found in ancient Egyptian relics, but in other Mediterranean civilizations such as Phoenician, Ancient Greece, and Ancient Rome, there was no papyrus in the same period. Can it be explained that Papyrus was widely used in Egypt and other Mediterranean civilizations? No application?
When a new game goes online for a month, the game planer randomly finds highly active users in the game to investigate and determines the core solution for the next iteration of the game. Will there be a fatal flaw?
A reporter found out on the Internet "the Republic of China primary school students essay", excellent literary talent, so the reporter concluded: the primary school language education and the Republic of China can not be compared!
In fact, the above cases are very likely to draw the wrong conclusion:

The 1936 American election survey was due to a telephone survey, and the telephone was still a rich man's patent in the United States in the 1930s. These rich people were not random samples of American voters. Ultimately, Roosevelt was not elected by the magazine as Langdon.
Ancient Egypt found Papyrus and other places did not, the real reason is that the other three places - Phoenician, ancient Greece, ancient Rome climate than the ancient Egyptian wet, while Egypt is relatively dry, and these papyrus paper in a humid environment Not saved.
The new game has been online for a month. There are retained users and lost users. It is important to pay attention to the needs of retained users. However, it is more important for a new game to focus on the causes of the loss of lost users.
The reason why the composition of the primary school students for the Republic of China can be circulated today is necessarily the leader at that time. It is a survivor and cannot represent the overall level of elementary and middle school students at the time.
In the decision-making of our daily work, the deviation of survivors is so common that we often inadvertently influence our decision-making and judgment. So the essence of this concept is God? Where does it happen easily? What is its mechanism of action? How can we avoid it? Today's article we will discuss this issue together:

History of "Survivor Deviation"
"Survivor Deviation" comes from a famous story in World War II:

In 1941, in the Second World War, the Air Force was one of the most important arms. The fighter planes of the Allied Forces suffered heavy losses in numerous air battles and were shot down by the Nazi gunfire numerous times. The headquarters of the Allied Army secretly invited some physicists and mathematics. Home and statisticians formed a team to study "how to reduce the probability of the Air Force being shot down."

At that time, the military’s top management counted the situation of all the returned aircraft. It was found that the wing parts of the aircraft were relatively dense and the fuselage and the tail part were sparse. So the advice of the then high-level military was : Strengthen wing protection.



However, this suggestion was rejected by Abraham Wald, a professor of statistics at Columbia University. Professor Ward offered the exact opposite point of view - strengthening the protection of the fuselage and the tail section.

So how did the statistician come to the conclusion that this seemingly incommensurable with common sense? Professor Ward’s basic starting point is based on three facts:

The sample of statistics is just a fighter who returns safely.
The aircraft that was hit by the wing several times seems to be able to return safely.
In the tail position of the fuselage, the reason that bullet holes are rarely found is not that it will not actually happen. However, once it is shot, its chance of returning safely is very small. That is, the returning aircraft is a survivor, relying solely on survivors. It is unscientific to make judgments. The non-survivors who have been ignored are the key. They did not come back at all!
The military adopted the proposal of the professor to strengthen the protection of the tail and the fuselage, and later confirmed that the decision was extremely accurate. The downing rate of the Allied fighter plane was greatly reduced. This is the source of the story of the “Survivor Deviation”.

The Nature of "Survivor Deviation"
The generalized survivor bias is explained by statistical terminology—“selection bias”, that is, we ignore the randomness and comprehensiveness of the sample when we perform the statistics, and replace the overall random sample with a partial sample, so that the overall The description is biased.

The simple description of statistics is this: The statistical set is A. Observe that A subset A1 has feature X, A1 is a survivor, and A additional subset A2 is not observed or is ignored, so J There are features X. In fact, the characteristics of A2 are Y.



The above reporters used the case of investigating the purchase of train tickets to substitute the explanation: A is the person who wants to buy the train ticket, A1 is the person who is already on the train, A2 is the person who wants to buy but not bought, and the feature X is the buy ticket. The feature Y is not purchased, ie, the survivor bias replaces a random sample with a small number of dominant samples, resulting in a statistical bias.



With this framework, we can understand these "survivor bias" specific cases from a theoretical perspective:

Asian student case
The American Student Association found that Asian classmates are far more likely than their peers in mathematics—“Survivor bias”: Children who can go to school in the United States generally have relatively superior education levels and growth environments in China. If the mother tongue of the Chinese is English, it must be Language performance will also be better than American peers.

Hospital study case
If Beijing Chang Gung Memorial Hospital conducts research on the eating habits of hospitalized patients with heart attacks and publishes a paper on the relationship between heart disease and eating habits, is the paper credible? The answer is no! Because Chang Gung Memorial Hospital is a high-end private hospital in Beijing, the dietary habits of patients in this hospital and ordinary patients will be different. At the same time, hospitalized patients will not be able to represent all cases (have died without hospitalization, can not afford to go to hospital, etc.). Above, excluding these interference factors is the basic principle of modern medical research.

Gym Case
I go to the company's gym at noon on January 35th. This habit has persisted for a long time. However, I was depressed for a while because I found that my colleagues in the company gym are basically better than me. This is actually typical. The "survivor bias" - those who are good at the gym are of course high-probability events. People who are not well-built and do not exercise often go to the gym.

Octopus Paul Case
The biggest star in the 2010 World Cup was not from a player, but the octopus Paul from the Oberhausen Aquarium in Germany. It miraculously predicted the results of the World Cup Germany seven times in a row. Octopus Paul became the world of the summer. The enthusiasm pursued by the media is in fact a typical "survivor's bias." In fact, many animals participated in the World Cup predictions that summer: the Filipino monkey, the Mexican alpaca, the African elephant, and the Bulgarian Dairy cows even have Chinese pandas, but because these animal predictions failed, there was no media coverage, and octopus Paul became the lucky one.



In the above four cases, the Complete Works A were: all children in China, all heart patients, all my colleagues in the company, all animals that predicted the World Cup;

The survivors A1 were: a child who had conditions to go to school in the United States, a heart patient at Chang Gung Memorial Hospital, a colleague who went to the gym, and Paul Octopus;

Feature X is: good mathematics, unique diet, good body, accurate prediction;

The characteristics Y are: mediocre mathematics, normal diet, general body, and inaccurate prediction.

This is the analytical framework for “survivor bias”.

Beware of abuse of "survivor bias"
Many people often misunderstand the term "survivor bias", which often leads to its abuse. In the author's opinion, it is equally important to guard against the abuse of "survivor bias" and vigilance of "survivor bias".

Many people saw some entrepreneurial "success stories" reported by the media immediately picking their noses - "This is the survivor's bias, do not know how many failed cases?", and then abandon the method and experience of successful people;

Many bribery workers saw the news that “one person was bribed and caught”. It is not surprising that this is a deviation from survivors. “The media only reports on people who have been paid bribery. In fact, there are still more uncaptured people! Then they continued to pay bribes.

How is the concept of "survivor bias" misused? Or for reporters to investigate high-speed trains to buy tickets, understanding the "survivor's bias" theory only allows us to understand - "Reporters surveyed on the high-speed rail to determine that all people have bought tickets." This method is unscientific.

Note that it does not directly infer that “everyone buys tickets.” This conclusion must be wrong, because the remaining people have no information to buy the tickets – we do not know: we can use common sense during the Spring Festival Judging that they may not be able to buy tickets, the usual high-speed trains basically buy tickets for people who want to buy tickets. Therefore, a direct judgment that “surely someone has not bought a ticket” is an abuse of “survivor bias”. The wrong side is not necessarily correct.

From a statistical point of view we look at how we abused survivor bias - we observed that A1 has characteristic X, and at the same time we realized that there may be survivor biases, we defined A1 as a survivor in advance, and then directly judged non-survivors. A2 must not have feature X, and the truth is: A2 has feature X. This information we do not know, may or may not have.

It is very important to beware of the abuse of “survivor bias”. In fact, the story of Ward’s statistics professor Ward mentioned earlier is only a posterity and a simplified version. If you think about it a little, you will know that a scientifically trained professor of statistics It is impossible to directly give conclusions based on intuitive judgment.

Actually, Professor Ward has submitted eight different reports on the aircraft's shooting down. The main paper is "A Method of Estimating Plane Vulnerability Based on Damage of Survivors", which is a "determination of the critical parts of an aircraft based on the surviving aircraft damage." Methods".



There are more than 80 pages of this paper, and only the summary of his contribution to his later generation has more than 10 pages (the public number replied to the key word - "Professor Ward" to obtain thesis), who wrote the masterpiece "Sequence Analysis" The authoritative professor clearly did a detailed and rigorous analysis of the characteristics of A2 in the framework before reaching a conclusion!

If you pat your head and become a statistician, then everyone is a statistician!

How do Internet users avoid "survivor bias"?
"Survivor bias" is a common logic error in data analysis, and data is one of the driving forces for the Internet. So how do Internet users avoid "survivor bias" when analyzing data and making decisions? Wei Xi summed up three steps:

To determine the randomness of the sample, it must be known whether the sample is random.
There will be no significant difference between the judgment sample and the remaining sample.
Analyze the remaining sample data and verify the conclusions.
Let's look at a few cases to directly train:

WeChat public number reward case
“Wei Xi Liao Advertisement” has opened the public number and opened Weibo account “Wei Xijun”. At this time, I found the same article under the same reading circumstances, WeChat rewarding was especially low, and Weibo was more. , So I initially judged that the WeChat fans' preference for playing was lower than Weibo, until I remembered that WeChat’s iOS users did not understand the previous guess because of Apple’s policy restrictions. Deviation, so I tried to add IOS appreciation code at the end of the last two articles, and the amount of praise for the article actually increased by nearly 4 times.

The standard three-step approach to avoiding survivor bias in this case is:

Judging the randomness of the sample, that is, seeing whether the rewarded user of the WeChat public number can represent the whole. The answer is no, because only Android users are covered;

Is there a significant difference between the judgment sample and the remaining sample? Is there any difference between Android users and iOS on the reward? The answer is: there may be differences;

Analyze the remaining sample data and verify the conclusion, that is, plus the iOS verification code to verify the result again.

Video website case
A video site launched a new US TV show in the VIP. The number of viewers for each episode of the US TV show has been stable before, but when it was broadcast to the seventh episode, there was a relatively significant loss of viewership. Operators began to analyze It was the Department’s American drama that started from the seventh episode with the story suddenly dropping and the protagonist suddenly hanging up. However, when they carefully analyzed the lost users, they discovered that the loss was due to the massive free gift of membership due three months ago. However, just coincided with the seventh episode, the ordinary members did not lose at all.

In this case, the three steps are as follows: 1. Judging the randomness of the sample, that is, analyzing whether the lost user is a random sample of all members. The answer is no - the free members are lost. 2. Are there significant differences between the judgment sample and the remaining sample? Is there a difference between a normal member and a free member? Of course there is. 3. Analyze the remaining sample data and verify the conclusion, that is, see if the normal members are lost.

Facebook video advertising case
The issue of Facebook’s video bias in video ads in September 2016 has become a kind of negative news in the company's advertising history. Facebook acknowledged in its official blog that in its data report submitted to advertisers, video ads averaged The duration of the play counts only those playouts that last longer than 3 seconds. That is, if the video is played for less than 3 seconds, Facebook actually puts it down. Obviously, the advertiser’s average play length is elongated. Because the short duration of playing time is not counted, this deviation has existed for as long as two years.

In this case, the analysis is still divided into three steps: 1. To determine the sample randomness - nonsense! Less than 3 seconds are all gone! Of course no randomness! 2. Are there significant differences between the sample and the remaining sample? Nonsense, there must be differences between 3 seconds and 3 seconds or more! 3. Analyze the remaining sample data and verify the conclusions. This ..... does not need to verify it!

The above analysis presupposes that we need to have a deep understanding of our business. Only if you deeply understand the specific and important factors in your business can you make correct guesses and judgments.

Well: The above describes the deviation of survivors from the perspective of theory to practice. At this time, some people will ask Wei Xi. Do you think the probability of survivors of the Chinese Internet's content on the platform will be greater? Hahaha, no doubt knowing!

Socializer Widget By Blogger Yard
SOCIALIZE IT →
FOLLOW US →
SHARE IT →

0 comments:

Post a Comment