In a skewed distribution with a long tail,
a high frequency population is followed by a low frequency population, which gradually tails off asymptotically
.
Rule of thumb: majority of occurrences (more than half, and when Pareto principles applies, 80%) are accounted for by the first 20% items in the distribution. The least frequently occurring 80% of items are more important as a proportion of the total population.
Example: Natural language
- Given some corpus of natural language - The frequency of any word is inversely proportional to its rank in the frequency table. The most frequent word will occur twice as often as the second most frequent, three times as often as the third most frequent… “The” accounts for 7% of all word occurrences (70000 over 1 million). “of” accounts for 3.5%, followed by “and”… Only 135 vocabulary items are needed to account for half the English corpus.
Other examples: Allocation of wealth among individuals: the larger portion of the wealth of any society is controlled by a smaller percentage of the people. File size distribution of Internet Traffic, Hard disk error rates, values of oil reserves in a field (a few large fields, many small ones), sizes of sand particles, sizes of meteorites.
In classification and regression problems, this is a issue when using models that make assumptions on the linearity and need to apply a monotone transformation on the data (logarithm…). When sampling, the data will become even more unbalanced.