Statistics
Branch of mathematics that deals with the collection, organization,
and analysis of numerical data and with such problems as experiment
design and decision making.
History
Simple forms of statistics have been used since the beginning of civilization,
when pictorial representations or other symbols were used to record
numbers of people, animals, and inanimate objects on skins, slabs,
or sticks of wood and the walls of caves. Before 3000 BC
the Babylonians used small clay tablets to record tabulations of agricultural
yields and of commodities bartered or sold. The Egyptians analyzed
the population and material wealth of their country before beginning
to build the pyramids in the 31st century BC.
The biblical books of Numbers and 1 Chronicles are primarily statistical
works, the former containing two separate censuses of the Israelites
and the latter describing the material wealth of various Jewish tribes.
Similar numerical records existed in China before 2000 BC. The ancient Greeks held censuses to
be used as bases for taxation as early as 594 BC.
See Census.
The Roman Empire was the first government to gather
extensive data about the population, area, and wealth of the territories
that it controlled. During the Middle Ages in Europe few comprehensive
censuses were made. The Carolingian kings Pepin the Short and Charlemagne
ordered surveys of ecclesiastical holdings: Pepin in 758 and Charlemagne
in 762. Following the Norman Conquest of England in 1066, William
I, king of England, ordered a census
to be taken; the information gathered in this census, conducted in
1086, was recorded in the Domesday Book. Registration of
deaths and births was begun in England in the early 16th
century, and in 1662 the first noteworthy statistical study of population,
Observations on the London Bills
of Mortality, was written. A similar study of mortality made in
Breslau, Germany, in 1691 was used
by the English astronomer Edmund Halley as a basis for the earliest
mortality table. In the 19th century, with the application of the
scientific method to all phenomena in the natural and social sciences,
investigators recognized the need to reduce information to numerical
values to avoid the ambiguity of verbal description.
At present, statistics is a reliable means of describing accurately the
values of economic, political, social, psychological, biological,
and physical data and serves as a tool to correlate and analyze such
data. The work of the statistician is no longer confined to gathering
and tabulating data, but is chiefly a process of interpreting the
information. The development of the theory of probability increased the scope of statistical applications. Much
data can be approximated accurately by certain probability distributions,
and the results of probability distributions can be used in analyzing
statistical data. Probability can be used to test the reliability
of statistical inferences and to indicate the kind and amount of data
required for a particular problem.
Statistical
Methods
The raw materials of statistics are sets of numbers obtained from enumerations
or measurements. In collecting
statistical data, adequate precautions must be taken to secure complete
and accurate information.
The first problem of the statistician is to determine what and how much
data to collect. Actually,
the problem of the census taker in obtaining an accurate and complete
count of the population, like the problem of the physicist who wishes
to count the number of molecule collisions per second in a given volume
of gas under given conditions, is to decide the precise nature of
the items to be counted.
The statistician faces a complex problem when, for example, he or she wishes
to take a sample poll or straw vote. It is no simple matter to gauge the size and
constitution of the sample that will yield reasonably accurate predictions
concerning the action of the total population.
In protracted studies to establish a physical, biological, or social law,
the statistician may start with one set of data and gradually modify
it in light of experience. For
example, in early studies of the growth of populations, future change
in size of population was predicted by calculating the excess of births
over deaths in any given period.
Population statisticians soon recognized that rate of increase ultimately
depends on the number of births, regardless of the number of deaths,
so they began to calculate future population growth on the basis of
the number of births each year per 1000 population. When
predictions based on this method yielded inaccurate results, statisticians
realized that other limiting factors exist in population growth. Because
the number of births possible depends on the number of women rather
than the total population, and because women bear children during
only part of their total lifetime, the basic datum used to calculate
future population size is now the number of live births per 1000 females
of childbearing age.
The predictive value of this basic datum can be further refined by combining
it with other data on the percentage of women who remain childless
because of choice or circumstance, sterility, contraception, death
before the end of the childbearing period, and other limiting factors.
The excess of births over deaths, therefore, is meaningful only as
an indication of gross population growth over a definite period in
the past; the number of births per 1000 population is meaningful only
as an expression of the proportion of increase during a similar period;
and the number of live births per 1000 women of childbearing age is
meaningful for predicting future size of populations.
Tabulation
and Presentation of Data
The collected data must be arranged, tabulated, and presented to permit
ready and meaningful analysis and interpretation. To study and interpret
the examination-grade distribution in a class of 30 pupils, for instance,
the grades are arranged in ascending order: 30, 35, 43, 52, 61, 65,
65, 65, 68, 70, 72, 72, 73, 75, 75, 76, 77, 78, 78, 80, 83, 85, 88,
88, 90, 91, 96, 97, 100, 100. This progression shows at a glance that the maximum is
100, the minimum 30, and the range, or difference, between the maximum
and minimum is 70.
In a cumulative-frequency graph, such as Fig. 1, the grades are marked on
the horizontal axis and double marked on the vertical axis with the
cumulative number of the grades on the left and the corresponding
percentage of the total number on the right. Each dot represents the
accumulated number of students who have attained a particular grade
or less. For example, the dot A corresponds to the second 72; reading
on the vertical axis, it is evident that there are 12, or 40 percent,
of the grades equal to or less than 72.
In analyzing the grades received by 10 sections of 30 pupils each on four
examinations, a total of 1200 grades, the amount of data is too large
to be exhibited conveniently as in Fig. 1. The statistician separates
the data into suitably chosen groups, or intervals. For example, ten
intervals might be used to tabulate the 1200 grades, as in column
(a) of the accompanying frequency-distribution table; the actual number
in an interval, called the frequency of the interval, is entered in
column (c).
The numbers that define the interval range are called the interval boundaries.
It is convenient to choose the interval boundaries so that the interval
ranges are equal to each other; the interval midpoints, half the sum
of the interval boundaries, are simple numbers, because they are used
in many calculations. A grade such as 87 will be tallied in the 80-90
interval; a boundary grade such as 90 may be tallied uniformly
throughout the groups in either the lower or upper intervals.
The relative frequency, column (d), is the ratio of the frequency of an
interval to the total count; the relative frequency is multiplied
by 100 to obtain the percent relative frequency.
The cumulative frequency, column (e), represents the number of students
receiving grades equal to or less than the range in each succeeding
interval; thus, the number of students with grades of 30 or less is
obtained by adding the frequencies in column (c) for the first three
intervals, which total 53.
The cumulative relative frequency, column (f), is the ratio of the cumulative
frequency to the total number of grades.

The data of a frequency-distribution table can be presented graphically
in a frequency histogram, as in Fig. 2, or a cumulative-frequency
polygon, as in Fig. 3. The histogram is a series of rectangles with
bases equal to the interval ranges and areas proportional to the frequencies.
The polygon in Fig. 3 is drawn by connecting with straight lines the
interval midpoints of a cumulative frequency histogram.

Newspapers and other printed media frequently present statistical data pictorially
by using different lengths or sizes of various symbols to indicate
different values.
Measures
of Central Tendency
After data have been collected and tabulated, analysis begins with the calculation
of a single number, which will summarize or represent all the data.
Because data often exhibit a cluster or central point, this number
is called a measure of central tendency.
Let x1, x2, …, xnbe the n tabulated
(but ungrouped) numbers of some statistic; the most frequently used
measure is the simple arithmetic average, or mean, written x, which is the sum of the numbers divided by n:

If the x's
are grouped into k intervals,
with midpoints m1,
m2, …,
mk and frequencies
f1, f2, …, fk,
respectively, the simple arithmetic average is given by

with i = 1, 2, …, k.
The median and the mode are two other measures of central tendency. Let
the x's
be arranged in numerical order; if n
is odd, the median is the middle x;
if n is even, the median is the average of
the two middle x's.
The mode is the x that occurs
most frequently. If two or more distinct x's occur with equal frequencies, but none with greater frequency,
the set of x's
may be said not to have a mode or to be bimodal, with modes at the
two most frequent x's,
or trimodal, with modes at the three most
frequent x's.
Measures
of Variability
The investigator frequently is concerned with the variability of the distribution,
that is, whether the measurements are clustered tightly around the
mean or spread over the range. One measure of this variability is
the difference between two percentiles, usually the 25th and the 75th
percentiles. The pth
percentile is a number such that p
percent of the measurements are less than or equal to it; in particular,
the 25th and the 75th percentiles are called the lower and upper quartiles,
respectively. The pth percentile is readily found from the cumulative-frequency
graph, (Fig. 1) by running a horizontal line through the p percent mark on the vertical axis on the graph, then a vertical
line from this point on the graph to the horizontal axis; the abscissa
of the intersection is the value of the pth percentile.
The standard deviation is a measure of variability that is more convenient
than percentile differences for further investigation and analysis
of statistical data. The standard deviation of a set of measurements
x1, x2, …, xn, with the mean x
is defined as the square root of the mean of the squares of the deviations;
it is usually designated by the Greek letter sigma (s). In symbols

The square, s2, of the standard deviation is called the variance.
If the standard deviation is small, the measurements are tightly clustered
around the mean; if it is large, they are widely scattered.
Correlation
When two social, physical, or biological phenomena increase or decrease
proportionately and simultaneously because of identical external factors,
the phenomena are correlated positively; under the same conditions,
if one increases in the same proportion that the other decreases,
the two phenomena are negatively correlated. Investigators calculate
the degree of correlation by applying a coefficient of correlation
to data concerning the two phenomena. The most common correlation
coefficient is expressed as

in which x
is the deviation of one variable from its mean, y is the deviation of the other variable from its mean, and N is the total number of cases in the series.
A perfect positive correlation between the two variables results in
a coefficient of +1, a perfect negative correlation in a coefficient
of -1, and a total absence of correlation in a coefficient of 0. Intermediate
values between +1 and 0 or -1 are interpreted by degree of correlation.
Thus, .89 indicates high positive correlation, -.76 high negative
correlation, and .13 low positive correlation.
Mathematical
Models
A mathematical model is a mathematical idealization in the form of a system,
proposition, formula, or equation of a physical, biological, or social
phenomenon. Thus, a theoretical, perfectly balanced die that can be
tossed in a purely random fashion is a mathematical model for an actual
physical die. The probability that in n
throws of a mathematical die a throw of 6 will occur k
times is

in which (‚) is the symbol for the binomial coefficient

The statistician confronted with a real physical die will devise an experiment,
such as tossing the die n
times repeatedly, for a total of Nn tosses, and then determine from the observed throws the
likelihood that the die is balanced and that it was thrown in a random
way.
In a related but more involved example of a mathematical model, many sets
of measurements have been found to have the same type of frequency
distribution. For example, let x1,
x2, …, xN be the number of 6's cast
in the N respective runs
of n tosses of a die and assume N to be moderately large. Let y1, y2, …, yN be the weights, correct to the nearest 1/100 g, of N lima beans chosen haphazardly
from a 100-kg bag of lima beans. Let z1, z2, …, zN
be the barometric pressures recorded to the nearest 1/1000 cm by N students in succession, reading the same
barometer. It will be observed that the x's, y's, and z's have amazingly similar frequency
patterns.
The statistician adopts a model that is a mathematical prototype or idealization
of all these patterns or distributions. One form of the mathematical
model is an equation for the frequency distribution, in which N is assumed to be infinite:

in which e
(approximately 2.7) is the base for natural logarithms (see Logarithm). The graph of this equation (Fig. 4) is the
bell-shaped curve called the normal, or Gaussian, probability curve.
If a variate x is normally
distributed, the probability that its value lies between a and b is given by

The mean of the x's
is 0, and the standard deviation is 1. In practice, if N is large, the error is exceedingly small.

Tests
of Reliability
The statistician is often called upon to decide whether an assumed hypothesis
for some phenomenon is valid or not. The assumed hypothesis leads
to a mathematical model; the model, in turn, yields certain predicted
or expected values, for example, 10, 15, 25. The corresponding actually
observed values are 12, 16, 21. To determine
whether the hypothesis is to be kept or rejected, these deviations
must be judged as normal fluctuations caused by sampling techniques
or as significant discrepancies. Statisticians have devised several
tests for the significance or reliability of data. One is the chi-square
(c2) test. The deviations (observed values minus expected
values) are squared, divided by the expected values, and summed:

The value of c2 is then compared with values in a statistical
table to determine the significance of the deviations.
Higher
Statistics
The statistical methods described above are the simpler, more commonly used
methods in the physical, biological, and social sciences. More advanced
methods, often involving advanced mathematics, are used in further
statistical studies, such as sampling theory, inference and estimation
theory, and design of experiments.