Samples, Statistics and Parameters
The totality of observations with which we are concerned, whether finite or infinite, constitutes what we call a
population. Once,
the word population referred to observations on people, but today
it applies to measurements of any entities of interest, whether it
be people, animals, plants or objects. The number of entities that
comprise the population is called the size of the population which,
in many circumstances may be regarded as infinite.
A sample is a subset of the population. A random sample is a sample
taken in such a way that each element of the population has the same
probability of being selected. We take random samples to ensure that our samples are
representative of the population from which they are
taken, so that what we learn from study of the samples will be more
or less true of the populations themselves. Statistics calculated from random
samples provide unbiassed estimates of the corresponding true values for the population.
Samples should be considered fuzzy snapshots of the populations from
which they are drawn, with the degree of fuzziness diminishing as
the intensity of sampling or the sample size increases.
Were we to be examining an entire population, the average that we calculate would be called
the population mean. The population mean is a fixed figure characteristic of the population.
It is not subject to variation -- no matter who calculates such a mean or how many times
it is calculated, barring mistakes, the same value will be obtained
in each case. As such the population mean is called a parameter.
Alternatively, time or resources may not permit an examination of
the entire population, and we might choose instead to select
and examine a random sample of say 100 items. We could then measure
them and calculate the sample mean. The
difference between the sample mean and the population mean
is that the sample mean is subject to natural variation or,
as it is called, sampling error. If we were to repeat our sampling
procedure by selecting another 100 items, we would obtain a second value
for the mean that differed from the first one, and if we repeated
the procedure once more, a third value would most likely result. For
this reason, the sample mean is called a statistic.
The sample mean is said to estimate the population mean.
Statistics, calculated from samples, are estimates of true
population parameters. The estimation improves as sample size increases.
When we study samples, we are seldom directly interested in them per
se. We study them to learn something of the population from which
the samples were drawn. We infer properties of the entire population,
which we have not studied in its entirety, from our detailed knowledge
of a sample of observations. The convenience of studying finite samples
rather than the population as a whole comes at a cost. Samples, because
they are finite and often relatively small, are somewhat akin to a
fuzzy snapshot -- the general impression of the population is evident,
but the sample is an inexact representation. No matter how intensively
we study the sample, there will be a level of uncertainty in what
we discover, if we try to extrapolate our findings to the entire population.
This uncertainty is often referred to as sampling error.
Sampling error has important practical consequences namely,
-
Sample statistics will typically differ somewhat from the
corresponding true values for the entire population. Estimating by
how much they differ is a problem addressed under the heading of
parameter estimation.
-
Any two samples, even if taken from identical populations,
will differ typically in all of their statistics. Determining whether
the observed difference in sample statistics is great enough to conclude
that the true population values differ is a problem addressed under
the heading of hypothesis testing.
Because they deal with making inferences about population parameters
on the basis of sample statistics, parameter estimation and hypothesis
testing are grouped under the broader heading of statistical inference.
|