http://www.gla.ac.uk/Compserv/Doc/General/un11.html#Section9
User Note
Choosing a Statistical Analysis Package
Number:
UN 11/1
Author:
James Currall
Date:
June 1991
Summary:
This document suggests strategies
for selecting which statistical software package to use, and summarises the
characteristics of the available software.
Contents
11. Software Support Categories
There are many items of
software available via the Computing Service at
This document is intended
to help anyone needing to carry out statistical analysis to choose the most
appropriate software for their needs. Careful
consideration of which software to use may slightly delay the start of the
analysis, but experience has shown that in the long term this will save time
and effort and also provide a much better chance of obtaining valid results
from the analysis.
There are from time to
time new releases of packages which add features or remove existing problems.
This document will be updated to take account of this, but between
editions some details given in the document may not reflect the current situation.
2.
Methods of Choosing Software
The Classical Approach
A classical approach to
the problem of selecting which statistical software package to use, though
perhaps a rather naive approach, would be to decide what analysis needed to
be carried out and then to scan a list of all the analyses available in each
package to find a match. For an individual
package the answer would be a clear yes or no. There are, however, four weaknesses in this
approach:
·
Not all packages refer
to the same technique by the same name, and so the technique you are looking
for may not appear to be available, when in fact it is.
·
The same name may imply
different things in different packages, or more subtly, a greater or lesser
degree of breadth from one package to another. For example, a range of statistical packages
can reasonably claim to carry out analysis of variance (ANOVA), yet the level
of complexity and the type of problem that can be handled is very different
indeed.
·
More than one package
may offer the analysis that you nominally require. In this case, more information is required in
order to choose between contenders.
·
There may be no package
that appears to offer the analysis you are looking for.
The Common Approach
It might appear on the
surface that you would have less of a problem if you did not know, at the
outset, what analyses you required, since you could then select any package
and use a selection of what is on offer. This
approach, above all others, runs a severe risk of failure, as you may well
end up with a set of statistical tools inappropriate to your analysis.
While it may seem amusing to suggest that statistical analyses be selected
in this way, empirical evidence suggests that this is a common approach.
The Reasonable Approach
Another approach, which
might seem to be more intuitively reasonable, would be to use the same statistical
package as your colleagues. This approach
appears to have two main advantages:
·
You could get help
with getting started and with ongoing problems.
·
Your colleagues are
likely to be from the same nominal discipline as yourself and therefore have
the same analysis needs.
The first advantage should
certainly not be underestimated, but the second relies on the often-unrealistic
assumption that your colleagues made the right decision when they decided
what to use in the first place.
The Recommended Approach
A rather better and more
reasoned approach would be to look beyond the particular analysis that you
have in mind for your data, to your general data analysis philosophy. Broadly speaking certain pertinent questions
should be asked:
·
What class of analysis techniques do you require?
·
Are your data predominantly continuous or categorical?
·
Are your response variables measurements, counts or
proportions?
·
Are your data from experiments or surveys?
·
Do you want to explore your data one step at a time
or process it all at once?
·
Are there repeated measures or multiple responses?
·
Is prediction or description your ultimate goal?
This broader, more philosophical
view of your requirements can then be matched to the philosophy of a statistical
package. This approach produces much
less overlap between packages than does a similar matching process based on
individual statistical techniques. It is also more likely to cushion you from the
discovery, at a future date, that other related analyses are not in your chosen
package.
One important consideration
is how the package sees data and what types of data structure it supports.
Some packages can only see data as a set of cases, for each of which
the same set of measurements or responses are available.
This implies data that has a table or flat file structure similar to
that found in relational databases. Such
packages will allow little if any departure from this condition except by
carefully controlled, and often compartmentalized, exceptions. The selection of such a package, for better
or for worse, will shape the course of your analysis. This may not be detrimental but will depend
on your data and what needs to be done with it.
If you select a package
that is philosophically aligned with your type of work, you will not have
to face the difficult choice of continuing to use a sub-optimal package or
changing to another package because you can not do what you want with the
present one.
Having
discussed the subject in rather general terms, the rest of this document deals
with the strengths and weaknesses of individual packages. Each package
is considered in terms of its philosophical base, the types of data structure
supported, and the types of analysis that are available. At the end of the document, the packages are
compared and such issues as availability and training are considered.
The Philosophical Base
MINITAB was written as
a vehicle for the teaching of statistics.
It was designed to be easy to use, so that the students could concentrate
on the statistics and forget about the complexities of computing. From the beginning the package was designed
to be used interactively as well as in the more traditional batch mode of
operation. This genesis is reflected
in the well-developed help system, which can be interrogated at any stage
to help sort out problems of syntax and usage.
In fact it is possible to learn to use MINITAB from the help system
alone, and MINITAB is one of the few statistical analysis packages which can
be used entirely satisfactorily without the aid of manuals. The MINITAB philosophy can summarized as simple
techniques, with simple-to-use commands in an interactive environment.
In general, small to medium-sized
data sets are appropriate, if only because even quite simple techniques carried
out on large data sets can be very anti-social in an interactive environment
and may not be very interactive anyway. In recent years MINITAB has acquired some more
advanced techniques, but some of these are not as comprehensive as their counterparts
in other packages and may therefore not meet your requirements.
MINITAB and Data
MINITAB recognizes three
data structures; columns, constants and matrices. The interpretation of the column is very similar
to that in spreadsheet software. You
might refer to columns as variables or variates. MINITAB does not distinguish between categorical,
ordinal and interval data types; they are all columns and can (in theory at
least) be used for input into any technique. This implies that if males are coded 1 and females
are coded 2 then MINITAB will have no problem with calculating the mean (or
average) sex. Most analyses in MINITAB
are carried out on columns but there are also constants, which are used to
store single values such as means or counters. The final data type in MINITAB is the matrix,
which is a two-dimensional data structure rather like a table. These are generally used in MINITAB for matrix
calculations and perhaps have less use in the simple statistical analyses,
which are commonly used in MINITAB.
MINITAB is able to recognize
missing data using a symbol (*), and cases with missing values are generally
ignored when carrying out analyses.
Strengths of MINITAB
MINITAB is excellent for
exploring your data. It has a very
good range of descriptive statistics and also has most of Tukey's Exploratory
Data Analysis (EDA) techniques. These
enable you to have a really good look at your data to spot typing mistakes
and outliers and see what the data distribution looks like, before moving
on to hypothesis-testing techniques. In
this context MINITAB has many simple graphical procedures to enable you to
visualize your data as well as the more traditional summary statistics. There are also high-resolution graphics to produce
clearer representations of your data, but in the current versions of MINITAB
these are not of presentation quality and you would probably have to use a
graphics package to produce the graphs for your reports.
Simple hypothesis tests
of both parametric and non-parametric varieties are well covered and easy
to use, and the results are presented in an easily understood fashion. The area of more complicated analysis that MINITAB
is good at is regression, in both the simple and multiple forms. It will do multiple regression
stepwise, with most of the features that you find in other packages.
From Release 8 both the
Macintosh and PC versions of MINITAB include a spreadsheet-style layout and
full-screen editor. This makes the
input and correction of data much easier than using the traditional READ and
LET commands, although these are also available as part of the main package.
MINITAB has a crude macro
facility to enable users to automate repetitive sequences of commands or create
new facilities. An extensive user group
macro library is available containing many useful procedures.
Weaknesses of MINITAB
The major weaknesses of
the package are in the areas of designed experiments and multivariate analyses,
although to be fair, a package which started as a teaching tool could hardly
be expected to have a large number of fully-featured multivariate techniques.
Until release 6, the only
types of designed experiments that could be directly analyzed were one-way
ANOVA and simple two-way ANOVA designs. The
scope has since been enhanced by the inclusion of more flexible ANOVA and
ANCOVA commands. These are not however
the answer to every experimenter's dreams. For example, MINITAB could not handle an experiment
with multiple error strata and give you all the correct F values using all
the appropriate error terms. In addition
these facilities will only cope with fully balanced designs.
At release 7 a new command that could handle unbalanced designs was
introduced, but for correct results the designs should still be orthogonal
and there is no check to see that this is so.
You must therefore understand exactly what you are doing before using
these facilities or you might end up with incorrect or just plain meaningless
results.
The whole area of assumption
checking is one that MINITAB is not good at. In the early days the techniques were simple
and little checking was needed, but as more sophisticated facilities have
been added the need has increased, but not the delivery. MINITAB is used extensively by people with little
statistical background and therefore the need for checking is great, to help
stop the users falling into the many pitfalls along the way. The reason for the lack of checking and warning
is that it requires a lot of code to carry out the checks. Writing code costs money and there are no new
features to be seen at the end of it. Selling
software is a largely a question of how many features are provided, not how
well those features are implemented. The multivariate analyses are not the full-bodied,
feature-packed routines that are available in other packages, and so if multivariate
analyses are primarily what you require then you will need to look elsewhere.
Graphics Facilities
MINITAB has a very good
range of character graphics type charts. There
are several different types of scatter diagrams, box and whisker plots, contour
plots and histograms. This type of
presentation is good for model and assumption checking.
There are also high-resolution versions of most of these procedures,
which produce diagrams on the screen, or on a plotter.
A moderate range of plotters is supported (but most of these are ones
that use HPGL to drive them) as are a range of dot-matrix printers but not
laser printers. Although these versions of the procedures provide
higher resolution, they do not give the user much flexibility and therefore
are unlikely to be suitable for inclusion in published papers etc.
At release 9 of MINITAB the graphics will be completely rewritten and
will provide facilities similar to specialist graphics packages (this release
is likely to be available towards the end of 1992).
Graphical User Interfaces
Both the PC and Macintosh
versions have what are referred to as graphical user interfaces at release
8 of MINITAB. These have help systems
and separate windows for commands, output, logs etc. You can choose your commands from pull-down
menus or type them directly at a prompt. These
interfaces make an already user-friendly package even more friendly but are
not yet available on mainframe systems.
Summary
If you have small to medium-sized
data sets and want descriptive statistics, graphical representations of your
data and simple hypothesis tests, then you can't do much better than MINITAB.
You can also have good regression facilities and simple analysis of
variance. It is an excellent teaching package and very
easy to use, with plenty of help available.
It is one of the few general packages that offers control charting
techniques (from release 7), although I have never used them and can not comment
on how good the facilities are. However,
do not expect a great deal beyond this; there are other packages that do more
complexes or specialized analyses better.
The Philosophical Base
SPSS used to stand for
Statistical Package for the Social Sciences, although now the company claims
that it stands for Superior Performing Statistical Software. The former is a more accurate reflection of
reality and gives an indication of where the package came from. Another important point is that the package
dates from the swinging sixties, when all computing was done on mainframe
computers using a batch mode of processing.
Little changed in the mainframe versions until release 3 (available
on the VME system at
Glasgow) when the possibility
of interactive working was introduced.
A more 'user friendly'
form of the package with windows and editing of commands became available
with the PC version, and this trend has continued with OS/2 and Apple Macintosh
versions. However, these advances only
enable you to issue the same style of SPSS commands of the earlier mainframe
releases. Essentially SPSS's strength
is survey data, in sets as large as you care to imagine. Many of the things that you would want to do
to survey data are available in a full-bodied form, although there are some
notable omissions (see later). It is
not really suited to interactive working (with large data sets you wouldn't
want to work interactively anyway) and although it will run on a PC, with
a sizeable data set you might need to leave it to run overnight!
SPSS and Data
SPSS recognizes three data
types, although one of these has to be stored in a file rather than in memory.
SPSS's major data types are the numeric and character variables.
All numeric variables are stored as real numbers even if they are integers. There is no distinction between interval, ordinal
and categorical data; they are all just variables. This arrangement both simplifies things and
creates problems. If you have an ordered
categorical variable with values 1, 2, 3 and 4, and you wish to select the
cases with values less than or equal to 3, you should select values less than
3.5 to be sure of getting the correct cases.
This is because when 3 is stored as a real number it is not stored
as exactly 3 and may be stored as 2.9999999.
The same problem occurs with recoding.
The problem would not arise if there was a data type in SPSS which
used integers for categorical variables. The
moral is to be careful and is especially applicable to survey data where a
large number of variables are categorical.
SPSS is entirely case orientated
and will only see data as being a set of variables measured or recorded on
each of a number of cases. Frequency
distributions and contingency tables can only be handled if they are generated
internally from raw case data, except in a few limited circumstances.
SPSS recognizes missing
values as specific values outside the normal range of numbers for a particular
variable, and will ignore cases with such values during most analyses. SPSS will also handle matrices under certain
circumstances, but these are stored in files not as variables, and they are
only available to and from a limited range of procedures.
The area of data structures
is where SPSS is most in need if improvement, as a package which is targeted
at survey data needs to have ordered (ordinal) and non-ordered (nominal) categorical
data types.
Strengths of SPSS
Cross-tabulation in SPSS
is very good indeed, and with the addition of the TABLES product, the written
output can be made to look extremely professional. You can have your data subdivided into categories
in several dimensions and then get a whole range of descriptive statistics
for each cell in the categorization. This
is no more and no less than you would expect from a survey analysis package.
There are also a very good range of simple hypothesis tests of both
parametric and non-parametric types. Most
forms of regression from simple to multiple and linear to non-linear and log-linear
are well implemented, and all the bells and whistles are there. In addition such multivariate techniques as
factor analysis and discriminate analysis are available in a fully featured
form (SPSS uses the term factor analysis in its generic sense to cover a range
of related techniques, including principal components analysis).
The one major disappointment
in the area of multivariate techniques is cluster analysis, where all the
similarity measures offered are designed for interval data and there are none
for mixed or binary data. Much data
in survey analysis is mixed or binary and there is little option but to turn
to other packages for cluster analysis. There is also an additional option called TRENDS,
which does various forms of time series analysis. A macro facility is available, but the SPSS
syntax and data structures do not give sufficient flexibility to make it really
useful.
Weaknesses of SPSS
The major weakness of SPSS
is in its handling of designed experiments.
It either does things badly or in an extremely convoluted manner. There are probably very few people in the world
who fully understand the MANOVA command and how to make it do all the things
that it is supposed to do. It tries
to do too much in one command and ends up doing almost everything in a totally
counter-intuitive way.
Until recently there were
very few techniques in the package designed for data which is at best ordinal,
which is surprising for a package that is targeted at survey data, though
the recent inclusion of multi-dimensional scaling and the optional extra CATEGORIES,
which includes correspondence analysis, has improved matters.
In terms of the philosophy
of statistics, SPSS will lead the unwary astray. The statistical philosophy demands that you
make your assumptions explicit before making a hypothesis test. In most packages you have to make your assumptions
clear in subcommands or options and will therefore be making a specific test.
In SPSS you get a series of answers which have different assumptions
attached to them, and then you choose the answer that you like best.
SPSS does not prevent you from establishing a priori assumptions but it does not encourage you to do it.
In addition almost all
SPSS commands have defaults for most of the choices between methods, and so
if you do not specify anything you get an analysis which may well be inappropriate
to your type of data and situation. Considerable
care is needed, especially with some of the more sophisticated techniques,
in order to specify an appropriate form of analysis.
In order to do this a set of manuals is essential.
They are well written and they are the only place where you can find
out exactly what the analysis is going to do to your data and what assumptions
are to be made. Many people remark that SPSS is easy to use,
that they understand it and that one doesn't need manuals to use it. It is in reality easy to misuse, many of the
techniques are extremely difficult to understand, and if you use it without
manuals you are in grave danger of seriously undermining your academic credibility.
Under most circumstances
it is very difficult in SPSS to pass the results of one analysis as input
to another, as it does not support the data structures to do this. This reduces the flexibility of the package
quite considerably.
SPSS is very poor at assumption
checking; if it warns you about problems with your data you are in serious
trouble, as such warnings are few and far apart. Much the same, in this respect, applies to SPSS
as to MINITAB. The warnings and the
checks are all described in the manuals, but the checks have to be carried
out by you on a preliminary analysis or analyses of the data, and then you
need to modify the options accordingly - none of this will be done for you
automatically.
Graphics Facilities
SPSS offers a range of
character graphics facilities in all its versions. These enable you to plot crude scatter diagrams
etc. High-resolution graphics is provided
in most versions via third party packages. When you issue an SPSS command to produce the
graphics, it writes a file of commands and passes this to the other package
(which is started up automatically by SPSS) to produce the graph. This way of providing graphics has both advantages
and disadvantages. It means that SPSS
Inc do not have to write graphics software, as they can just use someone else's,
which might be better written than anything that they could write themselves.
It also means however that they can only use features that are available
in the third party software (e.g. neither
Harvard Graphics nor Microsoft Chart, which are the packages used by the PC
version, can produce error bars). You
have to buy the graphics package separately (which can be as much as six times
the cost of a site license copy of SPSS).
There is also no guarantee that the package that you have bought will
continue to be the one that SPSS uses in the future.
No high-resolution graphics is implemented on the mainframe version
of SPSS at Glasgow.
Graphical User Interfaces
The versions of SPSS for
the PC, OS/2 and Apple Macintosh have what are referred to as graphical user
interfaces. These have help systems
and separate windows for commands, output, logs etc. You can choose commands from menus and 'paste'
them into the commands window, and thus build up standard SPSS commands from
menus. This is a two-edged sword; it
makes it easier to get the command syntax right, but makes it even more tempting
not to use the manuals, which are the only source of essential information
about methods and assumptions. These
interfaces do make it easier to construct SPSS commands, to correct errors
and tidy up output, and are much 'friendlier' than the mainframe systems,
but the pitfalls are still there and are easier to fall into.
Summary
SPSS can handle very large
data sets because it does not load all the data into memory at once; it just
gets it from disk when it wants it. It
was designed for survey data, and that design philosophy has not changed much
over the years. It is excellent for
sorting out and tabulating survey data, and it can handle a competent range
of univariate and multivariate techniques.
However, if you have designed experiments to analyze then it is not
for you.
The Philosophical Base
The original GENSTAT program
was written to analyze agricultural experiments. It was essentially an extremely powerful statistical
analysis programming language. Its
syntax was difficult and inconsistent and the package was used only by enthusiasts.
The other perceived problem with the package was that you needed
to know what you were doing in order to use it.
In the early 1980's the
user interface to the package was completely rewritten, with a more logical
and above all consistent syntax. The
result of this rewrite was a new package called GENSTAT 5. The second 'problem' was also addressed, to
the extent that the manuals were written in a way that made them easy to use
and helpful in explaining the techniques in addition to the command syntax.
The new user interface was written with interactive use in mind, and
a comprehensive hierarchical help system was included as part of this.
You still need to know
what it is that you are doing in order to use GENSTAT 5, but you do not have
to be a statistician. If you find this
idea off-putting and would prefer to use a package, which does not make such
demands on you, I would suggest that you reflect on the following.
In order to analyze data effectively, you need to
specify the problem correctly, apply an appropriate technique, and then interpret
the results produced.
You gain nothing by putting
your data through a technique and then failing to fully interpret the results
that the package produces. A package,
which allows you to carry out an analysis without understanding what you are
doing only, allows you to defer the knowledge gap to the interpretation phase.
It is therefore doing you no favors, as you run the serious risk of
selecting an inappropriate technique if you start off not understanding what
you are doing.
Visualization of data is
an important part of the GENSTAT philosophy, and it is thus supplied with
a very flexible range of high-resolution graphics facilities, which in them
are sufficient reason for using it.
GENSTAT 5 and Data
GENSTAT 5 recognizes a
wide range of data structures, which makes it very flexible to use and helps
to avoid ambiguity. The first structure
is that found in all statistical analysis packages, the variable, which in
GENSTAT is called the variate. Variates are observed
variables. Classification variables
have a different structure called the factor, which exists only
as a set of ordered or unordered levels. The
value of this distinction is that functions appropriate to observations cannot
be applied to the classifiers, and vice-versa.
Matrices are available
as a number of different types of structure; rectangular, symmetric and diagonal.
In addition, there are table structures, single-value structures (called
scalars in GENSTAT 5), and
specialized structures for time series, latent roots and vectors, and sums
of squares and products. Finally there
are text expression and formula structures.
Structures of differing types may be combined into pointer structures.
If you think that this
sounds unduly complicated, remember that you do not need to use all the structures,
but the fact that the structures used for different purposes are differentiated
allows appropriate functions to be applied to particular structures.
For example, to mimic MINITAB data structures you would only need to
use scalar for constants, variates
for columns, and matrix for matrices.
GENSTAT will cope with
missing data by using a code, which you select, and in some procedures will
not ignore cases with missing values, but will employ iterative interpolation
techniques to estimate the missing value.
Strengths of GENSTAT 5
GENSTAT 5 has five main
strengths:
·
It has a large range
of statistical techniques and is particularly strong on designed experiments,
regression, curve fitting, generalized linear modeling, and multivariate techniques.
Many of the techniques and algorithms incorporated into the package
represent the state of the art.
·
GENSTAT provides an
impressive toolkit of arithmetic, statistical and matrix manipulation techniques,
which enable experienced users to modify existing techniques and build new
ones from, scratch. Using the tools,
new techniques can very quickly be added to the package and made available
via the procedure library. Users may
also use procedures to add new features of their own to GENSTAT, in the same
way that macros are used in other packages. There is therefore a very short lead-time before
new analyses are added to the facilities available.
·
The data structures
enable the results of one technique to be directly imported into another.
Unlike many other packages, any result that can be printed can be used
for further analysis.
·
There are very good
high-resolution graphics to facilitate data visualization and reporting, which
unlike some other packages are of publication quality and are included in
the package as standard.
·
GENSTAT 5 is probably
the best package available for statistical assumption checking. This in fact makes it more suitable for inexperienced
users than some of the other packages as it helps them to avoid some of the
more fundamental errors.
Weaknesses of GENSTAT 5
GENSTAT 5 does not have
good facilities for the cross-tabulation of surveys and questionnaires, which
is not altogether surprising considering its origins.
GENSTAT 5 is not as easy
to use for simple techniques as other packages such as MINITAB. In fact some of the simpler data exploration
techniques are not included in the base package but are only provided through
the procedure library. Although the
command syntax has been greatly improved it is perhaps less immediately intuitive
than some other packages.
GENSTAT 5 does not have
an infinite workspace and each data structure that you use will take some
space. Data is not read directly from
disk for each procedure, and so there is, as with most other packages except
SPSS, a limit to the size of problem that can be handled.
The actual size of the
workspace is related to the type of machine on which the package is running.
In the early releases of the package there were some efficiency problems
which have now been addressed.
The 'problem' of needing
to know what you are doing has already been addressed. It is perhaps a matter of opinion as to whether
or not this is a weakness.
Graphics Facilities
GENSTAT 5 offers a range
of character based graphics including the ability to plot fitted lines on
scatter diagrams (albeit rather crude ones involving commas and semicolons). There is a considerable amount of control that
the user can exert over the layout and style of scatter diagrams, histograms,
box and whisker plots, contour plots and dendrograms.
Unlike many packages, high-resolution
graphics are built into GENSTAT 5 and are available in all versions (except
the current implementation on CMS). A moderate amount of control over layout, color
and fill patterns is available for scatter and line graphs, biplots, box and
whisker plots, contour plots, histograms and barcharts, piecharts, dendrograms
and shade diagrams. The devices that
are supported depend on the version; the PC version supports plotters using
HPGL and Epson dot matrix printers as well as the usual graphics screen drivers.
Release 2 of GENSTAT 5 (late 1991) will support graphical input and
so it will be possible to indicate points on graphs with a mouse and thus
identify an outliner in a graph. It will also be possible to write procedures
to perform brushing scatterplot techniques.
Summary
GENSTAT 5 is a very well
featured statistics package with excellent high-resolution graphics facilities.
It requires a reasonable understanding of what you are doing but at
the same time it is much better at assumption checking than many other packages.
There is slightly more to learn initially than for many other packages,
which makes it less suitable for one-off or occasional use. It is likely to be of particular use to those
who need to do a moderate
amount of statistics or
use a wide range of techniques. It
will however be invaluable to those who wish to develop novel techniques and
provides a wide range of graph plotting facilities in a fairly easy to use
form.
The Philosophical Base
GLIM stands for Generalized
Linear Interactive Modeling system. It
was written by the Royal Statistical Society, and continues to be developed
by them. It was written to enable statisticians
to fit generalized linear models (GLMs) to their data, in a way that was not
possible in other packages. Many common
statistical problems, such as regression, analysis of variance and covariance
of designed experiments, log-linear models of counts, and logit and probit
models of proportions, are just special cases of a GLM.
Even very simple problems such as t-tests can be specified as GLMs,
and so in the hands of a statistician GLIM is a very powerful tool.
The recipe is very simple;
there are facilities for reading in the data, specifying the model, fitting
the model, and producing output as simple graphs, tables and reports. For a package of such power the GLIM program
is very small and will fit on a PC with only 256Kb of memory. If you do not have a firm grasp of statistical
modeling and do not know what link functions and error distributions are,
it is definitely not for you, unless a statistician is guiding you.
GLIM and Data
GLIM recognizes only four
types of data structure, two of which are column structures, like one-dimensional
arrays in a programming language:
·
The variate
is used to hold continuous variables, such as measurements. This structure is assumed unless otherwise specified.
·
The factor
is used to store categorical variables, either nominal or ordinal. Factors may be redefined as variates and vice
versa as required.
·
The scalar
is used to hold single values. Scalars
are very important in GLIM because they are used to hold all the parameters
pertaining to how well the current model fits and how much better it fits
than the last one. Users may also define
scalars of their own if required.
·
The pointer
is used to hold the name of another structure, such as the name of the current
Y variable.
Strengths of GLIM
GLIM is excellent for exploring
data and trying a series of models with different assumptions. It is one of the few packages that allow you
to assume error distributions other than normal ones; Binomial, Gamma and
Poisson error distributions are also permitted.
The package also allows a range of link functions between the dependent
and independent variables (exponent, logit, probit, log, complemetary log-log,
reciprocal and square root), as well as the straightforward identity relationship.
In fact the whole package is built around fitting models with appropriate
link and error function combinations.
As a result of the compact
nature of the package, it runs fast and is virtually identical on all the
machines on which it is available (which is almost any machine with a Fortran
compiler). There are a good range of
mathematical functions for transformations of variates, and simple line-printer
type histograms and scatter diagrams to examine how well a specific model
fits to your data. There is also the
facility to write macros (rather like programs of GLIM commands) which can
be used to store frequently used analyses.
The package is supplied with a library of macros written by other people
to carry out particular types of model fitting.
Weaknesses of GLIM
GLIM has few bells and
whistles and is written by statisticians for statisticians. There is very little in the way of a help facility,
except that if you make a mistake in specifying the arguments to a particular
command, GLIM will give you an error message and a summary of the command
syntax, so that you can try again.
GLIM is emphatically not
a general-purpose statistics package and it can not therefore really be criticized
on the basis of the types of analysis that it cannot perform. It does not handle missing values at all, which
can be something of a problem, but that is to do with the algorithms that
handle the actual model fitting and the problem of deciding, for a general
case, what to do about missing data. There
is no facility in GLIM for direct access to the operating system, for example
to examine a directory listing in order to find out the name of a file.
Finally, although variate and factor names may be longer than four
characters, only the first four are used.
User-defined scalars may only be one letter long and you can therefore
only have 26 of them. These features restrict the sort of names that
you can use for data structures.
Graphics Facilities
GLIM offers only character-based
graphics, but they are satisfactory for model and assumption checking that
is essential to the way that GLIM is used. Release 4 of GLIM (early 1992) will have high-resolution
graphics built into it, and these facilities should be available in all versions.
Summary
If generalized linear modeling
is what you know about and what you want to do then GLIM will undoubtedly
be attractive to you. If you want neatly
packaged statistical analyses then it will definitely not provide what you
are looking for. Release 4 of GLIM
will provide extensions to the type of models that can be fitted, new data
structures such as matrices, and high-resolution graphics, but is unlikely
to have many features to improve its user friendliness except for some on-line
documentation.
The Philosophical Base
BMDP started out in 1961
as BioMeDical computer programs (BMD). It
was then, as it still is today, a collection of programs to carry out specific
statistical analyses in the bio-medical sciences. The range of analyses available has been greatly
enhanced over the years, and there are a number of types of analysis which
are not available in other packages. The
programs are well written and the developers have never been accused of using
poor algorithms, as have the developers of some other packages.
There are interactive versions
of BMDP available, but with the possible exception of some of the facilities
of the PC90 version, the interactive versions make very few concessions to
modern interactive computing. However,
once you have learned the commands associated with the two or three programs
that you need, BMDP is very easy to use. The basic idea is that you set up your job,
submit it for processing, and collect your output when it has finished. As long as you have a problem, which fits one
of the BMDP programs, then everything is straightforward, but if not there
is no flexibility to tailor the analysis to suit your needs.
Although the package consists
of about 50 individual programs, they share the same commands for data input,
transformation, printing, storage and output.
This means that having prepared a job for one program you have only
to change the program-specific commands to make the job suitable for another
program.
BMDP and Data
There is not a great deal
to be said here. BMDP reads variables,
which can be either numeric or character strings of four or less characters.
Whether or not variables are measurements or categories is determined
by whether or not they are declared as grouping variables as part of the analysis
specification (continuous variables can be used for grouping if you specify
cut points). Various other data structures, such as correlation
matrices, can be stored in BMDP system files for use in other programs if
required.
BMDP is like SPSS almost
exclusively case oriented, and data is specified as a set of variables, with
one value for each variable for every case.
Missing values are handled by specifying a specific value, such as
999, to indicate that there is no data for a specific case. One useful feature of data input in BMDP, which
should be available more widely in statistics packages, is the facility to
specify the maximum and minimum values that a particular variable can take.
This means that many typographic errors, such as a decimal point in
the wrong place, are picked up automatically.
Strengths of BMDP
The quality and diversity
of the BMDP programs are its strength. Some
of the programs provide facilities that are not easily available in other
packages. These include survival analysis,
boolean factor analysis, the analysis of preference pairs, and stepwise logistic
regression.
The output from BMDP analyses
is better laid out and more easy to understand than that of some other packages,
although high-resolution graphical output is only available in the PC version.
If BMDP has the analysis that you require then it will do a good job
for you with little fuss. BMDP has a very useful program called the data
manager that enables you to transform and sort variables, merge files and
aggregate the data over cases, which helps to reduce its dependence on a case-oriented
view of data.
Weaknesses of BMDP
The major weaknesses are
lack of flexibility and the fact that the package consists of a series of
separate programs, with one for each type of analysis. The former means that if your problem is in
any way non-standard, you may not be able to carry out an optimal analysis
using a BMDP program. Problems concerned
with the latter result from the fact that although each program has a series
of parameters that are set out in what are called paragraphs, there is only
a loose commonality between programs. The
paragraphs concerned with input of data, transformation, printing and storage
are all the same across all programs, but the ones that specify the particular
analysis are specific to the individual program. There is no guarantee that if you wished to
carry out, say, a repeated-measures analysis of variance, then you could specify
the analysis in the same way in the three different programs that could perform
it for you. It is also very difficult
to carry out data exploration with BMDP, because you might need to run several
programs to collect all the information that you require.
Graphics Facilities
BMDP offers a reasonable
range of character based graphics in some of its programs. Only on the PC is there the possibility of high-resolution
versions of the graphs. The PC90 version
allows you to change certain features of the graphs from the defaults (colors,
axis limits, fonts and line styles) but the range of types of presentation
is quite limited.
Graphical User Interfaces
The PC90 version of BMDP
has a menu-driven interface to the BMDP programs. This allows you to create a file of commands
and/or data for input into a BMDP program and then select an appropriate program
from a menu. There is a help system
to aid your choice of program and help you with the editor. Finally, if you have the high-resolution graphics
option, you can construct and modify your graphs from further menus.
This menu system takes some of the effort out of using BMDP but it
does not transform it into anything like an interactive package; it still
remains a batch orientated package that is good for problems with a structure
that fits a particular BMDP program.
Summary
If you have a well-specified
problem in the bio-medical sciences, that are common to others in your field,
BMDP may well have the analysis that you are looking for. The analysis will be well implemented and will
give you clear and concise output. In
short, BMDP will provide you with a well-packaged product. If however you want to explore your data and
try out various approaches to its analysis, there are easier ways of doing
it. For certain types of analysis there
is not really much choice and you have to accept the program that will do
it.
The level of interactivity
across the machines on which BMDP is available is rather low at present, but
there are indications in the PC90 implementation that this situation may well
improve in the medium term.
The Philosophical Base
The SAS system is a very
large integrated collection of products which is capable of carrying out a
wide range of tasks in the data storage, analysis and display field. It might be thought of as an environment for
data handling rather than simply as a package.
Although it has its origins in statistical analysis it has spread into
graphics, database management, econometrics, operations research and quality
control.
Most versions of SAS offer
multiple screen windows in interactive working so that all your work can be
carried out in the SAS environment. The
package is probably of most interest to those who need a complete working
environment, from planning their work through to the finished reports. The package is very popular in the pharmaceuticals
industry, since it is obligatory in food and drug trials for the FDA.
It is also popular in commerce as a computer performance-monitoring
tool. Until fairly recently its major
userbase was organizations with IBM mainframes, but over the past few years
it has become available on a much wider selection of hardware.
Although the range of facilities
is impressive overall, in some of the areas that the package covers it is
less good than competing more specialized packages, but this disadvantage
must be weighed against the benefits of the integration of diverse elements
into one environment.
SAS and Data
Data in SAS is organized
into structures called SAS datasets. Ordinary SAS datasets contain from one to many
variables, where a variable is one measurement or observation made on a set
of cases. The term variable in this
context accords with its usage in other packages. Variables may be numeric, character or date.
Nominal, ordinal and interval variables are not distinguished and are
all numeric (although the character variable type may be used for nominal
variables). A dataset may be constructed from raw data read
into the package from a file or from keyboard input. Unless instructed otherwise, all SAS procedures
are carried out on the latest SAS dataset established by a SAS DATA step.
Special SAS datasets are
used to contain data structures which do not fit the 'variable by cases' model
such as matrices of correlation coefficients.
These are primarily provided to enable the results of one procedure
to be fed in as input to another one. It
is a pity that the provision of special SAS datasets is not wider so that
all results that can be printed were available as data for further manipulation.
They are currently provided primarily for inter-procedure communication.
Multiple SAS datasets may
be in use at one time, which partially removes the strict case orientation
that is found in some other packages. SAS
datasets may be either temporary, in which case they will be lost at the end
of a session, or they may be permanent, in which case they will be saved at
the end of the session.
Strengths of SAS
They major strength of
SAS is the integration of a number of products to produce a complete working
environment. It offers a wide range
of statistical procedures although the range could not be described as fully
comprehensive. Most of the common types
of analysis are to be found and there are number of facilities that are found
in few other packages.
Most notable amongst these
rare features are the facilities for the planning and execution of industrial
type experiments such as those advocated by Taguchi and Demming and their
followers. The industrial quality control
facilities are of course backed up, like all the SAS facilities, by good graphical
presentations, which are essential in modern statistical analysis.
The integration of the
SAS products is seamless, so that you do not need to know whether you are
using a Base SAS, a SAS Stat or a SAS Graph procedure, so long as you know
the appropriate syntax. SAS has a full
set of macro facilities, which can be used to add new features to the system
by programming them in the SAS command language.
SAS will link via its Access product to most major database packages.
The SAS system is a very
large piece of software backed up by a very impressive array of manuals. It is probably the most heavily documented statistical
package available, with dozens of manuals from very introductory guides through
to extremely technical reports. In
addition to the manuals there are a number of computer-based training modules
available.
Weaknesses of SAS
The major problems with
SAS are that there is a lot to learn to use it effectively, and the syntax
is non-trivial. This would be a major
problem if you were the sort of person who only uses a statistics package
from time to time. You will only learn
how to use SAS by a lot of hard work and experience.
This probably accounts for why SAS is more popular in commerce than
in academic circles, as academics rarely do enough data analysis to make all
the learning worthwhile. SAS has what
amounts to a programming language, which provides the 'glue' to hold all the
bits together. SAS gurus can manipulate
this language to do many things and so in theory SAS can do almost anything.
Unfortunately, although this is true, as a result there are many quite
simple tasks, which can only be achieved by rather convoluted routes.
A quick glance through the SAS user magazine (SAS Communications) will
show you that many SAS users ask very simple questions and are given 10-20
line SAS programs to type in order to provide facilities that arguably ought
to be provided as commands or procedures in their own right. SAS has for instance no function to calculate
a factorial, but you can calculate one in a few lines of SAS code. To an extent SAS suffers from being a jack-of-all-trades.
Graphics
SAS has the usual character
graphics procedures as well as some that other packages do not have, such
as three-dimensional barcharts. Where
SAS really comes into its own is in relation to high-resolution graphics.
There is a fairly wide range of chart types and each of these has many
options to alter the color, fonts, layout etc.
In addition to graphs, SAS has a procedure to draw maps and a large
number of map data sets for various parts of the world at various scales.
In addition to the individual graphical procedures, graphs can be overlaid
on each other to make quite sophisticated finished results. A very large number of graphics device drivers
are supplied with the package and so you can certainly output your results
on most common types of hard-copy device. There are also the facilities to write your
own driver if you have a particularly obscure plotter and the inclination
to write graphics device drivers.
The biggest drawback of
the SAS graphics procedures is that they are entirely command driven. This would come as something of a shock to anyone
used to presentation graphics packages on a PC or who is used to the mouse-driven
world of the Apple Macintosh.
Graphical User Interfaces
For a long time SAS has
had separate windows on the screen for the program editor, the log and the
program output, so all versions of SAS have a graphical user interface of
sorts. The PC and SUN versions have
pop-up windows for help and other useful things. As part of the SAS product there is a menuing
system that enables tailor-made applications to be set up for non-SAS users,
and there is a Full Screen product which contains the tools to handle data
on the screen as though it were on a form or series of forms. These items make SAS a tailorable full-screen
environment so that appropriate graphical interfaces can be set up to meet
your needs (assuming that you have the time to learn how to use them properly).
Summary
The SAS system is a comprehensive
piece of data handling software, which achieves a reasonable standard across
its many areas. It offers a very high
level of integration of functions usually only available in separate packages.
There is a great deal to learn to use the package effectively and so
it is likely to be of most interest to those who spend a lot of their time
handling data, and of less interest to casual users of statistical packages. It offers the best set of graphics procedures
currently available within a multi-platform statistical package and is capable
of linking directly to the major database packages.
Experimental versus Observational (or Survey) Data
Each package handles one
type of situation better than the other. The
reason for this is that the packages were written to tackle the specific types
of problem encountered by a group of researchers who were involved in either
experimental or observational science.
There is considerable overlap
in the range of techniques used by the two types of researcher, but there
are also techniques, which are largely confined to one group or the other. The consequence of this is that these techniques
are better implemented and have more facilities in the package that is oriented
towards that type of data analysis. The
diagram below attempts to indicate the orientation of the packages.
Experimental Observational
<-----------------GENSTAT 5---------------->
<-------------BMDP------------>
<------------MINITAB------------->
<-------------GLIM------------->
<--------------SPSS-------------->
<----------------------SAS---------------------->
Note that packages in the
middle tend to be general in their applicability, not specializing in either
experimental or observational data.
Interactivity
Interactivity refers to
the ease with which you can let the data itself suggest the appropriate form
of analysis and so, by working with the data, arrive at conclusions. This approach is the antithesis of pushing the
data in at one end of the package and expecting to see pearls of wisdom emerge
from the other. Many important aspects
of the data may be missed by a failure to examine the data properly.
The degrees of interactivity in the different packages is summarized
below.
Interactive
Non-Interactive
MINITAB
GLIM
SPSS
GENSTAT
BMDP
SAS
Ease of Use
This would appear at first
glance to be an easy attribute to determine, but unfortunately it is rather
multi-dimensional in nature. The following
characteristics all need to be considered:
·
The complexity of the
command syntax.
·
The flexibility of
the command syntax.
·
The ease with which
a particular standard analysis may be specified.
·
The ease with which
important elements can be identified in the results.
·
The readability and
helpfulness of the manuals.
·
The amount and type
of information available in the help system.
There is also the statistical
error checking which ensures that what you are trying to do does not infringe
major assumptions, and guidance in the choice of appropriate techniques.
Balancing all these elements
together is not an easy task, but MINITAB would probably come out best and
GLIM is definitely only for those with a thorough statistical background.
SPSS will allow you to
do almost anything, fairly easily, right or wrong (never mind the quality,
just look at the volume of output). GENSTAT
has a more difficult learning curve but carries out many more checks on your
behalf. SAS is a very wide-ranging
package but has quite a steep learning curve and syntax, which takes a bit
of getting used to. BMDP programs make
short work of specific well-defined problems, so long as your problem matches
what the program does, but the programs are not very flexible.
Availability
The six packages discussed
in detail above are the major packages available on machines within the University.
They are not the only packages in use, but they have one thing in common:
they are all available on a range of machine types.
It is possible to have each of them except SAS on VME, CMS, VMS, SUNs
and PCs, although the University does not have licences for every combination
of package and machine. This means that whatever machines provide the
computing power in the University in the future, it is likely that the same
packages will be available to you. The
current availability of statistics packages is summarized in the table below.
+-----------+--------------------------------------------+
| |
System |
+-----------+------+------+------+------+------+---------+
| Package |
CMS | VME | VMS |
SUN | PC |Macintosh|
+-----------+------+------+------+------+------+---------+
|MINITAB |
A | A
| P |
P |
S | S
|
|SPSS |
P | A
| P |
P | S
| S |
|GENSTAT 5 |
A | A
| P |
P | S
| N |
|GLIM |
A | A
| P |
P | L
| N |
|BMDP |
A | A
| P |
P | L
| N |
|SAS |
P | N
| P |
S | S
| N |
+-----------+------+------+------+------+------+---------+
Key: A =
Available, P = Possible, S = Site Licence,
L =
Limited Licence, N = Not Available
There are many good packages
that are restricted to one particular type of machine, such as Statgraphics
on the PC, SYSTAT on a number of machines, or S on the SUN and other Unix
machines. These packages do not however
possess the hardware independence of the packages discussed in detail in this
document.
Where a site Licence is
available, the package may be obtained for a nominal charge from Computing
Service Reception. Where there is a
limited Licence, the package may be available by negotiation with the department
which paid for the License (contact the author for further details).
Packages designated as 'Available' is available to all registered users
of the particular system (all members of the University may apply for registration
on any system). Packages designated as 'Possible' could become
available if there was a sufficiently broad demand from across the University,
or if any departments were willing to pay the License fee concerned.
Training
Currently the Computing
Service runs courses to introduce new users to MINITAB, SPSS and GENSTAT 5.
These courses are designed to get users started in the use of the package,
to outline the major concepts underlying the package and to introduce the
user interface and help facilities. The courses are not designed to teach statistics,
but obviously a discussion of some statistical questions is inevitable. Attendees will gain more from the courses if
they have a clear idea of the types of statistical procedures that they need
to employ in their work and how to interpret the results that the procedures
produce.
Experience with the courses
would confirm other indications, from around the University, that there is
a need, especially amongst postgraduate students, for courses to teach particular
types of statistical analysis. If departments
perceive this as an unsatisfied need then they should pursue it through the
appropriate channels, as the Computing Service has currently no remit or resources
for such an activity.
With regard to GLIM, it
is assumed that people who wish to use GLIM know sufficient about what they
are doing to manage without introductory courses, although such courses could
be arranged if there were a demand. In
the case of BMDP, as the package is a collection of essentially separate programs,
it is difficult to teach an introductory course except in the use of a specific
program. Such a course would not be
of general interest, but would have to be targeted towards a small group of
individuals with similar data analysis requirements.
Introductory SAS courses
could be made available if there was sufficient demand for them.
The packages discussed
in detail above by no means provide the only way to do statistical analysis
on a computer. There are many packages,
which have been written for one particular type of machine and make very good
use of the features of that machine. The
trouble with such packages is that they are only available to a minority of
people and are therefore very much more difficult to support in a multiple
machine environment such as a university.
In addition to statistical
packages, there are two other ways of getting statistical results; from subroutine
libraries called from user-written programs and from other packages such as
spreadsheets that have some statistical functions even though their primary
purpose is not statistical analysis.
10.1 NAG Library
The NAG Library, which
is available across a wide range of machines at Glasgow, is a series of subroutines
for numerical analysis, statistical analysis and graphics. The routines are called from Fortran programs
written by the user. The statistical
routines cover a wide range of techniques, from simple univariate statistical
summaries to complex multivarite techniques.
In addition the NAG Library also contains the numerical tools to construct
statistical procedures from scratch, if you understand this sort of thing.
The routines for statistical analysis are all contained in chapter
'G' of the NAG Library. The sections in this chapter are as follows:
G01 Simple calculations on statistical
data
G02 Correlation and regression analysis
G03 Multivariate methods
G04 Analysis of variance
G05 Random number generators
G07 Univariate estimation
G08 Nonparametric statistics
G11 Contingency table analysis
G13 Time series analysis
Full details of these routines
are given in the NAG Library documentation, a copy of which is available for
consultation in the Computing Service Advisory room. The statistical routines within the NAG library
are of most use to people who want to build statistical procedures into they're
own programs which carry out other functions as well. If you only want to do statistical analysis
you are probably better off with a statistics package.
10.2 EXCEL
EXCEL is a spreadsheet
package available on both Macintoshes and PCs, which, in addition to having
built-in graphics, is also capable of carrying out some statistical functions.
These functions enable a number of simple statistical procedures to
be carried out, including the calculation of means, variances and standard
deviations, linear and exponential regression and correlation.
As the basis for spreadsheets is that a cell can be calculated as a
formula relating that cell to others cells, statistical procedures not explicitly
available in EXCEL can be carried out fairly easily, so long as you know the
formula required. If you already use
EXCEL and if it can carry out the statistical analyses that you require then
there is no need to use a dedicated statistics package.
This can save you the effort of having to learn to use another package.
11. Software Support Categories
The Computing Service currently
has a number of levels of support for software, referred to as support categories.
Category A is the highest; the Computing Service provides courses,
help, documents and telephone support for users of packages in this category.
Category B is similar to A except that the help is provided by another
department. The packages described
in this document are all in category A, except for BMDP and SAS, which are
in category B.
You are recommended to
choose a package in category A or B unless there are good reasons not to. If you choose to use a package in category C
or D then there may not be any help available if you have problems. If the needs of your work indicate a category
C or D package then you will probably have to make your own arrangements for
help and training.
The packages that have
been placed in category A are all well-established and available across a
wide range of machines and operating environments.
The Computing Service is therefore confident that they will continue
to be available whatever equipment is in use in the future. These packages will not necessarily provide
the best facilities on any particular machine, but because of their consistency
across environments it is possible to provide a high level of support for
them.
Packages in category B
have been chosen by certain departments either for historical reasons or else
because they are particularly suited to the needs of that department, and
so there is considerable expertise in the use of the package within the department
concerned.
Packages in category C
have particular appeal to some people because of the facilities they provide
or in relation to the environment in which they operate. In most cases the expertise in their use is
somewhat limited.
If you require more specific
advice on the choice of a package, you are welcome to contact the author on
ext. 4821 or enrol for a course by
contacting the Advisory Service on ext. 4831. If
you want to know more about the facilities available in a particular package
you should consult the relevant manuals, which are available in the Advisory
room in the Computing Service.
Information about how to
use the packages on a particular computer system is provided in a series of
short Computing Service user notes, which are freely available from the Advisory
room. At present the following user
notes are available:
UN 21 MINITAB Reference
Summary
UN 502 SPSS/pc+
UN 28 GENSTAT Procedure
Library
UN 508 GENSTAT on 8086
or 80286 PCs
UN 509 GENSTAT on 80386
or 80486 PCs
Note that the mainframe
version of the SPSS package is known as SPSSx.
A detailed introduction to the SPSS package itself is provided in a
Computing Service user guide:
UG 20
SPSSX Introductory Guide
This document is over sixty
pages long, and is available for a small charge (currently #1) from Computing
Service Reception. A copy of the Minitab
Reference Manual is also available for purchase from Reception.
A user guide produced by
the University of Liverpool that gives an introduction to the use of SAS on
a PC may be obtained on request to the Advisory Service. Users of the CMS system can access or print
this by typing DOC SAS while logged on to CMS.
If you obtain any site-License
software from the Computing Service you will also receive an installation
note which provides assistance with installing the software on your own machine.
All the above documents
include details of the manuals available for the relevant package. A copy of each of these manuals is available
for reference or overnight loan in the Advisory room.
The Computing Service welcomes
feedback on its user documentation. Please
send your comments on this particular document to James Currall in the Computing
Service (J.Currall@compservgla.ac.uk).