http://www.gla.ac.uk/Compserv/Doc/General/un11.html#Section9

 

 

University of Glasgow Computing Service

 

User Note

Choosing a Statistical Analysis Package

Number: UN 11/1

Author: James Currall

Date: June 1991

Summary: This document suggests strategies for selecting which statistical software package to use, and summarises the characteristics of the available software.

 

Contents

 

1. Introduction

2.  Methods of Choosing Software

3.  MINITAB

4.  SPSS

5.  GENSTAT 5

6.  GLIM

7.  BMDP

8.  SAS

9.  Comparison of Packages

10. Other Packages

11.  Software Support Categories

12.  Further Information

 

 

1.  Introduction

 

There are many items of software available via the Computing Service at Glasgow which are capable of carrying out statistical analyses.  It is not possible to recommend a single piece of software as being the best to use for all applications, as the choice depends a great deal on exactly what you want to do and how you want to do it.

 

This document is intended to help anyone needing to carry out statistical analysis to choose the most appropriate software for their needs.  Careful consideration of which software to use may slightly delay the start of the analysis, but experience has shown that in the long term this will save time and effort and also provide a much better chance of obtaining valid results from the analysis.

 

There are from time to time new releases of packages which add features or remove existing problems.  This document will be updated to take account of this, but between editions some details given in the document may not reflect the current situation.

 

 

 

 

 

2.  Methods of Choosing Software

 

The Classical Approach

 

A classical approach to the problem of selecting which statistical software package to use, though perhaps a rather naive approach, would be to decide what analysis needed to be carried out and then to scan a list of all the analyses available in each package to find a match.  For an individual package the answer would be a clear yes or no.  There are, however, four weaknesses in this approach:

 

·         Not all packages refer to the same technique by the same name, and so the technique you are looking for may not appear to be available, when in fact it is. 

·         The same name may imply different things in different packages, or more subtly, a greater or lesser degree of breadth from one package to another.  For example, a range of statistical packages can reasonably claim to carry out analysis of variance (ANOVA), yet the level of complexity and the type of problem that can be handled is very different indeed. 

·         More than one package may offer the analysis that you nominally require.  In this case, more information is required in order to choose between contenders. 

·         There may be no package that appears to offer the analysis you are looking for. 

 

 

The Common Approach

 

It might appear on the surface that you would have less of a problem if you did not know, at the outset, what analyses you required, since you could then select any package and use a selection of what is on offer.  This approach, above all others, runs a severe risk of failure, as you may well end up with a set of statistical tools inappropriate to your analysis.  While it may seem amusing to suggest that statistical analyses be selected in this way, empirical evidence suggests that this is a common approach.

 

 

The Reasonable Approach

 

Another approach, which might seem to be more intuitively reasonable, would be to use the same statistical package as your colleagues.  This approach appears to have two main advantages:

 

·         You could get help with getting started and with ongoing problems. 

·         Your colleagues are likely to be from the same nominal discipline as yourself and therefore have the same analysis needs. 

 

The first advantage should certainly not be underestimated, but the second relies on the often-unrealistic assumption that your colleagues made the right decision when they decided what to use in the first place.


The Recommended Approach

 

A rather better and more reasoned approach would be to look beyond the particular analysis that you have in mind for your data, to your general data analysis philosophy.  Broadly speaking certain pertinent questions should be asked:

 

·         What class of analysis techniques do you require?

·         Are your data predominantly continuous or categorical?

·         Are your response variables measurements, counts or proportions?

·         Are your data from experiments or surveys?

·         Do you want to explore your data one step at a time or process it all at once?

·         Are there repeated measures or multiple responses?

·         Is prediction or description your ultimate goal?

 

This broader, more philosophical view of your requirements can then be matched to the philosophy of a statistical package.  This approach produces much less overlap between packages than does a similar matching process based on individual statistical techniques.  It is also more likely to cushion you from the discovery, at a future date, that other related analyses are not in your chosen package.

 

One important consideration is how the package sees data and what types of data structure it supports.  Some packages can only see data as a set of cases, for each of which the same set of measurements or responses are available.  This implies data that has a table or flat file structure similar to that found in relational databases.  Such packages will allow little if any departure from this condition except by carefully controlled, and often compartmentalized, exceptions.  The selection of such a package, for better or for worse, will shape the course of your analysis.  This may not be detrimental but will depend on your data and what needs to be done with it.

 

If you select a package that is philosophically aligned with your type of work, you will not have to face the difficult choice of continuing to use a sub-optimal package or changing to another package because you can not do what you want with the present one.

 

Having discussed the subject in rather general terms, the rest of this document deals with the strengths and weaknesses of individual packages.  Each package is considered in terms of its philosophical base, the types of data structure supported, and the types of analysis that are available.  At the end of the document, the packages are compared and such issues as availability and training are considered.


 

3.  MINITAB

 

The Philosophical Base

 

MINITAB was written as a vehicle for the teaching of statistics.  It was designed to be easy to use, so that the students could concentrate on the statistics and forget about the complexities of computing.  From the beginning the package was designed to be used interactively as well as in the more traditional batch mode of operation.  This genesis is reflected in the well-developed help system, which can be interrogated at any stage to help sort out problems of syntax and usage.  In fact it is possible to learn to use MINITAB from the help system alone, and MINITAB is one of the few statistical analysis packages which can be used entirely satisfactorily without the aid of manuals.  The MINITAB philosophy can summarized as simple techniques, with simple-to-use commands in an interactive environment.

 

In general, small to medium-sized data sets are appropriate, if only because even quite simple techniques carried out on large data sets can be very anti-social in an interactive environment and may not be very interactive anyway.  In recent years MINITAB has acquired some more advanced techniques, but some of these are not as comprehensive as their counterparts in other packages and may therefore not meet your requirements.

 

MINITAB and Data

 

MINITAB recognizes three data structures; columns, constants and matrices.  The interpretation of the column is very similar to that in spreadsheet software.  You might refer to columns as variables or variates.  MINITAB does not distinguish between categorical, ordinal and interval data types; they are all columns and can (in theory at least) be used for input into any technique.  This implies that if males are coded 1 and females are coded 2 then MINITAB will have no problem with calculating the mean (or average) sex.  Most analyses in MINITAB are carried out on columns but there are also constants, which are used to store single values such as means or counters.  The final data type in MINITAB is the matrix, which is a two-dimensional data structure rather like a table.  These are generally used in MINITAB for matrix calculations and perhaps have less use in the simple statistical analyses, which are commonly used in MINITAB.

 

MINITAB is able to recognize missing data using a symbol (*), and cases with missing values are generally ignored when carrying out analyses.

 

 

Strengths of MINITAB

 

MINITAB is excellent for exploring your data.  It has a very good range of descriptive statistics and also has most of Tukey's Exploratory Data Analysis (EDA) techniques.  These enable you to have a really good look at your data to spot typing mistakes and outliers and see what the data distribution looks like, before moving on to hypothesis-testing techniques.  In this context MINITAB has many simple graphical procedures to enable you to visualize your data as well as the more traditional summary statistics.  There are also high-resolution graphics to produce clearer representations of your data, but in the current versions of MINITAB these are not of presentation quality and you would probably have to use a graphics package to produce the graphs for your reports.

 

Simple hypothesis tests of both parametric and non-parametric varieties are well covered and easy to use, and the results are presented in an easily understood fashion.  The area of more complicated analysis that MINITAB is good at is regression, in both the simple and multiple forms.  It will do multiple regression stepwise, with most of the features that you find in other packages.

 

From Release 8 both the Macintosh and PC versions of MINITAB include a spreadsheet-style layout and full-screen editor.  This makes the input and correction of data much easier than using the traditional READ and LET commands, although these are also available as part of the main package.

 

MINITAB has a crude macro facility to enable users to automate repetitive sequences of commands or create new facilities.  An extensive user group macro library is available containing many useful procedures.

 

 

Weaknesses of MINITAB

 

The major weaknesses of the package are in the areas of designed experiments and multivariate analyses, although to be fair, a package which started as a teaching tool could hardly be expected to have a large number of fully-featured multivariate techniques.

Until release 6, the only types of designed experiments that could be directly analyzed were one-way ANOVA and simple two-way ANOVA designs.  The scope has since been enhanced by the inclusion of more flexible ANOVA and ANCOVA commands.  These are not however the answer to every experimenter's dreams.  For example, MINITAB could not handle an experiment with multiple error strata and give you all the correct F values using all the appropriate error terms.  In addition these facilities will only cope with fully balanced designs.  At release 7 a new command that could handle unbalanced designs was introduced, but for correct results the designs should still be orthogonal and there is no check to see that this is so.  You must therefore understand exactly what you are doing before using these facilities or you might end up with incorrect or just plain meaningless results.

 

The whole area of assumption checking is one that MINITAB is not good at.  In the early days the techniques were simple and little checking was needed, but as more sophisticated facilities have been added the need has increased, but not the delivery.  MINITAB is used extensively by people with little statistical background and therefore the need for checking is great, to help stop the users falling into the many pitfalls along the way.  The reason for the lack of checking and warning is that it requires a lot of code to carry out the checks.  Writing code costs money and there are no new features to be seen at the end of it.  Selling software is a largely a question of how many features are provided, not how well those features are implemented.  The multivariate analyses are not the full-bodied, feature-packed routines that are available in other packages, and so if multivariate analyses are primarily what you require then you will need to look elsewhere.

 

 

Graphics Facilities

 

MINITAB has a very good range of character graphics type charts.  There are several different types of scatter diagrams, box and whisker plots, contour plots and histograms.  This type of presentation is good for model and assumption checking.  There are also high-resolution versions of most of these procedures, which produce diagrams on the screen, or on a plotter.  A moderate range of plotters is supported (but most of these are ones that use HPGL to drive them) as are a range of dot-matrix printers but not laser printers.  Although these versions of the procedures provide higher resolution, they do not give the user much flexibility and therefore are unlikely to be suitable for inclusion in published papers etc.  At release 9 of MINITAB the graphics will be completely rewritten and will provide facilities similar to specialist graphics packages (this release is likely to be available towards the end of 1992).

 

 

Graphical User Interfaces

 

Both the PC and Macintosh versions have what are referred to as graphical user interfaces at release 8 of MINITAB.  These have help systems and separate windows for commands, output, logs etc.  You can choose your commands from pull-down menus or type them directly at a prompt.  These interfaces make an already user-friendly package even more friendly but are not yet available on mainframe systems.

 

Summary

 

If you have small to medium-sized data sets and want descriptive statistics, graphical representations of your data and simple hypothesis tests, then you can't do much better than MINITAB.  You can also have good regression facilities and simple analysis of variance.  It is an excellent teaching package and very easy to use, with plenty of help available.  It is one of the few general packages that offers control charting techniques (from release 7), although I have never used them and can not comment on how good the facilities are.  However, do not expect a great deal beyond this; there are other packages that do more complexes or specialized analyses better.

 

 

4.  SPSS

 

The Philosophical Base

 

SPSS used to stand for Statistical Package for the Social Sciences, although now the company claims that it stands for Superior Performing Statistical Software.  The former is a more accurate reflection of reality and gives an indication of where the package came from.  Another important point is that the package dates from the swinging sixties, when all computing was done on mainframe computers using a batch mode of processing.  Little changed in the mainframe versions until release 3 (available on the VME system at

Glasgow) when the possibility of interactive working was introduced.

 

A more 'user friendly' form of the package with windows and editing of commands became available with the PC version, and this trend has continued with OS/2 and Apple Macintosh versions.  However, these advances only enable you to issue the same style of SPSS commands of the earlier mainframe releases.  Essentially SPSS's strength is survey data, in sets as large as you care to imagine.  Many of the things that you would want to do to survey data are available in a full-bodied form, although there are some notable omissions (see later).  It is not really suited to interactive working (with large data sets you wouldn't want to work interactively anyway) and although it will run on a PC, with a sizeable data set you might need to leave it to run overnight!

 

 

 

SPSS and Data

 

SPSS recognizes three data types, although one of these has to be stored in a file rather than in memory.  SPSS's major data types are the numeric and character variables.  All numeric variables are stored as real numbers even if they are integers.  There is no distinction between interval, ordinal and categorical data; they are all just variables.  This arrangement both simplifies things and creates problems.  If you have an ordered categorical variable with values 1, 2, 3 and 4, and you wish to select the cases with values less than or equal to 3, you should select values less than 3.5 to be sure of getting the correct cases.  This is because when 3 is stored as a real number it is not stored as exactly 3 and may be stored as 2.9999999.  The same problem occurs with recoding.  The problem would not arise if there was a data type in SPSS which used integers for categorical variables.  The moral is to be careful and is especially applicable to survey data where a large number of variables are categorical.

 

SPSS is entirely case orientated and will only see data as being a set of variables measured or recorded on each of a number of cases.  Frequency distributions and contingency tables can only be handled if they are generated internally from raw case data, except in a few limited circumstances.

 

SPSS recognizes missing values as specific values outside the normal range of numbers for a particular variable, and will ignore cases with such values during most analyses.  SPSS will also handle matrices under certain circumstances, but these are stored in files not as variables, and they are only available to and from a limited range of procedures.

 

The area of data structures is where SPSS is most in need if improvement, as a package which is targeted at survey data needs to have ordered (ordinal) and non-ordered (nominal) categorical data types.

 

 

Strengths of SPSS

 

Cross-tabulation in SPSS is very good indeed, and with the addition of the TABLES product, the written output can be made to look extremely professional.  You can have your data subdivided into categories in several dimensions and then get a whole range of descriptive statistics for each cell in the categorization.  This is no more and no less than you would expect from a survey analysis package.  There are also a very good range of simple hypothesis tests of both parametric and non-parametric types.  Most forms of regression from simple to multiple and linear to non-linear and log-linear are well implemented, and all the bells and whistles are there.  In addition such multivariate techniques as factor analysis and discriminate analysis are available in a fully featured form (SPSS uses the term factor analysis in its generic sense to cover a range of related techniques, including principal components analysis).

 

The one major disappointment in the area of multivariate techniques is cluster analysis, where all the similarity measures offered are designed for interval data and there are none for mixed or binary data.  Much data in survey analysis is mixed or binary and there is little option but to turn to other packages for cluster analysis.  There is also an additional option called TRENDS, which does various forms of time series analysis.  A macro facility is available, but the SPSS syntax and data structures do not give sufficient flexibility to make it really useful.

 

 

Weaknesses of SPSS

 

The major weakness of SPSS is in its handling of designed experiments.  It either does things badly or in an extremely convoluted manner.  There are probably very few people in the world who fully understand the MANOVA command and how to make it do all the things that it is supposed to do.  It tries to do too much in one command and ends up doing almost everything in a totally counter-intuitive way.

 

Until recently there were very few techniques in the package designed for data which is at best ordinal, which is surprising for a package that is targeted at survey data, though the recent inclusion of multi-dimensional scaling and the optional extra CATEGORIES, which includes correspondence analysis, has improved matters.

 

In terms of the philosophy of statistics, SPSS will lead the unwary astray.  The statistical philosophy demands that you make your assumptions explicit before making a hypothesis test.  In most packages you have to make your assumptions clear in subcommands or options and will therefore be making a specific test.  In SPSS you get a series of answers which have different assumptions attached to them, and then you choose the answer that you like best.  SPSS does not prevent you from establishing a priori assumptions but it does not encourage you to do it.

 

In addition almost all SPSS commands have defaults for most of the choices between methods, and so if you do not specify anything you get an analysis which may well be inappropriate to your type of data and situation.  Considerable care is needed, especially with some of the more sophisticated techniques, in order to specify an appropriate form of analysis.  In order to do this a set of manuals is essential.  They are well written and they are the only place where you can find out exactly what the analysis is going to do to your data and what assumptions are to be made.  Many people remark that SPSS is easy to use, that they understand it and that one doesn't need manuals to use it.  It is in reality easy to misuse, many of the techniques are extremely difficult to understand, and if you use it without manuals you are in grave danger of seriously undermining your academic credibility.

 

Under most circumstances it is very difficult in SPSS to pass the results of one analysis as input to another, as it does not support the data structures to do this.  This reduces the flexibility of the package quite considerably.

 

SPSS is very poor at assumption checking; if it warns you about problems with your data you are in serious trouble, as such warnings are few and far apart.  Much the same, in this respect, applies to SPSS as to MINITAB.  The warnings and the checks are all described in the manuals, but the checks have to be carried out by you on a preliminary analysis or analyses of the data, and then you need to modify the options accordingly - none of this will be done for you automatically.

 

 

Graphics Facilities

 

SPSS offers a range of character graphics facilities in all its versions.  These enable you to plot crude scatter diagrams etc.  High-resolution graphics is provided in most versions via third party packages.  When you issue an SPSS command to produce the graphics, it writes a file of commands and passes this to the other package (which is started up automatically by SPSS) to produce the graph.  This way of providing graphics has both advantages and disadvantages.  It means that SPSS Inc do not have to write graphics software, as they can just use someone else's, which might be better written than anything that they could write themselves.  It also means however that they can only use features that are available in the third party software (e.g.  neither Harvard Graphics nor Microsoft Chart, which are the packages used by the PC version, can produce error bars).  You have to buy the graphics package separately (which can be as much as six times the cost of a site license copy of SPSS).  There is also no guarantee that the package that you have bought will continue to be the one that SPSS uses in the future.  No high-resolution graphics is implemented on the mainframe version of SPSS at Glasgow.

 

 

Graphical User Interfaces

 

The versions of SPSS for the PC, OS/2 and Apple Macintosh have what are referred to as graphical user interfaces.  These have help systems and separate windows for commands, output, logs etc.  You can choose commands from menus and 'paste' them into the commands window, and thus build up standard SPSS commands from menus.  This is a two-edged sword; it makes it easier to get the command syntax right, but makes it even more tempting not to use the manuals, which are the only source of essential information about methods and assumptions.  These interfaces do make it easier to construct SPSS commands, to correct errors and tidy up output, and are much 'friendlier' than the mainframe systems, but the pitfalls are still there and are easier to fall into.

 

 

Summary

 

SPSS can handle very large data sets because it does not load all the data into memory at once; it just gets it from disk when it wants it.  It was designed for survey data, and that design philosophy has not changed much over the years.  It is excellent for sorting out and tabulating survey data, and it can handle a competent range of univariate and multivariate techniques.  However, if you have designed experiments to analyze then it is not for you.

 

 

5.  GENSTAT 5

 

The Philosophical Base

 

The original GENSTAT program was written to analyze agricultural experiments.  It was essentially an extremely powerful statistical analysis programming language.  Its syntax was difficult and inconsistent and the package was used only by enthusiasts.  The other perceived problem with the package was that you needed to know what you were doing in order to use it.

 

In the early 1980's the user interface to the package was completely rewritten, with a more logical and above all consistent syntax.  The result of this rewrite was a new package called GENSTAT 5.  The second 'problem' was also addressed, to the extent that the manuals were written in a way that made them easy to use and helpful in explaining the techniques in addition to the command syntax.  The new user interface was written with interactive use in mind, and a comprehensive hierarchical help system was included as part of this.

 

You still need to know what it is that you are doing in order to use GENSTAT 5, but you do not have to be a statistician.  If you find this idea off-putting and would prefer to use a package, which does not make such demands on you, I would suggest that you reflect on the following. 

 

In order to analyze data effectively, you need to specify the problem correctly, apply an appropriate technique, and then interpret the results produced. 

 

You gain nothing by putting your data through a technique and then failing to fully interpret the results that the package produces.  A package, which allows you to carry out an analysis without understanding what you are doing only, allows you to defer the knowledge gap to the interpretation phase.  It is therefore doing you no favors, as you run the serious risk of selecting an inappropriate technique if you start off not understanding what you are doing.

 

Visualization of data is an important part of the GENSTAT philosophy, and it is thus supplied with a very flexible range of high-resolution graphics facilities, which in them are sufficient reason for using it.

 

 

GENSTAT 5 and Data

 

GENSTAT 5 recognizes a wide range of data structures, which makes it very flexible to use and helps to avoid ambiguity.  The first structure is that found in all statistical analysis packages, the variable, which in GENSTAT is called the variate.  Variates are observed variables.  Classification variables have a different structure called the factor, which exists only as a set of ordered or unordered levels.  The value of this distinction is that functions appropriate to observations cannot be applied to the classifiers, and vice-versa.

 

Matrices are available as a number of different types of structure; rectangular, symmetric and diagonal.  In addition, there are table structures, single-value structures (called scalars in GENSTAT 5), and specialized structures for time series, latent roots and vectors, and sums of squares and products.  Finally there are text expression and formula structures.  Structures of differing types may be combined into pointer structures.

 

If you think that this sounds unduly complicated, remember that you do not need to use all the structures, but the fact that the structures used for different purposes are differentiated allows appropriate functions to be applied to particular structures.  For example, to mimic MINITAB data structures you would only need to use scalar for constants, variates for columns, and matrix for matrices.

 

GENSTAT will cope with missing data by using a code, which you select, and in some procedures will not ignore cases with missing values, but will employ iterative interpolation techniques to estimate the missing value.

 

Strengths of GENSTAT 5

 

GENSTAT 5 has five main strengths:

 

·         It has a large range of statistical techniques and is particularly strong on designed experiments, regression, curve fitting, generalized linear modeling, and multivariate techniques.  Many of the techniques and algorithms incorporated into the package represent the state of the art. 

·         GENSTAT provides an impressive toolkit of arithmetic, statistical and matrix manipulation techniques, which enable experienced users to modify existing techniques and build new ones from, scratch.  Using the tools, new techniques can very quickly be added to the package and made available via the procedure library.  Users may also use procedures to add new features of their own to GENSTAT, in the same way that macros are used in other packages.  There is therefore a very short lead-time before new analyses are added to the facilities available. 

·         The data structures enable the results of one technique to be directly imported into another.  Unlike many other packages, any result that can be printed can be used for further analysis. 

·         There are very good high-resolution graphics to facilitate data visualization and reporting, which unlike some other packages are of publication quality and are included in the package as standard. 

·         GENSTAT 5 is probably the best package available for statistical assumption checking.  This in fact makes it more suitable for inexperienced users than some of the other packages as it helps them to avoid some of the more fundamental errors. 

 

 

Weaknesses of GENSTAT 5

 

GENSTAT 5 does not have good facilities for the cross-tabulation of surveys and questionnaires, which is not altogether surprising considering its origins.

 

GENSTAT 5 is not as easy to use for simple techniques as other packages such as MINITAB.  In fact some of the simpler data exploration techniques are not included in the base package but are only provided through the procedure library.  Although the command syntax has been greatly improved it is perhaps less immediately intuitive than some other packages.

 

GENSTAT 5 does not have an infinite workspace and each data structure that you use will take some space.  Data is not read directly from disk for each procedure, and so there is, as with most other packages except SPSS, a limit to the size of problem that can be handled. 

The actual size of the workspace is related to the type of machine on which the package is running.  In the early releases of the package there were some efficiency problems which have now been addressed.

 

The 'problem' of needing to know what you are doing has already been addressed.  It is perhaps a matter of opinion as to whether or not this is a weakness.

 

 

Graphics Facilities

 

GENSTAT 5 offers a range of character based graphics including the ability to plot fitted lines on scatter diagrams (albeit rather crude ones involving commas and semicolons).  There is a considerable amount of control that the user can exert over the layout and style of scatter diagrams, histograms, box and whisker plots, contour plots and dendrograms. 

 

Unlike many packages, high-resolution graphics are built into GENSTAT 5 and are available in all versions (except the current implementation on CMS).  A moderate amount of control over layout, color and fill patterns is available for scatter and line graphs, biplots, box and whisker plots, contour plots, histograms and barcharts, piecharts, dendrograms and shade diagrams.  The devices that are supported depend on the version; the PC version supports plotters using HPGL and Epson dot matrix printers as well as the usual graphics screen drivers.  Release 2 of GENSTAT 5 (late 1991) will support graphical input and so it will be possible to indicate points on graphs with a mouse and thus identify an outliner in a graph.  It will also be possible to write procedures to perform brushing scatterplot techniques.

 

Summary

 

GENSTAT 5 is a very well featured statistics package with excellent high-resolution graphics facilities.  It requires a reasonable understanding of what you are doing but at the same time it is much better at assumption checking than many other packages.  There is slightly more to learn initially than for many other packages, which makes it less suitable for one-off or occasional use.  It is likely to be of particular use to those who need to do a moderate

amount of statistics or use a wide range of techniques.  It will however be invaluable to those who wish to develop novel techniques and provides a wide range of graph plotting facilities in a fairly easy to use form.

 

 

6.  GLIM

 

The Philosophical Base

 

GLIM stands for Generalized Linear Interactive Modeling system.  It was written by the Royal Statistical Society, and continues to be developed by them.  It was written to enable statisticians to fit generalized linear models (GLMs) to their data, in a way that was not possible in other packages.  Many common statistical problems, such as regression, analysis of variance and covariance of designed experiments, log-linear models of counts, and logit and probit models of proportions, are just special cases of a GLM.  Even very simple problems such as t-tests can be specified as GLMs, and so in the hands of a statistician GLIM is a very powerful tool.

 

The recipe is very simple; there are facilities for reading in the data, specifying the model, fitting the model, and producing output as simple graphs, tables and reports.  For a package of such power the GLIM program is very small and will fit on a PC with only 256Kb of memory.  If you do not have a firm grasp of statistical modeling and do not know what link functions and error distributions are, it is definitely not for you, unless a statistician is guiding you.

 

 

GLIM and Data

 

GLIM recognizes only four types of data structure, two of which are column structures, like one-dimensional arrays in a programming language:

 

·         The variate is used to hold continuous variables, such as measurements.  This structure is assumed unless otherwise specified. 

·         The factor is used to store categorical variables, either nominal or ordinal.  Factors may be redefined as variates and vice versa as required. 

·         The scalar is used to hold single values.  Scalars are very important in GLIM because they are used to hold all the parameters pertaining to how well the current model fits and how much better it fits than the last one.  Users may also define scalars of their own if required. 

·         The pointer is used to hold the name of another structure, such as the name of the current Y variable. 

 


Strengths of GLIM

 

GLIM is excellent for exploring data and trying a series of models with different assumptions.  It is one of the few packages that allow you to assume error distributions other than normal ones; Binomial, Gamma and Poisson error distributions are also permitted.  The package also allows a range of link functions between the dependent and independent variables (exponent, logit, probit, log, complemetary log-log, reciprocal and square root), as well as the straightforward identity relationship.  In fact the whole package is built around fitting models with appropriate link and error function combinations.

 

As a result of the compact nature of the package, it runs fast and is virtually identical on all the machines on which it is available (which is almost any machine with a Fortran compiler).  There are a good range of mathematical functions for transformations of variates, and simple line-printer type histograms and scatter diagrams to examine how well a specific model fits to your data.  There is also the facility to write macros (rather like programs of GLIM commands) which can be used to store frequently used analyses.  The package is supplied with a library of macros written by other people to carry out particular types of model fitting.

 

 

Weaknesses of GLIM

 

GLIM has few bells and whistles and is written by statisticians for statisticians.  There is very little in the way of a help facility, except that if you make a mistake in specifying the arguments to a particular command, GLIM will give you an error message and a summary of the command syntax, so that you can try again.

 

GLIM is emphatically not a general-purpose statistics package and it can not therefore really be criticized on the basis of the types of analysis that it cannot perform.  It does not handle missing values at all, which can be something of a problem, but that is to do with the algorithms that handle the actual model fitting and the problem of deciding, for a general case, what to do about missing data.  There is no facility in GLIM for direct access to the operating system, for example to examine a directory listing in order to find out the name of a file.  Finally, although variate and factor names may be longer than four characters, only the first four are used.  User-defined scalars may only be one letter long and you can therefore only have 26 of them.  These features restrict the sort of names that you can use for data structures.

 

 

Graphics Facilities

 

GLIM offers only character-based graphics, but they are satisfactory for model and assumption checking that is essential to the way that GLIM is used.  Release 4 of GLIM (early 1992) will have high-resolution graphics built into it, and these facilities should be available in all versions.

 

Summary

 

If generalized linear modeling is what you know about and what you want to do then GLIM will undoubtedly be attractive to you.  If you want neatly packaged statistical analyses then it will definitely not provide what you are looking for.  Release 4 of GLIM will provide extensions to the type of models that can be fitted, new data structures such as matrices, and high-resolution graphics, but is unlikely to have many features to improve its user friendliness except for some on-line documentation.

 

 

7.  BMDP

 

The Philosophical Base

 

BMDP started out in 1961 as BioMeDical computer programs (BMD).  It was then, as it still is today, a collection of programs to carry out specific statistical analyses in the bio-medical sciences.  The range of analyses available has been greatly enhanced over the years, and there are a number of types of analysis which are not available in other packages.  The programs are well written and the developers have never been accused of using poor algorithms, as have the developers of some other packages.

 

There are interactive versions of BMDP available, but with the possible exception of some of the facilities of the PC90 version, the interactive versions make very few concessions to modern interactive computing.  However, once you have learned the commands associated with the two or three programs that you need, BMDP is very easy to use.  The basic idea is that you set up your job, submit it for processing, and collect your output when it has finished.  As long as you have a problem, which fits one of the BMDP programs, then everything is straightforward, but if not there is no flexibility to tailor the analysis to suit your needs.

 

Although the package consists of about 50 individual programs, they share the same commands for data input, transformation, printing, storage and output.  This means that having prepared a job for one program you have only to change the program-specific commands to make the job suitable for another program.

 

 

BMDP and Data

 

There is not a great deal to be said here.  BMDP reads variables, which can be either numeric or character strings of four or less characters.  Whether or not variables are measurements or categories is determined by whether or not they are declared as grouping variables as part of the analysis specification (continuous variables can be used for grouping if you specify cut points).  Various other data structures, such as correlation matrices, can be stored in BMDP system files for use in other programs if required.

 

BMDP is like SPSS almost exclusively case oriented, and data is specified as a set of variables, with one value for each variable for every case.  Missing values are handled by specifying a specific value, such as 999, to indicate that there is no data for a specific case.  One useful feature of data input in BMDP, which should be available more widely in statistics packages, is the facility to specify the maximum and minimum values that a particular variable can take.  This means that many typographic errors, such as a decimal point in the wrong place, are picked up automatically.

 

 

Strengths of BMDP

 

The quality and diversity of the BMDP programs are its strength.  Some of the programs provide facilities that are not easily available in other packages.  These include survival analysis, boolean factor analysis, the analysis of preference pairs, and stepwise logistic regression.

 

The output from BMDP analyses is better laid out and more easy to understand than that of some other packages, although high-resolution graphical output is only available in the PC version.  If BMDP has the analysis that you require then it will do a good job for you with little fuss.  BMDP has a very useful program called the data manager that enables you to transform and sort variables, merge files and aggregate the data over cases, which helps to reduce its dependence on a case-oriented view of data.

 

 

Weaknesses of BMDP

 

The major weaknesses are lack of flexibility and the fact that the package consists of a series of separate programs, with one for each type of analysis.  The former means that if your problem is in any way non-standard, you may not be able to carry out an optimal analysis using a BMDP program.  Problems concerned with the latter result from the fact that although each program has a series of parameters that are set out in what are called paragraphs, there is only a loose commonality between programs.  The paragraphs concerned with input of data, transformation, printing and storage are all the same across all programs, but the ones that specify the particular analysis are specific to the individual program.  There is no guarantee that if you wished to carry out, say, a repeated-measures analysis of variance, then you could specify the analysis in the same way in the three different programs that could perform it for you.  It is also very difficult to carry out data exploration with BMDP, because you might need to run several programs to collect all the information that you require.

 

 

Graphics Facilities

 

BMDP offers a reasonable range of character based graphics in some of its programs.  Only on the PC is there the possibility of high-resolution versions of the graphs.  The PC90 version allows you to change certain features of the graphs from the defaults (colors, axis limits, fonts and line styles) but the range of types of presentation is quite limited.

 

 

Graphical User Interfaces

 

The PC90 version of BMDP has a menu-driven interface to the BMDP programs.  This allows you to create a file of commands and/or data for input into a BMDP program and then select an appropriate program from a menu.  There is a help system to aid your choice of program and help you with the editor.  Finally, if you have the high-resolution graphics option, you can construct and modify your graphs from further menus.  This menu system takes some of the effort out of using BMDP but it does not transform it into anything like an interactive package; it still remains a batch orientated package that is good for problems with a structure that fits a particular BMDP program.

 

 

Summary

 

If you have a well-specified problem in the bio-medical sciences, that are common to others in your field, BMDP may well have the analysis that you are looking for.  The analysis will be well implemented and will give you clear and concise output.  In short, BMDP will provide you with a well-packaged product.  If however you want to explore your data and try out various approaches to its analysis, there are easier ways of doing it.  For certain types of analysis there is not really much choice and you have to accept the program that will do it. 

 

The level of interactivity across the machines on which BMDP is available is rather low at present, but there are indications in the PC90 implementation that this situation may well improve in the medium term.

 

 

8.  SAS

 

The Philosophical Base

 

The SAS system is a very large integrated collection of products which is capable of carrying out a wide range of tasks in the data storage, analysis and display field.  It might be thought of as an environment for data handling rather than simply as a package.  Although it has its origins in statistical analysis it has spread into graphics, database management, econometrics, operations research and quality control.

 

Most versions of SAS offer multiple screen windows in interactive working so that all your work can be carried out in the SAS environment.  The package is probably of most interest to those who need a complete working environment, from planning their work through to the finished reports.  The package is very popular in the pharmaceuticals industry, since it is obligatory in food and drug trials for the FDA.  It is also popular in commerce as a computer performance-monitoring tool.  Until fairly recently its major userbase was organizations with IBM mainframes, but over the past few years it has become available on a much wider selection of hardware.

 

Although the range of facilities is impressive overall, in some of the areas that the package covers it is less good than competing more specialized packages, but this disadvantage must be weighed against the benefits of the integration of diverse elements into one environment.

 

 

SAS and Data

 

Data in SAS is organized into structures called SAS datasets.  Ordinary SAS datasets contain from one to many variables, where a variable is one measurement or observation made on a set of cases.  The term variable in this context accords with its usage in other packages.  Variables may be numeric, character or date.  Nominal, ordinal and interval variables are not distinguished and are all numeric (although the character variable type may be used for nominal variables).  A dataset may be constructed from raw data read into the package from a file or from keyboard input.  Unless instructed otherwise, all SAS procedures are carried out on the latest SAS dataset established by a SAS DATA step.

 

Special SAS datasets are used to contain data structures which do not fit the 'variable by cases' model such as matrices of correlation coefficients.  These are primarily provided to enable the results of one procedure to be fed in as input to another one.  It is a pity that the provision of special SAS datasets is not wider so that all results that can be printed were available as data for further manipulation.  They are currently provided primarily for inter-procedure communication.

 

Multiple SAS datasets may be in use at one time, which partially removes the strict case orientation that is found in some other packages.  SAS datasets may be either temporary, in which case they will be lost at the end of a session, or they may be permanent, in which case they will be saved at the end of the session.

 

 

Strengths of SAS

 

They major strength of SAS is the integration of a number of products to produce a complete working environment.  It offers a wide range of statistical procedures although the range could not be described as fully comprehensive.  Most of the common types of analysis are to be found and there are number of facilities that are found in few other packages. 

 

Most notable amongst these rare features are the facilities for the planning and execution of industrial type experiments such as those advocated by Taguchi and Demming and their followers.  The industrial quality control facilities are of course backed up, like all the SAS facilities, by good graphical presentations, which are essential in modern statistical analysis.

 

The integration of the SAS products is seamless, so that you do not need to know whether you are using a Base SAS, a SAS Stat or a SAS Graph procedure, so long as you know the appropriate syntax.  SAS has a full set of macro facilities, which can be used to add new features to the system by programming them in the SAS command language.  SAS will link via its Access product to most major database packages.

 

The SAS system is a very large piece of software backed up by a very impressive array of manuals.  It is probably the most heavily documented statistical package available, with dozens of manuals from very introductory guides through to extremely technical reports.  In addition to the manuals there are a number of computer-based training modules available.

 

 

Weaknesses of SAS

 

The major problems with SAS are that there is a lot to learn to use it effectively, and the syntax is non-trivial.  This would be a major problem if you were the sort of person who only uses a statistics package from time to time.  You will only learn how to use SAS by a lot of hard work and experience.  This probably accounts for why SAS is more popular in commerce than in academic circles, as academics rarely do enough data analysis to make all the learning worthwhile.  SAS has what amounts to a programming language, which provides the 'glue' to hold all the bits together.  SAS gurus can manipulate this language to do many things and so in theory SAS can do almost anything.  Unfortunately, although this is true, as a result there are many quite simple tasks, which can only be achieved by rather convoluted routes.  A quick glance through the SAS user magazine (SAS Communications) will show you that many SAS users ask very simple questions and are given 10-20 line SAS programs to type in order to provide facilities that arguably ought to be provided as commands or procedures in their own right.  SAS has for instance no function to calculate a factorial, but you can calculate one in a few lines of SAS code.  To an extent SAS suffers from being a jack-of-all-trades.

 

 

Graphics

 

SAS has the usual character graphics procedures as well as some that other packages do not have, such as three-dimensional barcharts.  Where SAS really comes into its own is in relation to high-resolution graphics.  There is a fairly wide range of chart types and each of these has many options to alter the color, fonts, layout etc.  In addition to graphs, SAS has a procedure to draw maps and a large number of map data sets for various parts of the world at various scales.  In addition to the individual graphical procedures, graphs can be overlaid on each other to make quite sophisticated finished results.  A very large number of graphics device drivers are supplied with the package and so you can certainly output your results on most common types of hard-copy device.  There are also the facilities to write your own driver if you have a particularly obscure plotter and the inclination to write graphics device drivers.

 

The biggest drawback of the SAS graphics procedures is that they are entirely command driven.  This would come as something of a shock to anyone used to presentation graphics packages on a PC or who is used to the mouse-driven world of the Apple Macintosh.

 

 

Graphical User Interfaces

 

For a long time SAS has had separate windows on the screen for the program editor, the log and the program output, so all versions of SAS have a graphical user interface of sorts.  The PC and SUN versions have pop-up windows for help and other useful things.  As part of the SAS product there is a menuing system that enables tailor-made applications to be set up for non-SAS users, and there is a Full Screen product which contains the tools to handle data on the screen as though it were on a form or series of forms.  These items make SAS a tailorable full-screen environment so that appropriate graphical interfaces can be set up to meet your needs (assuming that you have the time to learn how to use them properly).

 

 

Summary

 

The SAS system is a comprehensive piece of data handling software, which achieves a reasonable standard across its many areas.  It offers a very high level of integration of functions usually only available in separate packages.  There is a great deal to learn to use the package effectively and so it is likely to be of most interest to those who spend a lot of their time handling data, and of less interest to casual users of statistical packages.  It offers the best set of graphics procedures currently available within a multi-platform statistical package and is capable of linking directly to the major database packages.

 

 

9.  Comparison of Packages

 

Experimental versus Observational (or Survey) Data

 

Each package handles one type of situation better than the other.  The reason for this is that the packages were written to tackle the specific types of problem encountered by a group of researchers who were involved in either experimental or observational science. 

 

There is considerable overlap in the range of techniques used by the two types of researcher, but there are also techniques, which are largely confined to one group or the other.  The consequence of this is that these techniques are better implemented and have more facilities in the package that is oriented towards that type of data analysis.  The diagram below attempts to indicate the orientation of the packages.

 

 

Experimental                                     Observational

 

<-----------------GENSTAT 5---------------->

     <-------------BMDP------------>

             <------------MINITAB------------->

             <-------------GLIM------------->

                          <--------------SPSS-------------->

       <----------------------SAS---------------------->

 

 

Note that packages in the middle tend to be general in their applicability, not specializing in either experimental or observational data.

 

 

Interactivity

 

Interactivity refers to the ease with which you can let the data itself suggest the appropriate form of analysis and so, by working with the data, arrive at conclusions.  This approach is the antithesis of pushing the data in at one end of the package and expecting to see pearls of wisdom emerge from the other.  Many important aspects of the data may be missed by a failure to examine the data properly.  The degrees of interactivity in the different packages is summarized below.

 

     Interactive                          Non-Interactive

 

     MINITAB

      GLIM                         SPSS

       GENSTAT                               BMDP

      SAS

 

 

Ease of Use

 

This would appear at first glance to be an easy attribute to determine, but unfortunately it is rather multi-dimensional in nature.  The following characteristics all need to be considered:

 

·         The complexity of the command syntax. 

·         The flexibility of the command syntax. 

·         The ease with which a particular standard analysis may be specified. 

·         The ease with which important elements can be identified in the results. 

·         The readability and helpfulness of the manuals. 

·         The amount and type of information available in the help system. 

 

There is also the statistical error checking which ensures that what you are trying to do does not infringe major assumptions, and guidance in the choice of appropriate techniques.

 

Balancing all these elements together is not an easy task, but MINITAB would probably come out best and GLIM is definitely only for those with a thorough statistical background. 

SPSS will allow you to do almost anything, fairly easily, right or wrong (never mind the quality, just look at the volume of output).  GENSTAT has a more difficult learning curve but carries out many more checks on your behalf.  SAS is a very wide-ranging package but has quite a steep learning curve and syntax, which takes a bit of getting used to.  BMDP programs make short work of specific well-defined problems, so long as your problem matches what the program does, but the programs are not very flexible.

 

 

Availability

 

The six packages discussed in detail above are the major packages available on machines within the University.  They are not the only packages in use, but they have one thing in common: they are all available on a range of machine types.  It is possible to have each of them except SAS on VME, CMS, VMS, SUNs and PCs, although the University does not have licences for every combination of package and machine.  This means that whatever machines provide the computing power in the University in the future, it is likely that the same packages will be available to you.  The current availability of statistics packages is summarized in the table below.

 

+-----------+--------------------------------------------+

|           |                  System                    |

+-----------+------+------+------+------+------+---------+

| Package   | CMS  | VME  | VMS  | SUN  | PC   |Macintosh|

+-----------+------+------+------+------+------+---------+

|MINITAB    |  A   |  A   |  P   |  P   |  S   |  S      |

|SPSS       |  P   |  A   |  P   |  P   |  S   |  S      |

|GENSTAT 5  |  A   |  A   |  P   |  P   |  S   |  N      |

|GLIM       |  A   |  A   |  P   |  P   |  L   |  N      |

|BMDP       |  A   |  A   |  P   |  P   |  L   |  N      |

|SAS        |  P   |  N   |  P   |  S   |  S   |  N      |

+-----------+------+------+------+------+------+---------+

 

 Key:   A  =  Available,   P = Possible,   S = Site Licence,

        L = Limited Licence,   N = Not Available

 

There are many good packages that are restricted to one particular type of machine, such as Statgraphics on the PC, SYSTAT on a number of machines, or S on the SUN and other Unix machines.  These packages do not however possess the hardware independence of the packages discussed in detail in this document.

 

Where a site Licence is available, the package may be obtained for a nominal charge from Computing Service Reception.  Where there is a limited Licence, the package may be available by negotiation with the department which paid for the License (contact the author for further details).  Packages designated as 'Available' is available to all registered users of the particular system (all members of the University may apply for registration on any system).  Packages designated as 'Possible' could become available if there was a sufficiently broad demand from across the University, or if any departments were willing to pay the License fee concerned.

 

 

Training

 

Currently the Computing Service runs courses to introduce new users to MINITAB, SPSS and GENSTAT 5.  These courses are designed to get users started in the use of the package, to outline the major concepts underlying the package and to introduce the user interface and help facilities.  The courses are not designed to teach statistics, but obviously a discussion of some statistical questions is inevitable.  Attendees will gain more from the courses if they have a clear idea of the types of statistical procedures that they need to employ in their work and how to interpret the results that the procedures produce.

 

Experience with the courses would confirm other indications, from around the University, that there is a need, especially amongst postgraduate students, for courses to teach particular types of statistical analysis.  If departments perceive this as an unsatisfied need then they should pursue it through the appropriate channels, as the Computing Service has currently no remit or resources for such an activity.

 

With regard to GLIM, it is assumed that people who wish to use GLIM know sufficient about what they are doing to manage without introductory courses, although such courses could be arranged if there were a demand.  In the case of BMDP, as the package is a collection of essentially separate programs, it is difficult to teach an introductory course except in the use of a specific program.  Such a course would not be of general interest, but would have to be targeted towards a small group of individuals with similar data analysis requirements.

 

Introductory SAS courses could be made available if there was sufficient demand for them.

 

 

10.  Other Packages

 

The packages discussed in detail above by no means provide the only way to do statistical analysis on a computer.  There are many packages, which have been written for one particular type of machine and make very good use of the features of that machine.  The trouble with such packages is that they are only available to a minority of people and are therefore very much more difficult to support in a multiple machine environment such as a university.

 

In addition to statistical packages, there are two other ways of getting statistical results; from subroutine libraries called from user-written programs and from other packages such as spreadsheets that have some statistical functions even though their primary purpose is not statistical analysis.

 

10.1 NAG Library

 

The NAG Library, which is available across a wide range of machines at Glasgow, is a series of subroutines for numerical analysis, statistical analysis and graphics.  The routines are called from Fortran programs written by the user.  The statistical routines cover a wide range of techniques, from simple univariate statistical summaries to complex multivarite techniques.  In addition the NAG Library also contains the numerical tools to construct statistical procedures from scratch, if you understand this sort of thing.  The routines for statistical analysis are all contained in chapter 'G' of the NAG Library.  The sections in this chapter are as follows:

 

G01 Simple calculations on statistical data

G02 Correlation and regression analysis

G03 Multivariate methods

G04 Analysis of variance

G05 Random number generators

G07 Univariate estimation

G08 Nonparametric statistics

G11 Contingency table analysis

G13 Time series analysis

 

Full details of these routines are given in the NAG Library documentation, a copy of which is available for consultation in the Computing Service Advisory room.  The statistical routines within the NAG library are of most use to people who want to build statistical procedures into they're own programs which carry out other functions as well.  If you only want to do statistical analysis you are probably better off with a statistics package.

 

 

10.2 EXCEL

 

EXCEL is a spreadsheet package available on both Macintoshes and PCs, which, in addition to having built-in graphics, is also capable of carrying out some statistical functions.  These functions enable a number of simple statistical procedures to be carried out, including the calculation of means, variances and standard deviations, linear and exponential regression and correlation.  As the basis for spreadsheets is that a cell can be calculated as a formula relating that cell to others cells, statistical procedures not explicitly available in EXCEL can be carried out fairly easily, so long as you know the formula required.  If you already use EXCEL and if it can carry out the statistical analyses that you require then there is no need to use a dedicated statistics package.  This can save you the effort of having to learn to use another package.

 

 

11.  Software Support Categories

 

The Computing Service currently has a number of levels of support for software, referred to as support categories.  Category A is the highest; the Computing Service provides courses, help, documents and telephone support for users of packages in this category.  Category B is similar to A except that the help is provided by another department.  The packages described in this document are all in category A, except for BMDP and SAS, which are in category B.

 

You are recommended to choose a package in category A or B unless there are good reasons not to.  If you choose to use a package in category C or D then there may not be any help available if you have problems.  If the needs of your work indicate a category C or D package then you will probably have to make your own arrangements for help and training.

 

The packages that have been placed in category A are all well-established and available across a wide range of machines and operating environments.  The Computing Service is therefore confident that they will continue to be available whatever equipment is in use in the future.  These packages will not necessarily provide the best facilities on any particular machine, but because of their consistency across environments it is possible to provide a high level of support for them.

 

Packages in category B have been chosen by certain departments either for historical reasons or else because they are particularly suited to the needs of that department, and so there is considerable expertise in the use of the package within the department concerned.

Packages in category C have particular appeal to some people because of the facilities they provide or in relation to the environment in which they operate.  In most cases the expertise in their use is somewhat limited.

 

 

12.  Further Information

 

If you require more specific advice on the choice of a package, you are welcome to contact the author on ext.  4821 or enrol for a course by contacting the Advisory Service on ext.  4831.  If you want to know more about the facilities available in a particular package you should consult the relevant manuals, which are available in the Advisory room in the Computing Service.

 

Information about how to use the packages on a particular computer system is provided in a series of short Computing Service user notes, which are freely available from the Advisory room.  At present the following user notes are available:

 

UN 21 MINITAB Reference Summary

UN 502 SPSS/pc+

UN 28 GENSTAT Procedure Library

UN 508 GENSTAT on 8086 or 80286 PCs

UN 509 GENSTAT on 80386 or 80486 PCs

 

Note that the mainframe version of the SPSS package is known as SPSSx.  A detailed introduction to the SPSS package itself is provided in a Computing Service user guide:

UG 20 SPSSX Introductory Guide

 

This document is over sixty pages long, and is available for a small charge (currently #1) from Computing Service Reception.  A copy of the Minitab Reference Manual is also available for purchase from Reception.

 

A user guide produced by the University of Liverpool that gives an introduction to the use of SAS on a PC may be obtained on request to the Advisory Service.  Users of the CMS system can access or print this by typing DOC SAS while logged on to CMS.

 

If you obtain any site-License software from the Computing Service you will also receive an installation note which provides assistance with installing the software on your own machine.

 

All the above documents include details of the manuals available for the relevant package.  A copy of each of these manuals is available for reference or overnight loan in the Advisory room.

 

The Computing Service welcomes feedback on its user documentation.  Please send your comments on this particular document to James Currall in the Computing Service (J.Currall@compservgla.ac.uk).