Table of Contents

Introduction

Why Do We Need to Manage Metadata?

Data from Everywhere

Data as a Potential Resource

Metadata Status in an Organization

Metadata Management: Historical Approaches

Data Administration: Data Dictionaries

The Havoc of Distributed Systems

Early Metadata

Early Data Warehousing

Metadata Management Trends

Growth of Data Warehousing

The Resurgent Repository

An Enterprise View of Metadata Management

Warehousing Efforts

A Repository as a Metadata Integration Platform

Lifecycle Issues

What Needs to Be In an Enterprise Repository to Make the Warehouse Work Better

Non-proprietary Relational Database Management System

Fully Extensible Meta Model

Application Programming Interface (API) Access

Central Point of Metadata Control

Impact Analysis Capability

Naming Standards Flexibility

Versioning Capabilities

Robust Query and Reporting

Data Warehousing Support

Conclusion

Glossary of Terms



See:

Questions Metadata Can Answer

Performance Measurement Issues and Principles

An Introduction to Neural Computing



Copyright © 2000 by The Applied Technologies Group
Putting Metadata to Work in the Warehouse

-A White Paper

Introduction

As a recent article in the Wall Street Journal pointed out, data is becoming an abundant commodity, we can get it anywhere and everywhere.  However, just as with any other item which becomes a commodity, data by itself is losing value, in part simply because there is so much of it.  

Not too long ago, the situation was exactly the opposite: raw information was extremely difficult to acquire and therefore highly prized.  For example, fifty years ago, investors interested in trading in a futures market such as coffee beans would ask the question: “how is the coffee bean harvest doing in South America?” 

In order to get their answer, they hired agents to investigate coffee production as well as other important commodities.  The reason they were willing to go to these lengths is that the answers to these questions were critical, and could make an investor an overnight millionaire in the futures market. 

Today, you can get higher quality raw data, including satellite pictures if you wish, over the Internet, for less than ten dollars a month.  More data can be obtained in 15 minutes, from a broader and richer set of sources than any investor 50 years ago could get with months of effort.

The expanded assessibility of data and the staggering quantities available make it difficult to deal with.  When data was less abundant, the amount available was consumable by users with fairly primitive tools.  Now that data is plentiful, we are faced with the problem of separating the significant facts from the rest.

There are many analogies that can be used to describe the situation.  Some say it is like trying to take a drink from a fire hose; others say it is like trying to find one specific grain of sand on a stretch of beach.  In all cases, the problem being described is singular, pervasive, and compelling, because one thing that hasn’t changed is the fact that the ability to use data to get answers to business questions is still the key to making money and achieving success in business endeavors.

We do know that in order to make good business decisions we need good data.  In fact, in today’s eBusiness economy, data has become one of the few assets that is unique to an organization, and as such can be used as a key differentiator in competitive landscapes.  The tools, processes and services an organization can employ to leverage their data are generally available to anyone; but their data remains unique.  The process that has been generally accepted as good business practice in exploiting data has been described as follows: 

·         First, acquiring quality raw data;

·         Second, combining and integrating the data to make useful information, and then;

·         Analyzing and visualizing the information to provide knowledge for making high quality decisions.

The acute issue is to know which data to use to create useful information.  The end goal is to make better and higher quality decisions than your competitors.  The torrent of raw data has added more choice, and therefore complexity, to the process.

What we need to do is put the data in context, give the data meaning, relevance, and purpose, and make it complete and accurate.  Data that is viewed in this light is called information, because we can use it for deductive and inductive insights that lead us to quality decisions.

This paper discusses the issues confronting management today as they grapple with the floods of raw data and the pressing need to know what they have as data assets and how to achieve the goal of better decision making.

 

Why Do We Need to Manage Metadata?

As mentioned previously, raw data is proliferating at a rapid rate.  Data is flowing into the company from suppliers and customers.  Data is being purchased and even leased (or time-shared) from external sources.  And, the internal systems of the corporation are adding their share.

 

Data from Everywhere

Corporations are interacting on the web, blending suppliers, partners, and customers to form a virtual enterprise that functions as the superset of the physical organizations.  This eBusiness is performed through a myriad of real-time information exchange technologies such as Electronic Document Interface (EDI), Electronics Funds Transfer (EFT), eXtended Markup Language (XML) Data Streams, Business To Business (B2B) Exchange Services, email, and a host of other data acquisition and eBusiness Server applications.

At the same time, existing legacy systems within enterprises continue to generate data on orders, sales, revenues, employee information, manufacturing schedules, inventory, fleet status, and every other parameter imaginable.  As computers become more and more affordable, as storage costs continue to plummet, as user sophistication increases in the use of information technology, the proliferation of the technology adds to the exponential growth of the data it generates.

What do we know about the data being generated by these systems? 

First of all, we know that it is by and large dispersed across the enterprise.  Each department, division, group, branch, section, individual employee or any other subdivision is today capable of generating its own unique caches of data.  The information technology advances of the last 20 years have added significantly to the amount and depth of data produced, managed, and stored. 

The waves of management interest in achieving operating efficiencies by centralizing, decentralizing, and reengineering.  Along with the technology whiplash of mainframe to old two-tier client server architectures to new three-tier client server architectures has created the opportunity for the data in one group to have a different meaning in another group in the same organization. 

B2B interactions now command knowledge about data across the value chain (external companies, suppliers, distributors), further complicating the data architecture of an organization.  This disparate data is exacerbated by rapid application development tools, application and code generators, underutilized data models and definitions, database products, spreadsheets, and other client friendly products, and a lack of leadership in management.

Secondly, we know that along with the dispersion there exists a view that the data generated by each group belongs only to itself, and is intended only for its own uses.

Finally, we know that because of the first two observations, the potential for integrating these disparate data elements across various departments and organizations is poor without some significant work.

Recently, management has begun to recognize the value of using data as a corporate asset.  There are even proposals floating to make it a line item on the corporate Balance Sheet.  The idea of using all of the organization’s data to get a complete picture of the enterprise is today’s ideal. 

At the very least, management is recognizing the need to view data from multiple departments and across the value chain to get some kind of combined view of operations.  Data warehouses emerged as a technology by which management could get a single comprehensive view of the state of the organization.  Data is extracted at regular intervals from existing systems and placed in the warehouse, summarized to allow management to look at trends, but also available in detail for drill down data access and analysis. 

In more leading edge organizations, the data is presented to neural network technology for Data Mining, including automated correlation analysis, pattern detection, business rules generation and predictive model generation. 

It is the basis for the data needed to effectively do eBusiness, CRM and B2B applications. 

Add to this the unstructured data and directories that are now prolific and we paint a very complicated data picture.

In many companies and across the value chain, the same data element may be used by different entities to mean different things.  Or different data element names could be used to represent the same things, potentially creating hundreds of instances of the same data all inconsistently named. 

For example, Manufacturing may exclude work in process from an inventory analysis, while purchasing does not.  Different groups may have different standards and approaches to defining prospects ready to buy; territories may be different (is Kentucky East or Southeast?), among many possibilities.  This is an everyday reality to most companies working with data warehousing and eBusiness applications.

 

Data as a Potential Resource

Faced with these dilemmas, management has realized that data is a resource but only if all its important attributes are known and understood.  Data must be set in context, have meaning to its users, be relevant and have its relationships be understood, and have purpose.  It is not enough to know that the inventory levels have ranged between two values over time.  One must also know what the definition is of inventory levels.  It is not enough to know that the value of a certain case of French wine has increased over time.  One must also know what has happened to

the relative value of the French Franc to the Dollar, and whether that value has been adjusted for fluctuations in the currency exchange.

Data must also be complete and accurate.  If there are multiple sources for a particular data element, which one is being used in the data warehouse, and why? What are the business rules that impact how we view data?  If we have calculated some data elements, such as profitability, what equations and formulas have been used to derive those results?  Only when these are known, understood, and applied can data be fully utilized, and only then can we begin the reliable building of information from the data that ultimately leads to quality decision making.

The need to understand the data leads to a need for managing the data.  This need is particularly acute in systems such as data warehouses, eBusiness applications (including CRM) and Data Mining whose primary purpose is to provide knowledge to target a Web interaction or supply a fertile ground for exploration and insight.

Having thus established a requirement to understand and manage the properties of the data, the question then becomes what is the best mechanism for achieving this? What we are really talking about then is a store of attributes about the data, or data about data.  Semanticists have termed this concept “metadata,” from the Greek “meta,” which means a later stage, transcending, or situated behind. 

Literally, then, we are talking about data that sits behind the operational data, and that describes its origin, meaning, derivation, etc.  (What is gross sales?—Dollars or French Francs, quarterly or annualized, what system does it come from, when is it extracted, etc.?) Metadata can range from a conceptual overview of the real world to detailed physical specifications for a particular database management system.  A data resource becomes useless without readily available high quality metadata.  Its primary objective is to provide a comprehensive guide to the data resource.

Metadata Status in an Organization

If organizations today are having problems managing data, what can we say about their ability to manage metadata?  Most companies suffer from the “ready, fire, aim” syndrome in that they are so rushed to implement systems that the planning and documentation aspects of most projects are the first to suffer.

Pressure from management and users to gain the information or functions they need to do their work leads inevitably to a rushed implementation where there is little thought given to coordinating data elements with other groups who may use the same concept.  Few, if any, resources are dedicated to a careful documentation of the properties of the attributes, the business rules used in their derivation, and so on. 

In short, the problem continues.  In most companies, the metadata situation is worse than the data situation.  Along with the disparate data arriving from multiple sources from within and outside the corporation, there are multiple tools creating metadata in a variety of formats. 

Those companies who have strong data administration and metadata management strategies eliminate the possibility of being “data rich, but knowledge poor.”

It is typical that companies rapidly implement data ware-housing or eBusiness systems and then discover themselves in a metadata dilemma.  That is, they have a critical need for readily available high quality metadata to leverage their data resource, yet the organization has no system in place for maintaining adequate metadata. 

As a result, many data warehousing and eBusiness initiatives slow down as an organization grapples with these issues brought about by disparate data and poor quality metadata.  In any given project, there is a need to include business experts, domain experts, and data experts so that the metadata that is formed is relevant and useful as applied to the project’s purpose.

Metadata Management: Historical Approaches

Managing the data assets of an organization effectively has been a goal of Information Technology since its inception.  As systems have increasingly become more diverse, distributed, and complex, the management of the data assets has become increasingly difficult but nevertheless critical to the corporate entity.



Data Administration: Data Dictionaries


Early in the days of Information Technology, all data was defined and maintained within the computer program itself.  These were total and complete packages of logic and purpose (64K Assembler core programs). 
There was no need to share data between systems and programs because one couldn’t, without great effort, transfer data between different physical computer systems.  While the need and demand was in place to get multiple programs to work together sequentially to get a given job done, the state of the art simply did not allow it.

In the late 1960’s and early ’70s, the technology improved to the point where multiple programs could run sequentially against a given data set to solve business problems.  For example, a batch run could use a data set such as a collection of checking account transactions and calculate a new balance.  This effort required some coordination within the set of programs as they used the system holding the transactions and then later as they accessed account balances, deposits, etc. 

In this evolution of the technology, the overhead of having each program be its own little environment was not a tenable solution, and so early versions of data coordinators
were developed which were simple data “dictionaries” that were shared by programs looking for data to use in their logic processes.  The programs would load the data definitions and locations it would need in its run from a common data dictionary.

These dictionaries were most likely managed by an IS organization that was centralized and tightly controlled.  As developers became more sophisticated over time, data dictionaries evolved to provide more than just data attribute descriptions.  They were also able to track which applications accessed which pieces of the database.  This meant that managers who took advantage of the capabilities of the data dictionary and did a good job of designing and populating the data dictionary found themselves in the enviable position of being able to maintain their systems more easily than their counterparts who did not. 

For example, a user wants to change a SKU number definition from five digits to seven.  How many programs need to be changed to affect this enhancement?  If a manager has done the job properly, this question can be answered by a simple query into the data dictionary.  Such a centrally designed and maintained system, which holds the data definitions as well as CASE information about which applications use which pieces of the database is sometimes called a repository.  This concept will be expanded in later sections of the paper.

 

The Havoc of Distributed Systems

As demand for more IS technology blossomed and the technology advanced with lower cost midrange and distributed systems, these systems found themselves as islands of automation ensconced in individual departments and working on specific business problems which only related to the specific department.  Data was defined in a decentralized manner, by the business unit, with no central arbiter, if it was defined at all. 

Worse yet, as new and different CASE tools came on the market, and as new and different architectures came into vogue, such as object oriented databases and client/server architectures, different tools were used to define the data for different applications.  In some cases, the same element was defined multiple times with slight variations, as new systems were used to create applications to help users solve various business problems.

Exchanging data between systems became risky, highly structured, and infrequent.  Importing data from external systems and environments became a labor-intensive endeavor and was avoided if possible.  However, in many cases it became necessary.

For example, a call management system for customer support needs to get all of the customer information from the legacy databases. In order to do this successfully, elaborate programs were written to “scrub” the data clean.  With inadequate metadata, data incompatibilities could cause programs to fail, keeping programmers up nights debugging operational systems, as well as taking down the call management system, for example.

 

Early Metadata

Just as any other kind of warehouse needs to keep an inventory of its holdings, early implementers of data warehouses found that they needed to keep track of what data the warehouse was currently holding along with the “pedigree” of that data.  To do this, the idea of a metadata repository was created, similar to a data dictionary, to give users and technicians information about the data, such as where the data came from, what rules were used in creating the data, what the data elements meant, how recent the data was updated, and so on.

In early implementations, and even up to the recent past, many systems divided the business or end-user related information from the technical or development directory, so that technical information about the data, which would be of limited use to an end-user, and could arguably make the end user’s task more difficult, was kept in a separate store from that which was required by the user.

It is now widely accepted that the metadata component must be designed so that everyone understands the data in the warehouse.  Robert Typanski, Data Manager at Bayer was quoted in Datamation: “Unless people can identify the data that’s in the warehouse, they’re not going to be able to access it any better than if it were buried in some legacy operational system.”

 

Early Data Warehousing

Based on the seminal work by Bill Inmon, early adopters of the concept took Inmon literally in defining a data warehouse as having an enterprise wide scope.  The early incarnations of this concept were gargantuan structures that encompassed summaries of data from all aspects of the business.  In order to tie together all of this data, a tremendous amount of work had to be done to find the legacy data (data archaeology), build a common data model which was appropriate for an enterprise view of the business, and then extract the data from the legacy systems. 

During the extraction, data had to be not only consolidated but normalized and rationalized so that the resultant picture did not contain any duplications, contradictions, or other anomalies which could interfere with the accurate and timely analysis of the consolidated data.  These projects often took longer than expected, and estimated costs were in the millions of dollars, much of which was spent in the data understanding and data preparation phases of the project.  Large systems with broad scopes, such as the ones described above, are sometimes not as responsive as users would like. 

This is not necessarily from the standpoint of system response time, although that could be a problem in an enterprise wide data warehouse, but also from the standpoint of being able to modify data structures and rules to make certain specific analyses which are particular to a single department.

This led to the next evolution of the data warehouse, a special purpose warehouse, or data mart, which is an application specific implementation, with data derived from the warehouse itself or directly from production sources.  The objective of these individual department specific implementations was and continues to be the generation of better support for management decision making by supplying data in a more relevant form and in a more responsive manner.

 

Metadata Management Trends

The awareness of the need to manage metadata has been an offshoot of the growth of data warehousing.  As the number and diversity of data warehousing implementations began to grow, IT managers and end users began to realize that the data warehouse was only as useful as the quality, accuracy, and ease of use of its data.

 

Growth of Data Warehousing

There is no doubt that data warehousing has grown as a technology and that it has firmly established itself as a mainstream tool in competitive businesses today.  However, along with the success of the concept came growth in a number of other areas.

·         Growth across platforms.  Initially, the global data warehouses were the domains of a few platforms and a few vendors.  As their popularity grew, data warehousing technologies such as parallel computing, parallel databases, and OLAP/ROLAP were extended into smaller and more pervasive platforms, such as department UNIX/NT servers and desktop systems.  Now the range of platforms which claim data warehousing and data mart capabilities includes wireless devices, desktop, department servers as well as the traditional big iron mainframes.

·         Growth across tools.  Success has many parents, and failure is an orphan.  This is certainly true of data warehousing.  As the bandwagon got rolling, vendors sprang from all directions to jump on, each touting special features and functions.  From legacy data extraction tools to maintenance and scheduling tools, and yes, even tools that purported to handle metadata, vendors and products proliferated.

·         Growth across departments.  Success also breeds demand, which is exactly what happens when department A gets a new data mart and starts showing colleagues in department B how easily they can now access consistent data.  Soon, department B has its own data mart, and departments C, D, and E are not far behind.  Throughout all of this growth, however, the subject of metadata management remained fractured and dispersed.  Extraction tools, loading tools, cleansing tools and analysis tools, all claimed to have a piece of the metadata problem solved.  In fact, until recently, there has been little progress in terms of a solution to the integrated metadata issue.

However, enterprises today are clamoring for such a solution, and for good reason.  Users need to know what they are looking at if they are to make intelligent decisions and take informed actions based on the data they have received from the data warehouse. 

A system cannot leave it up to the user to assume the business rules embedded in a calculated data element because different assumptions will lead to different courses of action that will inevitably conflict with each other. 

Yet, in most cases today, metadata is spread across different components of the warehouse, from the scheduler to the data extraction/cleansing tools which claim to build metadata as they are extracting and cleansing, to the loading tools, to the OLAP tools, which need to present metadata to users in order to navigate.  Business rules are separated from technical metadata, as they should be, but are kept by different systems in different formats with different user interfaces.

In the case of multiple data marts spread across the enterprise, this situation is multiplied by the number of marts.  And, if there is a need for a user in department A to use a data mart created by department B, then in most cases, that user has to relearn the metadata navigation for that system.  Clearly, users would like to go to one place and be able to see either the business metadata or the technical metadata in a single system with a single user interface and single screen metaphor for any and all data residing in the enterprise.

 

The Resurgent Repository

The scenarios depicted above are the primary drivers behind the resurgence of the metadata repository.  A repository is the vehicle of metadata.  Simply put, a repository is where information (metadata) about an organization’s information systems components (objects, table definitions, fields, business rules and so on) is held.  A repository also contains tools to facilitate the manipulation and query of the metadata.

A repository has a number of potential applications within an enterprise schema that deliver value beyond that exclusively in the domain of a data warehouse.  For example a repository can: 

·         aid in the integration of the views of disparate systems by helping understand how the data used by those systems are related; 

·         support rapid change and assistance in building systems quickly by impact analysis and provision of standardized data; 

·         facilitate reuse by using object concepts and central accessibility;

·         assist in implementation for data warehousing (A central repository can be built in advance of the warehouse, purely for data and application integration purposes, and then be ready to support a warehouse implementation.  Alternatively, if the repository is built in support of the initial warehousing effort, it can be of enormous value in deploying subsequent efforts.);

·         support software development teams.

One of the primary benefits of a repository is that it provides consistency of key data structures and business rules, which makes it easier to tie warehousing efforts together (data marts) across the enterprise.  This has been one of the major criticisms leveled at the proponents of independent data marts—that deploying data marts without a unifying infrastructure simply promulgates the “islands of automation” problems we have with our legacy systems. 

The repository also leverages an organization’s investment in existing legacy systems by documenting program information for future application development.

 

An Enterprise View of Metadata Management

Metadata is therefore a key resource to the warehouse during all phases of its life cycle, from the warehouse construction, through the user access, and into the maintenance and update of the data it holds.

During the past few years there has been a tremendous level of activity in the vendor metadata repository field, largely due to the rapid growth in the data warehouse and data mart markets.

Business has come to clearly see the issues surrounding the disparate data as they attempt to leverage these data assets across their organizations, and vendors are responding by building enterprise level strategies.  As an example of the state of metadata today, below is an excerpt from a Datamation article:

“Syncing metadata between two products—different functions, different metadata stores, different vendors—is a huge challenge, too.  To do it, you’d have to get the right piece of metadata at the right level of detail from one product and map it to the right piece of metadata at the right level of detail in the other product; then straighten out any differences in meaning or in coding between them.  And then do it again for each of the hundreds of other points in metadata space that the two products share in common.  And then figure out what to do when the metadata changes in one of the products.  And if the metadata structure (yes, that would be the meta-metadata) of a product changes, you get to do it all again.

Syncing the metadata between two products is tough.  Syncing metadata among each of the half-a-dozen tools it could take to build, run, and access a data warehouse is an almost unthinkable task.  But for a smooth, robust, efficient data warehouse operation, it’s sync or sink.  What you really need is a single, comprehensive metadata source that is accessible to all of the tools you buy—the tools you buy for the data warehouse, certainly, but also the tools you buy for virtually every other IS function, as well.  One metadata source, no syncing.”

Vendors are beginning to respond to these kinds of pressures and are trying to solve the enterprise metadata puzzle.  However, the answers are not simple.  Metadata is collected and/or generated in a variety of places in the Data Warehousing architecture, from data rationalization, data extraction, from data manipulation and application specifics, and from query engines such as OLAP.

Today, each of these areas has a number of vendors who offer products, and each vendor has a slightly different approach or paradigm to their metadata solution.  There are several possible approaches, and some vendors are propagating the concept of a single enterprise level metadata repository to integrate the enterprise’s disparate metadata. 

The simplest approach is to have all vendors utilize the same semantics, paradigm, etc. and collect all of the metadata in a single format in a single location.  In real-life, business deployment occurs on multiple business platforms, such that the repository must follow business needs and not necessarily be implementing technology for technology’s sake. 

The ideal repository needs to exist in the real world by providing quality metadata to support existing business initiatives, wherever those initiates reside.  The primary objective should be to produce quality metadata that is consistent with the business model and provide support across the enterprise, and be readily leveraged into eBusiness applications, data warehousing, and other IT deployed business solutions.

The Computer Associates PLATINUM Repository product line highlights a practical approach designed as an entire business metadata solution that meets the needs of today’s competitive businesses.  In the mainframe-dominated business, the PLATINUM Repository exploits the MVS space with proven technology that enables IT organizations to employ their most effective resources in order to provide an excellent metadata solution.

In the distributed business environments, PLATINUM Repository is hosted on NT systems provisioned with an open DBMS and incorporating metadata sources from many other operating systems such as MVS, NT, UNIX and AS/400.  This often is the platform that provides the whole business solution that spans the enterprise.

Finally, PLATINUM Repository also comes with an Open Information Model (OIM) metamodel that is hosted on NT and that supports OIM compliant third party products.  Each deployment option supports distributive business needs across the enterprise, leveraging existing technology investment, yet providing easy web-based intranet inquiry technology for business and IT users.  By providing the appropriate deployment for the business platforms of choice, PLATINUM Repository leverages metadata technology to meet the business needs.

Another alternative is to partner with other vendors in various domains within the data warehouse architecture (extraction, loading, OLAP, etc.), and build translation schema from their metadata to a canonical metadata.  This in effect gives users the option of a suite of products whose metadata all can be translated into a single “metadata language.”  Vendors such as Computer Associates, who have an wide spectrum of product in the eBusiness and Data Warehousing world, can ensure that all of their own products speak this “metadata language” through a common services layer that links in any of the deployed repository technologies across the distributed enterprise. 

A single metadata repository then could be built at an enterprise level, and a corporation could attain business consistency and manageability by implementing this concept.  This area will be of intense interest to corporations as they build their corporate architectures and approaches to “enterprise” metadata.  This is a very active area for vendors at this time with many levels of interactivity.  Computer Associates is integrating the products of recent acquisitions into their mainstream tool sets.

Other major vendors are arranging alliances and developing bridging software.  Needless to say, in an area this active, there will be multiple degrees of integration, features, and functions, all dynamic and changing with every release.  In the following sections, we will review important characteristics in an enterprise metadata product.

Managing Metadata Within and Across

Warehousing Efforts

Most large organizations today have had some experience with data warehousing implementations.  Today, these typically take the form of data mart style implementations in various departmental focus areas such as financial analysis or customer focused systems assisting business units.  Many organizations have multiple warehousing initiatives underway simultaneously and these systems will most likely be based on products from multiple data warehousing vendors, in the typical decentralized approach of most corporations.  This approach has worked to date in that it has allowed reasonably rapid implementation of these systems and demonstrated to the organization the benefit and potential of data warehousing as a business tool at a fraction of the cost of the enterprise data warehouse model.

However, as pointed out earlier, this is the typical “ready, fire, aim” approach which got us to the redundant, disparate, legacy data we have today.  Some areas of the business are beginning to show signs of stress as a result of this approach to implementing data warehousing.  Data and metadata are spread across multiple data warehousing systems, and system managers are wondering how best to coordinate and manage the dispersed metadata mess they have today.  How do we maintain consistency when business rules change as a result of corporate reorganizations, regulatory changes, or other changes in business practices?

What happens when an application wants to change the technical definition? How many places are impacted for each of these potential changes? These issues among others are forcing businesses to take a larger view—an enterprise view—of metadata management systems.  Coordinating metadata across multiple data warehouses is one significant step in the right direction, and a repository is just the tool to do that.

 

A Repository as a Metadata Integration Platform

Ideally, a corporation should adopt a repository as a metadata integration platform, making metadata available across the organization.  This would serve to manage key metadata across all of the data warehouse and data mart implementations within an organization.  This would allow all of the participants to share common data structures, business rule definitions, and data definitions from system to system across the enterprise. The platform would accept and manage information from multiple sources.  These would include systems from major vendor technology databases (e.g. IBM, Microsoft, Computer Associates, Informix, Oracle, Sybase, etc.) and across a broad spectrum of tools, from extraction tools to analysis tools.  On the output side, the system should provide open access by multiple tools as well as API’s for custom needs.

The metadata repository also facilitates consistency and maintainability.  It provides a common understanding across warehouse efforts promoting sharing and reuse.  If a new data element definition is required for a data mart implementation, the platform should permit versioning to support the need.  With a shared metadata repository the exchange of key information between business decision managers (facilitated by good solid end user access tools) becomes more feasible.  And, when multiple data marts and data warehouses are involved, a central metadata platform will simplify and reduce the effort required to maintain them when viewed as a whole.



Lifecycle Issues

Repository systems need to contribute to and integrate with the existing legacy system environment and play an active role throughout the lifecycle of data warehousing systems to be truly considered enterprise metadata repositories.

Documenting database and legacy information are important capabilities in metadata repositories.  Legacy models provide the information sourcing, data inventorying, and design that are key to developing an effective data warehouse.  The metadata surrounding the acquisition, access, and distribution of ware- house data is the key to providing the business user with a complete map of the data warehouse.



What Needs to Be In an Enterprise Repository to Make the Warehouse Work Better

The repository should play an active role in the entire life cycle of the data warehouse and all the output attributes of system and business value.  This includes existing legacy systems as sources, third party tools, etc.  This then leverages the repository’s role so it contributes in the development phases as well as the bulk cost of all IS systems (the downstream support and maintenance costs).  These would include systems management, database management, business intelligence, and application development tools and components listed below.
 

·         Systems management tools that can be used to manage jobs, improve performance, and automate operations, not only in operational systems but also in data warehouse systems. 

·         Database management tools that can help create and maintain the database management systems for operational systems, data warehouses, and data marts. 

·         Data movement tools that transform and integrates disparate data types and move data reliably to the warehouse. 

·         Business intelligence tools that provide end-user access and analysis for making business decisions. 

·         Neural Network technology solutions that mine the data to create knowledge. 

·         Business applications that provide packaged warehouse solutions for specific markets.

·         Data warehouse consulting that uses a methodology based on the experiences of hundreds of other companies, thereby reducing the risk associated with making uninformed business decisions.

·         Application development solutions that help you build, test, deploy, and manage operational and warehouse applications throughout the enterprise.

·         CASE tools support that provide consistency and maintain-ability immediately by developing consistent terminology and structures.

·         Repository-to-CASE interfaces that enable an organization to manage multiple CASE workstations from the repository.  These tools are designed to allow an organization to better utilize the data maintained in their CASE workstations by, providing a central point of control and storage.

·         Sophisticated version controls, collision management, and bi-directional interfaces, enabling the sharing and reuse of metadata among programmers and analysts working inde-pendently. What Needs to Be In an Enterprise Repos-itory

·         to Make the Warehouse Work Better Some areas to focus on in reviewing repository functionality are discussed in the following sections.

 

Non-proprietary Relational Database Management System

A repository should ideally use an industry standard DBMS that provides significant advantages over vendor-developed DBMS’s.  These advantages include advanced tools and utilities for database management (such as backups and performance tuning) as well as dramatically enhanced reporting capabilities.  Further-more, maintainability and accessibility are enhanced by an “open” system.

Using a standard database also allows the repository vendor to focus on the quality of the repository, not the features of the database management system.  In addition, it allows the vendor to take advantage of new features made available by the DBMS vendor.

 

Fully Extensible Meta Model

A repository should be a completely self-defining, extensible repository based on a common entity/relationship diagram.  By using a model that reflects industry standards, it can provide users with the ability to easily customize the meta model to meet their specific needs.  The repository should support the following meta model extensions:

·         adding or modifying an entity type,

·         adding or modifying a linkage between entity types (associations or relationships),

·         adding user views (with different screen layouts or validations) to entities or relationships,

·         adding, deleting, or modifying attributes of relationships or entities,

·         modifying the list of allowable values for an attribute type,

·         adding or modifying commands or user exits,

·         adding custom command macros, and

·         adding or modifying help and informational messages.

The vendor should support standards such as Open Information Model (OIM) and metadata exchange through commonly accepted formats such as eXtended Markup Language (XML) data streams, which will allow information to be easily shared across multiple vendors products.  Ideally the vendor will also be focused upon supporting eBusiness activity through current XML data and the imbedded or globally defined meta-data, Document Type Data (DTD).

Application Programming Interface (API) Access

An API access to the repository can provide an organization with the flexibility needed to create a metadata management system that suits their unique needs.  Architecture can make the repository powerful by allowing users to create custom applications and programs.  In addition, the separation of metadata from the tools that access and manipulate it by the API is a flexible feature.  The tools can manipulate metadata through the API, thereby allowing transparent access to the data.  If the data structures change, the tools do not need to be changed.  This allows for greater efficiency and flexibility in an organization’s application development.


Central Point of Metadata Control

The repository serves as a central point of control for data, providing a single place of record about information assets across the enterprise.  It documents where the data is located, who created and maintains the data, what application processes it drives, what relationship it has with other data, and how it should be translated and transformed.  This provides users with the ability to locate and utilize data that was previously inaccessible.  Furthermore, a central location for the control of metadata

ensures consistency and accuracy of information, providing users with repeatable, reliable results and organizations with a competitive advantage.

 

Impact Analysis Capability

If the repository has an impact analysis facility it can provide virtually unlimited navigation of the repository definitions to provide the total impact of any change.  Users easily determine where any entity is used or what it relates to by using impact analysis views.

An impact analysis facility answers the true questions in the analysis phases without forcing a user to sift through large quantities of unfocused information.  Furthermore, sophisticated impact analysis capabilities allow better time estimates for system maintenance tasks.  They also reduce the amount of rework resulting from faulty impact analysis (e.g., a program not being changed as a result of a change to a table that it queries).


Naming Standards Flexibility

A repository should provide a detailed map of data definitions and elements, thereby allowing an organization to identify redundant definitions and elements and decide which ones should be eliminated, translated, or converted.  By enforcing naming standards, the repository assists in reducing data redun-dancies in the future and increasing data sharing, making the application development process more efficient and therefore less costly.  In addition, an easily enforceable standard encourages organizations to define and use consistent data definitions, thereby increasing the reuse of standard definitions across disparate tools.

 

Versioning Capabilities

In repository discussions, “versioning” can have many different definitions.  For example some version control capabilities are:

·         version control as in test vs. production (lifecycle phasing);

·         versions as unique occurrences;

·         versioning by department or business unit; and

·         version by aggregate or workstation ID.

The repository’s versioning capabilities facilitate the application lifecycle development process by allowing developers to work with the same object concurrently. Developers should be able to modify or change objects to meet their requirements without affecting other developers.


Robust Query and Reporting

The repository should provide business users with a vehicle for robust query and report generation.  The end user tool should seamlessly pass queries to its own tool or third party products for automatic query generation and execution.  Furthermore, business users should be able to create detailed reports from these tools, increasing the amount of valuable decision support information they are able to receive from the repository.

 

Data Warehousing Support

The repository provides information about the location and nature of operational data that is critical in the construction of a data warehouse.  It acts as a guide to the warehouse data, storing information necessary to define the migration environment, mappings of sources to targets, translation requirements, business rules, and selection criteria to build the warehouse.


Conclusion

Organizations are becoming increasingly aware of the limitations of their own systems and internal data.  The attempts to liberate and leverage data across the organization’s stovepipes have been replete with frustration and too many examples of failure.  These experiences, coupled with drivers demanding flexibility in business processes, are hastening the day that businesses will implement an enterprise level view of metadata.  Activity to supply this enterprise level capability is being aggressively pursued by all major vendors.  It is critical that corporations understand the issues at hand as they adopt enterprise strategies for Data Warehouse and eBusiness initiatives, and that they be in a position to evaluate what set of vendor products are appropriate to their situation.

 

Glossary of Terms

24x7 Lights Out Operations—The use of Systems Management tools to ensure the reliable movement and update of data from operational systems to analytical systems.

Analytical Data Store—Useful in making strategic decisions, this data storage area maintains summarized or historical data. This stored data is time variant, unlike operational systems that contain real-time data.  Information contained in this data store is determined and collected based on the corporate business rules.

Application Lifecycle—Includes the following three stages:

  1. process and change management,
  2. analysis and design,
  3. construction and testing.

Architecture—A definition and preliminary design which describes the components of a solution and their interactions. Architecture is the blueprint by which implementers construct a solution that meets the users’ needs.

Availability—A measure of the percentage of time that a computer system is capable of supporting a user request.  A system may be considered unavailable as a result of events such as system failures or unplanned application outages.

Business-Driven—An approach to identifying the data needed to support business activities, acquiring or capturing those data, and maintaining them in a data resource that is readily available.

Business-Driven Approach—The process of identifying the data needed to support business activities, acquiring or capturing those data, and maintaining them in the data resource.

Business Information Demand—An organization’s continuously increasing, constantly changing need for current, accurate information, often on short notice, to support its business activities.

Business Rules—The statements and stipulations that a corporation has set as “standard” in order to run the enterprise more consistently and smoothly.

Capacity Planning—The process of considering the effects of a warehouse on other system resources such as response time, DASD requirements, etc.

CASE—Computer Aided Software Engineering.

CASE Management—The management of information between multiple CASE “encyclopedias,” whether the same or different CASE tools.

Centralized Data Warehouse—A Data Warehouse implementation in which a single warehouse serves the need of several business units simultaneously with a single data model that spans the needs of the multiple business divisions.

Change Propagation—The process of generating only the updates from the source databases to the target databases (usually the data warehouse).

Chargeback—The process that data warehouse managers use to ensure appropriate costs are correctly distributed to the corresponding business units and users so that they can meet financial reporting requirements.

Client/Server—A distributed technology approach where the processing is divided by function.  The server performs shared functions—managing communications, providing database services, etc.  The client performs individual user functions—providing customized interfaces, performing screen to screen navigation, offering help functions, etc.

Client/Server Processing—A form of cooperative processing in which the end-user interaction is through a programmable workstation (desktop) that must execute some part of the application logic over and above display formatting and terminal emulation.

Consistent Data Quality—The state of a data resource where the quality of existing data is thoroughly understood and the desired quality of the data resource is known.  It is a state where disparate data quality is known, and the existing data quality is being adjusted to the level desired to meet the current and future business information demand.

Data—Items representing facts, text, graphics, bit-mapped images sound, analog, or digital live-video segments.  Data is the raw material of a system supplied by data producers and is used by information consumers to create information.

Data Access—The process of entering a database to store or retrieve data.

Data Access Tools—An end-user oriented tool that allows users to build SQL queries by pointing and clicking on a list of tables and fields in the data warehouse.

Data Accuracy—The component of data integrity that deals with how well data stored in the data resource represent the real world.  It includes a definition of the current data accuracy and the adjustment in data accuracy to meet the business needs.

Data Administration—The processes and procedures by which the integrity and currency of the data in the warehouse are maintained.

Data Analysis and Presentation Tools—Software that provides a logical view of data in a warehouse.  Some create simple aliases for table and column names; others create data that identify the contents and location of data in the warehouse.

Data Consistency—The result of using a repository to capture and manage data as it changes so that decision support systems can be continually updated.

Data Definition—Enabling the structure and instances of a database to be defined in a human-and machine-readable form.

Data Distribution—The placement and maintenance of replicated data at one or more data sites on a mainframe computer or across a telecommunications network.  This part of developing and maintaining an integrated data resource that ensures data are properly managed when distributed across many different data sites.  Data distribution is one type of data deployment, which is the transfer of data to data sites.

Data Mart—A subset of the data resource, usually oriented to a specific purpose or major data subject, that may be distributed to support business needs.  The concept of a data mart can apply to any data whether they are operational data, evaluation data, spatial data, or metadata.

Data Model—A logical map that represents the inherent properties of the data independent of software, hardware, or machine performance considerations.  The model shows data elements grouped into records, as well as the association around those records.

Data Movement—The transportation of data from disparate sources ranging from various mainframes, client/server machines, and network file servers to a central location, the data warehouse, in order to create a reliable source of information, usable for strategic decision making.

Data Quality—Indicates how well data in the data resource meet the business information demand.  Data quality includes data integrity, data accuracy, and data completeness.

Data Store—A place where data is stored; data at rest.  A generic term that includes databases and flat files.

Data Transformation—(1) The formal process of transforming data in the data resource within a common data architecture.  It includes transforming disparate data to an integrated data resource, transforming data within the integrated data resource, and transforming disparate data.  It includes transforming operational, historical, and evaluation data within common data architecture.

(2) Creating “information” from data.  This includes decoding production data and merging of records from multiple DBMS formats.  It is also known as data scrubbing or data cleansing.

Data Warehouse—(1) A subject oriented, integrated time-variant, non-volatile collection of data in support of management’s decision making process.  A repository of consistent historical data can that can be easily accessed and manipulated for decision support.

(2) An implementation of an informational database used to store sharable data sourced from an operational database-of-record. It is typically a subject database that allows users to tap into a company’s vast store of operational data to track and respond to business trends and facilitate forecasting and planning efforts.

Database—A collection of data that are logically related.

DBA—Database Administrator.

Decision Support—A set of software applications intended to allow users to search vast stores of information for specific reports that are critical for making management decisions.

Disparate Data—Data that are essentially not alike, or are distinctly different in kind, quality, or character.  They are unequal and cannot be readily integrated to adequately meet the business information demand.  Disparate data are heterogeneous data.

Distributed Database—A collection of multiple, logically related databases that are provided to data sites.

End User Data—Data formatted for end-user query processing; data created by end users; data provided by a data warehouse.

DTD—Document Type Data, the imbedded or globally defined metadata that maps the data components of an XML data stream.

Enterprise—A complete business consisting of functions, divisions, or other components used to accomplish specific objectives and defined goals.

Enterprise Data Warehouse—An Enterprise Data Warehouse is a Centralized Warehouse that services the entire enterprise.

FTP (File Transfer Protocol)—A client-server protocol which allows a user on one computer to transfer files to and from another computer over a TCP/IP network.

Fragmentation—The process in which a packet is broken into smaller pieces, fragments, to fit the requirements of a physical network over which the packet must pass.

Global Enterprise—A corporate environment not limited by geographic location.

Heterogeneous Data—See disparate data.

Heterogeneous Databases—See disparate databases.

Incremental Refresh—A technique which loads only data which has changed since the last load into a Data Warehouse or Data Mart.

Information—(1) A collection of data that is relevant to one or more recipients at a point in time.  It must be meaningful and useful to the recipient at a specific time for a specific purpose. Information is data in context, data that have meaning, relevance, and purpose. 

(2) Data that has been processed in such a way that
it can increase the knowledge of the person who receives it. Information is the output, or “finished goods,” of information systems.  Information is also what individuals start with before it is fed into a Data Capture transaction processing system.

Information Technology Infrastructure—An infrastructure for the information technology discipline that provides the resources necessary for an organization to meet its current and future business information demand.  It consists of the data resource, the platform resource, business activities, and information systems.

Job Management Tools—Tools which include job scheduling, job optimization, charge back, and output management tools that help operations managers and system administrators manage, monitor, and coordinate the execution of enterprise-wide IT jobs.

Legacy Data—Another term for disparate data because they support legacy systems.

Metadata—(1) Traditionally, metadata were data about the data. In the common data architecture, metadata are all data describing the foredata, including meta-praedata and the meta-paradata.

They are data that come after or behind the foredata and support the foredata.  (2) Metadata is data about data.  Examples of metadata include data element descriptions, data type descriptions, attribute/property descriptions, range/domain descriptions, and process/method descriptions.  The repository environment encompasses all corporate metadata resources: database catalogs, data dictionaries, and navigation services.  Metadata includes things like the name, length, valid values, and description of a data element.  Metadata is stored in a data dictionary and repository. It insulates the data warehouse from changes in the schema of operational systems.

Methodology—A system of principles, practices, and procedures applied to a specific branch of knowledge.

OIM (Open Information Model)—An open standard for metadata exchange, independent of source and target.

OLTP—On-Line Transaction Processing.

On-Line Transaction Processing—Processing that supports the daily business operations.  Also know as operational processing and OLTP.

Operational Data Store—Contains timely, current, and integrated information.  The data is typically very granular.  These systems are subject oriented, not application oriented, and are optimized for looking up one or two records at a time for decision making.

Operational Systems—Please refer to Legacy Data.

Performance Management Tools—Tools that help warehouse managers’ monitor, maintain, and manage warehouse performance in distributed, heterogeneous environments.

Problem Resolution Software—Tools that provide automated problem report management for help desks, technical support departments, or customer service operations.  These products can be used by support staff, as they assist customers or end users, or as part of an automated call-in self-help system.

Query—A (usually) complex SELECT statement for decision support.  See Ad-Hoc Query or Ad-Hoc Query Software.

RDBMS—Relational Database Management System.

Reference Data—Business data that has a consistent meaning and definition and is used for reference and validation (Process, Person, Vendor, and Customer, for example).  Reference data is fundamental to the operation of the business.  The data is used for transaction validation by the data capture environment, decision support systems, and for representation of business rules.  Its source for distribution and use is a data warehouse.

Refresh Technology—A process of taking a snapshot from one environment and moving it to another environment overlaying old data with the new data each time.

Relational Database Management System—A Database Management System which uses the concept of two-dimensional “tables” to define “relationships” among the different elements of the database.

Repository—A location, physical or logical, where databases supporting similar classes of applications are stored.

Repository Environment—The Repository environment contains the complete set of a business’s metadata.  It is globally accessible.  As compared to a data dictionary, the repository environment not only contains an expanded set of metadata, but also can be implemented across multiple hardware platforms and database management systems (DBMS).

Reusability—Using code developed for one application program in another application.

Scalability—(1) The ability to scale to support larger or smaller volumes of data and more or less users.  The ability to increase or decrease size or capability in cost-effective increments with minimal impact on the unit cost of business and the procurement of additional services. 

(2) The ability of a system to accommodate
increases in demand by upgrading and/or expanding existing components, as opposed to meeting those increased demands by implementing a new system.

Securability—The ability to provide differing access to individuals according to the classification of data and the user’s business function, regardless of the variations.

SQL (Structured Query Language)—A structured query language for accessing relational, ODBC, DRDA, or non-relational compliant database systems.

Stovepipe Decision Support Systems—Independent, departmental data marts incapable of making accurate decisions across the enterprise because they have no way to consistently define data.

Target Database—The database in which data will be loaded or inserted.

Warehouse Application Vitality—A solution to enable business needs to drive the technology that reaches the end-user’s desktop by limiting the negative effects of application change. XML (eXtended Markup Language)—The successor programming language to HTML and SGML, XML can define web page content and/or eBusiness Data Streams.

 

Copyright Information

This White paper is the property of The Applied Technologies Group, Inc. and is made available upon these terms and conditions. The Applied Technologies Group, Inc. reserves all rights herein.  Reproduction in whole or in part of this paper is only permitted with the written consent of The Applied Technologies Group, Inc.  This report shall be treated at all times as a proprietary document for internal use only. 

It may not be duplicated in any way, except in the form of brief excerpts or quotations for the purpose of review.  In addition, the information contained herein may not be duplicated in other books, databases or any other medium. Making copies of this report, or any portion for any purpose other than your own, is a violation of United States Copyright Laws. The information contained in this report is believed to be reliable but cannot be guaranteed to be complete or correct.

Copyright © 2000 by The Applied Technologies Group, One Apple Hill, Suite 216, Natick, MA 01760, Tel: (508) 651-1155, Fax: (508) 651-1171, E-mail: info@techguide.com, Web

Site: http://www.techguide.com