Skip to content

Instantly share code, notes, and snippets.

@cristianvasquez
Last active September 28, 2022 00:08
Show Gist options
  • Save cristianvasquez/e2cd41f0b1070160bd31544241e0e16b to your computer and use it in GitHub Desktop.
Save cristianvasquez/e2cd41f0b1070160bd31544241e0e16b to your computer and use it in GitHub Desktop.
Descriptive and Technical Study of Languages and Metadata Systems Used on the Web

Descriptive and Technical Study of Languages and Metadata Systems Used on the Web

Cristian Vasquez, 2002

Introduction

These days most of the web content is designed by humans to be read by themselves, forming a chaotic library from which knowledge is difficult and expensive to extract.

Distinct communities promote the use of the "Semantic Web," which adds representation to the Web, to be machine-processable. The idea is to establish channels for feasible processing, integration, and reuse of Information, betting as the extraction of knowledge of better utility to humans. The semantic Web would only be an extension of the existing Web, where the Information acquires meaning through the use of Metadata, which provides semantics to its content and enables automatic reasoning about the Information.

What is Metadata?

Although we could define (etymologically speaking) that Metadata is something that goes "beyond the data," the truth is that in the current literature, there is no consensus on what Metadata is. A frequently used definition tells us that Metadata is "data about data," in general, an object that describes or says something about another piece of Information. Although the use of the word "metadata" became widespread in a context that refers to the era of digital Information, metadata generation dates back centuries. Librarians have created Metadata that has taken the form of book catalogs, card catalogs, and currently online catalogs. Today the generalization of the concept has covered any type of descriptive (standardized) Information about resources, including those that are not digital.

Formally, we could say that Metadata is data responsible for keeping a record of the meaning, context, or purpose of an informational object to discover, understand, extract, and manage that object. In general, these records are smaller than the objects they describe and are created in a short and concise format to exchange them.

Metadata can describe collections of objects, the processes in which they are involved, events, their components, and each of the restrictions that apply to them.

Metadata define the relationships between objects, such as tuples in a database or classes in object orientation, generating structures. Throughout the document, I use a specific technical terminology on languages: Semantics: It is the meaning that associates a community with a metadata element or with the values of that element, organized in common understanding vocabulary. Structure: Part of the language that imposes an order to the semantic expressions, generates coherence in the codification of the non-ambiguous semantics. It allows a consistent interpretation for those who communicate. Syntax: Part of the language that provides the means to represent structures, delivers mechanisms to encode, exchange, visualize, and process metadata. An example is XML.

Evolution of Metadata on the Web

Metadata has its roots in the catalog, probably invented shortly after the beginning of history by the Sumerians. Over the centuries, the clay tablets used to evolve to handwritten lists and later to book records after the invention of the printing press. These first catalogs of books were printed and were alphabetically arranged lists without sophisticated classification criteria. An essential advance regarding classification schemes developed around 1900 when library catalogs were entirely replaced by cards to enable updates. In the decade of the sixties, mass production methods (such as computers) make it necessary to have multiple copies of existing catalogs, massively distributed collections of books, and card catalogs fail to meet the new requirements. It is needed then to develop coding standards, called today in Metadata.

The first (digital) metadata system dates from the end of the 20th century, actualizing many coding standards, languages , and protocols already used in physical catalogs.

MAchine Readable Cataloging (MARC)

MARC was designed to transmit data from one system to another and was revolutionary by incorporating variable length fields. It contains ìdirectory 'codes, alphanumeric of a fixed length that determine the name, length, and where each description field begins, control fields, used to classify the Information in terms of time and place. The variable description fields are those containing cataloging data traditional. They are preceded by a defined code that goes from 001 to 999, where for example the code 650 is the subject by typical of the resource. More than twenty national standards emerge since the creation of MARC (DenMARC, AZMARC , CHMARC, UKMARC, CAN / MARC etc), which tend to "harmonize." The best known of them is USMARC (United States MARC), also called LC-MARC, which was developed in 1968 by the Library of Congress of States United and derived from MARC Another renowned standard is MARC21 product of the CAN / MARC (Canada) and USMARC conjugation in 1999. Also, since 1977 there is an interlingua between the different MARC standards, creates thanks to an effort of each national bibliographic agency, where libraries translated their standards to a UNIMARC (UNiversal MARC) scheme and vice versa.

ISO 23950 (Z39.50)

This is a protocol for generating queries across multiple online catalogs. Of American origin, it dates from 1988 when it was approved by the NISO (National Information Standards Organization) and allows a user of a system to search and retrieve Information without knowing the syntax used by the other systems.

It has an XML protocol called XER and is portable to SQL. Traditional libraries widely use both MARC and Z39.50 for a while due to the high cost that these entities must incur to change the format, in addition to the little funding they have for these purposes.

The development of the languages used in metadata marking plays a fundamental role in Standard Generalized Markup Language (SGML), the Document Type Definition (DTD), and The Warwick Framework.

Standard Generalized Markup Language (SGML)

SGML is a document markup language. Its roots date back to 1969 when IBM laboratories develop Generalized Markup Language (GML), a language that evolves until 1974, where it is called SGML. The International Organization for Standardization (ISO) approves and publishes this language in 1984 under the official name ISO 8879. This international standard consists of a set of rules to describe the structure of a document and promote exchange through computer platforms.

SGML is extremely flexible and is the basis of the most commonly used markup languages today. In SGML, a document is defined based on the structure of the entities that comprise it. These entities are hierarchically organized in a logical structure, determining the structure of the document elements. Different documents can share entities. Marking is carried out using delimiters and labels of the form element . The labels are represented and can be nested using the basic character set according to the ISO 8879 standard. In the historical context of Metadata, the introduction of SGML played a fundamental role, since it established a new paradigm, in which the data is no longer just data. SGML documents contain separately (in the logical sense) the contents, structure, and format.

Document Type Definition (DTD)

These are SGML applications and are those used to define the structures of a particular type of document. Its roots date back to 1978 when IBM laboratories publish the first DTDs as part of the development of SGML. A DTD describes the structures of multiple documents or for one in particular. The structure is defined through rules, expressing names, the content of each type of element, and the order in which the parts may appear. One of the best known is the HTML (HyperText Markup Language) DTD that defines the rules that give birth to this massive language for Web page marking. Libraries use Various DTDs, such as EAD (Encoded Archival Description) for bibliographic description and TEI (Text Encoding Initiative), for marking electronic versions of cultural texts.

The Warwick Framework

This initiative has its origins in April 1996, in a workshop held at the University of Warwick. The meeting had more than fifty representatives from libraries, Internet standards, text marking, and digital library projects. The concept behind this initiative is to keep multiple sets of metadata independent of each other in one place, to support access through distinct metadata sets. This framework allows the existence of different syntaxes in each set of metadata according to the semantic requirements, promoting interoperability and extensibility when handling (selectively) these packages by the agents or systems that use them.

The framework can have two types of objects: containers and packages. A package is a metadata set, which contains elements such as a description of Dublin Core, and a container defined as the place where other containers or packages are stored. A container can have transient and persistent states, where a transient container is a transport object between repositories, customers, and agents. In turn, persistent containers are those that last over time and are accessible by a universal identifier. This framework resulted from an analysis of the Dublin Core and greatly influenced the creation of the Resource Description Framework (RDF).

Extrinsic and Intrinsic Metadata

Depending on the context, two large groups of metadata of very different natures can be distinguished, extrinsic metadata and intrinsic Metadata. Extrinsic Metadata has a persistent link between the metadata record and the object they describe, a relationship that exists separately from these objects. Extrinsic Metadata allows the exchange of information without the need to exchange the resources themselves, reducing the costs of distribution and administration of the objects. For this reason, operations are cleaner. A practical case is to request authentication before accessing the data.

Extrinsic metadata are also flexible and scalable when altering them since modifications and aggregations are on the Metadata without re-encoding the primary data. Intrinsic Metadata is synchronized with the object they describe; for a use that is sensitive to the context in which they are used, it is challenging to reside outside the object they represent. The most common cases are metadata that describe dynamically generated objects and those that describe objects with characteristics that frequently vary over time, such as spatial distributions of objects or the availability of an item in a house commercial. The main problem of intrinsic Metadata is that the administration of Metadata is closely interrelated with the object they describe, generating technical issues when altering any of the two. Some communities recommend the joint use of extrinsic and intrinsic Metadata for their descriptions (while distinguishing them).

Identifiers

An identifier points to an entity uniquely and persistently, identifying that entity throughout its existence. Identifiers are the basis of the description with Metadata since, in this area, objects only exist when they are identifiable and distinguishable.

Uniform Resource Identifier (URI)

A URI is an object that identifies some entity uniquely and persistently. A URI is a code that is manifested by a sequence of characters in a limited alphabet. These URIs are independent of how they manifest themselves, either in the form of printed letters or symbols on the screen; the essential thing is that they identify an object uniquely and persistently. A URI always defines or specifies an entity, which can be anything, can be a document, a physical device, a movie, an abstract concept (e.g. "author"), or even a citizen of a country. An interesting analogy of the URIs is the user identity card number for some person, which is unique to any person entity and persists throughout its existence, regardless of where it is or how it has changed throughout the weather. One of the most important cases is the URL identifiers (Uniform Resource Locator) that refer to the subset URIs which identify the resources through the representation of its primary access mechanism (place), rather than identifying the resource by a name or some other attribute of that resource. The term URN (Uniform Resource Name) refers to the subset of URIs that is required to identify a resource globally and persistently even if the resource ceases to exist. There are different schemes of URIs, with different representation schemes on which they depend. These different schemes differ by identifying different components of a resource or representing these resources by a different alphabet, for example, the URL alphabet considers that D and d are the same characters, so <http: //www.dcc.uchile. cl> is equivalent to http://www.DCC.UChile.CL, unlike Unix file identifiers. There are many identifier schemes, with varying degrees of acceptance and use, some examples of identifiers are:

An identifier always refers to a resource regardless of the context, hence the term 'uniform.' Different types of resource identifiers must also be able to be used in the same context, for example, in a description with Metadata. It is important to note that the objects identified with URIs can be legitimately identified by more than one identifier. The trend indicates that, over time, new types of resource identifiers are introduced, which should not interfere with existing ones, allowing the existence of a conventional semantic interpretation for all different ids. The determination and administration of the URIs are carried out decentralized and by various organizations. Since it is challenging to maintain exhaustive control over them, the URIs can be assigned unequivocally over time. Mechanisms are then necessary to resolve these conflicts. The concept of 'degree of trust' towards organizations emerges.

Information Retrieval with Metadata

The techniques for recovering classical information assume the text is enough to describe the content of a resource in sufficient detail. These techniques are handy, but at the time of doing searches with a higher degree of sophistication, although users incur in structuring the search space (according to the type of resource, locality, and language, to name a few), these tend to be fruitless. The existence of metadata structures allows the generation of relevant tools for the extraction of resources that include the semantic aspects of the queries. With the current mass-use information retrieval systems, questions such as "tell me all the resources that contain 10-year-old Chilean girls writing a story" or "Venezuelan songs that include the sound of a flute" do not yield the appropriate results. The effectiveness increases in searches in an environment where there are metadata, these allow defining multiple interrelations between the information objects, providing the semantic structure necessary to make a useful recovery to the users in terms of quality and relevance of the results. Also, the defined structures allow inferences about them and therefore allow for more complex queries.

It is currently challenging to automatically extract descriptions of metadata from the content of the lexical resources, so the generation of these metadata must not be discarded manually or assisted. The production of Metadata should be simple and straight to the point, taking into account that the purpose of metadata systems is to support the operations to be carried out on a resource in particular and not create complete descriptions of it. People who carry out data marking are generating specialized skills over time, causing a culture of metadata marking

Benefits when using Metadata

The benefits of using metadata are diverse and depend on the area. In general terms: Metadata adheres content, context, and structure to information objects, thus assisting the process of recovering knowledge from collections of objects.

Metadata allows us to generate different conceptual points of view for its users or systems. It frees the latter from having advanced knowledge about the existence or characteristics of the object they describe.

These conceptual views can depend on the system or user that uses them, for example, a child can visualize the information of a particular page differently (Using Metadata of the page and profile of the child ).

Metadata allows the exchange of information without the need to involve the transfer of the resources themselves. This particularity facilitates, among other things, searches for distributed collections. Also, the Metadata allows a precise and discrete description of the resources, allowing the creation of virtual collections that group the information objects to meet specific requirements. An example could be an educational institution that collects course materials from different institutions of the globe grouped by subject, regardless of the format of the content received. In each productive process, or at each stage of the life cycle of an information object, Metadata creates added value to the system of metadata and objects they describe, generating data from which it is possible to extract relevant knowledge to the system itself and its processes. Metadata allows access to resources in a controlled manner since the object described is precisely known. It is then possible to establish filtering systems and will enable them to generate bases for authentication and mechanisms to define degrees of confidence about the sources of information. Metadata facilitates the preservation of information objects, allowing them to migrate (thanks to structural information) successively, for possible use by future generations. The semantic information of the objects is maintained, thus reducing the loss of knowledge. Metadata are essential to sustain the growth of a Web on a larger scale, allowing searches and knowledge integration from a more significant number of heterogeneous sources.

Metadata Standards

A metadata standard is a collection of keywords or structures that describe different concepts, generally covering some field of knowledge that is stable and not very large. These concepts help to explain statements within this field of expertise, capturing the most critical aspects of their meaning.

The generation of metadata standards is an investment in terms of future interoperability as it expands the possibilities of different parties to work effectively in the long run, regardless of the change in technology. The standardization of Metadata on the Web is generally very difficult, and even inconsistency of Metadata can exist on objects of the same domain. In terms of metadata, communities do not agree on consensus to establish criteria and standards, which is logical since there are innumerable ways of organizing objects. To date, no standard has achieved global acceptance, which under specific points of view, is an advantage.

Classification of Metadata Standards

Classification of Metadata facilitates its overall understanding. The classification I propose, is carried out by groups or categories according to the general purposes of each metadata framework.

Administrative

Refer to information provided to facilitate the administration of resources. Data about how an object was created, who is responsible for controlling access or registering its content, what processing activities were carried out concerning the content, and what access or use restrictions are applicable. An example is Metadata used for preservation, supporting the long-term retention of digital objects and their reconstruction in case of loss.

Examples:

  • Acquisition records
  • Legal access requirements
  • Spatial information of the resources to be managed
  • Version control
  • User tracking
  • Content reuse
  • The physical condition of a resource

Descriptive and Discovery:

They refer to the information provided to find, describe and distinguish each of the information objects. Dublin Core is the clearest example of this type of metadata. This category also includes the Metadata responsible for describing resources from specific domains of knowledge. Examples for the field of science would be Darwin Core metadata that provides representation for the search and recovery of natural history collections and those belonging to the Data Documentation Initiative (DDI), the standard used to describe data sets for use in social sciences.

Examples:

  • Support metadata for information retrieval
  • Specialized indexes
  • Information on the metadata scheme used
  • Taxonomies

Technical Models:

They correspond to the metadata standards related to the elements that describe how to interpret the workings of a system. An example of these is the Metadata that describes the format of some digital image. Models: They are related to the pieces of a composite information object, in terms of how each of its components is interrelated. For example, metadata can describe that, in the context of a book, we will arrive at the desired topic if we follow the page number indicated in the index.

It is important to note that the boundaries between these categories tend to be diffuse, and Metadata might not fit in just one of the categories. Thus, in the same metadata scheme, components with different purposes and scope are included. A formal classification that groups Metadata into only one of these three categories does not adequately represent reality, so I use a triangular diagram to visualize the classification.

In this diagram, four (of innumerable) standards are classified, where Metadata located near one of the categories indicates a higher number of components than are intended to fulfill said general purpose. In the diagram Dublin Core is (almost) entirely in the type of descriptive and discovery metadata, and MPEG7 is close to the center of the description because it has elements that fulfill the three general purposes.

Examples:

  • Metadata that describe the operation of hardware and software
  • Structure of digital formats
  • Models of metadata exchange

Metadata Languages

We express Metadata using languages that specify the syntax in which structures are defined, in addition to providing means for the necessary semantic specifications (to tell us what expressions Synthetics means in terms of a model). These models and syntaxes are what allow us to represent expressions, facts, rules, and queries about descriptions. Each of these different languages ​​is derivations or instances of the languages ​​(or schemas) that precede them. As can be seen in the following diagram

Important Points

An essential element in the description of informative objects on the Web is the imperative need to identify them and possess a method to access them or descriptions of them as necessary. Throughout the study, no metadata was distinguished that did not use URIs (except for core metadata, which is self-supporting as DOI), being those that allow the use of controlled vocabularies and determine the existence of objects.

It is necessary that the identifiers are unique, to be stable and secure, to be of public access, and to be persistent. These four conditions are tough to fulfill since social bases are necessary for their implementation. Descriptions use metadata systems: The way in which metadata systems are generated on the Web differs greatly from those used in traditional library science.

The Metadata used on the Web does not aim to make exhaustive descriptions of the resources (unlike those that happen with marking schemes such as MARC) but to create systems that use the different frameworks together. The Metadata must be granular in this sense, being a necessary characteristic for the subsistence of these metadata. This is how, for example, it is common to find descriptions made using Dublin Core fields and identified by DOI. The generation of these metadata systems requires languages that allow the grouping of the different components. RDF is the main candidate because it is a basic level language that allows the generation of these systems and through the use of DAML (currently OWL), allowing to infer on the ontologies. Within the components of a description with Metadata, a description with Dublin Core is the one that has proven to be the most stable, being the candidate for broader diffusion, being able to consider a general purpose standard. The degree of standardization of some frameworks proposed for general purposes allows a certain degree of reuse, however, at the level of particular domains, communities invest significant efforts in the development of specialized structures to meet their requirements, being commonly dissidents of the systems used in other communities of the same domain

The generation of metadata systems currently has technical problems such as insufficient documentation of the characteristics of the models of the different standards, which makes the interoperability of the systems difficult (for example, the translation of the elements is difficult). Also, it is currently not common to have representations of the models (ontologies) through a standard language, which allows us to understand these metadata. The Metadata is closely linked to the object they describe: In intrinsic and extrinsic metadata, metadata descriptions are strictly related to the relevant characteristics of the objects, generating new technical difficulties. It is necessary to deal with the changes that object experience over time. On the Web, objects usually are dynamically created, establishing challenges that until today have not been resolved (however, the technology for it exists).

The Barrier

The main obstacle in the use of Metadata is not technological since the elements are available to materialize the three points mentioned above.

The real barrier to its use is social; being necessary to create a culture of metadata. Under specific points of view, this is possible, betting that the evolution of these metadata will follow a similar path to what the catalogs did. The context: On the Web, no central authority governs Metadata (such as a national library), but consists of multiple groups, organizations, and people working independently. While this system hinders the development of universal standards, on the other hand, it provides the freedom to create and use enough for an explosion of metadata systems. We will use Languages ​​ such as RDF to allow this freedom of creation. Metadata can be an evolution in knowledge acquisition systems, and it is very likely that in the future, its use will be massive, not only because of its additional benefits but because they are essential to sustain a growth of a Web on a larger scale. The incorporation of metadata will generate new problems and challenges for the future.

Bibliography

Metadata Principles and Practicalities Erik Duval et al. http://www.dlib.org/dlib/april02/weibel/04weibel.html

Setting the Stage, Anne J. Gilliland-Swetland, Introduction to metadata: pathways to digital information, Getty Information Institute, 2000

Metadata: The Future of Information Systems Keith G Jeffery A Survey of Current Metadata Standards and the Underlying Models' Ronald Snijder 2001. http://www.geocities.com/ronaldsnijder/

Assessing Metadata Needs for SUL / AIR Digital Collections: A Guide Nancy Hoebelheinrich Introduction to Metadata Pathways to digital Information and Library Notes July/August 1999, No. 1286, Harvard University, page 4

Metadata In The World Wide Web Roberto Galnares Computer and Information Science Department New Jersey Institute of Technology http://www.cs .njit.edu /~galnares/Metadata.html

About Intellectual Property World Intellectual Property Organization. http://www.wipo.org/about-ip/en

Extending the Warwick Framework: From Metadata Containers to Active Digital Objects Daniel, Ron Lagoze, Carl Lagoze (1997) http://www.dlib.org/dlib/november97/daniel/11daniel.html

The organization of Information Arlene G. Taylor Cataloging Services Department Stanford University Libraries & Academic Information Resources http://www-sul.stanford.edu/depts/catdept/units/metadata/

Library of Congress Network Development and MARC Standards Office http://www.loc.gov/marc/

Ontologies: Principles, Methods and Applications M. Uschold, M. Grunninger http://www.dcc.uchile.cl/~cgutierr/websemantica/uschold96ontologie.ps.

The Semantic Web, A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities Tim Berners-Lee, James Hendler, Ora Lassila http://www.scientificamerican.com/2001/0501issue/0501berners-lee.

DC Dublin Core Metadata Initiative

http://www.dublincore.org

The Dublin Core Metadata Initiative: Mission, Current Activities, and Future Directions Weibel, Stuart and Traugott Koch (2000) http: /www/dlib.org/dlib/december00 /weibel/12weibel.html

The State of the Dublin Core Metadata Initiative April 1999 Weibel, Stuart (1999) http://www.dlib.org/dlib/april99/04weibel.html

Metadata: The Foundations of Resource Description Weibel, Stuart ( 1995) http://www.dlib.org/dlib/july95/07weibel.html

A Common Model to Support Interoperable Metadata: Progress report on reconciling metadata requirements from the Dublin Core and INDECS/DOI Communities Bearman, David, Eric Miller

http: //www.dlib.org/dlib/january99/bearman/01bearman.html

URI Uniform Resource Identi fiers (URI): Generic Syntax T. Berners-Lee, R. Fielding, L. Masinter http://www.ietf.org/rfc/rfc2396.txt

Guidelines for new URL Schemes, November 1999 RFC2718 http://www.ietf.org/rfc/rfc2718.txt

L. Masinter Registration Procedures for URL Scheme Names R. Petke http://www.ietf.org/rfc/rfc2717.txt

Hypertext Style: Cool URIs don't change T. Benders-Lee http://www.w3.org/Provider/Style/URI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment