Copyright is held by the World Wide
Web Conference Committee (IW3C2). Distribution of these
papers is limited to classroom use, and personal use by
WWW 2006, May 23.26, 2006, Edinburgh, Scotland.
Ontologies are at the heart of the semantic web. They define the concepts and relationships that make global interoperability possible. However, as these ontologies grow in size they become more and more difficult to create, use, understand, maintain, transform and classify. We present and evaluate several algorithms for extracting relevant segments out of large description logic ontologies for the purposes of increasing tractability for both humans and computers. The segments are not mere fragments, but stand alone as ontologies in their own right. This technique takes advantage of the detailed semantics captured within an OWL ontology to produce highly relevant segments. The research was evaluated using the GALEN ontology of medical terms and procedures.
I.2.4 Knowledge Representation Formalisms and MethodsSemantic networks; I.2.8 Problem Solving, Control Methods, and Search & Graph and tree search strategies
Algorithms, Experimentation, Performance
Ontology, OWL, Segmentation, Scalability, Semantic Web
The ultimate vision for a semantic web is to create an internet that computers can understand and navigate. Making this vision a reality will either require an extremely large ontology that describes every term of interest on the Internet, or, more realistically, numerous domain-specific ontologies, which, when aligned with one another, form a web of semantic inter-ontology relations. Either way, the result is a very large knowledge corpus.
Examples of such enormous ontologies are already starting to appear. For example, the biomedical domain has numerous very large ontologies such as SNOMED-CT , GALEN , FMA  and NCI-Thesaurus . However, these ontologies have grown too large to effectively used and maintained, often requiring large teams of highly trained experts .
If a truly massive semantic web is going to be of use to anyone, users and applications will have to find a way to limit their scope. The knowledge web, as a whole, will be too big and mostly irrelevant for any single task.5]. A similarly sophisticated distributed system may be feasible for use with the ontologies of the semantic web. However, ontologies, such as those represented in the Web Ontology Language (OWL) , are significantly more complex data structures than mere web pages. OWL builds several levels of complexity on top of the XML of conventional web data  . It is likely that large and complex ontologies will require a novel solution.
This paper suggests such a solution: instead of attempting to capture the entire semantic web in a gigantic index, each web application extracts and uses a custom ontology segment specific to its particular needs. Segments are big enough to be useful, but not so big that scaling becomes a problem.
The ontology segmentation techniques shown in this paper exploit the semantic connections between ontology terms and thereby enable web-application developers to quickly (or even automatically) create the custom ontologies they require. This is a first step towards a working application layer on top of a large-scale semantic web.
GALEN serves as an excellent test case because it is both large and complex. It also utilizes much of the expressive power of modern description logics, whereas other large ontologies more closely resemble simple controlled vocabularies. If an effective segmentation algorithm can be demonstrated for something as complex as GALEN, we can therefore expect it to also work well for the complex large ontologies of the future.
Another pre-requisite for our segmentation methodology is that the ontology be normalised . Primitive classes in a normalised ontology have no more than one primitive superclass: multiple parents are modeled implicitly and left to be explicitly inferred later. Normalisation greatly simplifies ontology maintenance.
GALEN in OWL uses the SHIF subset (without negation or disjunction) of the full SHOIN(D) expressivity of OWL-DL, so the segmentation is currently constrained to that. The methodology presented here is not meant to offer a complete solution with rigorous logical proofs. Instead, we present empirical evidence as to the effectiveness of our approach.
Most ontologies' properties are structured as flat lists. GALEN however employs a rich property hierarchy with over 500 distinct properties. This is especially useful for producing extracts constrained to specific user and/or application requirements. Ontologies with simple property structures, such as, for example, the Gene Ontology , will not be able to take advantage of this aspect of the segmentation algorithm presented herein.38], RACER , or Pellet  can be used to infer new information that is implicit in an ontology . This process is very important, especially for an ontology like GALEN, which was built with normalisation principles in mind. It is therefore critical that GALEN in OWL can be classified.
However, none of the above mentioned description logic reasoners based on the tableaux algorithm are currently able to classify the complete GALEN ontology. GALEN is too large and complex for these reasoning systems. A primary aim of this research was therefore to produce classifiable segments. The ideal segment is as small and focused as possible, while still containing enough information to enable the reasoner to infer relevant new subsumption relationships.
Restrictions, from one point-of-view, are anonymous classes and can be added as superclasses of another (named) class. For example: the class MalePerson might have the restriction in Figure 1 asserted as its superclass. This means that all individuals that the MalePerson class defines must have one or more relations using the hasGender property to individuals in the class MaleGender. Figure 1 illustrates this relationship.
However, seen from another point-of-view, restrictions represent cross-links between different classes as shown in Figure 2, so that an ontology can be seen as a large hierarchy of classes linked by restrictions.
In reality, of course, the anonymous qualified restriction superclasses actually restrict individuals' relations to other individuals, but it is useful to think of them simply as links. This paper will, from here on, assume this model of ontological topology.
Even though ``isPartOf'' and ``hasPart''are inverses of each other, the reciprocal statements in Figure 3 are not equivalent; in fact neither implies the other.
GALEN is unusual in that it commonly represents anatomy using reciprocal pairs of restrictions. This representation is inherently cyclical and connects every piece of anatomy with every related piece in both directions. Tableaux classifiers intrinsically scale exponentially when faced with such constructs. None of the current tableaux-based reasoners listed in Section 1.6 can classify even a small extract of the GALEN ontology containing both types of reciprocal links present in the original. (Note: the original classifier used in GALEN used different principles and did not suffer from this particular limitation .)
The algorithm presented herein therefore takes the approach of taking all reciprocals into account, but producing actual segments with only one-way links, using, for example, only ``isPartOf'' relations. Some of the new relations may be virtual: i.e. have only been found by first adding the reciprocals.
It is important to note that these reciprocal links differ from symmetric links. That statement in Figure 4 is a symmetric link, which has a very different meaning to the example of a reciprocal link given above. Symmetric links do not adversely affect classification.
The property hierarchy is however never traversed downwards. Properties are not of interest unless they are used in the class hierarchy. So, if they are used, they, their superproperties and no other properties, are included.
Additionally, the superproperties and superclasses of these newly included properties and classes also need to be recursively included, otherwise these concepts would just float in OWL hyperspace. That is, without being attached to the hierarchy, concepts are assumed to simply be subsumed by the top concept, leading to a very messy, confusing and often semantically incorrect view of the unclassified ontology.
Figure 5 gives an illustration of this segmentation algorithm. Starting at the target of the extract, the algorithm traverses the hierarchy upwards all the way to the root class. It also traverses it downwards all the way to the leaf classes. Additionally, any links across the hierarchy from any of the previously traversed classes are followed. The hierarchy is traversed upwards (but not downwards) from any of these classes that the cross-links point to. Links pointing at other classes from these newly traversed classes are also included. This continues until there are no more links left to follow.
For example, if a user is not interested in the diseases modeled in GALEN, he or she can specify to exclude all locative properties. These are properties that specifically link diseases to the locations in the body where they might occur: e.g. ``IschaemicCoronaryHeartDisease hasLocation Heart''.
The upper-level meta-properties which it may be logical to include and/or exclude will be different for each ontology to be segmented. These meta-properties are, in this case, actual properties, since GALEN groups related properties together under super-properties. The following meta-properties and their inverses were selected for course grain property filtering:
Note: The various properties could be broken down much more elaborately. However, the point is that organizing properties under any sensible upper-level property structure will enable some degree of useful property filtering. A more detailed analysis of the GALEN property hierarchy may be found in .14]). Trivially equivalent definitions are therefore transformed into primitive classes by the segmentation algorithm. These still occupy the correct place in the hierarchy and are easy for editors to display.
As shown in the progression in Figure 6, if the filtering process removes the restriction on a class and this results in a trivial equivalence, then the definition is converted into a primitive class.
A chain of links is followed to create a list of classes to include in an extract. In doing so, each classes' restrictions' filler classes should be included to produce a semantically correct extract (see Sections 2.1 and 3.4). However, if, upon reaching a certain recursion depth, calculated from the extract's target concept, all the links on a class are removed, this class becomes a boundary class.
For example, one might remove the axiom in Figure 7 stating that the Pericardium (the membrane that surrounds the heart) is a component of the CardiovasuclarSystem (line three of the Figure), since one may not be interested in including the CardiovascularSystem and everything related to it in a segment of the Heart. This creates a boundary class that is still defined in the hierarchy (under SerousMembrane) and therefore still makes sense within the ontology, but has an incomplete definition.
The named superclasses of a boundary class (line two of Figure 7) must be included in the extract in order to place classes in their proper position in the hierarchy. These classes would otherwise all be subsumed under the top concept. These superclasses are however also boundary classes, unless they are linked to by way of shorter recursion path along another concept, as shown in Figure 8.
The main hierarchy of ``is-A'' superclass relationships between classes should not be counted when calculating the traversal depth, since they need to be included in any case and do not substantially increase the complexity of the segment. Subclass relations can be ignored completely, since they are not included in the extract in the first place (see Section 3.5). Figure 8 illustrates the entire boundary extraction procedure.
This methodology effectively limits the size of the ontology, since the presence of a boundary class will cause a link traversal algorithm to terminate. Otherwise, in the case of a densely interlinked ontology such as GALEN, practically the entire ontology could be ``linked-in''.
Noy and Musen's ontology extraction research , also uses the boundary class term, but defines it as any class which is in the range of any property that is used in each restriction on each of the classes which are targeted by the extract. The resulting list of boundary classes is to function as a reminder to the user of other classes they might want to include in the view they are constructing. This approach relies on property ranges being specified in the ontology, which is often not the case and on a graphical user interface to ``prompt''  the user. The approach presented here takes a more automated approach, aiming to produce a heuristic algorithm that creates a useful segment without much user intervention.
(Tests were carried out on a 2.8 Ghz Pentium 4 with 2.5 GB of RAM running Racer 1.8.0 on Windows XP service pack 2.)9 gives a breakdown of how long various aspects of the segmentation process take. The first step is loading the target ontology. The next involves an initial pass over the ontology, scanning for and marking the classes to include in the eventual segment extraction process. Extraction constructs a new, self-contained ontology segment, which is then saved to disk.
As can be seen from the figure, the complete segmentation process takes an average of one minute to complete. However, most time is spent loading the ontology. Once the target ontology is in memory, the segmentation itself takes only around six seconds to complete. It can be observe that segments from large ontologies can be created with good computational efficiency, though this is, of course, dependent on the specific implementation of the extraction algorithm.
Performance is currently not fast enough for real-time user queries. However, the results show good potential for future optimisations, especially if loading times can be reduced by streaming segmentation techniques and/or caching. Furthermore, segmentation is not meant to replace querying. Instead, it enables efficient querying of otherwise intractable ontologies.1. As can be seen from the table, the segment is roughly a quarter the size of the original ontology, with the number of properties being reduced the least and the number of primitive classes being reduced the most. A similar pattern can be observed when segmenting using different target classes.
This reduction in size is not enough to enable classification given current memory and reasoner applications. All current tableaux algorithm-based description logic reasoner systems stack-overflow when attempting to classify the basic extract of GALEN. The filtering and boundary extraction algorithms do however create classifiable ontology segments (see Section 5.3.2).4.1. Segments using combinations of property categories were also produced. It was found that the combination of Partitive, Functional and Modifier properties produced the largest ontology that could still be classified successfully. Statistics for this combination segment are therefore also included in the tables below.
Table 2 gives an overview of the size of various property filtered segments. As can be seen from the results, segments could be reduced in size by an average factor of 20 over the original ontology and by a factor of five over the basic extraction methodology.
The test query (probe class) in Figure 10 was introduced into every segmented ontology to test its classification performance. An approximate measure of the degree of knowledge present in a segment may be obtained by counting the number of new classes inferred as subclasses of the probe. The probe effectively answers the query ``everything related to the Heart'' by using the ``attribute'' property, which is the top-level property in GALEN.3 shows several classification statistics:
Note: the ``new inferences'' column only lists new inferences under the probe class. Many other new subclass relationships are inferred in each segment, but these are not necessarily relevant to the extract's target concept and were therefore not counted as part of this evaluation.
This result indicates that the link structure of the GALEN ontology is very interwoven and unpredictable. There seem to be no tight group of boundary classes that limit a particular extract and therefore also no way to cleanly divide an ontology into modules. That is, the complex ontological relationships cannot be cleanly divided into fixed categories. We should therefore expect traditional partitioning methodologies, such as those discussed in Section 6, to be of limited use in this domain.
Table 4 shows the results of the boundary classification testing. Only boundary depth ``one'' could be successfully classified.
Boundary extraction by itself provides a very good means of controlling the size of an extract, but does not seem to provide much optimization for classification. A combination of boundary extraction and filtering segmentation allows one to control both the classifiability and size of a segment. This combination represents the optimal segmentation strategy.
The research presented in this paper falls into category three.32] defines a simple query mechanism for RDF. Multiple queries are required in order to extract complex knowledge as, for example, a class and its transitive closure (all classes related to it). SparQL might be a good low-level tool for implementing ontology segmentation, but is not a solution in and of itself. 39]. They highlight RQL  as the only RDF query language that takes the semantics of RDF Schema into account. Their view system has the ability to place each concept in its corresponding place in the complete RDF hierarchy. This practice, similar to the algorithm presented in Section 3, gives a more complete picture of the structure of a query answer than merely returning the relevant concepts in isolation. They do not however provide a means of materializing a view, i.e. views are transient: they are discarded as soon as they have served their purpose. 18]. This allows views to be customized on-the-fly for specific applications' requirements. They however also side-step the ontology updating problem by only creating virtual views. Their views are merely a collection of pointers to the actual concepts, and are discarded after they have served their purpose.
By contrast, the methods presented herein create self-standing, persistent, multi-use ontology segments. That is, the segments have a life of their own: they can be transformed, updated, shared, annotated, plugged into applications and otherwise manipulated in myriad of ways.33]. That is, we can always find clusters of objects that are more related to each other than to the other objects around them. How complete a decomposition is possible depends on the nature of the system in question.
Researchers in networking use algorithms to organize the nodes on a network into inter-related islands . Some ontology researchers propose applying a similar methodology to segmenting ontologies.
An ontology can, from this point of view, be viewed as a network of nodes connected by links. The class hierarchy can be interpreted as a directed acyclic graph (DAG) and any relations between classes can be represented as links between the nodes (a simplified model of the paradigm presented in Section 2.1).36]. They exploit the structure of the hierarchy and constraints on properties' domains and ranges (for example: the ``hasGender'' property might have a domain of ``Animal'' and a range of ``Male or Female'') to iteratively break the ontology up into dynamically sized modules. This method does not take OWL restrictions, which can act as additional links between concepts, into account. Instead it relies on the globally asserted domain & range constraints. However, domains and ranges are optional and may not therefore be asserted in a given ontology.
Structure-based partitioning is primarily meant for breaking an ontology into broad packages or modules so that it can be more easily maintained, published and/or validated. However, this process destroys the original ontology, leaving it decomposed into whatever modules the partitioning algorithm deemed appropriate. Moreover, ontologies, particularly those modeled in OWL, tend to be more semantically rich than a simple network abstraction will capture.8] present a method for modularizing OWL ontologies similar to Stuckenschmidt and Klein's approach . However, they address the issue of the original ontology being destroyed by using -connections  to keep the individual modules somewhat interconnected. The modularized ontology fragments produced by their algorithm are formally proven to contain the minimal set of atomic axioms necessary in order to maintain crucial entailments.
While Grau's approach is formally sound, it does not seem to scale up to complex, large ontologies. In particular, tests using a 3000-class fragment of GALEN failed to produce a useful segmentation. The methodology does not seem to be able to modularize ontologies that make use of an upper-ontology . Since many large ontologies rely on such an upper ontologies to maintain a high-level organizational structure , Grau's approach is only of limited real-world use.17], which decomposes the knowledge base into self-contained mini-prover partitions, which then communicate with each other using message passing techniques. The researchers thereby successfully improve the efficiency of their reasoning algorithm when answering queries over large knowledge bases. 20] of ontology maintenance tools, which are themselves plug-ins to the Protégé ontology editor . Their extraction methodology  focuses on traversal directives, which define how the ontology links should be traversed. Collections of directives completely and unambiguously define an ontology view and can themselves be stored as an ontology. They also introduce the concept of boundary classes around the edges of an extract. However, their view of boundary classes differs from the perspective given in this paper (see Section 4.2).
Noy's research establishes the mechanics of ontology view extraction, but does not address how her system might be used to construct relevant, useful and computationally tractable segments.3]. It is a generic system that can theoretically be adapted to work with any ontology format. The system extracts a sub-ontology based on a user's labelling of which ontology terms to include and which to exclude. It also has the ability to optimise an extract based upon a set of user selectable optimisation schemes. These schemes can produce either the smallest possible extract, a medium size one, or include as much detail as possible. These extracts can be further restricted by enforcing a set of additional constraints. Their system can, for example, enforce the semantic completeness and well-formedness of an extract .
However, the primary focus of Bhatt and Wouters' architecture is parallel processing. While, their extract system performs very poorly when run on a single machine (17 minutes to produce an extract from a 5000-concept ontology), it achieves optimum performance using around five separate processors.
We argue that speed is not a critical factor in the extraction process. Performance is too poor to facilitate an instant, on-demand extraction web-service, but not poor enough that it becomes a serious problem. For example, extraction tests on the GALEN ontology by these authors took in the order of two to five minutes to complete.
Both MOVE and PROMPT produce a materialized view, i.e. a self-standing ontology that has no direct connection with its origin. Both also have the notion of the transitive closure of a concept (Wouters et al. call this semantic completeness ). However, neither methodology addresses the important issue of how to update the main ontology if the view is modified, how to transform the ontology on-the-fly while extracting, nor do they discuss the ability to classify an extract using description-logic reasoning systems. Finally, neither systems make use of meta-information about an ontology's semantics in determining the best extract. The user must make a great deal of manual selections and choices for each new extract he or she wishes to produce.
By contrast, the segmentation algorithms presented herein automates the extraction process as much as possible by taking advantage of meta-information. Additionally, these methods have the ability to transform the extracted ontology segments (see Section 4.1.1) and are optimized for enabling classification.
The key difference between the approach present here and the other approaches is that we do not aim to create views of one type or another. Instead, we aim to produce independently useful and usable ontologies.1.5).
Such a generic algorithm will then be integrated with ontology alignment methodologies, thereby making it possible to produce extracts that cut across ontology borders. Such a segmentation algorithm, offered as a web service, will make it possible to create ontology segments from a web of ontologies: an otherwise unmanageable semantic web will become tractable for computers and comprehendible for humans.
Additionally, other applications and evaluations of the segmentation methodology (as outlined in Section 1.3) will be explored in the future.
The methods presented take advantage of many ontology maintenance principles: normalisation , upper-ontologies  and rich property hierarchies  are all taken into account in order to produce more relevant segments. Other approaches to segmentation do not take advantage of many of the semantics captured within an OWL ontology and are therefore only of limited use.
Evaluation has shown that segmenting ontologies can decrease their size considerably and significantly improve their performance. The size of the GALEN ontology was reduced by a factor of 20. Moreover, segments of this ontology, which was previously impossible to classify, were classified within seconds. Additionally, useful insights into the ontology meta-structure were gained from the analysis of various segments.
The complete GALEN in OWL along with a web application that can generate custom GALEN segments is available online1.