Reasoning with Data Flows and Policy Propagation Rules

Data-oriented systems and applications are at the centre of current developments of the World Wide Web. In these scenarios, assessing what policies propagate from the licenses of data sources to the output of a given data-intensive system is an important problem. Both policies and data flows can be described with Semantic Web languages. Although it is possible to define Policy Propagation Rules (PPR) by associating policies to data flow steps, this activity results in a huge number of rules to be stored and managed. In a recent paper, we introduced strategies for reducing the size of a PPR knowledge base by using an ontology of the possible relations between data objects, the Datanode ontology, and applying the (A)AAAA methodology, a knowledge engineering approach that exploits Formal Concept Analysis (FCA). In this article, we investigate whether this reasoning is feasible and how it can be performed. For this purpose, we study the impact of compressing a rule base associated with an inference mechanism on the performance of the reasoning process. Moreover, we report on an extension of the (A)AAAA methodology that includes a coherency check algorithm, that makes this reasoning possible. We show how this compression, in addition to being beneficial to the management of the knowledge base, also has a positive impact on the performance and resource requirements of the reasoning process for policy propagation.


Introduction
Data-oriented systems and applications are at the centre of current developments of the World Wide Web (WWW). Emerging enterprises focus their business model on providing value from data collection, integration, processing, and redistribution. These kind of systems are not new, as the Web has enabled for a long * Corresponding author, e-mail: enrico.daga@open.ac.uk time tools such as news aggregators, which collect articles from various providers, and republish them as collections of short readings, often focusing on specific topics (politics, sport, etc.) 1 . Nowadays, the extraction, publication, and reuse of data on the Web is an established practice, and a large number of APIs provide access to JSON documents, data tables, or Linked Data for a variety of use cases, spanning from content and media linkage [22] to science and education [20].
The key aspect on which we are focusing here is the publication of Licenses and Terms and Condition documents associated with those APIs and data artifacts, that declare the associated rights and policies that should guide their use. Data Hubs collect a large variety of data sources and process them in order to implement the workflow that connects data in their original sources to applications that might want to exploit these data [12]. These systems create new challenges in terms of the volume of data to be stored and require novel processing techniques (for example stream-based analysis [26]), but more importantly they demand for more sophisticated approaches to data governance [10]. In the Web of (open) data, developers can access a large variety of information, and often publish the results of their processing. Hence, they need to be aware of any usage constraints attached to data sources they want to exploit, and they need support in publishing the appropriate policies alongside the data they distribute.
In this complex scenario, assessing what policies propagate from the licenses associated with the data sources to the output of a given data-intensive process is an important problem. Both policies and data flows can be described within the Semantic Web, relying on standards like the W3C PROV model 2 to describe process executions in a provenance chain and the Open Digital Rights Language 3 , which actual purpose is to formalize and validate policies. Particularly, it is possible to specify Policy Propagation Rules (PPR) [7] by associating policies with data flow steps, although this activity results in a large number of rules to be stored and managed. In [7], we studied how a PPR knowledge base can be compressed by using an ontology of the possible relations between data objects, the Datanode ontology 4 , and by applying the (A)AAAA methodology, a knowledge engineering approach that exploits Formal Concept Analysis (FCA).
In this article we illustrate how reasoning on policy propagation can be practically performed. Building upon [7], we report on an extension of the (A)AAAA methodology that includes a coherency check between the hierarchy of the FCA lattice and the Datanode ontology. This extension was necessary in order to exploit the compressed rule base during reasoning and avoid incorrect results. While the compression of the rule base reduces the number of rules to be managed, it requires the reasoner to compute more inferences. Therefore, we study the impact of rule base compression on the performance of the reasoning process. In other words, this article focuses on two contributions that relate to the aspect of reasoning with (compressed) PPRs, which was missing in [7]: 1-the extension of the (A)AAAA methodology by adding an additional coherency check step to the Assessment phase, and 2-the evaluation of the effect of compression on reasoning performance.
The article is structured as follows. Section 2 reviews the relevant literature. Section 3 presents an exemplary use case, and introduces the elements for reasoning on policy propagation, going through the description of the data flow, the representation of policies, and the concept of Policy Propagation Rule (PPR). Section 4 provides a summary of the (A)AAAA methodology, integrated with a novel Assessment phase that includes a coherency check algorithm that allows effective reasoning with a compressed rule base. We also evaluate the impact of this evolved methodology on the compression factor of the knowledge base of PPRs. In Section 5, we report on experimental results about the impact of a compressed rule base on reasoning. For this purpose, we compare the performance of reasoning with an uncompressed rule base against reasoning with a compressed one. We perform this comparison using two different reasoners, the first computing the inferences at query time, the second materializing them at load time. Finally, we discuss our observations before closing the article with some conclusions and perspectives on future work.

Related Work
In recent years, data repositories and registries have been growing, spanning from data cataloguing services (Datahub 5 ), data collections (Wikidata 6 , Europeana 7 ), to platforms that manage the collection and redistribution of data (Socrata 8 ). An emerging category of such systems are City Data Hubs, which need to support developers not only in obtaining data, but also in assessing the policies associated with data resulting from complex pipelines [12,3,2]. It is therefore for these systems to implement technologies that allow policies associated to derived datasets to be computed. In this article we concentrate on the problem of reasoning with propagating policies.
Policies can be represented on the Web in a machinereadable format. The W3C ODRL Community Group 9 works on the development of a set of specifications to enable interoperability and transparent communication of policies associated with software, services, and data. The Open Digital Rights Language (ODRL) 10 is an emerging language to support the definition, exchange and validation of policies [18]. Although ODRL is also available as an ontology, it only defines the semantics of policies in terms of natural language descriptions. An extension of the ODRL semantics has been proposed in [31] by considering dependencies between actions, and discussing the impact of explicit and implicit dependencies on the evaluation of policy expressions. The idea of establishing dependencies between ODRL actions in order to enhance the evaluation of ODRL expressions is related to our work, where we abstract relations in data flows. The Datanode ontology [6] which we use here is however designed to express a wider range of relations between data artifacts, and not only the ones derivable from actions. For instance, partitive relations influence the attached policies but are not derived from any action on the data. Nevertheless, a PPR reasoner can surely benefit from a well-defined semantics of ODRL actions. Recently, the W3C Permissions & Obligations Expression Working Group 11 followed up on ODRL to develop an official W3C standard for defining permissions and obligations.
The RDF Licenses Dataset [28] is an attempt to establish a knowledge base of license descriptions based on RDF and the ontology provided by ODRL. It also uses other vocabularies aimed to extend the list of possible actions, for instance the Linked Data Rights 12 vocabulary. 9  Process executions can be described in the Semantic Web using the Provenance Ontology (PROV-O) [24]. PROV-O describes workflow executions in terms of agents, actions and assets involved. The Datanode ontology has been designed to describe Semantic Web applications by means of the relations between the data involved in their processes [6]. The ontology is a taxonomy of possible relations that may occur between data objects, which might be part of a process execution, such as the ones described with PROV-O. It can therefore be used to further qualify the implications of the actions performed in such a process. Datanode can describe process implications in a data-oriented way, namely as network of data objects. While policies and process executions can be represented, in the present paper we aim at studying the process of reasoning upon the propagation of policies across a data flow.
Rule-based representation and reasoning over policies is required in order to enable secure data access and usage in distributed environments, particularly in the Semantic Web [25,13,4]. Defeasible logic is used to reason with deontic statements, for example to check compatibility of licenses or to validate constraints attached to components on multi-agent systems [29]. The problem of licenses' compatibility has been extensively studied in the literature [16,15] and tools that can perform such assessment do exist [23]. Our previous work introduces a form of policy reasoning, namely policy propagation [7]. A Policy Propagation Rule (PPR) is a Horn clause defined by associating a Datanode relation with an ODRL policy. Reasoning with Horn rules is an effective way of dealing with policies, particularly because Horn rules allow tractable defeasible reasoning [1]. While in this article we only focus on policy propagation, PPRs can in principle be integrated with rule-based reasoners for policy validation.
Formal Concept Analysis (FCA) [33] has the capability of classifying collections of objects depending on their features. We apply FCA in conjunction with the Datanode ontology to detect a common behaviour of relations in terms of policy propagation, with the purpose of compressing a PPR knowledge base. We refer the reader to [9] for a description of the Contento tool, that implements FCA as well as other functionalities for evolving concept lattices in Semantic Web ontologies, also part of the approach we present here.
The approach described in this paper clearly relates to principles and methods of knowledge engineering [32]. In [27], knowledge acquisition is considered as an iterative process of model refinement. More recently, problem solving methods have been studied in relation to the task of understanding process executions [14]. These contributions form the background of the approach we are following in the present work. The problem of compressing propositional knowledge bases has been extensively studied in the past, focusing on the optimization of a Horn minimization process to improve the performance of rule execution [17,5]. Differently, we deal with compression as a mean to reduce the number of minimal rules to be managed (each PPR being already an atomic rule), by means of an additional knowledge base (the Datanode ontology).
It is worth noting that our problem is not one of policy enforcement, but of providing the right information about policies that might affect the terms of use of a given asset produced by a complex data flow. This problem is also different from the one of minimizing access control policies (example, the abstraction by subsumption proposed in [19]), as the abstraction required is on the propagation of the policy, not the policy itself. Reasoning on policy propagation does not require the policies to be validated per se. On the contrary, we claim that validating the policies of a data artifact, which is the result of some manipulation, should consider the policies inherited from the input data, according to the particular actions performed.
To our knowledge, the problem of propagation of usage policies in data flows has not been tackled before the contribution in [7]. In [10] we proposed an approach for integrating policy propagation in the data governance activity of Data Hubs, where policies and data flows are managed by Data Hub managers. However, in [7] as well as in the present work, we do not focus on the quality of the data flow representations, and assume a machine-readable description of the policies of the input asset, as well as the existence of an accurate data flow.

Reasoning on policy propagation
In this section, we describe our approach for reasoning on policy propagation, and we present a use case as an example.

Approach
We define the problem of policy propagation as identifying the set of policies associated with the output of a process, implied by the policies associated with the input data source. In order to perform reasoning on policy propagation, we need: a) descriptions of policies attached to data sources; b) a description of the data flow (the actions performed on the data), and c) policy propagation rules (which actions do propagate a given policy).
Description of policies. We assume the policies of data sources are described as licenses or "terms and conditions" documents, and that they are expressed in RDF according to the ODRL ontology 13 . An ODRL odrl:Policy is an entity to capture the statements of the policy, specifying a set of odrl:Rules, each including a deontic aspect (odrl:permission or odrl:prohibition), which are defined for a set of odrl:Actions and a odrl:target odrl:Asset. Permissions, in turn, can comprise a odrl:duty (or more). For example, the RDF Licenses Dataset [28] is a source of such descriptions. In our work, we also developed ad-hoc RDF documents to satisfy this requirement, when necessary.  The ontology groups the relations under five main dimensions 14 , summarized in Figure 1: Adjacency. dn:adjacentTo represents proximity between two datanodes in a data container. For example, proximity may result from being parts of the same datasetdn:disjointPartWith, or being an annotation of the dataset -dn:hasAnnotation, or an attachment -dn:attachedTo.
Derivation. This branch specializes dn:hasDerivation in a number of different forms. Examples cover activities like mining -dn:hasExtraction, selection -dn:isSelectionOf, reasoning -dn:hasInference, remodelling dn:remodelledFrom, or the activity of making snapshots of data or cachesdn:hasSnapshot, dn:hasCache, to mention a few.
Metalevels. This dimension covers the relations between a data object and its metadata. The property dn:metadata is used to designate a relation with information that applies to the datanode as a whole. This relation specializes as dn:describes, dn:hasAnnotation and dn:hasStatistics.
Interpretation. This is designed to capture the possibility that a datanode might contribute to inferences that can be made in another one. Two datanodes might be "understood" together, i.e. their content can be compared, or the interpretation (inferences) of one may affect the interpretation (inferences) of another. The more intuitive examples are dn:consistentWith and dn:inconsistentWith. However, this is also the area of the ontology that covers partitive relations: dn:isPartOf and the two specializations dn:isPortionOf and dn:isSectionOf. In Datanode, portion refers to a part of the population of a dataset (such as the rows of a spreadsheet), while section refers to a set of values for a certain dimension in a dataset (for example, a column of a spreadsheet).
14 In this section we only summarize the basic features of the ontology, and we omit to specify inverse relations (for example dn:isDerivationOf), for clarity. The interested reader is referred to [6] and to the online documentation: http://purl. org/datanode/docs Capabilities. Capability is intended as the power or ability to generate an outcome 15 . Capability is covered with two separate branches starting from dn:overlappingCapabilityWith and dn:differentCapabilityFrom, respectively. Two datanodes may have similar (or different) potential. For example, dn:overlappingVocabularyWith and dn:overlappingPopulationWith express the similarity between two data objects in terms of vocabulary or population of a dataset. Under this scope, we also positioned dn:optimizedInto (also a kind of derivation), to state the empowerment of an existing capability.
It is worth noting that Datanode relations often have multiple ancestors. For example, dn:hasStatistics is both a dn:hasComputation and a dn:describedBy kind of relation, which in turn are subsumed by dn:hasDerivation and dn:metadata respectively. Similarly, dn:hasAnnotation relates a datanode to some attached metadata, therefore it is subsumed by dn:attachedTo and dn:metadata. We refer to [6] for a discussion on the development of Datanode.
In this work, we use the representations of data flows extracted from the descriptions of several Semantic Web applications prepared in [6].

Policy Propagation Rules. A Policy Propagation
Rule (PPR) establishes a binding between a Datanode relation r and a policy p. A PPR is a Horn clause of the following form: where X and Y are data objects, p is a policy and r a Datanode relation between X and Y . When the policy p holds for a data object X, related to another data object Y by the relation r, then the policy p will also hold for the data object Y . For example a PPR could be used to represent the fact that downloading a file F distributed with an attribution requirement will result in a local copy D, which also needs to be used according to the attribution requirement. Therefore, the above abstract rule could be instantiated as follows: In fact, we can reduce a PPR to a more compact form, i.e. a binary association between a policy p and a relation r: as the other components of the rule can be automatically derived for any possible X and Y . With these elements established, we can trace the policies propagated within the data flow connecting input and output.

Example use case
We described the components required to reason upon policy propagation in data flows. We now introduce a motivating scenario. The following are those name spaces that will be referred to in this example: r d f s : < h t t p : / / www. w3 . org / 2 0 0 0 / 0 1 / r d f schema #> o d r l : < h t t p : / / www. w3 . org / ns / o d r l / 2 / > cc : < h t t p : / / c r e a t i v e c o m m o n s . o r g / ns #> dn : < h t t p : / / p u r l . com / d a t a n o d e / ns / > p p r : < h t t p : / / p u r l . com / d a t a n o d e / p p r / ns / > ex : < h t t p : / / p u r l . o r g / d a t a n o d e / ex / > We selected EventMedia [21] as an exemplary dataoriented system. EventMedia exploits real-time connections to enrich content describing events and associates it with media objects 16 . The application reuses data exposed by third parties, aggregating data about events and exposing them alongside multimedia objects retrieved on the Web. Aggregated data are internally represented using the LODE ontology [30]. In order to associate the right policies to these data, a description of the policies of the input data, a description of the data flow, and a knowledge base of PPRs are needed. Table 1 lists the licenses or terms of use documents associated with the input data objects 17 . Listing 1 lists 16 See http://eventmedia.eurecom.fr/. 17 The Upcoming service is not available at the time of writing, however a snapshot of the documentation can be consulted from the Web Archive, reporting a non-commercial use clause: https: //web.archive.org/web/20130131064223/http: //upcoming.yahoo.com/services/api/. The application was firstly produced in 2014, when the EventMedia dataset description article was firstly submitted to the Semantic Web Journal. The description produced refers to the submitted version, which could be changed in the published version.  Figure 2 illustrates the EventMedia data flow and Listing 2 the equivalent RDF description. Data are processed from event directories and enriched with additional information and media from sources like DBpedia 19 , Flickr 20 or Foursquare 21 . In the figure, circles are data objects and arcs are Datanode relations. We will follow the path that connects the ex:output data object to two of the input data objects, namely ex:Flickr -that represents the Flickr API 22 (this path is highlighted in the figure), and Eventful 23 -a portal to search for upcoming events and related tickets. Apart from using the LODE ontology, ex:output is remodelled from an aggregation of various sources, named as ex:collection. The population (entities) of ex:collection includes ex:events, a dn:combinationFrom ex:Eventful with other sources (central path in the figure). Moreover, ex:collection includes descriptions of media from ex:Flickr, expressed by the path dn:hasPortion / dn:isCopyOf / dn:isSelectionOf. The data selected from ex:Flickr also refer to (some of) the entities aggregated in ex:events. This is expressed 7 Fig. 2. The data flow of EventMedia. Input sources are the top nodes. The node at the bottom depicts the output data, which is a remodelling of the data collected from various sources according to a specific schema. by the path ex:descriptionsFromFlickr dn:samePopulation / dn:isPortionOf ex:events. Therefore, the data flow is a backtrace of the abstract process of the EventMedia system, from the ex:output data object towards the input data sources.  The data flow described so far can be leveraged by a reasoner in conjunction with the ODRL policies of the inputs, and the PPRs, to infer the policies associated with ex:output. Listing 3 shows the policies propagated from the inputs to the output of the EventMedia data flow, some of the deriving from the restrictions applied to Flickr data, shown previously in Listing 1 Listing 3: Example of policy associated with the output of EventMedia. In [7] we considered the set of relations defined by Datanode and the policies defined in the RDF Licenses Dataset to generate a knowledge base of 3865 propagation rules. With the goal of improving the management of the rules, we studied to what extent it is possible to reduce the number of rules to be stored. This reduction requires to be complemented by inferences produced by a reasoner, relying on the axioms of the Datanode ontology. In the present work, we study whether this reasoning is practically feasible, and make the hypothesis that compressing the size of the rule base will not negatively impact the efficiency of the reasoner in computing the propagated policies.

(A)AAAA Methodology: overview and coherency check
Firstly introduced in [7], the (A)AAAA methodology covers all the phases necessary to set up a compact knowledge base of PPRs 28 . The methodology is based on two assumptions: 1) Policy Propagation Rules are associations between policies and data flow steps, and 2) an ontology is available to organize data flow steps in a semantic hierarchy, e.g., for expressing the fact that the relation is a copy of is a sub-relation of is a derivation of 29 . The inferences that can be derived from the ontology allow us to remove rules from the knowledge base. The methodology permits to measure the impact of the application of the ontology and supports its evolution with the purpose of maximizing the compression of the knowledge base of PPRs [7]. With respect to the methodology already presented, the novel contribution of this article is the introduction of a coherency check method in the Assessment phase.
In what follows we summarize the methodology, focusing on the coherency check element, and refer the reader to [7] for a general overview.
The methodology is composed of the following phases: The initial task is to set up a knowledge base of PPRs. We used the Datanode ontology to extract a list of 115 possible relations between data objects, and combined them with 113 policies derived from the ones defined in the RDF License Dataset. The combination of relations and policies lead to a matrix of 12995 cells. This phase required a manual supervision of all associations between policies and relations in order to establish the initial set of propagation rules. This was performed with the support of the Contento tool [9].

A2 -Analysis.
The objective of the second phase is to detect common behaviors of relations with respect to policy propagation. We achieve this by applying FCA, providing as input the binary matrix representation of the knowledge base R consisting of PPRs. The output of the FCA algorithm is an ordered set of concepts C. In FCA terms, each concept groups a set of objects (the concept's extent) and maps it to a set of attributes (the concept's intent). In our case, each concept represents a set of relations propagating the same set of policies. These concepts are organized hierarchically in a lattice, ordered from the top concept T , which includes all the objects and potentially no attribute, to the bottom concept B, including all the attributes with potentially an empty extent (set of objects). All other concepts are ordered from the top to the bottom. For example, usually a first layer of concepts right below T would include large groups of objects all having few attributes in common. Layers below would have more attributes and less objects, until the bottom B is reached. In our case, the top concept T would include all relations and no policy, while the bottom concept B includes all the policies but no relation. The concepts identified by FCA group relations that have a common behavior, as they propagate the same policies. The output of the process is an ordered lattice of concepts: clusters of policies that are propagated by the same set of relations.

A3 -Abstraction.
In this phase, we apply a method for subtracting rules in order to reduce the size of the knowledge base. The abstraction process is based on applying an ontology that organizes the relations in a hierarchy (the Datanode ontology). For instance, the relation dn:hasCopy is a sub-relation of dn:hasDerivation. Intuitively, a number of policies propagated by dn:hasDerivation should be also propagated by dn:hasCopy and by all the other sub-relations in that branch of the hierarchy. By grouping all the relations below dn:hasDerivation in a transitive closure, we obtain a group of relations similar to the ones in the FCA concepts, that we call the dn:hasDerivation branch. We compute the branch of each one of the relations in the ontology hierarchy. Since we expect branches of the ontology to be reflected in the clusters of relations obtained by FCA, we therefore search for matches between the branches and the concepts of the lattice. When a match occurs, we subtract the rules that can be inferred from the PPR knowledge base.
A general estimation of the effectiveness of the approach is given by the compression factor (CF ). We calculate the CF as the number of abstracted rules divided by the total number of rules: with R the set of rules, and A the set of rules that can be subtracted. Concrete examples of the application of this phase can be found in [7].

A4 -Assessment.
The objective of this phase is to assess to what extent the ontology and the FCA lattice are coherent. In particular, we want to: 1. detect mismatches (coherency check) to be resolved before using the compressed rule base with a reasoner, and 2. identify quasi matches that could become a full match by performing changes in the rule base or the ontology Coherency check. The abstraction process is based on the assumption that it is possible to replace asserted rules with inferences implied by subsumed relations in the ontology. This requires that all policies propagated by a given relation must be propagated by all the sub-relations in the original (uncompressed) rule base. A coherency check process is necessary to identify whether this assumption does hold for all the relations in each one of the concepts of the lattice. In case it does not, we want to collect and report all the mismatches in order to be able to fix them at a later stage in the methodology. Listing 4 shows the algorithm used to detect such problems on a given concept in the lattice. We know from the definition of a FCA lattice that super-concepts will include a larger set of relations propagating a smaller number of policies. Given a concept c, the algorithm extracts the relations (extent) of each of any super-concept (S denotes the set of all super-concepts s of c). In case these relations are not present in (the extent of) c, it is mandatory for them not to be sub-relations of any relation in the extent of c. In case they are, this means that a sub-relation is not inheriting all the policies of the parent one, thus invalidating our assumption. Mismatches M are identified and reported. Listing 5 shows the results obtained by applying the algorithm to Concept 71. In this example, a number of sub-relations of dn:isVocabularyOf do not propagate some of the policies of Concept 71.  Table 2 shows an example of the measures obtained. The measures defined in the Abstraction phase are now considered to quantify and qualify the way the ontology aligns with the propagation rules: precision (P re) and recall (Rec) indicate how close a relation is to being a suitable abstraction for policy propagation. For example, Concept 67 matches with two branches of the ontology hierarchy: hasPart and isPartOf. The P re is 1 for the hasPart branch, meaning that all the relations subsumed by hasPart ( hasSection, hasPortion, etc.) also propagate the policies in Concept 67. Conversely, the P re with respect to isPartOf is 0.86, meaning that some of the relations in this branch apparently do not propagate the policies in Concept 67. Concept 36 covers the branches hasCopy and isCopyOf, meaning that the related policies are transferred between copies of a given data artifact, re-gardless of the direction of the relation (that specifies which one of the two object was the original). Some general considerations can be made by inspecting these measures. When Rec = 1, the whole extent of the concept is in the branch. The branch might also include other relations, which do not propagate the policies included in the concept. When P re = 1, we can perform the subtraction of rules. The perfect match between a concept and a branch of the ontology would be F 1 = 1. A low recall indicates that a high number of exceptions still need to be kept in the rule set. It also reflects a high ES, from which we can deduce a low number of policies in the concept. As a consequence of that, inspecting a partial match with high precision and low recall highlights a problem that might be easy to fix, as the number of relations and policies to compare will be low. For example, row 2 of Table 2  At this stage we can make the following considerations: -The presence of mismatches between the lattice and the ontology will cause the reasoner to return wrong results. They must, therefore, be eliminated.
-The size of the matrix that was manually prepared in the Acquisition phase is large (13k cells), and even with the support of the Contento tool it is still possible that errors or misjudgments are made at that stage of the process.
-The Datanode ontology was not designed for the purpose of representing a common behavior of relations in terms of propagation of policies. It should be possible to refine the ontology in order to make it cover the current use case in a better way (and to further reduce the number of rules).

A5 -Adjustment
In this phase we perform operations that change the ontology (or the PPR knowledge base) in order to repair mismatches, correct inaccuracies, refine the hierarchy of relations, and improve the compression factor as a consequence. Six operations can be performed: Fill, Wedge, Merge, Group, Remove, Add. The Fill operation modifies the PPR knowledge base by adding all the rules necessary to make an ontology branch being fully covered by a concept, therefore evolving a quasi match into a full match. All the other actions are targeted to add, remove or reposition relations in the ontology hierarchy (further details about each operation can be found in [7]). The Assessment phase of the methodology reported possible mismatches between the FCA output and the ontology hierarchy. These errors must be repaired if we want the compressed rule base to be used by a reasoner. For example, Listing 5 shows the set of mismatches detected for concept 71. In this list, the dn:isVocabularyOf branch contains a number of relations that do not propagate the related policies, breaking the assumption that all the policies of dn:isVocabularyOf are also propagated by all the other relations in his branch. With the Fill operation, we can add all the necessary rules to remove this mismatch.
After each operation, we run our process again from the Analysis phase to the Assessment, in order to evaluate whether the change fixed the mismatch and/or how much the change affected the compression factor. The process is repeated until all mismatches have been fixed, and there are no other quasi matches that can be adjusted to become full matches. Moreover, when new policies are defined in the dataset of licenses, the process has to be repeated in order to insert the new propagation rules. However, this is only required after changes in the licenses, as changes in the associations between policies and data objects do not affect the PPRs, e.g., changing the license of a data source or adding new data flows.
As reported in Table 3 we performed the process 27 times with the objective to improve the compression and remove errors from the PPR knowledge base, identified by the coherency check algorithm. Figure 4 shows how the compression factor CF increases with the number of adjustments performed, while Figure  5 illustrates the progressive reduction of mismatches.  Details about the changes performed are provided in Table 3 (identified by the symbol +), which also includes statistics about number of mismatches (6 =), the impact on number of rules (R), number of concepts generated by FCA (C), number of rules abstracted (A), remaining rules (R + ), and compression factor (CF ). Moreover, Table 3 highlights the improvements obtained before (published in [7]) and the further compression obtained after the introduction of the coherency check method in the Assessment phase (after change 15). The first column identifies the change performed (starting from the initial state).  Thanks to this methodology we have been able to fix many errors in the initial data, to refine Datanode by clarifying the semantics of many properties and adding new useful ones. The inclusion of a coherency check phase is required for a safe use of the compressed rule base with a reasoner. However, the introduction of this approach allowed us to reduce the size even more. As final result we obtained: 4225 rules in total, 34 concepts, 3451 rules abstracted and 774 rules remaining, boosting the CF up to 0.817.
The version of the ontology prior to performing such changes can be found at http://purl.org/ datanode/0.3/ns/ and the modified version of the ontology can be found at http://purl.org/ datanode/0.5/ns/. As previously mentioned, the Acquisition phase has been performed with the Contento tool [9,8]. The tools used in the other phases of the methodology, from the Abstraction to the Adjustment phases, can be found at https://github. com/enridaga/ppr-a-five. Highlighted are the maximum and minimum values for each of the data flow inputs. In one case (DISCOU-11), none of the policies attached to the source are propagated to the output.

Experiments
The methodology described in the previous section allows to reduce the number of rules that need to be stored and managed. The results of applying this methodology on the PPR knowledge base derived from the RDF Licenses Dataset, show how the compression factor can be dramatically increased after several iterations. Our assumption in this work is that it might positively affect the performance of reasoning on policy propagation. Here, we therefore assess through realistic cases the performance of reasoners when dealing with a compressed knowledge base of PPRs, as compared to when dealing with the uncompressed set.
We took 15 data flow descriptions from previous work [6], referring to 5 applications that rely on data obtained from the Web. Each data flow represents a data manipulation process, consuming a data source (sometimes multiple sources), and returning an output data object. Given a set of policies P i associated with the input data, the objective of a reasoner is to find the policies P o associated with the output of the data flow. The experiments have the objective to compare the performance of a reasoner when using an uncompressed or a compressed rule base respectively. Therefore, each reasoning task is performed twice: at first time, to provide the full knowledge base of PPRs; the second time, to provide the compressed knowledge base in conjunction with the hierarchy of relations of the Datanode ontology (required to produce the inferences).
Reasoners infer logical consequences from a set of asserted facts and inference rules (knowledge base). A reasoner can compute the possible inferences from the rules and the facts any time it is queried, thus exploring the inferences required to provide the complete answer. Alternatively, a reasoner can compute all possible inferences at the time the knowledge base is loaded, and only explore the materialized facts at query time. In order to appropriately address both of those reasoning strategies, we run the experiments with two different reasoners. The first reasoner performs the inference at query time using a backward chaining approach; is implemented as Prolog program and we will refer to it as the Prolog reasoner. The second reasoner computes all the inferences at loading time (materialization); is implemented as an RDFS reasoner in conjunction with SPIN rules, and we will refer to it as the SPIN reasoner. Both reasoners are implemented in Java within the PPR Reasoner project 30 . Both reasoners have the capability of executing PPRs and expand the results according to the ontology hierarchy.
The Prolog implementation is a program relying on JLog, a Prolog interpreter written in Java 31 . The program incorporates a meta rule that traverses the set of PPRs, encoded as facts. At the same time, it supports the subsumption between relations. Listing 6 shows an excerpt of the program. The SPIN reasoner is built upon the RDFS reasoner of Apache Jena 32 in combination with SPIN 33 , a rule engine that allows to define rules using SPARQL. The core part of the reasoner executes PPRs as a SPARQL meta query (Listing 7).
Listing 7: Construct meta-query of the SPIN reasoner.  Table 4. Each data flow describes a process executed within one of the 5 systems selected as exemplary data-oriented applications. These data flows were formalized before the present work (in [6]), and were reused for the experiments without changes. However, information about the policies of the input was added. Table 4 illustrates the properties of these data flows, and compares them along several dimensions. The has 31 http://jlogic.sourceforge.net/ 32 http://jena.apache.org/ 33 http://spinrdf.org/ policy column reports the number of statements about policies, from a minimum of 5 to 37 policies. The size of the data flow is reported in the has relation column of the table, as it is measured in number of Datanode relations used, spanning from 2 to the maximum of 25. The relations column reports the number of distinct relations, the same applying to data objects, policies, sources and the propagated output policies. Highlighted are the maximum and minimum values for each of the dimensions. In one case (DISCOU-11), none of the policies attached to the source are propagated to the output.
Each experiment takes the following arguments: -Input: a data flow description -Compression: T rue/F alse -Output: the output resource to be queried for policies In case compression is F alse, we provide the complete knowledge base of PPRs as input of the reasoning process without including information on subsumption between the relations described in the dataflow. Conversely, when compression is set to T rue, the compressed PPR knowledge base is used in conjunction with the Datanode ontology. It is worth noting that the (A)AAAA methodology is also an ontology evolution method, as most of the operations targeted to improve the compression of the rule base are performed on the ontology by adding, removing and replacing relations in the hierarchy. In these experiments, we are considering the evolved rule base (and ontology), that has been harmonized by fixing mismatches between the rule set and the ontology. The experiments were executed on a MacBook Pro with an Intel Core i7/3 GHz Dual Core processor and 16 GB of RAM. In case a process was not completed within five minutes, it was interrupted. Each process was monitored and information about CPU usage and RAM (RSS memory) was registered at intervals of half a second. When terminating, the experiment output would include: total time (t), resources load time (l), setup time (s), and query time (q). The size of the input for each experiment is reported in the diagrams in Figure 6.
We consider performance on two main dimensions: time and space.
Time performance is measured under the following dimensions: L Resources load time.
S Setup time. It includes L, in addition to any other operation performed before being ready to receive queries (e.g., materialization). Q Query time. T Total duration: T = S + Q.
Space is measured as follows: P a Average CPU usage. M Maximum memory required by the process Each experiment was executed 20 times. We compared the results of the experiments with and without compression, and verified they included the same policies. In the present report, we show the average of the measures obtained in the different executions. In order to evaluate the accuracy of the computed average measure from the twenty executions of the same experiment, we calculated the related Coefficient of Variation (CV ) 34 . CV is a measure of spread that indicates the amount of variability relative to the mean. A high CV indicates a large difference in the observed measures, thus reducing the significance of the computed mean. Diagrams 7a and 7b display the CV of all the measures for the Prolog and SPIN reasoner, respectively. In almost all the cases the CV for the Prolog reasoner was below 0.1, with the exception of memory usage M , that in many cases showed a fluctuation between 0.2 and 0.4. Experiments with the SPIN reasoner reported a much more stable behaviour in terms of consumed resources, the CV being assessed below 0.1 in almost all the cases, except the Query time of some experiments (the peak is on DBREC-4). However, Q with the SPIN reasoner were fluctuating around an average of 10ms, making the observed variation irrelevant. Finally, we consider the computed mean of the observed measures in these experiments to be significant.
Before discussing the results, it is worth reminding the reader that this evaluation is not targeted to compare the two implementations of a PPR reasoner, but to observe the impact of our compression strategy on the approaches of the Prolog and SPIN implementations, assuming that any other implementation is likely to make use of a combination of the two reasoning strategies they respectively implement. Figures 8 and 9 illustrate the results of the experiments performed with the Prolog and the SPIN reasoner, respectively. For each data flow, the bar on the 34 Coefficient of Variation, also known as Relative Standard Deviation (RSD). https://en.wikipedia.org/wiki/ Coefficient_of_variation left displays the time with an uncompressed input, and the one on the right the time with a compressed input. We will follow this convention in the other diagrams as well. Figure 8c displays a comparison of the total time between an uncompressed and compressed input with the Prolog reasoner. In all cases, there has been a significant increase in performance with the compressed rule base: in three cases (DBREC-5, DISCOU-1, REXPLORE-4) the uncompressed version of the experiment could not complete within the five minutes, while the compressed version returned results in less then a minute. The total time of the experiments with the SPIN reasoner ( Figure 9c) is much smaller (fractions of a second), having the maximum total time of approximately 2 seconds (EventMedia-1). However, in this case too, we report an increase in every case in performance for all the data flows, with some cases performing much better than others (DBREC-3, DBREC-4). The total time T of the experiment can be broken up into setup time S (including load time L) and query time Q. This observation is depicted in Figures 8a and  9a, and in both cases the impact of the rule reduction process is evident. An interesting difference between the two implementations can be seen by comparing Figures 8b and 9b. The cost of the query time in the Prolog reasoner is very large compared to the related setup time S. The SPIN reasoner, conversely, showed a larger setup time S with a very low cost on query time Q. The reason is that the second materializes all the inferences at setup time, before query execution. This accounts for the lack of difference in query time between the uncompressed and compressed version of the experiments with the SPIN reasoner.
We did not observe changes in P a for the Prolog reasoner (Figure 8d), while the differences in memory consumption M is significant (Figure 8e), demonstrating a performance improvement caused by the compressed input. A decrease in space consumption was also observed using the SPIN reasoner (Figures 9d  and 9e), even if smaller, and negative in only 2 cases with regard to memory consumption M (DBREC-1 and DBREC-6).
A summary of the impact of the compression on the different measures is depicted in Figures 10 and 11. The first bar on the left of both diagrams illustrates the reduction of the size of the Input, while the others how much each measure is reduced. A serious improvement has been achieved in the case of the Prolog reasoner, implementing a backward chaining algorithm executed at query time. A PPR reasoner could also be implemented to perform inferencing at loading time 16

E. Daga et al. / Reasoning with Data Flows and Policy Propagation Rules
(a) Prolog reasoner: input size computed as number of Prolog facts with the original (dark orange) and compressed (light yellow) input for each data flow.  (materialization). The experiments with the SPIN implementation is therefore used to show that the effect on reasoning performance exists in both cases, even if in different ways depending on the approach to inferencing. The main conclusion from our experiments is therefore that the methodology presented in [7] and extended with coherency check leads to a compressed PPR knowledge base that is not only more manageable for the knowledge engineers maintaining them, but also improves our ability to apply reasoning for the purpose of policy propagation. In addition, it appears clearly that, when dealing with a compressed PPR knowledge base, an approach based on materialization of inferences at load time is preferable to one based on computing the inferences at query time.

Conclusions
In this article, we presented an approach for reasoning on the propagation of policies in a data flow. This method is grounded on a rule base of Policy Propagation Rules (PPRs). Rules can easily grow in number, depending on the size of the possible policies and the one of the possible operations performed in a data flow. The (A)AAAA methodology can be used to reduce this size significantly, as demonstrated in [7], by relying on the inference properties of the Datanode ontology, applied to describe the possible relations between data objects. We presented an evolved version of the methodology, which was required to be sure the inferred policies were correct when using the compressed rule base. However, while this activity reduces the size of the input of the reasoner, it requires more inferences to be computed. Therefore, we performed experiments to assess the impact of the compression on reasoning performance. The present article provides two major contributions: the (A)AAAA methodology has been extended by including a coherency check algorithm, and experimental results demonstrating that a compressed knowledge base makes the reasoning on policy propagation more efficient.
This is a preliminary step on studying compression in knowledge management and its impact on reasoning in a more general point of view. Reasoning on policy propagation requires a formalisation of the data flow,  and producing such representation can be time consuming. Recent work by the authors investigate how it is possible to support users in the formalisation of data flows derived from scientific workflows [11]. It would be of interest to explore methods for supporting and automating the generation of such data flows from other pre-existing artefacts (e.g., code bases and their documentation). Future work includes defining new measures to describe the complexity of a data flow and how it affects reasoning on policy propagation, as well as studying the validation of data flows with respect to policies, particularly when multiple sources are used. Finally, we are currently setting up an experimental evaluation (including a user study) to assess the quality of the knowledge base of PPRs produced with this approach, the correctness of the reasoning results with respect to users' expectations, and the effectiveness of the associated methodology in the environment of the MK:Smart Data Hub [12].