Abstract
Schema mappings have been extensively studied in the context of data exchange and data integration, where they have turned out to be the right level of abstraction for formalizing data inter-operability tasks. Up to now and for the most part, schema mappings have been studied as static objects, in the sense that each time the focus has been on a single schema mapping of interest or, in the case of composition, on a pair of schema mappings of interest. In this paper, we adopt a dynamic viewpoint and embark on a study of sequences of schema mappings and of the limiting behavior of such sequences. To this effect, we first introduce a natural notion of distance on sets of finite target instances that expresses how “close” two sets of target instances are as regards the certain answers of conjunctive que- ries on these sets. Using this notion of distance, we investigate pointwise limits and uniform limits of sequences of schema mappings, as well as the companion notions of pointwise Cauchy and uniformly Cauchy sequences of schema mappings. We obtain a number of results about the limits of sequences of GAV schema mappings and the limits of sequences of LAV schema mappings that reveal striking differences between these two classes of schema mappings. We also consider the completion of the metric space of sets of target instances and obtain concrete representations of limits of sequences of schema mappings in terms of generalized schema mappings, that is, schema mappings with infinite target instances as solutions to (finite) source instances.
Keywords: Schema mappings, Limits, Pointwise convergence, Uniform convergence
Introduction
Schema mappings have been extensively studied in the context of data exchange and data integration, where they have turned out to be the right level of abstraction for formalizing data inter-operability tasks (see the surveys [11, 12] and the monograph [1]). Up to now and for the most part, schema mappings have been studied as static objects, in the sense that each time the focus has been on a single schema mapping or on a finite and, typically, small number of schema mappings. In the case of data exchange [6], a single schema mapping is used to specify the relationship between a source schema and a target schema. In the case of operators on schema mappings [3], such as the composition operator [8, 14], a fixed number of schema mappings is used as input (e.g., two schema mappings in the case of composition) and another schema mapping is returned as output. Even the case of schema-mapping evolution [9] entails a finite (but potentially large) number of schema mappings.
In this paper, we adopt a dynamic viewpoint and embark on a systematic investigation of sequences of schema mappings and of the limiting behavior of such sequences. The original motivation came from the earlier work [2, 5, 7, 10, 14] on schema-mapping optimization and the study of various notions of equivalence between schema mappings that, intuitively, stipulate that two schema mappings cannot be distinguished using conjunctive queries (C Q-equivalence) or conjunctive queries with at most n variables (C Q n-equivalence), for some fixed n ≥ 1. In particular, in [5] and, implicitly, in [14], it was shown that, given an SO-tgd (second-order tuple-generating dependency) σ and a positive integer n, one can construct a GLAV schema mapping that is C Q n-equivalent to σ. Informally, this means that a given SO tgd can be “approximated” by GLAV schema mappings up to any fixed level of precision, even though an SO tgd is a formula of second-order logic that may not be logically equivalent to any formula of first-order logic and, in particular, to any GLAV schema mapping. A more dynamic interpretation is that, given an SO-tgd σ, one can obtain a sequence of GLAV schema mappings , whose “limit” is σ.
Summary of Results
Our contributions are both conceptual and technical. At the conceptual level, we develop a framework for studying sequences of schema mappings by first introducing a natural notion of distance on the powerset of the set Inst(T) of finite instances over a schema T. Intuitively, this notion of distance expresses how “close” two sets of finite T-instances are as regards the certain answers of conjunctive queries on these sets. The pair is a pseudometric space, which means that the distance function d i s t(⋅,⋅) is symmetric and obeys the triangle inequality, but different sets of finite target instances may have distance zero; however, two such sets have distance zero if and only if they are C Q-equivalent, i.e., every conjunctive query has the same certain answers on these two sets. Thus, we will also work with the metric space obtained by considering the C Q-equivalence classes of members of , and will use the same notation for it.
Sequences of functions from some set to a metric space occupy a central place in the study of metric spaces (see, e.g., [18]). In particular, there are natural notions of a pointwise limit and of a uniform limit of a sequence (f n)n ≥ 1 of functions from some set to a metric space; moreover, there are companion notions of a pointwise Cauchy and of a uniformly Cauchy sequence of such functions. We now describe briefly how these notions can be applied to sequences of schema mappings. In its most general formulation, a schema mapping over a source schema S and a target schema T is a set of pairs (I, J), where I is a finite S-instance and J is a finite T-instance. It follows that a schema mapping can be also be viewed as a function f from the set Inst(S) of all finite S-instances to the powerset of the set of all finite T-instances, where . This way, a sequence of schema mappings over a source schema S and a target schema T can be viewed as a sequence of functions from Inst(S) to the (pseudo)metric space .
After the conceptual framework has been laid out, we study in depth the limiting behavior of sequences of GAV mappings and the convergence of sequences of LAV mappings. We establish a number of technical results that reveal rather dramatic and perhaps unanticipated differences between GAV schema mappings and LAV schema mappings.
For sequences of GAV mappings, we point out that every uniformly Cauchy sequence of GAV mappings is eventually constant, hence it has a GAV mapping as uniform limit. We also show that every pointwise Cauchy sequence of GAV mappings has a pointwise limit, but it need not have a uniform limit; moreover, there are pointwise Cauchy sequences of GAV mappings such that no GAV mapping is their pointwise limit. This raises the question as to when a sequence of GAV mappings has a GAV mapping as a pointwise limit. We prove that a sequence of GAV mappings has a GAV mapping as a pointwise limit if and only if it has a pointwise limit that allows for C Q-rewriting1.
For sequences of LAV mappings, we show that the notions of uniform limit and pointwise limit coincide; moreover, the same holds true for the notions of uniformly Cauchy and pointwise Cauchy sequences. However, there are uniformly Cauchy sequences of LAV mappings that have no uniform limit. We also establish that a uniformly Cauchy sequence of LAV mappings has a LAV mapping as a uniform limit if and only if it has a uniform limit that admits universal solutions. The aforementioned results lift to sequences of premise-bounded sequences of GLAV mappings, i.e., sequences of GLAV mappings for which there is a k ≥ 1 such that, for every mapping in the sequence, the left-hand side of every GLAV constraint has at most k source atoms (LAV mappings have k = 1).
In terms of techniques, we use systematically the structural characterizations of schema-mapping languages established in [19], thus creating a link with a different line of research.
The metric space is incomplete, i.e., there are Cauchy sequences of elements of that have no limit in . It is well known that every incomplete metric space (X, d) has a completion, which means that it can be embedded into a complete metric space (X ∗, d ∗) so that X is a dense subset of X ∗. Moreover, pointwise (respectively, uniformly) Cauchy sequences of functions on X have pointwise (respectively, uniform) limits that take values in X ∗. The construction of X ∗ from X involves equivalence classes of Cauchy sequences of elements of X, thus, in general, the members of X ∗ do not have a concrete representation. In the last part of the paper, we show that the members of can be represented by suitably constructed infinite T-instances. As a consequence of this, the pointwise (respectively, uniform) limits of Cauchy sequences of schema mappings can be represented by generalized schema mappings, i.e., schema mappings that allow for infinite target instances as solutions to finite source instances.
Preliminaries
This section contains a minimum amount of necessary background material.
Schemas, Instances, and Conjunctive Queries
A schema R is a finite sequence 〈R 1,…, R k〉 of relation symbols, where each R i has a fixed arity. An instance I over R, or an R -instance, is a sequence (R1I,…, R k I), where each is a finite relation of the same arity as R i. We will often use R i to denote both the relation symbol and the relation that interprets it. The active domain a d o m(I) of an instance I is the set of all values occurring in the relations of I. A fact of an instance I (over R) is an expression (or simply R i(t 1,…, t m)), where R i is a relation symbol of R and .
A conjunctive query is a first-order formula of the form ∃z 𝜃(x, z), where 𝜃(x, z) is a conjunction of atomic formulas R i(v 1,..., v m) and each v j is one of the variables in x and z. A boolean conjunctive query is a conjunctive query with no free variables. We write C Q for the class of all conjunctive queries over some schema. For every n ≥ 1, we let C Q n denote the class of all conjunctive queries with at most n variables. We also let C Q 0 denote the singleton consisting of a trivially true query. If I is an instance and q is a conjunctive query, then we write q(I) for the result of evaluating q on I; in particular, for boolean conjunctive queries q we have that q(I) = t r u e if and only if I satisfies q.
Schema Mappings, Universal Solutions, Certain Answers
Motivated by the terminology in data exchange [6], we typically work with two schemas, a source schema S and a target schema T with no relation symbols in common. We refer to S-instances as source instances, and to T-instances as target instances. We assume that the values occurring in the active domains of instances come from two fixed countably infinite disjoint sets, the set Const of all constants and the set Null of (labeled) nulls. We also assume that the active domains of source instances consist entirely of constants; the active domains of target instances may contain both constants and nulls.
In its most general form, a schema mapping between a source schema S and a target schema T is a set of pairs (I, J), where I is source instance and J a target instance. To avoid anomalies that arise from such a relaxed notion, we will assume that a schema mapping must also possess a mild closure property, namely, that is closed under isomorphisms that rename nulls by other nulls. This is a natural “genericity” condition that is akin to the condition that database queries are closed under arbitrary isomorphisms. The precise definitions are as follows.
Definition 1
Let S be a source schema and T a target schema.
- An isomorphism that renames nulls between two target instances J and J ′ is a one-to-one and onto function h : a d o m(J) → a d o m(J ′) such that:
-
(i)If c is a constant in a d o m(J), then h(c) = c.
-
(ii)If w is a null in a d o m(J), then h(w) is also a null.
-
(iii)For every relation symbol R of T of arity m and for every tuple (a 1,…, a m) of constants and nulls, we have that R J(a 1,…, a m) is a fact of J if and only if is a fact of J ′.In this case, we write J ′ = h(J) and say that J ′ is an isomorphic copy of J via an isomorphism that renames nulls.
-
(i)
A schema mapping between S and T is a set of pairs (I, J), where I is source instance and J a target instance, such that the following holds: if a pair (I, J) is in and if J ′ = h(J) is an isomorphic copy of J via an isomorphism h that renames nulls, then also (I, J ′) is in .
A schema mapping is often (but not always) given as a triple , where Σ is a set of formulas in some logical formalism such that if and only if I ∪ J⊧Σ. Clearly, if Σ is a set of first-order formulas or a set of second-order formulas, then is indeed closed under isomorphisms that rename nulls holds.
Let be a fixed schema mapping. In data exchange, the main problem is, given a source instance I, to find a solution for I w.r.t. , that is, a target instance J such that (or determine that no solution exists). We use the notation to denote the set of all solutions for I w.r.t. . In data integration, the main problem is to compute the certain answers of queries [12]. Specifically, given a query q over the target schema and a source instance I, the certain answers of q on I w.r.t. is the set
If q is a boolean conjunctive query, then , if q(J) = t r u e, for every solution J for I w.r.t. ; otherwise, . Note also if q is a non-boolean conjunctive query, then either or every tuple is null-free, that is, it consists entirely of constants. This is a consequence of the closure of under isomorphisms that rename nulls. Indeed, assume that . Let J be a solution for I w.r.t. and let J ′ = h(J) be a target instance that is an isomorphic copy of J via an isomorphism h that renames nulls from the active domain of J to nulls outside the active domain of J (such a target instance J ′ and such an isomorphism h exist because J is a finite set of facts, hence its active domain is a finite set). By the closure property of , the target instance J ′ is also a solution for I w.r.t. , hence t ∈ q(J ′). It follows that t must consist of values in the intersection a d o m(J) ∩ a d o m(J ′) of the active domains of J and J ′, hence t must consist entirely of constants. Note that the only property of conjunctive queries used in this argument is that they are safe, that is, they return tuples from the active domain of the instance on which they are evaluated.
On the face of it, the definition of certain answers may entail computing an intersection of infinitely many sets. One of the main findings in [6] is that there is a notion of a “good” solution in data exchange, called universal solution, that can also be used to compute the certain answers of conjunctive queries in a much more direct way.
Let J 1 and J 2 be two target instances. A function h is a homomorphism from J 1 to J 2 if the following hold: (i) for every constant c, we have that h(c) = c; and (ii) for every relation symbol R in R and every tuple , we have that . We write J 1 → J 2 to denote that there is a homomorphism from J 1 to J 2. We say that J 1 is homomorphically equivalent to J 2, written J 1 ⇔ J 2, if J 1 → J 2 and J 2 → J 1.
Let I be a source instance. A universal solution for I w.r.t. is a solution J such that for every solution , we have that J → J ′. Intuitively, a universal solution for I is a “most general” solution for I. We write to denote the set of all universal solutions for I w.r.t. (note that universal solutions need not always exist, so it is possible that ). The following useful property of universal solutions was first identified in [6].
Proposition 1
Assume that is a schema mapping, I is a source instance, and J is a universal solution for I w.r.t. .If q is a conjunctive query, then , where q(J)↓ is the set of all null-free tuples in q(J).
Proof 1
First, assume that . Then, as discussed earlier, t must be a null-free tuple. Since J is a solution for I w.r.t. , we have that t ∈ q(J), hence we have that t ∈ q(J) ↓. Next, assume that t is a null-free tuple in q(J). If J ′ is an arbitrary solution for I w.r.t. , then, since J is a universal solution for I w.r.t. I, there is a homomorphism h from J to J ′. Since conjunctive queries are preserved under homomorphisms, it follows that h(t) = t ∈ q(J ′). Thus, . □
Structural Properties of Schema Mappings
We now present a number of structural properties that a schema mapping may or may not possess. These properties were investigated in their own right in [19], where they were used to obtain characterizations of schema-mapping languages that will be of great interest to us in this paper.
Let be a schema mapping.
allows for C Q-rewriting if for every target conjunctive query q, there exists a union q ′ of source conjunctive queries such that , for every source instance I.
admits universal solutions if for every source instance I, there is a universal solution for I w.r.t. .
is closed under target homomorphisms if and J → J ′ imply that .
is closed under unions if and imply that .
is closed under target intersections if and imply that .
is n-modular if whenever , there is a subinstance I ′⊆ I with at most n elements in its active domain such that (“small counterexample”).
Schema Mapping Languages
A GLAV (global-and-local-as-view) constraint is a first-order formula of the form ∀x(φ(x) →∃y ψ(x, y)), where φ(x) is a conjunction of atoms over the source schema S, each variable in x occurs in at least one atom in φ(x), and ψ(x, y) is a conjunction of atoms over the target schema T with variables in x and y. We refer to φ(x) as the left-hand side, or premise, and ∃y ψ(x, y) as the right-hand side, or conclusion of the constraint. Another name for GLAV constraints is source-to-target tuple-generating dependencies or, in short, s-t tgds.
A LAV (local-as-view) constraint is a GLAV constraint whose left-hand side is a single atom over the source, while a GAV (global-as-view) constraint is a GLAV constraint whose right-hand side contains no existential quantifiers and consists of a single atom over the target. For example, ∀x, y(E(x, y) →∃z(F(x, z) ∧ F(z, y))) is a LAV constraint, and ∀x, y, z(E(x, z) ∧ E(z, y) → F(x, y)) is a GAV constraint.
A GLAV (global-and-local-as-view) mapping is a schema mapping such that Σ is a finite set of GLAV constraints. The notions of a LAV mapping and of a GAV mapping are defined analogously.
Every GLAV mapping admits universal solutions [6]; furthermore, given a source instance I, a canonical universal solution can be produced via the oblivious chase procedure as follows: whenever the antecedent of an s-t tgd in becomes true, fresh null values are introduced and facts involving these nulls are added to , so that the conclusion of the s-t tgd becomes true. Every GLAV mapping is also known to allow for C Q-rewriting and to be n-modular, for some n ≥ 1. Moreover, every LAV mapping is closed under unions, while every GAV mapping is closed under target intersections.
Second-Order tgds, or SO tgds, were introduced in [8] and were shown to be exactly the constraints needed to express the composition of a finite number of GLAV mappings. Instead of giving the precise definition of an SO tgd, we illustrate this notion with an example from [8]. The formula
expresses the property that every employee has a manager, and if an employee is the manager of himself/herself, then this employee is a self-manager. Clearly, SO tgds are existential second-order formulas with existentially quantified function symbols, which can be thought of as acting like Skolem functions. The use of these function symbols, however, is limited by the syntax of SO tgds: they can only appear in equations between terms in the antecedent of an implication or as arguments of atoms in the conclusion of an implication. As regards expressive power, SO tgds are, in general, strictly more expressive than GLAV constraints, but less expressive than arbitrary existential second-order formulas. In particular, the above formula is an SO tgd that is not logically equivalent to any (finite or infinite) set of GLAV constraints [8].
Every SO tgd allows for C Q-rewriting and admits universal solutions; however, an SO tgd may not be closed under target homomorphisms and there may not exist any n ≥ 1 such that the SO tgd is n-modular (see [8, 19]).
Pseudometric Spaces and Metric Spaces
A pseudometric space is a pair (X, d), where X is a set and d is a function from X × X to the set R + of non-negative real numbers with the following properties: (i) d(x, x) = 0, for every x in X; (ii) d(x, y) = d(y, x), for every x and y in X; (iii) d(x, y) ≤ d(x, z) + d(y, z), for every x, y, z in X (triangle inequality). A metric space is a pseudometric space (X, d) such that if d(x, y) = 0, then x = y. It is easy to see that if (X, d) is a pseudometric space, then the relation R d = {(x, y) ∈ X × X∣d(x, y) = 0} is an equivalence relation on X. From this, it follows that every pseudometric space (X, d) gives rise to a metric space , where is the set of equivalence classes of elements of X modulo the equivalence relation R d and .
Let (X, d) be a pseudometric space. A sequence of elements x 1, x 2,… of X converges to an element x of X, denoted by , if for every 𝜖 > 0, there is an integer n 0 such that d(x n, x) < 𝜖, for every n ≥ n 0. We say that x is a limit of this sequence. The limit is unique if (X, d) is a metric space. A sequence x 1, x 2,… of elements of X is Cauchy if for every 𝜖 > 0, there is an integer n 0 such that , for every n, n ′≥ n 0.
Using the triangle inequality, it is easy to see that if a sequence of elements in a (pseudo)metric space has a limit, then the sequence is Cauchy. The converse, however, does not hold for arbitrary (pseudo)metric spaces. A (pseudo)metric space (X, d) is complete if every Cauchy sequence of elements of X has a limit in X; otherwise, it is incomplete.
It is well known that every incomplete (pseudo)metric space (X, d) can be embedded into a complete (pseudo)metric space (X ∗, d ∗), called the completion of (X, d), in such a way that X is a dense subset of X ∗, i.e., every member of X ∗ is the limit of a sequence of members of X. The members of X ∗ are equivalence classes of Cauchy sequences of X, where two Cauchy sequences x 1, x 2,... and y 1, y 2,… of elements of X are equivalent if , while the distance function d ∗ is defined as . The proof of correctness of this construction can be found in [18] or any other book on metric spaces.
As a concrete example, the metric space of the real numbers is the completion of the metric space of the rational numbers (both with the standard distance).
Metric Space of Target Instances
To study the limits of sequences of schema mappings, we first introduce a pseudometric space of sets of target instances. By considering schema mappings as functions that map each source instance to the set of its solutions, we can view sequences of schema mappings as sequences of functions. The (pointwise or uniform) limit of a sequence of schema mappings is then simply defined in the standard way as the limit of a sequence of functions taking values in a pseudometric space. Moreover, by passing to the associated metric space of equivalence classes of sets of target instances, we ensure the uniqueness of the limit. If T is a schema, we write Inst(T) for the set of all finite instances of T. We also write for the power set of Inst(T). The notion of distance on that we are about to introduce is heavily based on the notion of the certain answers to conjunctive queries and on the idea that two members and of are “close” to each other if only “big” conjunctive queries can yield different certain answers on and .
Definition 2
Let T be a schema.
- Let q be a query over T and let be a member of . The certain answers of q over are defined as
We say that two sets of instances and in are C Q-equivalent, denoted , if holds for all conjunctive queries q.
We say that and are C Q n-equivalent, denoted , if it holds that for all conjunctive queries q with at most n variables (i.e., for all q in C Q n.)
Definition 3
Let and be two sets of instances in . The similarity and the distance between and are defined as follows:
;
.
It is easy to verify that the pair is a pseudometric space; in fact, dist is an ultrametric distance function, that is,
holds for all , , in . Moreover, if and only if and are C Q-equivalent.
Definition 4
Let T be a schema. If J is a T-instance, then we write v(J) to denote the member of consisting of all isomorphic copies of J via isomorphisms that rename nulls. In other words, v(J) consists of all T-instances J ′ such that J ′ is isomorphic to J via an isomorphism h that maps each constant to itself and maps each null to a (possibly different) null.
The next lemma will be used repeatedly in the sequel.
Lemma 1
Let T be a schema.
If J is a T -instance whose active domain consists entirely of nulls and q is a non-boolean conjunctive query, then c e r t(q, v(J)) = ∅.
If J is a T -instance whose active domain consists entirely of nulls and q is a boolean conjunctive query, then c e r t(q, v(J)) = q(J).
- If J and J ′ are T -instances whose active domains consist entirely of nulls, then, for every k ≥ 1, the following statements are equivalent:
- .
- J and J ′ satisfy the same boolean conjunctive queries in C Q k.
Proof 2
For the first two parts of the lemma, let J be a T-instance whose active domain consists entirely of nulls. For every non-boolean query q in C Q k, we have that c e r t(q, v(J)) = ∅, because v(J) contains instances with disjoint active domains. For every boolean query q, we have c e r t(q, v(J)) = q(J) for the following reason: first, J is a member of v(J), so if c e r t(q, v(J)) = t r u e, then q(J) = t r u e as well; second, since every member of v(J) is isomorphic to J and since boolean conjunctive queries are preserved under isomorphisms, we have that if q(J) = t r u e, then c e r t(q, v(J)) = t r u e.
For the third part of the lemma, let J and J ′ be T-instances whose active domains consist entirely of nulls and let k be a positive integer. If , then J and J ′ must satisfy the same boolean conjunctive queries in C Q k because J ∈ v(J) and J ′∈ v(J ′). For the converse, assume that J and J ′ satisfy the same boolean conjunctive queries in C Q k. We have to show that c e r t(q, v(J)) = c e r t(q, v(J ′)), for every conjunctive query q in C Q k. If q is a non-boolean conjunctive query in C Q k, then, by the first part of the lemma, we have that c e r t(q, v(J)) = ∅ = c e r t(q, v(J ′)). If q is a boolean query in C Q k, then, by the second part of the lemma and the hypothesis about J and J ′, we have that c e r t(q, v(J)) = q(J) = q(J ′) = c e r t(q, v(J ′)). □
The preceding lemma will be used in the next example, which presents a sequence from that has a limit in .
Example 1
Let T be a schema consisting of a single binary relation E and let C m be the undirected cycle of length m, m ≥ 1, where the vertices of the cycle are pairwise distinct labeled nulls. Consider the sequence (v(C 2n+1))n ≥ 1 arising from the cycles of odd size. Then, for every m ≥ 1, we have that . In particular, .
We first show that v(C 2m) ≡CQ v(C 2), for every m ≥ 1. By Lemma 1, it suffices to show that C 2m and C 2 satisfy the same boolean conjunctive queries. This is true because C 2m and C 2 are homomorphically equivalent (and boolean conjunctive queries are preserved under homomorphisms). Indeed, there is a homomorphism from C 2 to C 2m because C 2 is a subgraph of C 2m, and there is a homomorphism from C 2m to C 2 because C 2m is 2-colorable.
We will show that by showing that for every k, there exists n 0 such that for all n ≥ n 0, we have that . For this, we take n 0 = k and show that if n ≥ k, then . By the third part of Lemma 1, it suffices to show if q is a boolean conjunctive query in C Q k, then q(C 2n+1) = q(C 2). Since C 2 is a subgraph of C 2n+1, we have that if q(C 2) = t r u e, then also q(C 2n+1) = t r u e. Assume that q(C 2n+1) = t r u e. Since q ∈C Q k, there is a subgraph H of C 2n+1 with at most k distinct nodes such that q(H) = t r u e. Since 2n + 1 > n ≥ k, we have that H is a proper subgraph of C 2n+1. Consequently, H is 2-colorable and so there is a homomorphism from H to C 2, which, in turn, implies that q(C 2) = t r u e.
In contrast to what we have Just seen, there are Cauchy sequences of elements of that have no limit in . Thus, the pseudo-metric space is incomplete.
Proposition 2
Let T be a schema consisting of a single binary relation E and let K n be the clique of size n, for n ≥ 1, where the vertices are pairwise distinct labeled nulls. The sequence (v(K n))n ≥ 1 is Cauchy, but has no limit in .
Proof 3
The sequence (v(K n))n ≥ 1 is Cauchy, because if m ≥ n, then v(K m) and v(K n) satisfy the same conjunctive queries in C Q n. To show this, by the third part of Lemma 1, it suffices to show that if m ≥ n, then K m and K n satisfy the same boolean conjunctive queries in C Q n. Let q be a boolean conjunctive query in C Q n. Since K n is a subgraph of K m, if q(K n) = t r u e, then q(K m) = t r u e. Conversely, if q(K m) = t r u e, then there is a subgraph H of K m with at most n distinct nodes such that q(H) = t r u e. But then H is also a subgraph of K n, hence q(K m) = t r u e.
It remains to show that the sequence (v(K n))n ≥ 1 has no limit in . Assume to the contrary that there does exist a set of finite instances over T such that . We distinguish three cases.
First, if , then , for every conjunctive query q. In particular, this holds for the query q = ∃x E(x, x), which asserts the existence of a self-loop. In contrast, for this conjunctive query, we have that c e r t(q, v(K n)) = f a l s e, for every n ≥ 1, since K n ∈ v(K n) and none of the graphs K n, n ≥ 1 contains a self-loop.
Second, if and if every member J of contains a self-loop, then we again consider the query q = ∃x E(x, x). We thus have , whereas c e r t(q, v(K n)) = f a l s e, for every n ≥ 1.
It remains to consider the case that and at least one member does not contain a self-loop. Let m be the biggest integer such that J contains a clique of size m. We define the query q as
For graphs without self-loops, q asserts the existence of a clique of size m + 1. We now have that q evaluates to false overy J. Hence, holds, while c e r t(q, v(K n)) = t r u e, for every n ≥ m + 1. □
Since (v(K n)n ≥ 1) is a Cauchy sequence, it has a limit in the completion of . As we will see in Section 6, a concrete representation of this limit is the set consisting of all disjoint unions of cliques of all finite sizes in which every node is a null.
The following definitions are perfectly meaningful for every pseudometric space (X, d) and for every sequence of functions taking values in X. For concreteness, we give the definitions for sequences of functions taking values in .
Definition 5
Let A be a set, let (f n)n ≥ 1 be a sequence of functions from A to , and let f be a function from A to .
We say that (f n)n ≥ 1 converges pointwise to f , denoted as , if for every element x ∈ A, we have that .
We say that (f n)n ≥ 1 converges uniformly to f , denoted as , if for every 𝜖 > 0, there exists an integer n 0 ≥ 1 such that for every integer n ≥ n 0 and for every element x ∈ A, we have d i s t(f n(x), f(x)) < 𝜖.
We say that (f n)n ≥ 1 is pointwise Cauchy, if for every element x ∈ A, the sequence (f n(x))n ≥ 1 is Cauchy.
We say that (f n)n ≥ 1 is uniformly Cauchy, if for every 𝜖 > 0, there exists an integer n 0 ≥ 1 such that for all integers n, n ′≥ n 0 and for every element x ∈ A, we have .
Clearly, if (f n)n ≥ 1 converges pointwise (resp., uniformly), then (f n)n ≥ 1 is pointwise (resp., uniformly) Cauchy. The converse is not in general true for arbitrary (pseudo)metric spaces; in particular, it is not true for the pseudometric space , as we shall see later on.
We now bring schema mappings into the picture. Every schema mapping over a source schema S and a target schema T can be identified with a function , where (recall that is the set of all solutions of I w.r.t. , i.e., the set of all finite T instances J such that ). Thus, a sequence of schema mappings over a source schema S and target schema T can be viewed as a sequence of functions from Inst(S) to . Therefore, we can talk about a sequence of schema mappings being pointwise Cauchy and uniformly Cauchy if the sequence of the associated functions has these properties. Similarly, we say that a sequence of schema mappings has a pointwise limit (resp., a uniform limit) if the sequence of the associated functions converges pointwise (resp., converges uniformly) to a schema mapping.
The preceding notion of convergence of a sequence of schema mappings allows us to draw a connection to earlier work on schema mapping optimization [5, 7]. Here, we are considering C Q-equivalence and C Q n-equivalence of sets of instances. In previous works, these notions of equivalence have been mainly applied to schema mappings (see, e.g., [5, 7, 14]). Specifically, two schema mappings are C Q-equivalent (resp., C Q n-equivalent) if for every target conjunctive query q (resp., every target conjunctive query q in C Q n) and every source instance I, we have that . In this case, we write (resp., ). The notion of C Q n-equivalence has been studied in the context of schema mapping optimization [5, 7]. Below we discuss its relationship to the convergence of schema mappings.
Proposition 3
Consider a sequence of schema mappings and a schema mapping .Then if and only if for every integer k ≥ 1, there is an integer n 0 ≥ 1such that for all integers n ≥ n 0 , we have that .
Proof 4
The result follows by unfolding and comparing the definitions. Specifically, means that for every 𝜖 > 0, there is an integer n 0 ≥ 1 such that for every integer n ≥ n 0 and for every source instance I we have that . In turn, this means that for every integer k ≥ 1, there is an integer n 0 ≥ 1 such that for every integer n ≥ n 0 and for every source instance I we have that . Thus, for every integer k ≥ 1, there is an integer n 0 ≥ 1 such that for every integer n ≥ n 0, we have that . □
Intuitively, the preceding proposition states that it takes bigger and bigger conjunctive queries to distinguish the members of a sequence from its uniform limit.
Although never explicitly introduced, the notion of uniform convergence was implicit in [5], where it was shown that for every SO tgd σ and for every n ≥ 1, there is a GLAV mapping such that . From this, it is easy to see that . Thus, we have the following result.
Theorem 1
(implicit in[5]) Every SO tgd is a uniform limit of a sequence of GLAV mappings.
There are SO tgds that are not C Q-equivalent to any GLAV mapping. Indeed, from Example 4.6 and Theorem 4.10 in [7], it follows that the SO-tgd
is not C Q-equivalent to any GLAV mapping. Thus, the point of Theorem 1 is that SO tgds can be “approximated” up to any level of C Q k-equivalence by GLAV mappings, which are both syntactically simpler and generally more well-behaved.
As stated earlier, is a pseudometric space since it cannot distinguish C Q-equivalent sets of instances. Consequently, the limit of a sequence of sets of instances and the (uniform or pointwise) limit of a sequence of mappings need not be unique. However, the limit is unique up to C Q-equivalence and, as described in Section 2, there is an associated metric space obtained by considering the equivalence classes of modulo the equivalence relation R dist, where if and only if (i.e., if and only if ).
In subsequent sections, we will work with the metric space . Moreover, we will be interested in schema mappings modulo C Q-equivalence, which means that from now on we will view schema mappings as functions from source instances to equivalence classes of sets of target instances modulo C Q-equivalence. However, for notational simplicity, we will work each time with representatives of the equivalence classes. By a slight abuse of notation, we will write , instead of . Likewise, we will not explicitly distinguish between a schema mapping and the equivalence class of the schema mappings that are C Q-equivalent to .
Limits of Sequences of GAV Mappings
Our goal in this section is to analyze sequences of GAV mappings. To this effect, we first investigate the existence of limits of such sequences and then examine the definability of limits. As discussed in Section 3, if a sequence of schema mappings has a pointwise (resp., uniform) limit, then the sequence is pointwise (resp., uniformly) Cauchy. The next result asserts that the converse holds for sequences of GAV mappings.
Theorem 2
Let be a sequence of GAV mappings.
If is pointwise Cauchy, then it has a pointwise limit.
If is uniformly Cauchy, then it is eventually constant and thus has a GAV schema mapping as a uniform limit.
Proof 5
We consider GAV mappings over a source schema S and a target schema T. Let r denote the maximum arity of the relation symbols in T. For showing the first claim, assume that is a pointwise Cauchy sequence of schema mappings and let I be a source instance. For each n ≥ 1, consider the universal solution for I w.r.t. obtained by using the oblivious chase procedure. Since each is a GAV schema mapping, we have that contains constants from the active domain of I and no nulls. We claim that there exists some n 0 such that for all n ≥ n 0, we have that . In other words, we claim that the sequence is eventually constant (does not oscillate). Since every instance in the sequence has no nulls, it can be identified by evaluating on that instance the atomic queries R(x 1,…, x k), where R ranges over the relation symbols of T and k (with k ≤ r) denotes the arity of R. The assumption that the sequence is pointwise Cauchy implies that there exists a positive integer n 0 (that depends on I and r) such that for every integer n ≥ n 0 and every conjunctive query q ∈C Q r, we have that . This implies that and, consequently, for every n ≥ n 0, we have that .
We have Just shown that if is a pointwise Cauchy sequence of GAV mappings, then for every I, there exists a positive integer m I such that , for all n ≥ m I. It follows that the schema mapping is a pointwise limit of the sequence . Note that is indeed a schema mapping because contains no nulls.
For showing the second claim, assume that is a uniformly Cauchy sequence of GAV mappings. We claim that is eventually constant, i.e., there is some n 0 such that for all n ≥ n 0, holds. For this, we repeat the previous argument, but also note that, since the sequence is uniformly Cauchy, there exists a positive integer n 0 that depends only on r such that for every source instance I, for every integer n ≥ n 0 and every conjunctive query q ∈C Q r, we have that . This implies that for every source instance I and every n ≥ n 0, we have that ; consequently, for every source instance I and every n ≥ n 0, we have that . □
Next, we point out that, for sequences of GAV mappings, the notions of pointwise convergence and uniform convergence are genuinely different.
Proposition 4
There exists a sequence of GAV mappings that has a GAV mapping as a pointwise limit, but has no uniform limit.
Proof 6
For every n ≥ 2, let Intuitively, if E is interpreted as edge relation, then q n yields a non-empty answer over any graph that contains a self-loop or a clique of size n. Let S be a source schema consisting of a binary relation symbol E and a unary relation symbol P, let T be a target schema consisting of a unary relation symbol P ′. Let be the sequence of GAV mappings, where is specified by the constraint ∀x∀x 1,…, x n+1(P(x) ∧ q n+1 → P ′(x)). Intuitively, is a “copy” schema mapping, but the copying action is triggered only if the source instance contains a self-loop or a clique of size n + 1. We will show that the GAV schema mapping is a pointwise limit of , but that this pointwise limit is not a uniform limit of and thus no uniform limit of exists.
We first show that the GAV mapping is a pointwise limit of . Given a source instance I, we consider two cases.
If I contains a self-loop, then J = {P ′(x)∣P(x) ∈ I} is a universal solution for I w.r.t. and w.r.t. , for all n. Thus, , for all n.
If I is self-loop free, let n 0 be such that no clique larger than n 0 exists in I. Then, J = ∅ is a universal solution for I w.r.t. and w.r.t. , for all n ≥ n 0. Thus, , for all n ≥ n 0.
Next, we show that has no uniform limit. Towards a contradiction, suppose that such a uniform limit exists. Every uniform limit is also a pointwise limit; moreover, pointwise and uniform limits are unique up to C Q-equivalence. Hence, since the schema mapping defined above is a pointwise limit of , it follows that is also a uniform limit of . Let m = 1. Then there exists an n 0 such that for all n ≥ n 0 we have that . Take n = n 0. Let I be the source instance K n+1 ∪{P(c)} and let q be the target conjunctive query ∃x P ′(x). We now claim that , which contradicts the previously derived fact that . Indeed, since I contains a clique of size n + 1, we have P(c) is a universal solution for I w.r.t. , hence . However, since I contains no self-loop, we have that ∅ is a universal solution for I w.r.t. , hence . □
Proposition 4 and Theorem 2 imply that the sequence of GAV mappings in the proof of Proposition 4 is an example of a pointwise Cauchy sequence that is not uniformly Cauchy. Theorem 2 also implies that if a sequence of GAV mappings has a uniform limit, then it must have a GAV mapping as such a limit. In turn, this gives rise to the following natural question concerning the definability of pointwise limits: if a sequence of GAV mappings has a pointwise limit, does it have a GAV mapping as such a limit? We answer this question in the negative by showing that even the much richer language of SO tgds cannot express pointwise limits of sequences of GAV mappings.
Proposition 5
There is a pointwise Cauchy sequence of GAV schema mappings such that no SO tgd is a pointwise limit of that sequence.
Proof 7
Consider a source schema S consisting of a binary relation symbol E, and a target schema T consisting of a binary relation F. For every n ≥ 1, let P n(x, y) be the conjunctive query expressing the property “there is an E-path of length n from x to y”, and let be the GAV mapping specified by the set {∀x, y(P i(x, y) → F(x, y))∣1 ≤ i ≤ n}. Consider the schema mapping
It is easy to see that is a pointwise limit of the sequence ; the reason for this is that, for every source instance I and for every n ≥|a d o m(I)|2, we have that . However, is not C Q-equivalent to any schema mapping that allows for C Q-rewriting: if it were, then there would exist a union q of conjunctive queries over the source such that, for every source instance I,
Consequently, the transitive closure of I would be first-order definable over the source, which is not the case. Since every SO tgd allows for C Q-rewriting, no SO tgd is a pointwise limit of the sequence .□
We have just seen that there are sequences of GAV mappings that have a pointwise limit, but no such limit is definable by a GAV mapping. This raises the question of finding necessary and sufficient conditions guaranteeing that a sequence of GAV mappings has a GAV mapping as a pointwise limit. The next result provides an answer to this question.
Theorem 3
Let be a pointwise Cauchy sequence of GAV mappings. The following statements are equivalent:
has a GAV mapping as a pointwise limit.
has a pointwise limit that allows for C Q -rewriting.
Proof 8
Let be a pointwise Cauchy sequence of schema mappings. As shown in the proof of Theorem 2, for every source instance I, there is a positive integer m I, such that for all n ≥ m I the equality holds for the respective elements and of . Moreover, the schema mapping
is a pointwise limit of . Consider the following schema mapping :
It is clear that is also a pointwise limit of . The result we seek is an immediate consequence of the fact that the following four statements are equivalent:
has a GAV mapping as a pointwise limit.
has a pointwise limit that allows for C Q-rewriting.
allows for C Q-rewriting.
is logically equivalent to a GAV mapping.
We now show that these four conditions are equivalent.
(a) ⇒ (b) This is true because every GAV mapping allows for C Q-rewriting.
(b) ⇒ (c) This is true because if is a pointwise limit of that allows for C Q-rewriting, then so does since .
(c) ⇒ (d) This is the most involved part of the proof. Let us examine the structural properties that the schema mapping possesses. By hypothesis, allows for C Q-rewriting. By construction, admits universal solutions, since is a universal solution for I w.r.t. , for every source instance I. Moreover, it is clear from its definition that is closed under target homomorphisms. Finally, we claim that is closed under target intersections. Indeed, assume that both (I, J 1) and (I, J 2) are in . Then is contained in both J 1 and J 2, hence is contained in J 1 ∩ J 2, hence J 1 ∩ J 2 is a solution for I w.r.t. .
Thus, allows for C Q-rewriting, admits universal solutions, and is closed under both target homomorphisms and target intersections. Theorem 3.2 in [19] asserts that a schema mapping is logically equivalent to a GAV schema mapping if and only if it allows for C Q-rewriting, admits universal solutions, and is closed under both target homomorphisms and target intersections. It follows that is logically equivalent to a GAV mapping.
(d) ⇒ (a) This is obvious since is a pointwise limit of .
□
Observe that Theorem 3 (and its proof) provide necessary and sufficient conditions for a pointwise Cauchy sequence of GAV mappings to have a GAV mapping as a pointwise limit, but these conditions are on the pointwise limit and not on the sequence itself. By analyzing the proof of Theorem 3, however, it is possible to extract a necessary and sufficient condition on the sequence itself. For this, we need to introduce the following concept.
Definition 6
Let be a sequence of schema mappings. We say that allows for C Q -rewriting if for every target conjunctive query q, there is a union q ′ of source conjunctive queries having the following property: for every source instance I, there is a positive integer n I such that , for every n ≥ n I.
Let be a pointwise limit of a sequence of schema mappings. It is easy to show that allows for C Q-rewriting if and only if allows for C Q-rewriting. Indeed, assume first that allows for C Q-rewriting. To show that allows for C Q-rewriting, let q be a conjunctive query and let q ′ be a union of conjunctive queries such that , for every source instance I. Since is a pointwise limit of , for every instance I, there is a positive integer such that , for every . It follows that , for every , which shows that allows for C Q-rewriting. In the other direction, assume that allows for C Q-rewriting. To show that allows for C Q-rewriting, let q be a conjunctive query and let q ′ be a union of conjunctive queries such that for every source instance I, there is a positive integer n I such that , for every n ≥ n I. By the pointwise convergence of to , for every source instance I, there is a positive integer such that , for every . Let I be a source instance. By taking any , we have that and c e r t(q, I, M n) = q ′(I), hence , which shows that allows for C Q-rewriting.
By combining the preceding remarks with Theorems 2 and 3, we obtain the following result.
Corollary 1
Let be a pointwise Cauchy sequence of GAV mappings. The following statements are equivalent:
has a GAV mapping as a pointwise limit.
allows for C Q -rewriting.
Since every schema mapping specified by an SO tgd allows for C Q-rewriting, Theorem 3 also implies the following result.
Corollary 2
Let be a pointwise Cauchy sequence of GAV mappings. The following statements are equivalent:
has a GAV mapping as a pointwise limit.
has an SO tgd as a pointwise limit.
Finally, we note that Proposition and Theorem 3 yield a fairly complete picture of the definability of pointwise limits of GAV mappings. Specifically, there are two mutually exclusive possibilities:
No pointwise limit allows for C Q-rewriting and no GAV mapping is a pointwise limit.
Every pointwise limit admits C Q-rewriting and there is a GAV mapping that is a pointwise limit. Moreover, this happens precisely when the schema mapping in the proof of Theorem 3 allows for C Q-rewriting or, equivalently, when is logically equivalent to a GAV mapping.
Limits of Sequences of LAV Mappings
In this section, we investigate the existence and definability of limits of sequences of LAV mappings. In fact, we will consider a much broader class of GLAV mappings, namely k-premise-bounded GLAV mappings for arbitrary k ≥ 1. LAV mappings correspond to the special case of k = 1.
Definition 7
Let be a GLAV mapping and k a positive integer. We call a k-premise-bounded GLAV mapping if the premise of every constraint in has at most k atoms.
Let be a sequence of GLAV mappings. We say that is premise-bounded if there exists an integer k such that every element of is k-premise bounded.
Unlike the case of GAV mappings, the notions of pointwise Cauchy and uniformly Cauchy sequences of premise-bounded GLAV mappings coincide. Moreover, the same holds true for the notions of pointwise limit and uniform limit of sequences of such schema mappings.
Theorem 4
Let be a sequence of premise-bounded GLAV mappings.
The sequence is pointwise Cauchy if and only if it is uniformly Cauchy.
The sequence has a pointwise limit if and only if it has a uniform limit.
Proof 9
We prove the first part and then use it to prove the second part.
Part 1. It is obvious that every uniformly Cauchy sequence of mappings is also pointwise Cauchy. We focus on the reverse direction. Let be a pointwise Cauchy sequence of premise bounded GLAV mappings. We have to show that for every m, there is an N 0 such that for all n, n ′≥ N 0, we have that .
Fix an integer m. Since is pointwise Cauchy, for every source instance I, there is an integer n 0(I) such that for all n, n ′≥ n 0(I) and for every conjunctive query q in C Q m, we have that . Let p be the number of relation symbols in the target schema, let r be their maximum arity, and let k be the bound on the number of atoms in the premises of the members of the sequence . We write to denote the class of all source instances with at most k ⋅ p ⋅ m r atoms. Clearly, up to isomorphism, there are only finitely many instances . Moreover, if I ′≅I ″, then n 0(I ′) = n 0(I ″). Consequently, the quantity is a positive integer. We claim that for all n, n ′≥ N 0, we have that .
Let I be an arbitrary source instance and let q be an arbitrary conjunctive query in C Q m. We have to show that , for all n, n ′≥ N 0. Let a be a tuple of constants such that , hence . Since the query q has at most m variables, it must consist of at most p ⋅ m r atoms. Let be a homomorphism establishing that . It follows that there are at most p ⋅ m r facts in witnessing that . Each of these facts must be produced in a single step while chasing the source instance I with , which implies that each of these facts is produced using at most k facts from I. Let I ∗ be the subinstance of I consisting of all the aforementioned facts of I used to produce the facts in witnessing that . We then have that |I ∗|≤ k ⋅ p ⋅ m r and . Since n, n ′≥ N 0, we have that , hence . By the monotonicity of the chase procedure, we have that . It follows that . A symmetric argument establishes the containment , hence , which, in turn, implies that .
Part 2. It is obvious that if a sequence of schema mappings has a uniform limit, then it has a pointwise limit. We focus on the reverse direction. Let be a sequence of premise bounded GLAV mappings that has a pointwise limit . We claim that is also a uniform limit of .
Since has a pointwise limit, we have that is pointwise Cauchy. The previous part implies that is uniformly Cauchy as well. Fix an integer m. Since is uniformly Cauchy, there exists an n 0 such that for all n, n ′≥ n 0, we have that . We claim that also holds, for every n ≥ n 0. To show this, fix some n ≥ n 0 and let I be a source instance and q a conjunctive query in C Q m. We have to show that . Since is a pointwise limit of , there is an such that for all , we have that . Take an integer n ′ such that . Since n ′≥ n 0, we have that . Since , we have that . Thus, . □
Note that the preceding proof of Part 2 used only the hypothesis that the sequence is uniformly Cauchy and the fact that the sequence has a pointwise limit, as we have proved in Part 1. As a matter of fact, this is an instance of a general result about pseudometric spaces, namely, that if a uniformly Cauchy sequence of functions converges pointwise, then it also converges uniformly.
The following two propositions further demarcate the differences between GAV and premise-bounded GLAV mappings. In fact, these differences are already witnessed by sequences of LAV mappings. The first difference concerns the existence of limits of uniformly Cauchy sequences. In contrast to the GAV case, uniformly Cauchy sequences of LAV mappings may have no uniform limit; in fact, they may not even have a pointwise limit.
Proposition 6
There exists a uniformly Cauchy sequence of LAV mappings that has no pointwise limit; in particular, it has no uniform limit either.
Proof 10
Let S be a source schema consisting of a binary relation symbol E and let T be a target schema consisting of a binary relation F. For every n ≥ 1, let be the LAV mapping specified by the constraint
where is the boolean conjunctive query which is satisfied by the graphs containing a self-loop or a clique of size n (now considering F as the edge relation).
We first show that the sequence is uniformly Cauchy. Let k ≥ 1. We claim that if we take n 0 = k, then for every source instance I, for every n, m ≥ n 0, and every q ∈C Q k, we have that . To see this, note that for every source instance I and for every t ≥ 1, the universal solutions of I w.r.t. have active domains consisting entirely of labeled nulls. Hence, only boolean queries may return a non-empty result. Moreover, observe that these universal solutions have no self-loops, i.e., they contain no atoms of the form F(v, v) for some labeled null v.
We now distinguish two cases: First, suppose that q ∈C Q k is a boolean conjunctive query which contains a “self-loop”, i.e., an atom of the form F(z, z) for some variable z. Then we clearly have . It remains to consider the case that q ∈ C Q k is a boolean C Q containing no self-loop. Then we clearly have , since we are assuming that m, n ≥ k holds.
Using an argument similar to the one in the proof of Proposition 2, we now show that the sequence has no pointwise limit. Towards a contradiction, assume that does have a pointwise limit . Let I be a non-empty source instance. We consider three cases.
First, assume that is empty. Then, for every boolean conjunctive query q, it holds trivially that . This is, in particular, the case for the query q = ∃z F(z, z), which asks for the existence of a self-loop. However, for this query q, we have that for every n ≥ 1.
Second, assume that is non-empty and that all solutions contain a self-loop. For the query q = ∃z F(z, z) as above, we again have , whereas , for every n ≥ 1.
Finally, assume that is non-empty and that at least one solution does not contain a self-loop. Let m be the biggest integer such that J contains a clique of size m. Consider the conjunctive query
Then q evaluates to false over J and we have . On the other hand, for all n ≥ m + 1 we have . Again, this contradicts our assumption that is the pointwise limit of . □
The next difference is the definability of uniform limits. In Section 4, we saw that if a sequence of GAV mappings has a uniform limit, then it is eventually constant, hence it has a GAV mapping as a uniform limit. This property need not hold for sequences of LAV mappings (hence, it need not hold for sequences of premise-bounded schema mappings).
Proposition 7
There exists a sequence of LAV mappings that has a uniform limit, but no uniform limit of admits universal solutions. In particular, no SO tgd is a uniform limit of the sequence .
Proof 11
For every n ≥ 1, let be the LAV mapping specified by the constraint
where ∃P n is the conjunctive query ∃z 1…∃z n(F(z 1, z 2) ∧… ∧ F(z n−1, z n)) asserting that there is a “path” (possibly with repeated vertices) of length n in the target instance. We now show that the sequence has a uniform limit, but no uniform limit of this sequence admits universal solutions.
Part 1. For the first part of the claim, consider the schema mapping
where C k is a target instance consisting of a simple cycle of nulls of size k and v(C k) is the set of all isomorphic copies of C k via isomorphisms that rename nulls. We will show that is a uniform limit of the sequence . Specifically, we will show for every m, there exists n 0 such that for all n ≥ n 0, we have that .
Let n 0 = m. Since each has solutions consisting entirely of nulls, it suffices to consider boolean C Q s only. Let q be a boolean C Q with m variables and assume that , where n ≥ m. This implies that there is a homomorphism from the body of q into P n, where P n is the simple path with n nodes. In turn, this implies that C k⊧q, for every k. Thus, as well. In the other direction, assume that . Note that q cannot contain a directed cycle, since no directed cycle can be mapped homomorphically in every cycle of length greater than one. Let h be a homomorphism from the body of q into C m+1. Since q ∈C Q m, the variables of q have at most m distinct images among the nodes of C m+1. This means that , where is obtained from C m+1 by removing the facts that contain at least one element that is not the image of one of the variables of q under h. Note that has at least one fact less than C m+1, and so it is a collection of simple paths of length at most m; therefore, there is a homomorphism from to P n, hence P n⊧q.
Part 2. For the second part of the claim and towards a contradiction, assume that is a uniform limit of such that there exists a non-empty source instance I and a finite universal solution J for I w.r.t. . Note that for every i, we have that , because is a (uniform and, hence also pointwise) limit of the sequence . Then we also have that J⊧∃P i, since J is universal. Since J is finite, this is possible only if J contains a directed cycle.
We can now derive a contradiction as follows. For each positive integer l, let ∃C l be the boolean conjunctive query asserting the existence of a cycle of length l. Then there is no n such that . Thus, must hold for every l, since is a limit of . Hence, J cannot contain cycles.
Since every SO tgd admits universal solutions, it follows that no SO tgd is a (uniform or pointwise) limit of . □
By Theorem 1, every SO tgd is the uniform limit of a sequence of GLAV mappings. Proposition 7 implies that the converse is false, even for sequences of LAV mappings.
In the previous section, we showed that a sequence of GAV mappings has a GAV mapping as a pointwise limit if and only if it has a pointwise limit that allows for C Q-rewriting. Is there some structural property that characterizes when a sequence of premise-bounded GLAV mappings has a GLAV mapping as a pointwise limit (which, for premise-bounded mappings, is the same as a uniform limit)? We will show that the property of admitting universal solutions is the key to this question. Specifically, we have the following result.
Theorem 5
Let be a premise-bounded sequence of GLAV mappings. The following statements are equivalent.
has a GLAV mapping as a uniform limit.
has a uniform limit that admits universal solutions.
Moreover, if is a sequence of LAV mappings, then has a LAV mapping as a uniform limit if and only has a uniform limit that admits universal solutions.
We now give two lemmas which will be used in the proof of Theorem 5, but are also of interest in their own right.
Lemma 2
If is the uniform limit of a sequence of schema mappings each of which allows for C Q -rewriting, then also allows for C Q -rewriting.
Proof 12
Let q be a target conjunctive query with m variables. Since is a uniform limit of , there exists an integer n 0 such that for every n ≥ n 0 and every source instance I, we have that . In particular, . Since allows for C Q-rewriting, there is a source conjunctive query q ′ such that , for every source instance I. Hence, holds, for every source instance I. □
It should be noted that the conclusion of Lemma 2 does not hold, in general, if is a pointwise limit of a sequence of schema mappings each of which allows for C Q-rewriting. Indeed, if is the sequence of GAV mappings in the proof of Proposition 5, then Theorem 3 and Proposition 5 imply that no pointwise limit of allows for C Q-rewriting.
Lemma 3
Let be a uniform limit of a sequence of LAV mappings. If admits universal solutions, then it is closed under unions.
Proof 13
The proof proceeds through several stages and involves four claims, each of which builds on preceding ones. We first state the claims without proof and then use the last claim to show the desired conclusion. After this, we complete the proof of the lemma by proving each claim.
We first modify the notion of C Q-equivalence by limiting the number of atoms of C Q s, rather than the number of variables. This yields an equivalent notion of uniform limit.
For ℓ ≥ 1, we define C Q ′ ℓ = {q ∈C Q ∣ l e n g t h(q) ≤ ℓ}, where l e n g t h(q) denotes the number of atoms in q.
We say that two schema mappings and are C Q ′ ℓ-equivalent, denoted by , if for every source instance I and for every , we have that .
We say that is the u ′-limit of a sequence , denoted by , if for every ℓ, there exists n 0 such that for all n ≥ n 0, it holds that .
Claim A.
The notions of u ′-limit and uniform limit coincide. Formally, for every sequence of schema mappings and every schema mapping , we have that if and only if .
Next, we use the given sequence to construct another sequence of LAV mappings that possesses some desirable properties. To define the sequence , we need another claim.
Claim B.
Assume that . Then, there exists a strictly increasing sequence (n i)i ≥ 1 of positive integers, such that for every ℓ ≥ 1 and for every n ≥ n ℓ, we have that .
Let (n i)i ≥ 1 be the strictly increasing sequence of positive integers according to Claim B. We define the sequence of LAV mappings as follows:
Here, T(τ, ℓ) contains all LAV constraints obtained from τ by restricting the conclusion to at most ℓ atoms. Formally, let τ = A(x) →∃y A 1(x, y) ∧… ∧ A r(x, y) and let {j 1,…, j p}⊆{1,…r} for p ≥ 1. Define τ[j 1,…, j p]:= . We define
Claim C.
Let (n i)i ≥ 1 be the strictly increasing sequence of positive integers according to Claim B and let be the sequence of LAV mappings constructed above. Then, for every ℓ ≥ 1, the following properties hold: (i) for every n ≥ n ℓ, we have that ; (ii) the conclusion of every LAV constraint in is of length at most ℓ.
We now make the following claim about the sequence .
Claim D.
For every source instance I, there exists an integer n 0 ≥ 1 such that for every I ′⊆ I, we have that .
Next, we use Claim D to show that is closed under unions, i.e., given and , we must show that with I = I 1 ∪ I 2 and J = J 1 ∪ J 2. From Claim D, we know that there exists n 0 such that , for every I ′⊆ I. In particular, I 1, I 2 ⊆ I. Hence, for each i ∈{1,2}, we have , that is, and . Since is a LAV mapping, it is closed under unions. Hence, , and, since , we conclude that , i.e., .
To complete the proof of the lemma, it remains to prove Claims A-D.
Claim A.
The notions of u ′-limit and uniform limit coincide. Formally, for every sequence of schema mappings and every schema mapping , we have that if and only if . (⇒) Assume . We have to show that also holds. Consider an arbitrary ℓ ≥ 1 and let r be the maximal arity of the target schema of . Any conjunctive query with at most ℓ atoms can have at most m = ℓ ⋅ r variables. Hence, the inclusion C Q ′ ℓ ⊆C Q m holds.
We are assuming . Hence, there exists n 0(m) such that for all n ≥ n 0(m), we have that . That is, for all q ∈C Q m and for all I, it holds that . Since C Q ′ ℓ ⊆C Q m, we may conclude that for all q ∈C Q ′ ℓ and for all I, it holds that . Hence, indeed holds.
(⇐) Assume . We have to show that also holds. Consider an arbitrary m ≥ 1. As above, let r be the maximal arity of the target schema of . Moreover, let p be the number of target relation symbols. Any conjunctive query with at most m variables can have at most ℓ = p ⋅ m r atoms. Hence, the inclusion C Q m ⊆C Q ′ ℓ holds.
We are assuming . Hence, there exists n 0(ℓ) such that for all n ≥ n 0(ℓ), we have that . That is, for all and for all I, it holds that . Since C Q m ⊆C Q ′ ℓ, we may conclude that for all q ∈C Q m and for all I, it holds that . Hence, indeed holds.
Claim B.
Assume that . Then, there exists a strictly increasing sequence (n i)i ≥ 1 of positive integers, such that for every ℓ ≥ 1 and for every n ≥ n ℓ, we have that . Since , for each ℓ ≥ 1 there exists an integer such that for all , we have that . We may choose n ℓ as follows to ensure strict monotonicity:
…Then the sequence (n i)i ≥ 1 is strictly increasing and for all ℓ ≥ 1 and for all n ≥ n ℓ, we have that .
Claim C.
Let (n i)i ≥ 1 be the strictly increasing sequence of positive integers according to Claim B and let be the sequence of LAV mappings constructed above. Then, for every ℓ ≥ 1, the following properties hold: (i) for every n ≥ n ℓ, we have that ; (ii) the conclusion of every LAV constraint in is of length at most ℓ. Consider an arbitrary ℓ ≥ 1. By the construction of the sequence , every LAV constraint in has a conclusion of length at most ℓ. Hence, property (ii) clearly holds.
To prove property (i), consider an arbitrary n ≥ n ℓ. We have to show that , i.e., for arbitrary source instance I and arbitrary conjunctive query q ∈C Q ′ ℓ, we have to show that . By Claim B, we have . Hence, it suffices to show that holds. We prove the two inclusions separately.
By the construction of , we clearly have . From this, it follows immediately that .
For the reverse inclusion, consider an arbitrary tuple . Then, there exists a homomorphism with h(z) = a, where z denotes the free variables of q. Let with k ≤ ℓ. By construction, is obtained by restricting the conclusions of the LAV constraints in all possible ways to at most ℓ atoms. Hence, since k ≤ ℓ, we have that also contains the set {A 1,…A k} of atoms (up to renaming of labeled nulls). Thus, there exists a homomorphism and h(h n(⋅)) is a homomorphism with h(z) = a. Therefore, holds.
Before presenting the proof of Claim D, we need to bring the notion of fact block size into the picture; this notion was introduced in [7].
Fact Blocks. Let J be an instance. The Gaifman graph of facts G J of J is the graph whose nodes are the facts of J and there is an edge between two facts if they have a null in common. The fact blocks (or f-blocks) of J are the sets of nodes of the connected components of G J. The block size of an undirected graph G is the size of the maximal connected component of G J, where the size of a component is given as the number of nodes. The fact block size (f-block size) of an instance J is the block size of the Gaifman graph of facts of J.
Claim D.
For every source instance I, there exists an integer n 0 ≥ 1 such that for every I ′⊆ I, we have that . Consider an arbitrary I ′⊆ I. Let J denote a universal solution for I ′ w.r.t. and let . We set ℓ = size(J), where size (J) denotes the number of atoms in J. Moreover, we set n 0 = n ℓ from the construction of . We claim that n 0 has the desired property. The proof proceeds in three steps, namely, we will show (i) J → J ′, (ii) J ′→ J, and, finally, (iii) .
(i) Let u = (u 1,…, u i) in J denote the labeled nulls in J and let y = (y 1,…, y i) denote a vector of pairwise distinct variables. Consider the boolean conjunctive query ∃y q J whose atoms are the atoms in J where we instantiate the labeled nulls u = (u 1,…, u i) with y = (y 1,…, y i). Clearly q J → J holds and, therefore, also .
Since and ∃y q J ∈C Q ′ ℓ, also holds. Hence, there exists a homomorphism h ′: q J → J ′, which can be easily transformed into a homomorphism h: J → J ′ by setting h(u α) = h ′(y α) for every α ∈{1,…, i}.
(ii) For every f-block F ′ of J ′, we consider the boolean conjunctive query whose atoms are the atoms in F ′ and z = (z 1,…, z i) instantiates the labeled nulls v = (v 1,…, v i) in F ′ with pairwise distinct variables. Clearly, for every F ′, we have and, therefore, also .
Since all LAV-constraints in have conclusion size bounded by ℓ, the number of atoms in any f-block of J ′ is bounded by ℓ. Hence, for every F ′, the corresponding conjunctive query is in . Since , we have that . Hence, for every f-block F ′ of J ′, there exists a homomorphism , which can easily be transformed into a homomorphism by setting for every α ∈{1,…, i}. These homomorphisms from the f-blocks of J ′ to J can be combined to the desired homomorphism with h ′: J ′→ J.
(iii) Finally, we show that holds.
“ ⊆”: Let . Since J ′ is a universal solution for I ′ w.r.t. , there exists a homomorphism g ′ : J ′→ K. By composing g ′ with the homomorphism h : J → J ′, we obtain a homomorphism from J to K. By the closure under target homomorphisms, we conclude that
“ ⊇”: Now let . Since J is a universal solution for I ′ w.r.t. , there exists a homomorphism g : J → K. By composing g with the homomorphism h ′ : J ′→ J, we obtain a homomorphism from J ′ to K. Since LAV mapping is closed under target homomorphisms, we conclude that .
The proof of Lemma 3 is now complete.
□
We now have all the tools needed to present the proof of Theorem 5. Before doing so and for the sake of readability, we reproduce its statement.
Let be a premise-bounded sequence of GLAV mappings. The following statements are equivalent.
has a GLAV mapping as a uniform limit.
has a uniform limit that admits universal solutions.
Moreover, if is a sequence of LAV mappings, then has a LAV mapping as a uniform limit if and only has a uniform limit that admits universal solutions.
Proof 14 (Proof of Theorem 5)
The direction (1) ⇒ (2) is obvious. For the direction (2) ⇒ (1), we start with the case when is a sequence of LAV mappings.
Assume that is a uniform limit of a sequence of LAV mappings and that admits universal solutions. Without loss of generality, we may also assume that is closed under target homomorphism. Indeed, if we let be the schema mapping obtained by closing under target homomorphisms, then is also a uniform limit of and it admits universal solutions; this is so because the notion of uniform limit is based on C Q-equivalence and also conjunctive queries are preserved under homomorphisms. Then the schema mapping has the following properties:
allows for C Q-rewriting (by Lemma 2);
admits universal solutions (by hypothesis);
is closed under target homomorphisms (by hypothesis);
is closed under unions (by Lemma 3).
Theorem 3.1 in [19] asserts that if a schema mapping admits universal solutions, allows for query rewriting, and is closed under both target homomorphisms and unions, then it is logically equivalent to a LAV mapping. Consequently, we have that is logically equivalent to a LAV mapping.For the case when is a sequence of premise-bounded GLAV mappings (but not necessarily LAV mappings), we apply yet another structural characterization of GLAV mappings from [19], namely, Theorem 3.9, which asserts that if a schema mapping allows for C Q-rewriting, admits universal solutions, is closed under target homomorphisms, and is n-modular, for some fixed n, then it is logically equivalent to a GLAV mapping.
Let k be the constant bounding the length of premises in . We proceed exactly as in the proof of Lemma 3 and construct a sequence , in which the premises of tgds are the same as in tgds in , hence each tgd in has at most k atoms in its premise. We proceed exactly as in the proof of Lemma 3 to establish the following analog of Claim D.Claim D (in the proof of Lemma 3) For every source instance I, there exists an integer n 0 ≥ 1 such that for every I ′⊆ I , we have that .
Now, since each tgd in every element of has at most k atoms in its premise, it follows that there is a positive integer N k so that each mapping in is N k-modular. It is easy to see that N k ≤ k ⋅ r holds where r is the maximum relation arity in the source schema.
We now prove that is N k-modular. Assume that J is not a solution for I w.r.t. to . Take an integer n 0 as in Claim D and consider . It follows that J is not a solution for I w.r.t. . Since is N k-modular, there is a subinstance I ′ of I such that J is not a solution for I ′ w.r.t. and |d o m(I ′)|≤ N k. Again by Claim D, we have that J is not a solution for I ′ w.r.t. , hence M is N k-modular.
Thus, has the following properties: it admits C Q-rewriting (since it is the uniform limit of GLAV mappings that admit C Q-rewriting), it admits universal solutions, is closed under target homomorphisms (if it is not, we take its closure before we begin the construction), and, as just shown, it is N k-modular. Consequently, by Theorem 3.9 in [19], we have that is logically equivalent to a GLAV schema mapping, which completes the proof. □
We conclude this section with a conjecture concerning uniform limits of arbitrary sequences of GLAV mappings.
Conjecture 1
The following statements are equivalent for a sequence of GLAV mappings.
has an SO tgd as a uniform limit.
has a uniform limit that admits universal solutions.
It is not hard to show that the preceding conjecture is implied by a conjecture in [2] to the effect that the language of plain SO-tgds2 can be characterized by the following three properties: allowing for C Q-rewriting, admitting universal solutions, and closure under target homomorphisms.
Metric Space Completion and Generalized Schema Mappings
Let T be a schema containing a binary relation symbol. By Proposition 2, the metric space is not complete, i.e., there are Cauchy sequences of elements of that have no limit in . Let be the completion of . As described in Section 2, the elements of are the equivalence classes of Cauchy sequences of elements of , where two Cauchy sequences and are equivalent if . Clearly, this is a rather abstract description of . In this section we show that, in many cases, the elements of can be represented by suitably constructed infinite T-instances. In turn, this result and basic results about complete metric spaces imply that the (pointwise or uniform) limits of a Cauchy sequence of schema mappings can be represented by a generalized schema mapping, that is, a schema mapping in which infinite solutions are allowed. We also establish a tight connection between these results and the representation of structural limits in the monograph by Nešetřil and Ossona de Mendez [15].
Representing Limits of Cauchy Sequences in the Metric Completion
Let T be a schema. Recall that, by definition, a T-instance is a finite set of facts. In what follows, we will also consider infinite T-instances, where, by definition, an infinite T-instance is an infinite set I of facts R i(t 1,…, t m). The term T-instance will continue to denote a finite T-instance, but, at times and for emphasis or disambiguation, we will also use the term finite T-instance, especially in contexts in which infinite T-instances are also considered. According to Definitions 2 and 3, the notion of the distance between two sets of finite instances has been defined using the notion of C Q n-equivalence, where two sets and of finite T-instances are C Q n-equivalent, denoted , if it holds that , for all q ∈C Q n. The notion of C Q n-equivalence naturally extends to arbitrary (i.e., finite or infinite) T-instances. Hence, also the notions of similarity and distance, both of which were defined via C Q n-equivalence, immediately carry over to sets of arbitrary T-instances. Furthermore, the set of sets of arbitrary T-instances forms a pseudometric space, in which we can speak about Cauchy sequences and limits.
Definition 8
Let T be a schema.
- Let and be two sets of finite T-instances. We say that is an isomorphic copy of with nulls named apart if
- For every member J of , there is a member J ′ of that is an isomorphic copy of J via an isomorphism that renames nulls.
- Every member J ′ of is an isomorphic copy of some member J of via an isomorphism that renames nulls.
- No two members of have nulls in common.
If is a set of finite T instances, then denotes the union of all members of (where each member of is viewed as a set of facts).
- If is a set of finite T-instances, then denotes the set consisting of the unions of isomorphic copies of with nulls named apart, i.e.,
Several remarks are in order now.
Let be a set of finite T-instances. Clearly, if is finite, then is a finite T-instance, while if is infinite, then is an infinite T-instance. Note also that if is a set of finite T-instances such that at least one instance in contains nulls, then is infinite (even if is a finite set).
- According to Definition 4, if J is a T-instance whose active domain contains nulls only, then v(J) is the set of all T-instances that are isomorphic copies of J via an isomorphism that renames nulls. This notation makes sense also for infinite T-instances J whose active domains contain nulls only. With this in mind, observe that if is a finite set of T-instances and if is an isomorphic copy of with nulls named apart, then
As a concrete example, if , where K n is a clique of size n in which every node is a null, then the members of are precisely the disjoint unions of cliques of all finite sizes in which every node is a null.
Definition 9
Let q be a conjunctive query over the schema T with k free variables, k ≥ 0, and let a be a k-tuple of constants (if k = 0, then a = (), i.e., a is the empty tuple).
We write q(a) to denote the T-instance J obtained from q and a by (i) substituting the free variables of q by the respective elements of a; (ii) replacing the existential variables of q by fresh distinct labeled nulls; and (iii) treating the resulting body atoms of q as facts of the T-instance J.
Note that if q is a boolean query (in which case a = ()), then q(()) is the canonical database of q, i.e., the T-instance whose active domain is the set of variables of q viewed as distinct nulls and whose facts are the atoms of q. Conversely, every T-instance J whose active domain consists entirely of nulls is the canonical database of a boolean conjunctive query.
Before stating the main result of this section, we need to introduce one more concept. Let be a set of finite or infinite T-instances. We say that is closed under isomorphisms that rename nulls if for every (finite or infinite) T-instance J in and for every (finite or infinite) T-instance J ′ that is an isomorphic copy of J via an isomorphism that renames nulls, we have that J ′ is also in . Note that if is a set of finite T-instances, then is closed under isomorphisms that rename nulls. Moreover, if is a schema mapping between S and T, then, for every source instance I, the set of the solutions of I w.r.t. is closed under isomorphisms that rename nulls (see Definition 1).
Theorem 6
Let be a Cauchy sequence of elements of such that each is closed under isomorphisms that rename nulls. Then the limit of the sequence is the set , where
Proof 15
We have to show that, for every m ≥ 1, there is some n 0 such that for every q ∈C Q m and every n ≥ n 0, we have that . This will be done in two steps, as follows.
Step 1: We will show that, for every m ≥ 1, there is some n 1 such that , for every q ∈C Q m and every n ≥ n 1.
Step 2: We will show that, for every m ≥ 1, there is some n 2 such that , for every q ∈C Q m and every n ≥ n 2.
Then, given m ≥ 1, we can take n 0 = max{n 1, n 2}.
We start by pointing out that for every n ≥ 1 and every q ∈C Q, the certain answers consist entirely of null-free tuples. This follows from the assumption that is closed under isomorphisms that rename nulls (the proof is essentially the same as the proof of Proposition 1 in Section 2). Moreover, for every q ∈C Q, the certain answers also consist entirely of null-free tuples. This is so because contains isomorphic copies of having no nulls in common (e.g., if v 1,…, v n,… is a list of all nulls, then contains an isomorphic copy of in which all nulls have even index and an isomorphic copy of in which all nulls have odd index). Thus, we only need to focus on tuples of constants as possible certain answers.
To prove Step 1, since the sequence is Cauchy, for every m ≥ 1, there is some n 1 such that if s ≥ n 1 and t ≥ n 1, then . We now claim that , for every q ∈C Q m and every n ≥ n 1. Indeed, assume that q ∈C Q m and let a be a (possibly empty) tuple of constants in , where n ≥ n 1. It follows that , for every j ≥ n 1, hence the finite T-instance q(a) is in the set . Consequently, , for every isomorphic copy of with nulls named apart, which implies that .
To prove Step 2, we will first show that the set D of constants occurring in is finite (note that D is also the set of constants occurring in ). As a stepping stone, we will show the finiteness of a set D ′ that is defined next.
A single-atom conjunctive query is a query of the form ∃y R(x, y), where R is a relation symbol in the schema T. Let D ′ be the set of all constants b for which there is a single-atom query q and an index p, such that b occurs in , for all i ≥ p. We claim that the set D ′ is finite. To see this, observe first that every single-atom query has at most r variables, where r is the maximum arity of the relation symbols in T. Since the sequence is Cauchy, there exists an integer p r such that , for all i ≥ p r. This implies that the certain answers to single-atom conjunctive queries become fixed in starting from the index p r, which depends only on the schema T. By definition, the certain answers hold in every instance in . Since consists entirely of finite instances, the set D ′ must be finite as well.
To complete the proof of the finiteness of D, we will show that D ⊆ D ′. Let a be a tuple of constants for which there is a conjunctive query q and an index p, such that , for all i ≥ p. Let s be the number of atoms of q and consider the single-atom queries that cover q in the following sense: for every j with 1 ≤ j ≤ s, the atom of is the j-th atom of q, and y j contains exactly the free variables of q that occur in this atom. Let a j be the tuple of elements from a assigned to the variables y j. Clearly, every element of a is an element of some a j, 1 ≤ j ≤ s. Observe that implies that , hence we have that , for every i ≥ p. Thus, each element of a j, 1 ≤ j ≤ s, is an element of D ′. This shows that D ⊆ D ′ holds, hence D is a finite set.
We now return to the proof of Step 2. We will show that for every m ≥ 1, there is some n 2 such that , for every q ∈C Q m and every n ≥ n 2. Assume that q ∈C Q m and let a be a tuple of constants such that . Then, for every instance , we have that a ∈ q(J), hence there is a homomorphism h from the variables of q to the active domain ofJ such that the tuple of the free variables of q is mapped to a and the atoms of q are mapped to facts ofJ. Let s be the number of atoms of q and let f 1,…, f s be the facts ofJ that are the images of the atoms of q under the homomorphism h. Up to renaming nulls, each fact f j is a fact of some finite T-instance of the form q j(b j), where q j is a conjunctive query and b j is a tuple of constants such that , for all sufficiently large i. Let n q(a) be an index such that for every i ≥ n q(a), we have that holds, for 1 ≤ j ≤ s. Furthermore, let n 2 be the maximum such index n q(a), for all q in C Q m and for all tuples a in D. Such an index exists (i.e., it is a finite number) because both the set C Q m and the set of tuples of elements D of length at most m is finite.
Observe that n 2 has been chosen so that for every tuple a and for every q ∈C Q m with a homomorphism h mapping q(a) to some instance in (and thus to every instance in , by renaming the nulls in the co-domain of h accordingly), every fact f j in h(q(a)) can be mapped further to every instance , n ≥ n 2, via a homomorphism h i defined on the entire f-block of f j. (Recall that, by the definition of , each fact f j instantiates an atom of some conjunctive query q j whose certain answers persist in the sequence ; the bodies of these queries are mapped into instances of after renaming apart the nulls in them, thus ensuring that no two distinct queries end up in the same f -block of an instance of ).
The union of two homomorphisms h 1, h 2 defined on two distinct f-blocks B 1, B 2 is unambiguously defined, and it is a homomorphism on the instance B 1 ∪ B 2, since homomorphisms are the identity on constants and f-blocks do not share nulls. Thus, for an instance and for the image {f 1,…, f s} of q(a) under some homomorphism h, we also have a homomorphism from q(a) to J n, n ≥ n 2, obtained by composing h with a union h 1 ∪⋯ ∪ h s of homomorphisms from the f-blocks of the atoms f 1,…, f s to J n. It follows that , for every n ≥ n 2. This establishes the inclusion , for n ≥ n 2, and completes the proof of the theorem. □
Recall the sequence (v(K n))n ≥ 1 in Proposition 2, where K n is the clique of size n whose vertices are pairwise distinct labeled nulls. By Proposition 2, this sequence is Cauchy, but has no limit in . Theorem 6 tells us how to find the limit in the complete metric space via the conjunctive queries with non-empty certain answers over all but finitely many members of the sequence. Since the instances K n, n ≥ 1, have active domains consisting entirely of nulls, Lemma 1 tells us that we only need to consider boolean conjunctive queries and, moreover, it suffices to evaluate them on each K n. These queries can only use the edge relation E, thus they can be considered as graphs - with the variables representing the vertices. If a query contains a self-loop (i.e., an atom of the form E(z, z) for some variable z), then the query evaluates to false over every K n. On the other hand, if a query contains no self-loop, then it evaluates to true over all but finitely many instances K n. Indeed, let q be a conjunctive query without self-loop and suppose that q contains m variables. It is easy to verify that q evaluates to true over all instances K n with n ≥ m. Hence, by Theorem 6, the limit of (v(K n))n ≥ 1 is , where is a set of graphs with the following properties: (i) every member of is a graph with no self-loops and with labelled nulls as vertices; (ii) every graph with no self-loops is isomorphic to a graph in . Clearly, is also the limit of (v(K n))n ≥ 1, where is a set of graphs with the following properties: (i) every member of is a clique with labelled nulls as vertices; (ii) every clique is isomorphic to a graph in . Thus, the limit of (v(K n))n ≥ 1 is the set consisting of all disjoint unions of cliques of all finite sizes in which every node is a null. At any rate, it is clear that infinite instances have to be used to represent the limit of (v(K n))n ≥ 1.
Next, we extend our results about limits of Cauchy sequences of instances to limits of Cauchy sequences of mappings. To this end, we first recall two basic results about complete metric spaces.
Proposition 8
Let (Y, d)be a complete metric space and let (f n)n ≥ 1 be a sequence of functions from a set X to Y.
If (f n)n ≥ 1 is a pointwise Cauchy sequence, then (f n)n ≥ 1 has a pointwise limit f : X → Y , where , for every x ∈ X.
If (f n)n ≥ 1 is a uniformly Cauchy sequence, then (f n)n ≥ 1 has a uniform limit. Moreover, the pointwise limit f : X → Y of (f n)n ≥ 1 is also the uniform limit of (f n)n ≥ 1.
The proof of the first part of Proposition 8 is immediate from the definitions; the proof of the second part can be found in any standard book on metric spaces (see, e.g., Proposition 3.6.6 in [18]). In fact, the argument is essentially the same as the one given in the proof of Part 2 of Theorem 4. Note that the second part of Proposition 8 is known as the Cauchy criterion.
We are now ready to obtain concrete representations of the (pointwise or uniform) limits of Cauchy sequences of schema mappings.
Definition 10
Let S, T be two schemas. A generalized schema mapping is a set of pairs (I, J) such that I is a finite S-instance,J is a finite or infinite T-instance, and has the following closure property: if and if J ′ is an isomorphic copy ofJ via an isomorphism that renames nulls, then .
Corollary 3
Let be a sequence of schema mappings. Consider the generalized schema mapping
If is a pointwise Cauchy sequence, then the schema mappingis the pointwise limit of .
If is a uniformly Cauchy sequence, then the schema mappingis the uniform limit of .
Proof 16
The first part follows from Theorem 6 and the definitions. The second part follows from the first part and Proposition 8. □
Finally, we consider (pointwise or uniformly) Cauchy sequences of schema mappings admitting universal solutions and obtain a different representation of their limits.
Corollary 4
Let be a pointwise Cauchy sequence of schema mappings over a source schema S and a target schema T , each admitting universal solutions.
For every I ∈Inst(S), the sequence is Cauchy, and hence it has a limit in the complete metric space .
- The generalized schema mapping
is a pointwise limitof . Moreover, ifis a uniformly Cauchysequence, then is its uniform limit.
Connections with Representations of Structural Limits
In their recent monograph [15], Nešetřil and Ossona de Mendez considered a notion of distance between instances, as well as sequences of instances and limits of such sequences. In what follows, we describe the main differences between their setting and ours.
The first main difference is that they did not distinguish two classes of domain elements (namely, constants and nulls), as we did here. As a result, in the definition of homomorphism in [15], no special treatment of constants is needed, while, in our setting, constants must always be mapped to themselves. Their notion of homomorphism coincides with ours on instances whose active domains consist of labeled nulls only. Note that this is exactly the scenario we had in Example 1 and Proposition 2, which are both inspired by results in [15].
The second main difference is that the notion of distance in [15] is between a pair of two instances, while our notion of distance is between a pair of two sets of instances. This, of course, raises the question of how the two notions compare if, in our setting, both sets are singletons. We will address this question soon.
The third main difference is that, when cast in terms of the certain answers of conjunctive queries, the notion of distance in [15] involves boolean conjunctive queries only, while ours involves all conjunctive queries (boolean and non-boolean ones).
In what follows, we recall the definition of the similarity measure and the metric from [15] and briefly sketch the approach that Nešetřil and Ossona de Mendez took in representing limits of Cauchy sequences of instances via infinite instances.
Let T be a schema and let J and J ′ be two T-instances. By a slight abuse of notation, we write J → J ′ to denote the existence of a homomorphism from J to J ′ in the sense of Nešetřil and Ossona de Mendez (i.e., not distinguishing two types of domain elements). As mentioned before, if the active domains of J and J ′ contain nulls only, then this notion of homomorphism coincides with the one considered in the context of schema mappings and data exchange (which is the one we used here).
Definition 11
[Left distance in [15]] Let T be a schema and let J, J ′ be two T-instances.
The similarity s i m h(J, J ′) between J and J ′ is the size of the active domain of a smallest instance B such that one of the following two conditions holds: (a) B → J and ; (b) and B → J ′. If no such finite instance B exists, we let s i m h(J, J ′) = ∞.
The distance d i s t h(J, J ′) between J and J ′ is the quantity .
Nešetřil and Ossona de Mendez call this distance the “left distance”, because it is defined in terms of homomorphisms from other structures. This is to distinguish the notion from the “right distance” which is defined in terms of homomorphisms to other structures. For our purposes here, only the left distance is relevant. Because of the basic connection between homomorphisms and boolean conjunctive queries, it is easy to see that if J and J ′ are T-instances, then the following statements are equivalent.
s i m h(J, J ′) = m.
m is the largest number such that J and J ′ satisfy the same boolean conjunctive queries with at most m − 1 variables.
How do the notions of s i m h of similarity and d i s t h of distance compare with our notions sim of similarity and dist of distance? Clearly, this comparison is meaningful only when, in our setting, we consider singletons of instances and, moreover, the active domains of these instances contain nulls only. Recall that, according to the notation introduced in Definition 4, if J is a T-instance whose active domain contains nulls only, then v(J) is the set of all T-instances that are isomorphic copies of J via an isomorphism that renames nulls. The next observation is a direct consequence of Definitions 3 and 11, Lemma 1, and the preceding remarks.
Proposition 9
Let T be a schema and let J and J ′ be two T -instances whose active domains contain nulls only. Then the following statements are true.
s i m(v(J), v(J ′)) = s i m h(J, J ′) − 1.
d i s t(v(J), v(J ′)) = 2 ⋅ d i s t h(I, I ′).
In what follows, we will write NInst(T) to denote the set of all T-instances whose active domain consists entirely of nulls. The pair (NInst(T), d i s t h) is a pseudometric space, so a metric space can be obtained from it by passing to the equivalence classes [J] of target instances J, where [J] consists of all target instances that are homomorphically equivalent to J. As we did for the distance dist and the pseudometric space , we will identify each equivalence class with one of its members.
Cauchy sequences and limits arising from d i s t h are called left Cauchy sequences and left limits in [15]. Proposition 9 implies that if (J n)n ≥ 1 is a sequence of elements of NInst(T), then (J n)n ≥ 1 is Cauchy with respect to the distance d i s t h if and only if the sequence (v(J n))n ≥ 1 is Cauchy with respect to the distance dist. If (J n)n ≥ 1 is a sequence of elements of NInst(T), then we will write for the limit of the sequence (J n)n ≥ 1 in the metric completion of the space (NInst(T), d i s t h). Nešetřil and Ossona de Mendez obtained representations of the left limits of Cauchy sequences of instances by an approach that is based on the homomorphism preorder on instances and on ideals of partial orders.
The existence of homomorphisms between structures gives rise to the preorder ≤h, where L≤h J if L → J. By passing to the equivalence classes [J] of instances J in NInst(T) modulo homomorphic equivalence, the preorder ≤h becomes a partial order (also denoted by ≤h), where [L] ≤h[J] means that there is a homomorphism from some member of [L] to some member of [J]; this is the same as asserting that, for every pair (L ′, J ′) with L ′∈ [L] and J ′∈ [J], there is a homomorphism from L ′ to J ′. As before, we will not distinguish between equivalence classes and their members. The partial order ≤h extends to a partial order on the metric completion of (NInst(T), d i s t h) in the following way.
If (J n)n ≥ 1 and (L n)n ≥ 1 are two Cauchy sequences from NInst(T), then if for every m, there is a positive integer p such that for every i ≥ p, we have that .
As a special case, it is easy to see that if L is an element of NInst(T) and (J n)n ≥ 1 is a Cauchy sequence from NInst(T), then holds if and only if there is a positive integer p such that for every i ≥ p, we have that L → J i (this is the special case of in which L n = L, for all n).
Let (X,≤) be a (finite or infinite) partially ordered set.
A downset is a subset F of X with the property that for all x ∈ F and y ≤ x, also y ∈ F holds.
An ideal is a downset F with the additional property that for all x and y in F, there exists z in F such that both x ≤ z and y ≤ z hold.
In [15], it is shown that there is a correspondence between left limits of Cauchy sequences from NInst(T) and ideals in the partial order (NInst(T),≤h). Before presenting this correspondence, we need to introduce a piece of notation.
If is a set of T-instances, then the disjoint union is the set , where is an isomorphic copy of with nulls named apart. In other words, is the union of copies of all elements of (one copy of each element of ) so that no two members in the union have nulls in common. Clearly, is unique up to isomorphisms that rename nulls.
Let (J n)n ≥ 1 be a Cauchy sequence from NInst(T) and let be the left limit of (J n)n ≥ 1 in the metric completion of the space (NInst(T), d i s t h). Consider the set
It is easy to see that this set is an ideal of (NInst(T),≤h). Indeed, it is a downset because homomorphisms compose. Moreover, if L 1 and L 2 are in , then so is the disjoint union of L 1 and L 2; moreover, , for i = 1,2. The following is a consequence of Lemma 9.6 and Corollary 9.3 in [15].
Proposition 10 (15)
The following statements are true for the complete metric space and the partial order .
There is a bijection between NInst(T)∗ and the set of ideals of (NInst(T),≤h)given by
- If is the left limit of a Cauchy sequence (J n)n ≥ 1 from NInst(T), then can be represented as the disjoint union of the associated ideal , namely,
We now have all the conceptual and technical apparatus needed to establish a tight connection between the representations of limits given in Theorem 6 and the representation of limits given in Proposition 10.
Let {J n}n ≥ 1 be a Cauchy sequence (w.r.t. the distance function d i s t h) such that each J n is a member of NInst(T), i.e., each J n is a T-instance whose active domain consists entirely of nulls. Let be its left-limit in the metric completion of (NInst(T), d i s t h). As discussed earlier, the sequence {v(J n)}n ≥ 1 is Cauchy (w.r.t. the distance function dist), so it has a limit in the metric completion of . The following proposition establishes the close relationship between these two limits.
Proposition 11
Let {J n}n ≥ 1 be a Cauchy sequence (w.r.t. the distance function d i s t h ) such that each J n is a member of NInst(T).Then
Proof 17
Theorem 6 tells us that
Since the active domains of the elements of v(J n) consist entirely of nulls and are pairwise disjoint, we have that only boolean conjunctive queries q contribute to this expression. Moreover, by Lemma 1, the condition a ∈ c e r t(q,{v(J i)}) means that c e r t(q,{J i}) = t r u e or, equivalently, that J i⊧q. As mentioned earlier, every boolean conjunctive query q can be identified with its canonical database D q. Moreover, J i⊧q if and only if D q → J i. Thus, the preceding equation becomes
As explained earlier, the condition ∃p ∀i ≥ p (L → J i) is equivalent to the condition , hence the preceding equation becomes
This last equation and the second remark after Definition 8 imply that indeed
□
Concluding Remarks
In this paper, we have embarked on a systematic study of the limiting behavior of sequences of schema mappings using concepts and tools from metric spaces. For the important special cases of GAV and LAV mappings, our main results are summarized in Figs. 1 and 2.
Fig. 1.
Overall picture for GAV schema mappings
Fig. 2.
Overall picture for LAV schema mappings
In words, we have shown that, for GAV mappings, a pointwise Cauchy sequence need not be uniformly Cauchy; moreover, the existence of a pointwise limit does not imply the existence of a uniform limit. This cannot happen for LAV mappings. On the other side, a uniformly Cauchy sequence of LAV mappings need not even have a pointwise limit, which cannot happen for GAV mappings. We have also shown that structural properties of schema mappings can be used to characterize when the limit of a pointwise Cauchy sequence of GAV (or of LAV) mappings is equivalent to a GAV (or to a LAV) mapping. Finally, we have shown that infinite target instances and generalized mappings (i.e., schema mappings where target instances may be infinite) can be used to represent limits of Cauchy sequences of sets of target instances and limits of Cauchy sequences of arbitrary schema mappings.
We believe that the work reported here has laid the foundation for several interesting lines of subsequent investigations. We have seen that our results about sequences of LAV mappings extend in a natural way to sequences of premise-bounded GLAV mappings; an analogous extension of our results about sequences of GAV mappings to sequences of conclusion-bounded GLAV mappings is left for future work. We have also seen that there are sequences of LAV mappings for which no SO tgd is a uniform limit. Are there structural properties that characterize when a sequence of GLAV mappings has an SO tgd as a pointwise limit? In this vein, we have offered Conjecture 1. A related interesting open problem is whether schema mappings with target constraints are powerful enough to express pointwise limits or uniform limits of sequences of arbitrary GLAV schema mappings. We have some preliminary evidence that this is plausible, but much more work remains to be done.
We believe that the work reported in this paper provides a new perspective on the study of schema mappings by examining them from a dynamic viewpoint. As stated earlier, our original motivation came from schema-mapping optimization and, in particular, from the idea that “complex” schema mappings can be “approximated” by “simpler” ones. It remains to be seen whether the work reported here will lead to applications to schema-mapping optimization. We believe, however, that the study of the limiting behavior of schema mappings via metric spaces is interesting in its own right.
We also note there are several areas in theoretical computer science where the study of limiting behavior of objects has produced results that were significant in their own right and also had fruitful consequences. For example, starting with the work of Fagin [4], there has been an extensive investigation of the asymptotic probabilities of logical properties and of 0-1 laws for various logics of interest in computer science. More recently, there has been a study of profinite words, which has found applications to automata theory and to the satisfiability problem for variants of monadic second-order logic (see, e.g., [17, 20]). Note that the profinite words form the completion of a metric space on words in which the distance is based on the size of the largest deterministic finite automaton needed to separate two words. Finally, the connection between graph limits in the monograph [15] by Nešetřil and Ossona de Mendez and the completion of the metric space , which was mentioned in the previous section, may merit further exploration. It should also be pointed out that, motivated by the study of large-scale networks, there has been an extensive body of work on a notion of graph limits arising from converging sequences of homomorphism densities; a detailed account of this work is given in the monograph [13] by Lovász. In addition, Nešetřil and Ossona de Mendez [16] developed a general framework for limits of graphs and relational structures; in that framework, different fragments of first-order logic are used to define different notions of limits arising from converging sequences of the frequencies that first-order formulas in the fragment at hand are satisfied by an assignment (homomorphism densities correspond to the fragment consisting of all quantifier-free conjunctive queries). Homomorphisms, metric completions, and representations of limits of finite structures play a central role in [13, 16].
Acknowledgements
The research of Reinhard Pichler, Emanuel Sallinger, and Vadim Savenkov was supported by the Austrian Science Fund, projects (FWF):P25207-N23 and (FWF):Y698, and the Vienna Science and Technology Fund, project ICT12-015. The research of Phokion Kolaitis on this paper was partially supported by NSF Grant IIS-1217869. The full version was completed while Kolaitis was visiting the Simons Institute for the Theory of Computing during the fall of 2016. The research of Emanuel Sallinger was supported by the EPSRC programme grant EP/M025268/1.
Footnotes
Allowing for C Q-rewriting means that the certain answers of every conjunctive query over the target schema is definable by a union of conjunctive queries over the source schema - see [19].
A plain SO tgd is an SO tgd that contains no nested terms and no equalities. Every SO tgd is known to be C Q-equivalent to a plain one [2].
This article is part of the Topical Collection on Special Issue on Database Theory
Contributor Information
Phokion G. Kolaitis, Email: kolaitis@cs.ucsc.edu
Reinhard Pichler, Email: pichler@dbai.tuwien.ac.at.
Emanuel Sallinger, Email: emanuel.sallinger@cs.ox.ac.uk.
Vadim Savenkov, Email: vadim.savenkov@wu.ac.at.
References
- 1.Arenas M, Barceló P, Libkin L, Murlak F. Foundations of data exchange. Cambridge: Cambridge University Press; 2014. [Google Scholar]
- 2.Arenas M, Pérez J, Reutter J, Riveros C. The language of plain SO-tgds: composition, inversion and structural properties. J. Comput. Syst. Sci. 2013;79(6):763–784. doi: 10.1016/j.jcss.2013.01.002. [DOI] [Google Scholar]
- 3.Bernstein, P.A.: Applying model management to classical meta data problems. In: CIDR (2003)
- 4.Fagin R. Probabilities on finite models. J. Symb. Log. 1976;41(1):50–58. doi: 10.1017/S0022481200051756. [DOI] [Google Scholar]
- 5.Fagin, R., Kolaitis, P.G.: Local transformations and conjunctive-query equivalence. In: PODS, pp 179–190 (2012)
- 6.Fagin R, Kolaitis PG, Miller RJ, Popa L. Data exchange: semantics and query answering. Theor. Comput. Sci. 2005;336(1):89–124. doi: 10.1016/j.tcs.2004.10.033. [DOI] [Google Scholar]
- 7.Fagin, R., Kolaitis, P.G., Nash, A., Popa, L.: Towards a theory of schema-mapping optimization. In: PODS, pp 33–42 (2008)
- 8.Fagin R, Kolaitis PG, Popa L, Tan WC. Composing schema mappings: second-order dependencies to the rescue. ACM Trans. Database Syst. 2005;30(4):994–1055. doi: 10.1145/1114244.1114249. [DOI] [Google Scholar]
- 9.Fagin, R., Kolaitis, P.G., Popa, L., Tan, W.C.: Schema mapping evolution through composition and inversion. In: In Schema Matching and Mapping, pp. 191–222. Springer (2011)
- 10.Feinerer I, Pichler R, Sallinger E, Savenkov V. On the undecidability of the equivalence of second-order tuple generating dependencies. Inf. Syst. 2015;48:113–129. doi: 10.1016/j.is.2014.09.003. [DOI] [Google Scholar]
- 11.Kolaitis, P.G.: Schema mappings, data exchange, and metadata management. In: PODS, pp 61–75 (2005)
- 12.Lenzerini, M.: Data integration: a theoretical perspective. In: PODS, pp 233–246 (2002)
- 13.Lovász, L.: Large networks and graph limits, volume 60 of colloquium publications. American Mathematical Society (2012)
- 14.Madhavan, J., Halevy, A.Y.: Composing mappings among data sources. In: VLDB, pp 572–583 (2003)
- 15.Nešetřil J, de Mendez PO. Sparsity - graphs, structures, and algorithms, volume 28 of algorithms and combinatorics. Berlin: Springer; 2012. [Google Scholar]
- 16.Nešetřil, J., de Mendez, P.O.: A unified approach to structural limits, and limits of graphs with bounded tree-depth. arXiv:1303.6471 (2013)
- 17.Pin, J.: Profinite methods in automata theory. In: STACS, pp 31–50 (2009)
- 18.Shirali S, Vasudeva H. Metric spaces. Berlin: Springer; 2006. [Google Scholar]
- 19.ten Cate, B., Kolaitis, P.G.: Structural characterizations of schema-mapping languages. In: ICDT, pp 63–72 (2009)
- 20.Torunczyk, S.: Languages of profinite words and the limitedness problem. In: ICALP, pp 377–389 (2012)


