Logo

WatDiv

Benchmark Details

Description of the Dataset

The WatDiv data generator allows users to define their own dataset through a dataset description language (see tutorial). This way, users can control

Using these features, the WatDiv test dataset is designed (see the associated dataset description model). By executing the data generator with different scale factors, it is possible to generate test datasets with different sizes. Table 1 lists the properties of the dataset at scale factor=1.

Table 1. Characteristics of the WatDiv test dataset at scale factor=1.
triples 105257
distinct subjects 5597
distinct predicates 85
distinct objects 13258
URIs 5947
literals 14286
distinct literals 8018

An important characteristic that distinguishes the WatDiv test dataset from existing benchmarks is that instances of the same entity do not necessarily have the same set of attributes. Table 2 lists all the different entities used in WatDiv. Take the Product entity for instance. Product instances may be associated with different Product Categories (e.g., Book, Movie, Classical Music Concert, etc.), but depending on which category a product belongs to, it will have a different set of attributes. For example, products that belong to the category “Classical Music Concert” have the attributes mo:opus, mo:movement, wsdbm:composer, mo:performer (in addition to the attributes that is common to every product), whereas products that belong to the category “Book” have the attributes sorg:isbn, sorg:bookEdition and sorg:numberOfPages. Furthermore, even within a single product category, not all instances share the same set of attributes. For example, while sorg:isbn is a required attribute for a book, sorg:bookEdition (Pr=0.6) and sorg:numberOfPages (Pr=0.25) are optional attributes, where Pr indicates the probability that an instance will be generated with that attribute. It must be also noted that some attributes are correlated, which means that either all or none of the correlated attributes will be present in an instance (the pgroup construct in the WatDiv dataset description language allows the grouping of such correlated attributes). For a complete list of probabilities, please refer to Tables 3 and 4 in Appendix.

Table 2. Entities generated according to WatDiv data description model. Entities marked with an asterisk * do not scale.
Entity TypeInstance Count [per scale factor if applicable]
wsdbm:Purchase 1500
wsdbm:User 1000
wsdbm:Offer900
wsdbm:Topic* 250
wsdbm:Product 250
sdbm:City* 240
wsdbm:SubGenre* 21
wsdbm:Website 50
wsdbm:Language 25
wsdbm:Country* 25
wsdbm:Genre* 21
wsdbm:ProductCategory* 15
wsdbm:Retailer 12
wsdbm:AgeGroup* 9
wsdbm:Role* 3
wsdbm:Gender* 2

In short, the WatDiv test dataset is designed such that

Description of the Tests

WatDiv generates test workloads that are as diverse as possible. WatDiv offers three use cases:

At this point, you may be wondering how these differentiating aspects of WatDiv affect system evaluation, and why they are important at all. The answer is trivial: by relying on a more diverse dataset as such (which is typical for data on the Web), it is possible to generate test queries that focus on much wider aspects of query evaluation, which cannot be easily captured by other benchmarks. Consider the two SPARQL query templates C3 and S7 (cf., basic testing query templates). C3 is a star query that retrieves certain information about users such as the products they like, their friends and some demographics information. For convenience, for each triple pattern in the query template, we also display its selectivity (the reported selectivities are estimations based on the probability distributions specified in the WatDiv dataset description model). Note that while individually triple patterns in C3 are not that selective, this query as a whole, is very selective. Now, consider S7, which (as a whole) is also very selective, but unlike C3, its selectivity is largely due to only a single triple pattern. It turns out that different systems behave very differently for these queries. Systems like RDF-3x [2], which (i) decompose queries into triple patterns, (ii) find a suitable ordering of the join operations and then (iii) execute the joins in that order, perform very well on queries like S7 because the first triple pattern they execute is very selective. On the other hand, they do not do as well on queries like C3 because the decomposed evaluation produces many irrelevant intermediate tuples. In contrast, gStore [3] treats the star-shaped query as a whole and it can pinpoint the relevant vertices in the RDF graph without performing joins; hence, it is much more efficient in executing C3. For a more detailed discussion of our results, please refer to the technical report [4] and the stress testing paper.

Installing WatDiv Data, Query and Query Template Generator

Compiling WatDiv (in C++) is straightforward – the only dependencies are the Boost libraries and the Unix words file (i.e., make sure you have a wordlist package installed under /usr/share/dict/). Once you have installed Boost, simply execute the following commands on UNIX:

tar xvf watdiv_v05.tar
cd watdiv
setenv BOOST_HOME <BOOST-INSTALLATION-DIRECTORY>
export BOOST_HOME=<BOOST-INSTALLATION-DIRECTORY> (in bash)
make
cd bin/Release

The last step above is important. To run the data generator, issue the following command:

./watdiv -d <model-file> <scale-factor>

You will find a model file in the model sub-directory where WatDiv was installed. Using a scale factor of 1 will generate approximately 100K triples. For a more detailed description of the dataset that will be generated, please refer to Table 1. This will print the generated RDF triples on the standard output while producing a file named saved.txt in the same directory. The following steps depend on this file, therefore, keep it safe.

To run the query generator, issue the following command:

./watdiv -q <model-file> <query-file> <query-count> <recurrence-factor>

Use the same model file in the model sub-directory where WatDiv was installed. You will find the basic testing query templates in the testsuite sub-directory where WatDiv was installed.

To generate more query templates for stress testing (cf., stress testing paper), use the query template generator.

./watdiv -s <model-file> <dataset-file> <max-query-size> <query-count>

In the latest version, you may specify (i) the number of bound patterns in the query (default=1) as well as (ii) whether join vertices can be constants or not (default=false). To use these features, execute watdiv with the following signature instead.

./watdiv -s <model-file> <dataset-file> <max-query-size> <query-count> <constant-per-query-count> <constant-join-vertex-allowed?>

References

[1] S. Duan, A. Kementsietsidis, K. Srinivas, and O. Udrea. Apples and oranges: a comparison of RDF benchmarks and real RDF datasets. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2011, pages 145-156.

[2] T. Neumann and G. Weikum. The RDF-3X engine for scalable management of RDF data. VLDB J., 19(1): 91-113, 2010.

[3] L. Zou, J. Mo, D. Zhao, L. Chen, and M. T. Özsu. gStore: Answering SPARQL queries via subgraph matching. Proc. VLDB Endow., 4(1): 482-493, 2011.

[4] G. Aluç, M. T. Özsu, K. Daudjee, and O. Hartig. chameleon-db: a workload-aware robust RDF data management system. Technical Report CS-2013-10, University of Waterloo, 2013.