{
	"items" :      [
		{
			"label" :    "A New Suffix Tree Similarity Measure for Document Clustering",
			"type" :     "Paper",
			"author" :   [
				"Hung Chim",
				"Xiaotie Deng"
			],
			"url" :      "http://www2007.org/program/paper.php?id=91",
            "event" :    "Data Mining (3 of 5)",
			"abstract" : "In this paper, we propose a new similarity measure to compute the pairwise similarity of text-based documents based on suffix tree document model. By applying the new suffix tree similarity measure in Group-average Agglomerative Hierarchical Clustering (GAHC) algorithm, we developed a new suffix tree document clustering algorithm (NSTC). Our experimental results on two standard document clustering benchmark corpus OHSUMED and RCV1 indicate that the new clustering algorithm is a very effective document clustering algorithm. Comparing with the results of traditional keyword tfidf similarity measure in the same GHAC algorithm, NSTC achieved an improvement of 51% on the average of F-measure score. Furthermore, we apply the new clustering algorithm in analyzing the Web documents in online forum communities. A topic oriented clustering algorithm is developed to help people in assessing, classifying and searching the the Web documents in a large forum community."
		},
		{
			"label" :    "Efficient Search in Large Textual Collection with Redundancy",
			"type" :     "Paper",
			"author" :   [
				"Jiangong Zhang",
				"Torsten Suel"
			],
			"url" :      "http://www2007.org/program/paper.php?id=800",
            "event" :    "Search (1 of 7)",
			"abstract" : "Current web search engines focus on searching only the most recent snapshot of the web. In some cases, however, it would be desirable to search over collections that include many different crawls and versions of each page. One important example of such a collection is the Internet Archive, though there are many others. Since the data size of such an archive is multiple times that of a single snapshot, this presents us with significant performance challenges. Current engines use various techniques for index compression and optimized query execution, but these techniques do not exploit the significant similarities between different versions of a page, or between different pages.<br /><br /> In this paper, we propose a general framework for indexing and query processing of archival collections and, more generally, any collections with a significant amount of redundancy. Our approach results in very significant reductions in index size and query processing costs on such collections, and it is orthogonal to and can be combined with the existing techniques. It also supports highly efficient updates, both locally and over a network. Within this framework, we describe and evaluate several different implementations that trade off index size versus CPU cost and other factors, and discuss applications ranging from archival web search to local search of web sites, email archives, or file systems. We present experimental results based on search engine query log and a large collection consisting of multiple crawls."
		},
		{
			"label" :    "A Framework for Rapid Integration of Presentation Components",
			"type" :     "Paper",
			"author" :   [
				"Jin Yu",
				"Boualem Benatallah",
				"Fabio Casati",
				"Florian Daniel",
				"Maristella Matera",
				"Regis Saint-Paul"
			],
			"url" :      "http://www2007.org/program/paper.php?id=468",
            "event" :    "Web Engineering (1 of 2)",
			"abstract" : "The development of user interfaces (UIs) is one of the most time-consuming aspects in software development. In this context, the lack of proper reuse mechanisms for UIs is increasingly becoming manifest, especially as software development is more and more moving toward composite applications. In this paper we propose a framework for the integration of stand-alone modules or applications, where integration occurs at the presentation layer. Hence, the final goal is to reduce the effort required for UI development by maximizing reuse.<br /><br /> The design of the framework is inspired by lessons learned from application integration, appropriately modified to account for the specificity of the UI integration problem. We provide an abstract component model to specify characteristics and behaviors of presentation components and propose an event-based composition model to specify the composition logic. Components and composition are described by means of a simple XML-based language, which is interpreted by a runtime middleware for the execution of the resulting composite application. A proof-of-concept prototype allows us to show that the proposed component model can also easily be applied to existing presentation components, built with different languages and/or component technologies."
		},
		{
			"label" :    "Effort Estimation: How Valuable is it for a Web company to Use a Cross-company Data Set, Compared to Using Its Own Single-company Data Set?",
			"type" :     "Paper",
			"author" :   [
				"Emilia Mendes",
				"Sergio Di Martino",
				"Filomena Ferrucci",
				"Carmine Gravino"
			],
			"url" :      "http://www2007.org/program/paper.php?id=326",
            "event" :    "Web Engineering (2 of 2)",
			"abstract" : "Previous studies comparing the prediction accuracy of effort models built using Web cross- and single-company data sets have been inconclusive, and as such replicated studies are necessary to determine under what circumstances a company can place reliance on a cross-company effort model. This paper therefore replicates a previous study by investigating how successful a cross-company effort model is: i) to  estimate effort for Web projects that belong to a single company and were not used to build the cross-company model; ii) compared to a single-company effort model. Our single-company data set had data on 15 Web projects from a single company and our cross-company data set had data on 68 Web projects from 25 different companies. The effort estimates used in our analysis were obtained by means of two effort estimation techniques, namely forward stepwise regression and case-based reasoning. Our results were similar to those from the replicated study, showing that predictions based on the single-company model were significantly more accurate than those based on the cross-company model."
		},
		{
			"label" :    "Exploring in the Weblog Space by Detecting Informative and Affective Articles",
			"type" :     "Paper",
			"author" :   [
				"Xiaochuan Ni",
				"Gui-Rong Xue",
				"Xiao Ling",
				"Yong Yu",
				"Qiang Yang"
			],
			"url" :      "http://www2007.org/program/paper.php?id=225",
            "event" :    "Industrial Practice and Experience (1 of 1)",
			"abstract" : "Weblogs have become a prevalent source of information for people to express themselves. In general, there are two genres of contents in weblogs.  The first kind is about the webloggers\' personal feelings, thoughts or emotions.  We call this kind of weblogs affective articles. A second kind of weblogs is about technologies and different kinds of informative news. In this paper, we present a machine learning method for classifying informative and affective articles among weblogs. We consider this problem as a binary classification problem. By using machine learning approaches, we achieve 92% on information retrieval performance measures including precision, recall and F1. We set up three studies on the applications of above classification approach in both research and industrial fields. We use the above classification approach to improve the performance of classification of emotions from weblog articles.  We also develop an intent-driven weblog-search engine based on the classification techniques to improve the satisfaction of web users. Finally, we use above classification approach to search for weblogs with a great deal of informative articles."
		},
		{
			"label" :    "Supervised Rank Aggregation",
			"type" :     "Paper",
			"author" :   [
				"Yuting Liu",
				"Tie-Yan Liu",
				"Tao Qin",
				"Zhiming Ma",
				"Hang Li"
			],
			"url" :      "http://www2007.org/program/paper.php?id=286",
            "event" :    "Search (4 of 7)",
			"abstract" : "This paper is concerned with rank aggregation, the task of combining results of individual ranking functions in meta-search. Previously, rank aggregation was performed mainly by using unsupervised methods. It is hard for the unsupervised approach to improve ranking performances by leveraging the use of labeled data, when such data is available. We propose employing a supervised learning approach to perform the task, which we refer to as \'Supervised Rank Aggregation\'. We set up a general framework for conducting rank aggregation with supervised learning, in which learning for rank aggregation is formalized as an optimization issue that minimizes disagreements with the labeled ground truth data. As case study, we focus on Markov Chain based rank aggregation in this paper. The optimization problem is not a convex optimization problem for Markov Chain based methods, however, and thus is hard to solve. We transform the optimization problem into semi-definite programming and give proofs on the correctness. Experimental results on meta-searches show that Supervised Rank Aggregation can significantly outperform existing unsupervised methods."
		},
		{
			"label" :    "Connecting the bottom of the pyramid \' an exploratory case study of India\'s rural communication environment",
			"type" :     "Paper",
			"author" :   [
				"Sarita Seshagiri",
				"Aman Sagar",
				"Dhaval Joshi"
			],
			"url" :      "http://www2007.org/program/paper.php?id=127",
            "event" :    "Technology for Developing Regions (1 of 2)",
			"abstract" : "This paper is based on our exploratory study of a South Indian village in Chamrajanagar district of Karnataka. The study was to understand the rural communication environment and villagers\' communication preferences. We examined people\'s lifestyle, working conditions and their communication eco-system. Our study revealed that villagers, unlike urban inhabitants, interacted with people outside the village only for specific, rather than casual purposes. Another interesting aspect of rural communication was the marginal use of the postal system and the ubiquitous use of pay phone, apart from word of mouth and face-to-face interactions. In fact, personal (face-to-face) interaction was usually preferred among villages in this region, over other kinds of communication, despite infrastructural constraints like poor transport services.<br /><br /> We also observed that communication frequency increased when status quo changed to one that required immediate attention. During the analysis we identified certain social, economic and cultural communication gaps (or problems). However, these problems were clear opportunities to connect the unconnected rural users, by deploying new communication systems and features. Here, we have highlighted some of our findings and possible design avenues based on these findings."
		},
		{
			"label" :    "GeoTracker:  Geospatial and Temporal RSS Navigation",
			"type" :     "Paper",
			"author" :   [
				"Yih-Farn Chen",
				"Giuseppe  Di Fabbrizio",
				"David Gibbon",
				"Rittwik Jana",
				"Serban Jora",
				"Bernard Renger",
				"Bin Wei"
			],
			"url" :      "http://www2007.org/program/paper.php?id=530",
            "event" :    "Browsers and User Interfaces (2 of 2)",
			"abstract" : "The Web is rapidly moving towards a platform for mass collaboration in content production and consumption. Fresh content on a variety of topics, people, and places is being created and made available on the Web at breathtaking speed. Navigating the content effectively not only requires techniques such as aggregating various RSS-enabled feeds, but it also demands a new browsing paradigm. In this paper, we present novel geospatial and temporal browsing techniques that provide users with the capability of aggregating and navigating RSS enabled content in a timely, personalized and automatic manner. In particular, we describe a system called GeoTracker that utilizes both a geospatial representation and a temporal (chronological) presentation to help users spot the most relevant updates quickly. Within the context of this work, we provide a middleware engine that supports intelligent aggregation and dissemination of RSS feeds with personalization to desktops and mobile devices. We study the navigation capabilities of this system on two kinds of data sets, namely, 2006 World Cup soccer data collected over two months and breaking news items that occur every day. We also demonstrate that the application of such technologies to the video search results returned by YouTube and Google greatly enhances a user\'s ability in locating and browsing videos based on his or her geographical interests. Finally, we demonstrate that the location inference performance of GeoTracker compares well against machine learning techniques used in the natural language processing/information retrieval community.  Despite its algorithm simplicity, it preserves high recall percentages."
		},
		{
			"label" :    "XML Design for Relational Storage",
			"type" :     "Paper",
			"author" :   [
				"L Libkin",
				"S Kolahi"
			],
			"url" :      "http://www2007.org/program/paper.php?id=279",
            "event" :    "XML and Web Data (2 of 2)",
			"abstract" : "Design principles for XML schemas that eliminate redundancies and avoid update anomalies have been studied recently. Several normal forms, generalizing those for relational databases, have been proposed. All of them, however, are based on the assumption of a native XML storage, while in practice most of XML data is stored in relational databases.<br /><br /> In this paper we study XML design and normalization for relational storage of XML documents. To be able to relate and compare XML and relational designs, we use an information-theoretic framework that measures information content in relations and documents, with higher values corresponding to lower levels of redundancy. We show that most common relational storage schemes preserve the notion of being well-designed (i.e., anomalies- and redundancy-free). Thus, existing XML normal forms guarantee well-designed relational storages as well. We further show that if this perfect option is not achievable, then a slight restriction on XML constraints guarantees a ``second-best\'\' relational design, according to possible values of the information-theoretic measure. We finally consider an edge-based relational representation of XML documents, and show that while it has similar information-theoretic properties with other relational representations, it can behave significantly worse in terms of enforcing integrity constraints."
		},
		{
			"label" :    "Page-level Template Detection via Isotonic Smoothing",
			"type" :     "Paper",
			"author" :   [
				"Deepayan Chakrabarti",
				"Ravi Kumar",
				"Kunal Punera"
			],
			"url" :      "http://www2007.org/program/paper.php?id=588",
            "event" :    "Data Mining (1 of 5)",
			"abstract" : "We develop a novel framework for the ``page-level\'\' template detection problem.  Our framework is built on two main ideas.  The first is the automatic generation of training data for a classifier that, given a page, assigns a templateness score to every DOM node of the page.  The second is the global smoothing of these per-node classifier scores by solving a regularized isotonic regression problem; the latter follows from a simple yet powerful abstraction of templateness on a page.  Our extensive experiments on human-labeled test data show that our approach detects templates effectively."
		},
		{
			"label" :    "Open User Profiles for Adaptive News Systems: Help or Harm?",
			"type" :     "Paper",
			"author" :   [
				"Jaewook Ahn",
				"Peter Brusilovsky",
				"Jonathan Grady",
				"Daqing He",
				"Sue Yeon Syn"
			],
			"url" :      "http://www2007.org/program/paper.php?id=602",
            "event" :    "Browsers and User Interfaces (1 of 2)",
			"abstract" : "Over the last five years a range of projects focused progressively more elaborated techniques for adaptive news delivery. However, the adaptation process in these systems has become more complicated and thus less transparent to the users. In this paper, we concentrate on the application of open user models in adding transparency and controllability to adaptive news systems. We present a personalized news system YourNews that allowed their user to view and edit their interest profiles and report a study of the system. Contrary to our expectations, the study demonstrated that this ability to edit user profiles can harm the system and user performance and has to be used with caution."
		},
		{
			"label" :    "Supporting End-Users in the Creation of Dependable Web Clips",
			"type" :     "Paper",
			"author" :   [
				"Sandeep Lingam",
				"Sebastian Elbaum"
			],
			"url" :      "http://www2007.org/program/paper.php?id=702",
            "event" :    "Web Engineering (2 of 2)",
			"abstract" : "Web authoring environments enable end-users to create applications that integrate information from other web sources. Users can create web sites that include built-in components to dynamically incorporate, for example, weather information, stock-quotes, or the latest news from different web sources. Recent surveys conducted among end-users have indicated an increasing interest in creating such applications. Unfortunately, web authoring environments do not provide support beyond a limited set of built-in components.  This work addresses this limitation by providing end-user support for ``clipping\'\' information from a web site to incorporate it into the end-user site.  The support consists of a mechanism to identify the clipping target with multiple markers to increase robustness, and a dynamic assessment of the retrieved information to quantify its reliability. The clipping approach has been integrated as a feature into a popular web authoring tool on which we present the results of two preliminary studies."
		},
		{
			"label" :    "Bridging the Gap Between OWL and Relational Databases",
			"type" :     "Paper",
			"author" :   [
				"Boris Motik",
				"Ian Horrocks",
				"Ulrike Sattler"
			],
			"url" :      "http://www2007.org/program/paper.php?id=67",
            "event" :    "Semantic Web (4 of 5)",
			"abstract" : "Schema statements in OWL are interpreted in a different way from similar statements in a relational database setting. This can lead to problems in data-centric applications, where OWL\'s interpretation of the statements intended as constraints may be confusing and/or inappropriate. We propose an extension of OWL that attempts to mimic the intuition behind integrity constraints in relational databases. We discuss the algorithms for checking constraint satisfaction for different types of knowledge bases, and show that, provided the constraints are satisfied, we can disregard them while answering a broad range of positive queries."
		},
		{
			"label" :    "Communication as Information-Seeking: The Case for Mobile Social Software for Developing Regions",
			"type" :     "Paper",
			"author" :   [
				"Beth Kolko",
				"Emma Rose",
				"Erica Johnson"
			],
			"url" :      "http://www2007.org/program/paper.php?id=669",
            "event" :    "Technology for Developing Regions (1 of 2)",
			"abstract" : "In this paper, we describe several findings from a multi-year, multi-method study of how information and communication technologies have been adopted and adapted in Central Asia. We have found that mobile phone usage is outpacing that of Internet adoption, that access to the Internet is primarily through public access sites carrying with it issues regarding privacy and surveillance, that people rely on their social networks as information sources, that public institutions tend to be fairly weak as citizen resources, and that information seeking and communication are conflated in people\'s usage patterns with different technologies. In addition, in the developed world social networking software has grown rapidly and shown itself to have significant potential for mobilizing a population. Based on the collection of findings from Central Asia and observing patterns of technology usage in other parts of the world, our research leads to the conclusion that exploring mobile social software holds significant potential as an ICT that meshes well with preexisting patterns of communication and information seeking and also leverages the most predominant pattern of technology adoption. Many of the findings from this research echo results from studies in other geographic areas, and so we anticipate that much of this research will be relevant to developing regions generally."
		},
		{
			"label" :    "Reliable QoS Monitoring Based on Client Feedback",
			"type" :     "Paper",
			"author" :   [
				"Radu Jurca",
				"Walter Binder",
				"Boi Faltings"
			],
			"url" :      "http://www2007.org/program/paper.php?id=420",
            "event" :    "Web Services (2 of 2)",
			"abstract" : "Service-level agreements (SLAs) establish a contract between service providers and clients concerning Quality of Service (QoS) parameters. Without proper penalties, service providers have strong incentives to deviate from the advertised QoS, causing losses to the clients. Reliable QoS monitoring (and proper penalties computed on the basis of delivered QoS) are therefore essential for the trustworthiness of a service-oriented environment. In this paper, we present a novel QoS monitoring mechanism based on quality ratings from the clients. A reputation mechanism collects the ratings and computes the actual quality delivered to the clients. The mechanism provides incentives for the clients to report honestly, and pays special attention to minimizing cost and overhead."
		},
		{
			"label" :    "Exposing Private Information by Timing Web Applications",
			"type" :     "Paper",
			"author" :   [
				"Andrew Bortz",
				"Dan Boneh"
			],
			"url" :      "http://www2007.org/program/paper.php?id=555",
            "event" :    "Security, Privacy, Reliability, and Ethics (1 of 3)",
			"abstract" : "We show that the time web sites take to respond to HTTP requests can leak private information, using two different types of attacks.  The first directly measures response times from a web site to expose private information such as validity of an username at a secured site or the number of private photos in a publicly viewable gallery.  The second, called cross-site timing, enables a malicious web site to obtain information from the user\'s perspective at another site.  For example, a malicious site can learn if the user is currently logged in at a victim site and, in some cases, the number of objects in the user\'s shopping cart.  Our experiments suggest that these timing vulnerabilities are wide-spread.  We explain in detail how and why these attacks work, and discuss methods for writing web application code that resists these attacks."
		},
		{
			"label" :    "Combining Classifiers to Identify Online Databases",
			"type" :     "Paper",
			"author" :   [
				"Luciano Barbosa",
				"Juliana Freire"
			],
			"url" :      "http://www2007.org/program/paper.php?id=603",
            "event" :    "Search (2 of 7)",
			"abstract" : "We address the problem of identifying the domain of online databases. More precisely, given a set F of Web forms automatically gathered Web by a focused crawler and an online database domain D, our goal is to select from F only the forms that are entry points to databases in D. Having a set of Web forms that serve as entry points to similar online databases is a requirement for many applications and techniques that aim to extract and integrate hidden-Web information, including meta-searchers, database selection tools, hidden-Web crawlers, form-schema matching and merging, and in the construction of online database directories. We propose a new strategy that automatically and accurately classifies online databases based on features that can be easily extracted from Web forms. By judiciously partitioning the space of form features, this strategy allows the use of simpler classifiers that can be constructed using learning techniques that are better suited for each partition. Experiments using real Web data in a representative set of domains show that the use of different classifiers leads to high accuracy, precision and recall. This indicates that our modular classifier composition provides an effective and scalable solution for classifying online databases."
		},
		{
			"label" :    "ActiveRDF: Object-Oriented Semantic Web Programming",
			"type" :     "Paper",
			"author" :   [
				"Eyal Oren",
				"Renaud Delbru",
				"Armin Haller",
				"Sebastian Gerke"
			],
			"url" :      "http://www2007.org/program/paper.php?id=272",
            "event" :    "Semantic Web (4 of 5)",
			"abstract" : "Object-oriented programming is the current mainstream programming paradigm but existing RDF APIs are mostly triple-oriented. Traditional techniques for bridging a similar gap between relational databases and object-oriented programs cannot be applied directly, given the different nature of Semantic Web data, as can for example   be seen in the semantics of class membership, inheritance relations, and object conformance to schemas.<br /><br /> We present ActiveRDF, an object-oriented API for managing RDF data that offers full manipulation and querying of RDF data, does not rely on a schema and fully conforms to RDF(S) semantics. ActiveRDF can be used with different RDF data stores, adapters have been implemented to generic SPARQL endpoints, Sesame, Jena, Redland and YARS and new adapters can be added easily.  In addition, integration with the popular Ruby on Rails framework enables fast development of Semantic Web applications."
		},
		{
			"label" :    "Do Not Crawl in the DUST: Different URLs with Similar Text",
			"type" :     "Paper",
			"author" :   [
				"Ziv Bar-Yossef",
				"Idit Keidar",
				"Uri Schonfeld"
			],
			"url" :      "http://www2007.org/program/paper.php?id=194",
            "event" :    "Data Mining (2 of 5)",
			"abstract" : "We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, and dynamically generates the same page from various different URL requests. We present a novel algorithm, DustBuster, for uncovering DUST; that is, for discovering rules that transform a given URL to others that are likely to have similar content. DustBuster mines DUST effectively from previous crawl logs or web server logs, without examining page contents. Verifying these rules via sampling requires fetching few actual web pages. Search engines can benefit from information about DUST to increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as PageRank."
		},
		{
			"label" :    "A Scalable Application Placement Controller for Enterprise Data Centers",
			"type" :     "Paper",
			"author" :   [
				"Chunqiang Tang",
				"Malgorzata Steinder",
				"Michael Spreitzer",
				"Giovanni Pacifici"
			],
			"url" :      "http://www2007.org/program/paper.php?id=145",
            "event" :    "Performance and Scalability (2 of 2)",
			"abstract" : "Given a set of machines and a set ofWeb applications with dynamically changing demands, an application placement controller decides how many instances to run for each application and where to put them, while observing all kinds of resource constraints. This problem is NP hard. In this paper, we propose an online algorithm that uses heuristics to efficiently solve this problem. It allows multiple applications to share a single machine, and strives to maximize the total satisfied application demand, to minimize the number of application starts and stops, and to balance the load across machines. It can produce within 30 seconds high-quality solutions for hard placement problems with thousands of machines and thousands of applications. This scalability is crucial for dynamic resource provisioning in large-scale enterprise data centers. Our algorithm significantly and consistently outperforms the existing state-of-the-art algorithm under a wide variety of workloads."
		},
		{
			"label" :    "Analyzing Web Access Control Policies",
			"type" :     "Paper",
			"author" :   [
				"Vladimir Kolovski",
				"James Hendler",
				"Bijan Parsia"
			],
			"url" :      "http://www2007.org/program/paper.php?id=656",
            "event" :    "Security, Privacy, Reliability, and Ethics (3 of 3)",
			"abstract" : "XACML has emerged as a popular access control language on the Web, but because of its rich expressiveness,  it has  proved difficult to analyze in an automated fashion. Previous attempts to analyze XACML policies either use propositional logic or full First-Order logic. In this paper, we present a formalization of XACML using Description Logics (DL) . This formalization allows us to extend the subset of XACML supported by propositional logic-based analysis tools; we also provide a new analysis service (policy redundancy). Mapping XACML to description logics allows us to use off-the-shelf DL reasoners for analysis tasks such as policy comparison, policy verification and querying. We provide empirical evaluation of a policy analysis tool that was implemented on top of open source reasoner Pellet."
		},
		{
			"label" :    "Preference-based Selection of Highly Configurable Web Services",
			"type" :     "Paper",
			"author" :   [
				"Steffen Lamparter",
				"Anupriya Ankolekar",
                "Rudi Studer",
				"Stephan Grimm"
			],
			"url" :      "http://www2007.org/program/paper.php?id=469",
            "event" :    "Web Services (2 of 2)",
			"abstract" : "A key challenge for dynamic Web service selection is that Web services are typically highly configurable and service requesters often have dynamic preferences on service configurations. Current approaches, such as WS-Agreement, describe Web services by enumerating the various possible service configurations, an inefficient approach when dealing with numerous service attributes with large value spaces. We model Web service configurations and associated prices and preferences more compactly using utility function policies, which also allows us to draw from multi-attribute decision theory methods to develop an algorithm for optimal service selection. In this paper, we present an OWL ontology for the specification of configurable Web service offers and requests, and a flexible and extensible framework for optimal service selection that combines declarative logic-based matching rules with optimization methods, such as linear programming. Assuming additive price/preference functions, experimental results indicate that our algorithm introduces an overhead of only around 2 sec. compared to a random service selection, while giving optimal results. The overhead, as percentage of total time, decreases as the number of offers and configurations increase."
		},
		{
			"label" :    "Demographic Prediction based on User\'s Browsing Behavior",
			"type" :     "Paper",
			"author" :   [
				"Jian Hu",
				"Hua-Jun Zeng",
				"Hua Li",
				"Cheng Niu",
				"Zheng Chen"
			],
			"url" :      "http://www2007.org/program/paper.php?id=686",
            "event" :    "Data Mining (4 of 5)",
			"abstract" : "Demographic information plays an important role in personalized web applications. However, it is usually not easy to obtain this kind of personal data such as age and gender. In this paper, we made a first approach to predict users\' gender and age from their Web browsing behaviors, in which the webpage view information is treated as a hidden variable to propagate demographic information between different users. There are three main steps in our approach: First, learning from the web-page click-though data, Web pages are associated with users\' (known) age and gender tendency through a discriminative model; Second, users\' (unknown) age and gender are predicted from the demographic information of the associated Web pages through a Bayesian framework; Third, based on the fact that Web pages visited by similar users may be associated with similar demographic tendency, and users with similar demographic information would visit similar web pages, a smoothing component is employed to overcome the data sparseness of web click-though log. Experiments are conducted on a real web click-through log to demonstrate the effectiveness of the proposed approach. The experimental results show that the proposed algorithm can achieve up to 30.4% improvements on gender prediction and 50.3% on age prediction in terms of macro F1, comparing with baseline algorithms."
		},
		{
			"label" :    "Towards the Theoretical Foundation of Choreography",
			"type" :     "Paper",
			"author" :   [
				"Zongyan Qiu",
				"Xiangpeng Zhao",
				"Chao Cai",
				"Hongli Yang"
			],
			"url" :      "http://www2007.org/program/paper.php?id=200",
            "event" :    "Web Services (1 of 2)",
			"abstract" : "With the growth of interest on the web services, people pay more and more attention to choreography, that is, to describe collaborations of participants from a global viewpoint, in accomplishing a common business goal. In this paper, based on a simple choreography languages and a role-oriented process languages, we study some fundamental issues related to choreography, especially related to implementation, including semantics, projection and natural projection, dominant role in choices and iterations, etc. We develop the concept of \\emph{dominant role} and propose some novel languages structures related to it. The study reveals some clues about the language, semantics, specification and implementation of choreography."
		},
		{
			"label" :    "A Unified Platform for Data Driven Web Applictions with Automatic Client-Server Partitioning",
			"type" :     "Paper",
			"author" :   [
				"Fan Yang",
				"Nitin Gupta",
				"Nicholas Gerner",
				"Xin Qi",
				"Alan Demers",
				"Johannes Gehrke",
				"Jayavel Shanmugasundaram"
			],
			"url" :      "http://www2007.org/program/paper.php?id=485",
            "event" :    "Performance and Scalability (2 of 2)",
			"abstract" : "Data-driven web applications are structured into three tiers with different programming models at each tier. This division forces developers to manually partition application functionality across the tiers, resulting in complex logic, suboptimal partitioning, and expensive re-partitioning of applications.<br /><br /> In this paper, we introduce a unified platform for automatic partitioning of data-driven web applications. Our approach is based on Hilda, a high-level declarative programming language with a unified data and programming model for all the layers of the application. Based on run-time properties of the application, Hilda\'s run time system automatically partitions the application between the tiers to improve response time while adhering to memory or processing constraints at the clients. We evaluate our methodology with traces from a real application and with TPC-W, and our results show that automatic partitioning outperforms manual partitioning without the associated development overhead."
		},
		{
			"label" :    "Investigating Behavioral Variability in Web Search",
			"type" :     "Paper",
			"author" :   [
				"Ryen White",
				"Steven M. Drucker"
			],
			"url" :      "http://www2007.org/program/paper.php?id=535",
            "event" :    "Browsers and User Interfaces (1 of 2)",
			"abstract" : "In this paper we describe a longitudinal log-based study that investigated variability in peoples\' interaction behavior when engaged in search-related activities on the World Wide Web.  We analyze the consistency of interaction patterns for more than two thousand volunteer users over a period of five months.  Findings of our analysis suggest that there are dramatic differences in variability in key aspects of the interaction within and between queries, and within and between users.  Our findings also indicate the existence of at least two distinct classes of user \' navigators and explorers \' who exhibit large differences in their search behavior.  These findings have implications for the design of tools to support more effective search-related interactions on the Web."
		},
		{
			"label" :    "Web Projections: Learning from Contextual Subgraphs of the Web",
			"type" :     "Paper",
			"author" :   [
				"Jure Leskovec",
				"Susan Dumais",
				"Eric Horvitz"
			],
			"url" :      "http://www2007.org/program/paper.php?id=551",
            "event" :    "Search (3 of 7)",
			"abstract" : "Research on web search has demonstrated the value of using information about the graphical structure of the Web in ranking search results. To date, specific graphical properties have been used in these analyses. We introduce a {\\em web projection} method that generalizes prior efforts of graphical relationships of the web in several ways. With the approach, we create subgraphs by projecting sets of pages and domains onto the larger web graph, and then use machine learning to construct predictive models that operate on graphical properties. We describe the method and then present experiments that illustrate the construction of predictive models of search result quality and user query reformulation."
		},
		{
			"label" :    "Why We Search? Visualizing and Predicting User Behavior",
			"type" :     "Paper",
			"author" :   [
				"Eytan Adar",
				"Daniel Weld",
				"Brian Bershad",
				"Steven Gribble"
			],
			"url" :      "http://www2007.org/program/paper.php?id=520",
            "event" :    "Data Mining (4 of 5)",
			"abstract" : "The aggregation and comparison of behavioral patterns on the WWW represent a tremendous opportunity for understanding past behaviors and predicting future behaviors.  In this paper, we take a first step at achieving this goal.  We present a large scale study correlating the behaviors of Internet users on multiple systems ranging in size from 27 million queries to 14 million blog posts to 20,000 news articles.  We formalize a model for events in these time-varying datasets and study their correlation.   We have created an interface for analyzing the datasets, which includes a novel visual artifact, the DTWRadar, for summarizing differences between time series.   Using our tool we identify a number of behavioral properties that allow us to understand the predictive power of patterns of use."
		},
		{
			"label" :    "Answering Relationship Queries on the Web",
			"type" :     "Paper",
			"author" :   [
				"Gang Luo",
				"Chunqiang Tang",
				"Ying-li Tian"
			],
			"url" :      "http://www2007.org/program/paper.php?id=58",
            "event" :    "Search (6 of 7)",
			"abstract" : "Finding relationships between entities on the Web, e.g., the connections between different places or the commonalities of people, is a novel and challenging problem. Existing Web search engines excel in keyword matching and document ranking, but they cannot well handle many relationship queries. This paper proposes a new method for answering relationship queries on two entities. Our method first respectively retrieves the top Web pages for either entity from a Web search engine. It then matches these Web pages and generates an ordered list of Web page pairs. Each Web page pair consists of one Web page for either entity. The top ranked Web page pairs are likely to contain the relationships between the two entities. One main challenge in the ranking process is to effectively filter out the large amount of noise in the Web pages without losing much useful information. To achieve this, our method assigns appropriate weights to terms in Web pages and intelligently identifies the potential connecting terms that capture the relationships between the two entities. Only those top potential connecting terms with large weights are used to rank Web page pairs. Finally, the top ranked Web page pairs are presented to the searcher. For each such pair, the query terms and the top potential connecting terms are properly highlighted so that the relationships between the two entities can be easily identified. We implemented a prototype on top of the Google search engine and evaluated it under a wide variety of query scenarios. The experimental results show that our method is effective at finding important relationships with low overhead."
		},
		{
			"label" :    "Efficient Search Engine Measurements",
			"type" :     "Paper",
			"author" :   [
				"Ziv Bar-Yossef",
				"Maxim Gurevich"
			],
			"url" :      "http://www2007.org/program/paper.php?id=753",
            "event" :    "Search (1 of 7)",
			"abstract" : "We address the problem of measuring relevance neutral search quality metrics, like corpus size, index freshness, and density of duplicates in the index. The recently proposed estimators for such metrics [Bar-Yossef and Gurevich, WWW2006][Broder et al, CIKM 2006] suffer from significant bias and/or poor performance, due to inaccurate approximation of the so called ``document degrees\'\'.<br /><br /> We present two new estimators that are able to overcome the bias introduced by approximate degrees. Our estimators are based on a careful implementation of an approximate importance sampling procedure. Comprehensive theoretical and empirical analysis of the estimators demonstrates that they have essentially no bias even in situations where document degrees are poorly approximated.<br /><br /> Building on an idea from [Broder et al, CIKM 2006], we discuss Rao-Blackwellization as a generic method for reducing variance in search engine estimators. We show that Rao-Blackwellizing our estimators results in significant performance improvements, while not compromising quality."
		},
		{
			"label" :    "Yago: A Core of Semantic Knowledge - Unifying WordNet and Wikipedia",
			"type" :     "Paper",
			"author" :   [
				"Fabian M. Suchanek",
				"Gjergji Kasneci",
				"Gerhard Weikum"
			],
			"url" :      "http://www2007.org/program/paper.php?id=391",
            "event" :    "Semantic Web (1 of 5)",
			"abstract" : "We present YAGO, a light-weight and extensible ontology with high coverage and quality. YAGO builds on entities and relations and currently contains roughly 900,000 entities and 5,000,000 facts. This includes the Is-A hierarchy as well as non-taxonomic relations between entities (such as hasWonPrize). The facts have been automatically extracted from the unification of Wikipedia and WordNet, using a carefully designed combination of rule-based and heuristic methods described in this paper. The resulting knowledge base is a major step beyond WordNet: in quality by adding knowledge about individuals like persons, organizations, products, etc. with their semantic relationships -- and in quantity by increasing the number of facts by more than an order of magnitude. Our empirical evaluation of fact correctness shows an accuracy of about 95%. YAGO is based on a logically clean model, which is decidable, extensible, and compatible with RDFS. Finally, we show how YAGO can be further extended by state-of-the-art information extraction techniques."
		},
		{
			"label" :    "Compare&Contrast: Using the Web to Discover Comparable Cases for News Stories",
			"type" :     "Paper",
			"author" :   [
				"Jiahui Liu",
				"Earl Wagner",
				"Larry Birnbaum"
			],
			"url" :      "http://www2007.org/program/paper.php?id=759",
            "event" :    "Search (6 of 7)",
			"abstract" : "Comparing and contrasting is an important strategy people adopt to understand new situations and create solutions to for new problems. Similar events can provide hints for problem solving, as well as larger contexts for understanding the specific circumstances of an event. Lessons from past experience can be applied, insights can be gained about the new situation from familiar examples, and trends can be discovered among similar events. As the largest knowledge base for human beings, the Web provides both an opportunity and a challenge to discover comparable cases in order to facilitate situation analysis and problem solving. In this paper, we present Compare&Contrast, a system that uses the Web to discover comparable cases for news stories, documents about similar situations but involving distinct entities. The system analyzes a news story given by the user and builds a model of the situation. With the situation model, the system dynamically discovers entities comparable to the main entity in the original story and uses these comparable entities as seeds to retrieve web pages about comparable cases. The system is domain independent, does not require any knowledge engineering efforts, and deals with the complexity of unstructured text and noise on the web in a robust way. We evaluated the system with both a systematic experiment on a collection of news articles and a user study."
		},
		{
			"label" :    "Optimizing Web Search Using Social Annotation",
			"type" :     "Paper",
			"author" :   [
				"Shenghua Bao",
				"Xiaoyuan Wu",
				"Ben Fei",
				"Gui-Rong Xue",
				"Zhong Su",
				"Yong Yu"
			],
			"url" :      "http://www2007.org/program/paper.php?id=397",
            "event" :    "Search (4 of 7)",
			"abstract" : "This paper explores the use of social annotations to improve web search. Nowadays, many services e.g., del.icio.us have been developed for web users to organize and share their favorite web pages on line by using social annotations. We observed that the social annotations can benefit the web search in two aspects:  1) the annotations are usually good summaries of corresponding web pages; 2) the count of annotations indicates the popularity of web pages. Two novel algorithms are proposed to incorporate these information into page ranking: 1) SocialSimRank (SSR) calculates the similarity between social annotations and web queries; 2) SocialPageRank (SPR) captures the popularity of web pages. Preliminary experimental results show that SSR can find the latent semantic association between queries and annotations, while SPR successfully measures the quality (popularity) of a web page from the web users\' perspective. We further empirically evaluate the proposed methods with 50 manually annotated queries and 3000 auto-generated queries, on a dataset consisting of 690,482 web pages with 2,879,614 different annotations [32].  Experiments show that both SSR and SPR benefit the web search significantly. By incorporating both the SPR and SSR features, the quality of search results can be improved by as much as 14.80% and 25.02% compared with the original performance in MAP on two query sets respectively."
		},
		{
			"label" :    "SPARQ2L: Towards Support For Subgraph Extraction Queries in RDF Databases",
			"type" :     "Paper",
			"author" :   [
				"Kemafor Anyanwu",
				"Angela Maduko",
				"Amit Sheth"
			],
			"url" :      "http://www2007.org/program/paper.php?id=785",
            "event" :    "Semantic Web (4 of 5)",
			"abstract" : "Many applications in analytical domains often have the need to \'connect the dots\' i.e., query about the structure of data. In bioinformatics for example, it is typical to want to query about interactions between proteins. The aim of such queries is to \'extract\' relationships between entities i.e. paths from a data graph. Often, such queries will specify certain constraints that qualifying results must satisfy e.g. paths involving a set of mandatory nodes. Unfortunately, most present day Semantic Web query languages including the current draft of the anticipated recommendation SPARQL, lack the ability to express queries about arbitrary path structures in data. In addition, many systems that support some limited form of path queries rely on main memory graph algorithms limiting their applicability to very large scale graphs. In this paper, we present an approach for supporting Path Extraction queries. Our proposal comprises (i) a query language SPARQ2L which extends SPARQL with path variables and path variable constraint expressions, and (ii) a novel query evaluation framework based on efficient algebraic techniques for solving path problems which allows for path queries to be efficiently evaluated on disk resident RDF graphs. The effectiveness of our proposal is demonstrated by a performance evaluation of our approach on both real world based and synthetic datasets."
		},
		{
			"label" :    "Random Web Crawls",
			"type" :     "Paper",
			"author" :   [
				"Toufik Bennouas",
				"Fabien de Montgolfier"
			],
			"url" :      "http://www2007.org/program/paper.php?id=339",
            "event" :    "Search (3 of 7)",
			"abstract" : "This paper proposes a random Web crawl model. A Web crawl is a (biased and partial) image of the Web. This paper deals with the hyperlink structure, i.e. a Web crawl is a graph, whose vertices are the pages and whose edges are the hypertextual links. Of course a Web crawl has a very particular structure; we recall some known results about it. We then propose a model generating similar structures. Our model simply simulates a crawling, i.e. builds and crawls the graph at the same time. The graphs generated have lot of known properties of Web crawls. Our model is simpler than most random Web graph models, but captures the sames properties. Notice that it modelizes the crawling process instead of the page writting process of Web graph models."
		},
		{
			"label" :    "Identifying and Discriminating Between Web and Peer-to-Peer Traffic in the Network Core",
			"type" :     "Paper",
			"author" :   [
				"Jeffrey Erman",
				"Anirban Mahanti",
				"Martin Arlitt",
				"Carey Williamson"
			],
			"url" :      "http://www2007.org/program/paper.php?id=511",
            "event" :    "Technology for Developing Regions (2 of 2)",
			"abstract" : "Traffic classification is the ability to identify and categorize network traffic by application type. In this paper, we consider the problem of traffic classification in the network core. Classification at the core is challenging because only partial information of the flows and their contributors is available. We address this problem by developing and evaluating a classification framework that can classify a flow using only unidirectional flow information. We validated this approach using recent full-payload packet traces that we collected and pre-classified to establish a ``base truth\'\'. From our evaluation, we find that flow statistics along the server-to-client path of a TCP connection provides higher classification accuracy than flow statistics along the client-to-server path. Because collection of the server-to-client flow statistics may not always be feasible, we developed and verified an algorithm that can estimate the missing statistics from a unidirectional packet trace."
		},
		{
			"label" :    "Predicting Clicks: Estimating the Click-Through Rate for New Ads",
			"type" :     "Paper",
			"author" :   [
				"Matthew Richardson",
				"Ewa Dominowska",
				"Robert Ragno"
			],
			"url" :      "http://www2007.org/program/paper.php?id=784",
            "event" :    "Search (5 of 7)",
			"abstract" : "Search engine advertising has become a significant aspect of the Web browsing experience. The order in which a search engine displays ads greatly affects the probability that a user will see and click on each ad. Consequently, the ranking has a strong impact on the revenue the search engine receives from the ads. Further, showing the user an ad that they prefer to click on also improves user satisfaction. For these reasons, it is crucially important to be able to estimate the click-through rate of ads in the system. For ads that have been repeatedly displayed, this is empirically meas-urable, but when ads initially appear, other means must be used. We show that we can use features of ads, keywords, and advertis-ers to learn a model that accurately predicts the click-though rate for an ad. We also show that using our model improves the con-vergence and performance of an advertising system. As a result, our model would improve both revenues and user satisfaction."
		},
		{
			"label" :    "A Large-scale Evaluation and Analysis of Personalized Search Strategies",
			"type" :     "Paper",
			"author" :   [
				"Zhicheng Dou",
				"Ruihua Song",
				"Ji-Rong Wen"
			],
			"url" :      "http://www2007.org/program/paper.php?id=495",
            "event" :    "Search (7 of 7)",
			"abstract" : "Although personalized search has been proposed for many years and many personalization strategies have been investigated, it is still unclear whether personalization is consistently effective on different queries for different users, and under different search contexts. In this paper, we study the problem and get some preliminary conclusions. We present a large-scale personalized search evaluation framework based on search logs and then evaluate five personalized search strategies (including two click-based and three profile-based ones) using 12-day MSN search logs. By analyzing the results, we reveal that personalized search has significant improvement over common web search on some queries but it also has little effect on other queries (e.g., queries with small click entropy) and even harms search accuracy under some situations. Furthermore, we show that click-based personalization strategies perform consistently and considerablely well while profile-based ones are unstable in our experiments. We also reveal that both long-term and short-term contexts are very important in improving search performance for profile-based personalized search strategies."
		},
		{
			"label" :    "An Adaptive Crawler for Locating Hidden-Web Entry Points",
			"type" :     "Paper",
			"author" :   [
				"Luciano Barbosa",
				"Juliana Freire"
			],
			"url" :      "http://www2007.org/program/paper.php?id=429",
            "event" :    "Search (2 of 7)",
			"abstract" : "In this paper we describe new adaptive crawling strategies to efficiently locate the entry points to hidden-Web sources. The fact that hidden-Web sources are very sparsely distributed makes the problem of locating them especially challenging. We deal with this problem by using the contents of pages to focus the crawl on a topic; by prioritizing promising links within the topic; and by also following links that may not lead to immediate benefit. We propose a new framework whereby crawlers automatically learn patterns of promising links, and adapt their focus as the crawl progresses, thus greatly reducing the amount of required manual setup and tuning. Our experiments over real Web pages in a representative set of domains indicate that online learning leads to significant gains in harvest rates\'the adaptive crawlers retrieve up to three times as many forms as crawlers that use a fixed focus strategy."
		},
		{
			"label" :    "A Mobile Application Framework for the Geospatial Web",
			"type" :     "Paper",
			"author" :   [
				"Rainer Simon",
				"Peter Froehlich"
			],
			"url" :      "http://www2007.org/program/paper.php?id=287",
            "event" :    "Pervasive Web and Mobility (1 of 1)",
			"abstract" : "In this paper we present an application framework that leverages geospatial content on the World Wide Web by enabling innovative modes of interaction and novel types of user interfaces on advanced mobile phones and PDAs. We discuss the current development steps involved in building mobile geospatial Web applications and derive three technological pre-requisites for our framework: spatial query operations based on visibility and field of view, a 2.5D environment model, and a presentation-independent data exchange format for geospatial query results. We propose the Local Visibility Model as a suitable XML-based candidate and present a prototype implementation."
		},
		{
			"label" :    "Extraction and classification of dense communities in the Web",
			"type" :     "Paper",
			"author" :   [
				"Yon Dourisboure",
				"Filippo Geraci",
				"Marco Pellegrini"
			],
			"url" :      "http://www2007.org/program/paper.php?id=15",
            "event" :    "Search (3 of 7)",
			"abstract" : "The World Wide Web (WWW) is rapidly becoming important for society as a medium for sharing data, information and services, and there is a growing interest in tools for understanding collective behaviors and emerging phenomena in the WWW.  In this paper we focus on the problem of searching and classifying {\\em communities} in the web. Loosely speaking a community is a group of pages related to a common interest.  More formally communities have been associated in the computer science literature with the existence of a locally dense sub-graph of the web-graph (where web pages are nodes and hyper-links are arcs of the web-graph). The core of our contribution is a new scalable algorithm for finding relatively dense subgraphs in massive graphs.  We apply our algorithm on web-graphs built on three publicly available large crawls of the web (with raw sizes up to 120M nodes and 1G arcs). The effectiveness of our algorithm in finding dense subgraphs is demonstrated experimentally by embedding artificial communities in the web-graph and counting how many of these are blindly found. Effectiveness increases with the size and density of the communities: it is close to 100\\% for  communities of a thirty nodes or more (even at low density). It is  still about 80\\% even for communities of twenty nodes with  density over $50\\%$ of the arcs present. At the lower extremes the algorithm catches 35\\% of dense communities made of ten nodes. We complete our Community Watch system by clustering the communities found in the web-graph into homogeneous groups by topic and labelling each group by representative keywords."
		},
		{
			"label" :    "Is High-Quality VoD Feasible using P2P Swarming?",
			"type" :     "Paper",
			"author" :   [
				"Siddhartha Annapureddy",
				"Saikat Guha",
				"Dinan Gunawardena",
				"Christos Gkantsidis",
				"Pablo Rodriguez"
			],
			"url" :      "http://www2007.org/program/paper.php?id=158",
            "event" :    "Technology for Developing Regions (2 of 2)",
			"abstract" : "Digital media companies have recently started embracing P2P networks as an alternative content distribution channel. However, the drawback of the current {\\em P2P swarming} systems is that users need to download the full video and, hence, wait a long time before they can start watching it. While a lot of effort has gone into optimizing the distribution of large files, little research has been done on how to enable high-quality Video-on-Demand (VoD) functionality with P2P swarming systems. The main challenges reside in ensuring that users can start watching a movie at any point in time, while providing small start-up times, sustainable playback rates and high swarming efficiencies.<br /><br /> In this work, we explore the feasibility of providing high-quality VoD using P2P mesh-based networks. To this extent, we investigate scheduling and pre-fetching techniques, network coding, and mesh topology management. Using both simulations and results from a real implementation, we provide evidence that high-quality VoD is feasible, and give guidelines to enable play-as-you-download P2P swarming systems with high playback rates and minimum start-up delays."
		},
		{
			"label" :    "Exhibit: Light-weight Structured Data Publishing",
			"type" :     "Paper",
			"author" :   [
				"David Huynh",
				"David Karger",
				"Rob Miller"
			],
			"url" :      "http://www2007.org/program/paper.php?id=161",
            "event" :    "Semantic Web (2 of 5)",
			"abstract" : "It is no surprise that Semantic Web researchers and enthusiasts are excited to publish and accumulate semi-structured data on the Web. But looking beyond our community, we recognize that many, many other people also have structured data and want to publish it in rich browsing interfaces. These small-time authors fall into the same category as those early enthusiasts of the Web who were simply excited by the opportunity of using the new medium to share information that they cared about. With this insight, we create a lightweight structured data publishing framework called Exhibit that duplicates many factors we believe have contributed to the original growth of the Web. We argue that appealing to this segment of the Web population--addressing their publishing needs and desires at very low cost in many aspects--lets us leverage their labor to structure-ize existing content on the Web that has previously been authored in HTML by hand and is remaining hard to harvest automatically."
		},
		{
			"label" :    "Measuring Semantic Similarity between Words Using Web Search Engines",
			"type" :     "Paper",
			"author" :   [
				"Danushka Bollegala",
				"Yutaka Matsuo",
				"Mitsuru Ishizuka"
			],
			"url" :      "http://www2007.org/program/paper.php?id=632",
            "event" :    "Semantic Web (3 of 5)",
			"abstract" : "Semantic similarity measures play important roles in information retrieval and Natural Language Processing. In information retrieval, semantic similarity measures are used in automatic query suggestion and expansion. Previous work in semantic web-related applications such as community mining, relation extraction, automatic meta data extraction have used various semantic similarity measures. Despite the usefulness of semantic similarity measures in these applications, robustly measuring semantic similarity between two words (or entities) remains a challenging task. Semantic similarity is a dynamic phenomenon that changes over time and across domains. In this paper, we propose a robust semantic similarity measure that uses the information available on the Web to measure similarity between words or entities. We propose a method that exploits page counts and text snippets returned by a Web search engine to measure semantic similarity between words. We define various similarity scores for two given words \\textit{P} and \\textit{Q}, using the page counts for the queries  \\textit{P}, \\textit{Q} and \\textit{P AND Q}. Moreover, we propose a novel approach to compute semantic similarity using automatically extracted lexico-syntactic patterns from text snippets. These different similarity scores are integrated using support vector machines, to leverage a robust semantic similarity measure. Experimental results on Miller-Charles benchmark dataset show that the proposed measure outperforms all the existing web-based  semantic similarity measures by a wide margin, achieving a correlation coefficient of $0.834$. Moreover, the proposed semantic similarity measure significantly improves the accuracy ($F$-measure of $0.78$) in a community mining task, and improves accuracy in a entity disambiguation task,  thereby verifying the capability of the proposed measure to capture semantic similarity using web content."
		},
		{
			"label" :    "Semi-Austomated Adaptation of Service Interactions",
			"type" :     "Paper",
			"author" :   [
				"Hamid R. Motahari-Nezhad",
				"Boualem Benatallah",
				"Axel Marten",
				"Francisco Curbera",
				"Fabio Casati"
			],
			"url" :      "http://www2007.org/program/paper.php?id=584",
            "event" :    "Web Services (1 of 2)",
			"abstract" : "In today\'s Web, many functionality-wise similar Web services are offered through heterogeneous interfaces (operation definitions) and business protocols (ordering constraints defined on legal operation invocation sequences). The typical approach to enable interoperation in such a heterogeneous setting is through developing adapters. There have been approaches for classifying possible mismatches between service interfaces and business protocols to facilitate adapter development. However, the hard job is that of identifying, given two service specifications, the actual mismatches between their interfaces and business protocols.<br /><br /><br /><br /> In this paper we present novel techniques and a tool that provides semi-automated support for identifying and resolution of mismatches between service interfaces and protocols, and for generating adapter specification. We make the following main contributions: (i) we identify mismatches between service interfaces, which leads to finding mismatches of type of signature, merge/split, and extra/missing messages; (ii) we identify all ordering mismatches between service protocols and generate a tree, called mismatch tree, for mismatches that require developers\' input for their resolution. In addition, we provide semi-automated support in analyzing the mismatch tree to help in resolving such mismatches. We have implemented the approach in a tool inside IBM WID (WebSphere Integration Developer). Our experiments with some real-world case studies show the viability of the proposed approach. The methods and tool are significant in that they considerably simplify the problem of adapting services so that interoperation is possible."
		},
		{
			"label" :    "Towards Effective Browsing of Large Scale Social Annotations",
			"type" :     "Paper",
			"author" :   [
				"Rui Li",
				"Shenghua Bao",
				"Ben Fei",
				"Zhong Su",
				"Yong Yu"
			],
			"url" :      "http://www2007.org/program/paper.php?id=733",
            "event" :    "Web Engineering (2 of 2)",
			"abstract" : "This paper is concerned with the problem of browsing social annotations. Today, a lot of services (e.g., Del.icio.us, Filckr) have been provided based on social annotations. These services help users to manage and share their favorite URL, photos etc.  However, due to exponential increasing of the social annotations, more and more users are facing the problem of finding desired resources among a large annotation data. Existing methods such as tag cloud, annotation matching, work well only when the annotation scale is relatively small. Thus, an effective approach for browsing the large scale annotations and the associated resources is on great demand of both users and service providers.  In this paper, we propose a novel algorithm, namely Effective Large Scale Annotation Browser (ELSABer), to browse the large-scale social annotations. With the help of ELSABer, the users could browse the large-sale annotations in a semantic, hierarchical and efficient way. More specifically, 1) the semantic relations among annotations are explored for similar resource browsing; 2) the hierarchical relations among annotations are also constructed for top-down browsing; 3) the power-law distribution of social annotations is studied for efficient browsing. By incorporating the personal and time information, the ELSABer can be further extended for personalized and time-related browsing. A prototype system is also implemented based on ELSABer and shows promising results"
		},
		{
			"label" :    "Navigating the intranet with high precision",
			"type" :     "Paper",
			"author" :   [
				"Alexander Loeser",
				"Sriram Raghavan",
				"Shivakumar Vaithyanathan",
				"Huaiyu Zhu"
			],
			"url" :      "http://www2007.org/program/paper.php?id=793",
            "event" :    "Search (4 of 7)",
			"abstract" : "Despite the success of web search engines, search over large enterprise intranets still suffers from poor result quality. Earlier work that compared intranets and the Internet from the view point of keyword search has pointed to several reasons why the search problem is quite different in these two domains. In this paper, we address the problem of providing high quality answers to navigational queries in the intranet (e.g., queries intended to find product or personal home pages, service pages, etc.). Our approach is based on offline identification of navigational pages, intelligent generation of term-variants to associate with each page, and the construction of separate indices exclusively devoted to answering navigational queries. Using a testbed of 5.5M pages from the IBM intranet, we present evaluation results that demonstrate that for navigational queries, our approach of using custom indices produces results of significantly higher precision than those produced by a general purpose search algorithm."
		},
		{
			"label" :    "NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks",
			"type" :     "Paper",
			"author" :   [
				"Shashank Pandit",
				"Duen Horng Chau",
				"Samuel Wang",
				"Christos Faloutsos"
			],
			"url" :      "http://www2007.org/program/paper.php?id=207",
            "event" :    "Data Mining (5 of 5)",
			"abstract" : "Given a large online network of online auction uesrs and their histories of transactions, how can we spot anomalies, or even auction fraud? We describe the algorithms and system design decisions behind our proposed NetProbe system for uncovering auction fraud. We show that it is possible to do fast and scalable fraud detection, in large auction networks. The main idea is to use the machinery of \"Markov Random Fields\" (MRF), and try to guess the hidden state (fraud/honest) of each participant. We describe the algorithms behind our system, that are based on \"belief propagation\"; we provide our own incremental but accurate approximations to it; and we list and justify our design decisions for efficient crawling of real auction networks. We report experiments on synthetic graphs containing as many as 7,000 nodes and 30,000 edges, where NetProbe was able to spot fraudulent nodes with over 90 % precision and recall, with execution times in the order of seconds. We also report experiments on a real graph consisting of about 700,000 transactions between more than 66,000 eBay users, where NetProbe was highly effective at unearthing hidden networks of fraudsters, within a realistic response time of about 6 minutes."
		},
		{
			"label" :    "Optimal Audio-Visual Representations for Illiterate Users of Computers",
			"type" :     "Paper",
			"author" :   [
				"Indrani Medhi",
				"Archana Prasad",
				"Kentaro Toyama"
			],
			"url" :      "http://www2007.org/program/paper.php?id=764",
            "event" :    "Technology for Developing Regions (1 of 2)",
			"abstract" : "We present research leading toward an understanding of the optimal audio-visual representation for illustrating concepts for illiterate and semi-literate users of computers. In our user study, which to our knowledge is the first of its kind, we presented each of 13 different health symptoms to 200 illiterate subjects in one representation randomly selected among the following ten: text, static drawings, static photographs, hand-drawn animations, and video, each with and without voice annotation. The goal was to see how comprehensible these representation types were for an illiterate audience. We used a methodology for generating each of the representations tested in a way that fairly stacks one representational type against the others. Our main results are that (1) richer information is not necessarily better understood overall; (2) voice annotation generally helps in speed of comprehension, but bimodal audio-visual information can be confusing for the target population; (3) the relative value of dynamic imagery versus static imagery depends on other factors. Analysis of these statistically significant results and additional detailed results are also provided."
		},
		{
			"label" :    "Ontology Summarization Based on RDF Sentence Graph",
			"type" :     "Paper",
			"author" :   [
				"Xiang Zhang",
				"Gong Cheng",
				"Yuzhong Qu"
			],
			"url" :      "http://www2007.org/program/paper.php?id=565",
            "event" :    "Semantic Web (1 of 5)",
			"abstract" : "Ontology summarization is very important to quick understanding and selection of ontologies. In this paper, we study extractive summarization of ontology, which generates an indicative summary of ontology automatically by extracting a salient part of ontology. We use a notion of RDF sentence as the basic unit of ontology. An RDF Sentence Graph is proposed to characterize the linkage between RDF sentences derived from ontology from the viewpoint of a surfer. The salience of an RDF sentence is assessed in terms of its \"centrality\" in the graph. We propose to summarize an ontology as a set of salient RDF sentences extracted from the ontology according to a re-ranking strategy. We compare several methods in assessing the salience of RDF sentences and give an overall evaluation of experimented results. Experiments show that the RDF Sentence Graph approach to ontology summarization is feasible."
		},
		{
			"label" :    "Hierarchical, Perceptron-like Learning for Ontology-Based Information Extraction",
			"type" :     "Paper",
			"author" :   [
				"Yaoyong Li",
				"Kalina Bontcheva"
			],
			"url" :      "http://www2007.org/program/paper.php?id=428",
            "event" :    "Semantic Web (3 of 5)",
			"abstract" : "Recent work on ontology-based Information Extraction (IE) has tried to make an increased use of the knowledge from the target ontology in order to improve the semantic annotation results. However, only very few approaches are able to benefit from the ontology structure and one of them is not a learning system, thus is not easy to adapt to new domains, whereas the other one does not perform semantic annotation of documents, but only ontology population.<br /><br /> This paper introduces a hierarchical learning approach for IE, which uses the target ontology as an essential part of the extraction process. Hierarchical classification takes into account the relations between concepts, thus benefiting directly from the ontology.<br /><br /> We also carry out evaluation experiments on the largest available semantically annotated corpus of 146 classes. The results demonstrate clearly the benefits of using knowledge from the ontology for ontology-based IE. We also demonstrate the advantages of our approach over other state-of-the-art learning systems on a commonly used benchmark dataset."
		},
		{
			"label" :    "Speeding up Adaptation of Web Service Compositions Using Expiration Times",
			"type" :     "Paper",
			"author" :   [
				"John Harney",
				"Prashant Doshi"
			],
			"url" :      "http://www2007.org/program/paper.php?id=605",
            "event" :    "Web Services (2 of 2)",
			"abstract" : "Web processes must often operate in volatile environments where the parameters of the participating service providers change during the life cycle of the process.  Optimally adapting to these changes, therefore, becomes an important challenge.  Adaptation requires the knowledge from each of the service providers as to how much each of their service parameters have changed and using this knowledge to determine whether the Web process should make a different, and subsequently, more optimal decision. Monitoring this information change requires additional computations and may become a time-consuming process. In this paper, we use service expiration times obtained from pre-defined service level agreements to reduce the computational overhead of adaptation. We use the intuition that services whose parameters have not expired need not be considered for querying for revised information. Using two realistic scenarios, we illustrate our approach and demonstrate the associated computational savings."
		},
		{
			"label" :    "The Discoverability of the Web",
			"type" :     "Paper",
			"author" :   [
				"Anirban Dasgupta",
				"Arpita Ghosh",
				"Ravi Kumar",
				"Christopher Olston",
				"Sandeep Pandey",
				"Andrew Tomkins"
			],
			"url" :      "http://www2007.org/program/paper.php?id=592",
            "event" :    "Search (2 of 7)",
			"abstract" : "Previous studies have highlighted the rapidity with which new content arrives on the web.  We study the extent to which this new content can be efficiently discovered in the crawling model.  Our study has two parts.  First, we employ a maximum cover formulation to study the inherent difficulty of the problem in a setting in which we have perfect estimates of likely sources of links to new content.  Second, we relax the requirement of perfect estimates into a more realistic setting in which algorithms must discover new content using historical statistics to estimate which pages are most likely to yield links to new content.<br /><br /> We measure the overhead of discovering new content, defined as the average number of fetches required to discover one new page.  We show first that with perfect foreknowledge of where to explore for links to new content, it is possible to discover 50\\% of all new content with under 3\\% overhead, and 100\\% of new content with 28\\% overhead.  But actual algorithms, which do not have access to perfect foreknowledge, face a more difficult task: 26\\% of new content is accessible only by recrawling a constant fraction of the entire web. Of the remaining 74\\%, 80\\% of this content may be discovered within one week at discovery cost equal to 1.3X the cost of gathering the new content, in a model with full monthly recrawls."
		},
		{
			"label" :    "Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs",
			"type" :     "Paper",
			"author" :   [
				"Qiaozhu Mei",
				"Xu Ling",
				"Matthew Wondra",
				"Hang Su",
				"ChengXiang Zhai"
			],
			"url" :      "http://www2007.org/program/paper.php?id=680",
            "event" :    "Data Mining (4 of 5)",
			"abstract" : "In this paper, we define the problem of topic-sentiment analysis on Weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. The proposed Topic-Sentiment Mixture (TSM) model can reveal the latent topical facets in a Weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. It could also provide general sentiment models that are applicable to any ad hoc topics. With a specifically designed HMM structure, the sentiment models and topic models estimated with TSM can be utilized to extract topic life cycles and sentiment dynamics. Empirical experiments on different Weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from Weblog collections. The TSM model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction."
		},
		{
			"label" :    "Answering Bounded Continuous Search Queries in the World Wide Web",
			"type" :     "Paper",
			"author" :   [
				"Kukulenz Dirk",
				"Alexandros Ntoulas"
			],
			"url" :      "http://www2007.org/program/paper.php?id=403",
            "event" :    "Search (6 of 7)",
			"abstract" : "Search queries applied to extract relevant information from the World Wide Web over a  period of time may be denoted as continuous search  queries. The improvement of continuous search queries may concern not only the quality of retrieved results but also the freshness of results, i.e. the time between the availability  of a respective data object on the Web and the notification of a user by the  search engine. In some cases a user should be notified immediately since the value of the respective information decreases quickly, as e.g. news about companies that affect the value of respective stocks or sales offers for products that may no longer be available after a short period of time. In the document filtering literature the optimization of such queries is usually based on threshold classification. Documents above a quality threshold are returned to a user. The threshold is tuned in order to optimize the quality of retrieved results. The disadvantage of such approaches is that the amount of information returned to a user may hardly be controlled without further user-interaction. In this paper we consider the optimization of bounded continuous search queries where only the estimated best k elements are returned to a user. We present a new optimization method for bounded continuous search queries based on the optimal stopping theory and compare the new method to methods currently applied by Web search systems. The new method provides results of significantly higher quality for the cases where very fresh results have to be delivered."
		},
		{
			"label" :    "The Complex Dynamics of Collaborative Tagging",
			"type" :     "Paper",
			"author" :   [
				"Harry Halpin",
				"Valentin Robu",
				"Hana Shepherd"
			],
			"url" :      "http://www2007.org/program/paper.php?id=635",
            "event" :    "E*-Applications (1 of 2)",
			"abstract" : "The debate within the Web community over the optimal means by which to organize information often pits formalized classifications against distributed collaborative tagging systems. A number of questions remain unanswered, however, regarding the nature of collaborative tagging systems including whether coherent categorization schemes can emerge from unsupervised tagging by users.  This paper uses data from tagged sites on the social bookmarking site del.icio.us to examine the dynamics of collaborative tagging systems. In particular, we examine whether the distribution of the frequency of use of tags for \"popular\" sites with a long history (many tags and many users) can be described by a power law distribution, often characteristic of what are considered complex systems. We produce a generative model of collaborative tagging in order to understand the basic dynamics behind tagging, including how a power law distribution of tags could arise. We empirically examine the tagging history of sites in order to determine how this distribution arises over time and patterns prior to a stable distribution. Lastly, by focusing on the high-frequency tags of a site where the distribution of tags is a stabilized power law, we show how tag co-occurrence networks for a sample domain of tags can be used analyze the meaning of particular tags given their relationship to other tags."
		},
		{
			"label" :    "Defeating Script Injection Attacks with Browser-Enforced Embedded Policies",
			"type" :     "Paper",
			"author" :   [
				"Trevor Jim",
				"Nikhil Swamy",
				"Michael Hicks"
			],
			"url" :      "http://www2007.org/program/paper.php?id=595",
            "event" :    "Security, Privacy, Reliability, and Ethics (1 of 3)",
			"abstract" : "Web sites that accept and display content such as wiki articles or comments typically filter the content to prevent injected script code from running in browsers that view the site.  The diversity of browser rendering algorithms and the desire to allow rich content makes filtering quite difficult, however, and attacks such as the Samy and Yamanner worms have exploited filtering weaknesses.  To solve this problem, this paper proposes a simple mechanism called Browser-Enforced Embedded Policies (BEEP).  The idea is that a web site can embed a policy inside its pages that specifies which scripts are allowed to run.  The browser, which knows exactly when it will run a script, can enforce this policy perfectly.  We have added BEEP support to several browsers, and built tools to simplify adding policies to web applications.  We found that supporting BEEP in browsers requires only small and localized modifications, modifying web applications requires minimal effort, and enforcing policies is generally lightweight."
		},
		{
			"label" :    "A Content-Driven Reputation System for the Wikipedia",
			"type" :     "Paper",
			"author" :   [
				"B. Thomas Adler",
				"Luca de Alfaro"
			],
			"url" :      "http://www2007.org/program/paper.php?id=692",
            "event" :    "E*-Applications (2 of 2)",
			"abstract" : "On-line forums for the collaborative creation of bodies of information are a phenomenon of rising importance; the Wikipedia is one of the best-known examples.  The open nature of such forums could benefit from a notion of reputation for its authors.  Author reputation could be used to flag new contributions from low-reputation authors, and it could be used to allow only authors with good reputation to contribute to controversial or critical pages.  A reputation system for the Wikipedia would also provide an incentive to give high-quality contributions.<br /><br /> We present in this paper a novel type of content-driven reputation system for Wikipedia authors.  In our system, authors gain reputation when the edits and text additions they perform to Wikipedia articles are long-lived, and they lose reputation when their changes are undone in short order.  We have implemented the proposed system, and we have used it to analyze the entire Italian and French Wikipedias, consisting of a total of 691,551 pages and 5,587,523 revisions.  Our results show that our notion of reputation has good predictive value: changes performed by low-reputation authors have a significantly larger than average probability of having poor quality, and of being undone."
		},
		{
			"label" :    "Just the Right Amount: Extracting Modules from Ontologies",
			"type" :     "Paper",
			"author" :   [
				"Bernardo Cuenca Grau",
				"Ian Horrocks",
				"Yevgeny Kazakov",
				"Ulrike Sattler"
			],
			"url" :      "http://www2007.org/program/paper.php?id=433",
            "event" :    "Semantic Web (1 of 5)",
			"abstract" : "The ability to extract meaningful fragments from an ontology is key for ontology re-use. We propose a definition of a module that guarantees to completely capture the meaning of a given set of terms, i.e., to include all axioms relevant to the meaning of these terms, and study the problem of extracting minimally sized modules. We show that the problem of deciding if a module is minimal is undecidable even for rather restricted sub-languages of OWL DL. Hence we propose two ``approximations\'\', i.e., alternative definitions of modules for a vocabulary that still provide the above guarantee, but that are possibly too strict, and that may thus result in larger modules: the first approximation is semantic and can be checked using existing DL reasoners; the second is syntactic, and can be computed in polynomial time.  Finally, we report on an empirical evaluation of our syntactic approximation that demonstrates that the modules we extract are surprisingly small."
		},
		{
			"label" :    "Querying and Maintaining a Compact XML Storage",
			"type" :     "Paper",
			"author" :   "Raymond Wong",
			"url" :      "http://www2007.org/program/paper.php?id=794",
            "event" :    "XML and Web Data (2 of 2)",
			"abstract" : "As XML database sizes grow, the amount of space used for storing the data and auxiliary supporting data structures becomes a major factor in query and update performance.  This paper presents a new storage scheme for XML data that supports all navigational operations in near constant time. In addition to supporting efficient queries, the space requirement of the proposed scheme is within a constant factor of the information theoretic minimum, while insertions and deletions can be performed in near constant time as well.  As a result, the proposed structure features a small memory footprint that increases cache locality, whilst still supporting standard APIs, such as DOM, and necessary database operations, such as queries and updates, efficiently. Analysis and experiments show that the proposed structure is space and time efficient."
		},
		{
			"label" :    "A Large-Scale Study of Web Password Habits",
			"type" :     "Paper",
			"author" :   [
				"Dinei Florencio",
				"Cormac Herley"
			],
			"url" :      "http://www2007.org/program/paper.php?id=620",
            "event" :    "Security, Privacy, Reliability, and Ethics (2 of 3)",
			"abstract" : "We report the results of a large scale study of password use and password re-use habits. The study involved half a million users over a three month period. A client component on users\' machines recorded a variety of password strength, usage and frequency metrics. This allows us to measure or estimate such quantities as the average number of passwords and average number of accounts each user has, how many passwords she types per day, how often passwords are shared among sites, and how often they are forgotten. We get extremely detailed data on password strength, the types and lengths of passwords chosen, and how they vary by site. The data is the first large scale study of its kind, and yields numerous other insights into the role the passwords play in users\' online experience."
		},
		{
			"label" :    "CSurf: A Context-Driven Non-Visual Web-Browser",
			"type" :     "Paper",
			"author" :   [
				"Jalal Mahmud",
				"Yevgen Borodin",
				"I.V. Ramakrishnan"
			],
			"url" :      "http://www2007.org/program/paper.php?id=649",
            "event" :    "Browsers and User Interfaces (2 of 2)",
			"abstract" : "Web sites are designed for graphical mode of interaction. Sighted users can \"cut to the chase\" and quickly identify relevant information in Web pages. On the contrary, indi- viduals with visual disabilities have to use screen-readers to browse the Web. As screen-readers process pages sequen- tially and read through everything, Web browsing can be- come strenuous and time-consuming. Although, the use of shortcuts and searching offers some improvements, the prob- lem still remains. In this paper, we address the problem of information overload in non-visual Web access using the notion of context. Our prototype system, CSurf, embodying our approach, provides the usual features of a screen-reader. However, when a user follows a link, CSurf captures the context of the link using a simple topic-boundary detection technique, and uses it to identify relevant information on the next page with the help of a Support Vector Machine, a statistical machine-learning model. Then, CSurf reads the Web page starting from the most relevant section, identified by the model. We conducted a series experiments to eval- uate the performance of CSurf against the state-of-the-art screen-reader, JAWS. Our results show that the use of con- text can potentially save browsing time and substantially improve browsing experience of visually disabled people."
		},
		{
			"label" :    "Information Flow Modeling based on Diffusion Rate for Prediction and Ranking",
			"type" :     "Paper",
			"author" :   [
				"Xiaodan Song",
				"Yun Chi",
				"Koji Hino",
				"Belle Tseng"
			],
			"url" :      "http://www2007.org/program/paper.php?id=664",
            "event" :    "Data Mining (5 of 5)",
			"abstract" : "Information flows in a network where individuals influence each other.  The diffusion rate captures how efficiently the information can diffuse among the users in the network.  We propose an information flow model that leverages diffusion rates for: (1) prediction \' identify where information should flow to, and (2) ranking \' identify who will most quickly receive the information.  For prediction, we measure how likely information will propagate from a specific sender to a specific receiver during a certain time period. Accordingly a rate-based recommendation algorithm is proposed that predicts who will most likely receive the information during a limited time period.  For ranking, we estimate the expected time for information diffusion to reach a specific user in a network. Subsequently, a DiffusionRank algorithm is proposed that ranks users based on how quickly information will flow to them.  Experiments on two datasets demonstrate the effectiveness of the proposed algorithms to both improve the recommendation performance and rank users by the efficiency of information flow."
		},
		{
			"label" :    "Robust Web Page Segmentation for Mobile Terminal Using Content-Distances and Page Layout Information",
			"type" :     "Paper",
			"author" :   [
				"Gen Hattori",
				"Keiichiro Hoashi",
				"Kazunori Matsumoto",
				"Fumiaki Sugaya"
			],
			"url" :      "http://www2007.org/program/paper.php?id=752",
            "event" :    "Pervasive Web and Mobility (1 of 1)",
			"abstract" : "The demand of browsing information from general Web pages using a mobile phone is increasing. However, since the majority of Web pages on the Internet are optimized for browsing from PCs, it is difficult for mobile phone users to obtain sufficient information from the Web. Therefore, a method to reconstruct PC-optimized Web pages for mobile phone users is essential. An example approach is to segment the Web page based on its structure, and utilize the hierarchy of the content element to regenerate a page suitable for mobile phone browsing. In our previous work, we have examined a robust automatic Web page segmentation scheme which uses the distance between content elements based on the relative HTML tag hierarchy, i.e., the number and depth of HTML tags in Web pages. However, this scheme has a problem that the content-distance based on the order of HTML tags does not always correspond to the intuitional distance between content elements on the actual layout of a Web page. In this paper, we propose a hybrid segmentation method which segments Web pages based on both the content-distance calculated by the previous scheme, and a novel approach which utilizes Web page layout information. Experiments conducted to evaluate the accuracy of Web page segmentation results prove that the proposed method can segment Web pages more accurately than conventional methods. Furthermore, implementation and evaluation of our system on the mobile phone prove that our method can realize superior usability compared to commercial Web browsers."
		},
		{
			"label" :    "Web Object Retrieval",
			"type" :     "Paper",
			"author" :   [
				"Zaiqing Nie",
				"Yunxiao Ma",
				"Shuming Shi",
				"Ji-Rong Wen",
				"Wei-Ying Ma"
			],
			"url" :      "http://www2007.org/program/paper.php?id=626",
            "event" :    "Data Mining (1 of 5)",
			"abstract" : "The primary function of current Web search engines is essentially relevance ranking at the document level. However, myriad structured information about real-world objects embedded in static Web pages and online Web databases. Document-level information retrieval can unfortunately lead to highly inaccurate relevance ranking in answering object-oriented queries. In this paper, we propose a paradigm shift to enable searching at the object level. In traditional information retrieval models, documents are taken as the retrieval units and the content of a document is considered reliable. However, this reliability assumption is no longer valid in the object retrieval context when multiple copies of information about the same object typically exist. These copies may be inconsistent because of diversity of Web site qualities and the limited performance of current information extraction techniques. If we simply combine the noisy and inaccurate attribute information extracted from different sources, we may not be able to achieve satisfactory retrieval performance. In this paper, we propose several language models for Web object retrieval, namely an unstructured object retrieval model, a structured object retrieval model, and a hybrid model with both structured and unstructured retrieval features. We test these models on a paper search engine and compare their performances. We conclude that the hybrid model is the superior by taking into account the extraction errors at varying levels."
		},
		{
			"label" :    "Scaling Up All-Pairs Similarity Search",
			"type" :     "Paper",
			"author" :   [
				"Roberto Bayardo",
				"Yiming Ma",
				"Ramakrishnan Srikant"
			],
			"url" :      "http://www2007.org/program/paper.php?id=342",
            "event" :    "Data Mining (3 of 5)",
			"abstract" : "Given a large collection of sparse vector data in a high dimensional space, we investigate the problem of finding all pairs of vectors whose similarity score (as determined by a function such as cosine distance) is above a given threshold.  We propose novel optimization and indexing techniques for this problem, resulting in an algorithm that is both faster and simpler than the previous state-of-the-art approaches.  We demonstrate the effectiveness of our algorithm on the public DBLP dataset, and on two real-world web applications: generating recommendations for the Orkut social network, and computing pairs of similar queries from search snippet data among the 5 million most frequently issued Google queries. Our algorithm is between 5 times to 20 times faster than previous algorithms on these datasets."
		},
		{
			"label" :    "From SPARQL to Rules (and back)",
			"type" :     "Paper",
			"author" :   "Axel Polleres",
			"url" :      "http://www2007.org/program/paper.php?id=435",
            "event" :    "Semantic Web (4 of 5)",
			"abstract" : "As the data and ontology layers of the Semantic Web stack have achieved a certain level of maturity in standard recommendations such as RDF and OWL, the current focus lies on two related aspects. On the one hand, the definition of a suitable query language for RDF, SPARQL, seems to be close to candidate recommendation status within the W3C. The establishment of the Rules layer on top of the existing stack on the other hand marks the next step to be tackled, where especially languages with their roots in Logic Programming and Deductive Databases are receiving considerable attention. The purpose of this paper is threefold. First, we discuss the formal semantics of SPARQL extending recent results in several ways. Second, we provide translations from SPARQL to Datalog with stratified negation as failure. Third, we propose some useful and easy to implement extensions of SPARQL, based on this translation. As it turns out, the combination serves for direct implementations of SPARQL on top of existing rules engines as well as a basis for more general rules and query languages on top of RDF."
		},
		{
			"label" :    "Privacy-Enhancing Personalized Web Search",
			"type" :     "Paper",
			"author" :   [
				"Yabo Xu",
				"Benyu Zhang",
				"Zheng Chen",
				"Ke Wang"
			],
			"url" :      "http://www2007.org/program/paper.php?id=247",
            "event" :    "Search (7 of 7)",
			"abstract" : "Personalized web search is a promising way to improve search quality by customizing search results for people with individual information goals. However, users are uncomfortable with exposing private preference information to search engines. On the other hand, privacy is not absolute, and often can be compromised if there is a gain in service or profitability to the user. Thus, a balance must be struck between search quality and privacy protection. This paper presents a scalable way for users to automatically build rich user profiles. These profiles summarize a user\'s interests into a hierarchical organization according to specific interests. Two parameters for specifying privacy requirements are proposed to help the user to choose the content and degree of detail of the profile information that is exposed to the search engine. Experiments showed that the user profile improved search quality when compared to standard MSN rankings. More importantly, results verified our hypothesis that a significant improvement on search quality can be achieved by only sharing some higher-level user profile information, which is potentially less sensitive than detailed personal information."
		},
		{
			"label" :    "Multiway SLCA-based Keyword Search in XML Data",
			"type" :     "Paper",
			"author" :   [
				"Chong Sun",
				"Chee-Yong Chan",
				"Amit Goenka"
			],
			"url" :      "http://www2007.org/program/paper.php?id=591",
            "event" :    "XML and Web Data (1 of 2)",
			"abstract" : "Keyword search for smallest lowest common ancestors (SLCAs) in XML data has recently been proposed as a meaningful way to identify interesting data nodes in XML data whose subtrees contain an input set of keywords. In this paper, we generalize this useful search paradigm to support keyword search beyond the traditional AND semantics to include both AND and OR boolean operators as well. We first analyze properties of the LCA computation and propose more efficient algorithms to solve the traditional keyword search problem (with only AND semantics). We then extend our approach to handle general keyword search involving combinations of AND and OR boolean operators. The effectiveness of our new algorithms is demonstrated with a comprehensive experimental performance study."
		},
		{
			"label" :    "Turning Portlets into Services: The Consumer Profile",
			"type" :     "Paper",
			"author" :   [
				"Oscar Diaz",
				"Salvador Trujillo",
				"Sandy P%27rez"
			],
			"url" :      "http://www2007.org/program/paper.php?id=190",
            "event" :    "Web Engineering (1 of 2)",
			"abstract" : "Portlets strive to play at the front end the same role that Web services currently enjoy at the back end, namely, enablers of application assembly through reusable services. However, it is well-known in the component community that, the larger the component, the more reduced the reuse. Hence, the coarse-grained nature of portlets (they encapsulate also the presentation layer) can jeopardize this vision of portlets as reusable services. To avoid this situation, this work proposes a perspective shift in portlet development by introducing the notion of organization profile. While the user profile characterises the end user (e.g. age, name, etc), the organization profile captures the idiosyncrasies of the organization through which the portlet is being delivered (e.g. the portal owner) as far as the portlet functionality is concerned. The user profile is dynamic and hence, requires the portlet to be customised at run time. By contrast, the organization profile is known at registration time, and it is not always appropriate/possible to consider it at run time. Rather, it is better to customize the code at development time, and produce an organization-specific portlet which built-in, custom functionality. In this scenario, we no longer have a portlet but a family of portlets, and the portlet provider becomes the \"assembly line\" of this family. This work promotes this vision by introduces an organization-aware, WSRP-compliant architecture that let portlet consumers registry and handle \"family portlets\" in the same way that \"traditional portlets\". In so doing, portlets are nearer to become truly reusable services."
		},
		{
			"label" :    "Introduction and Evaluation of Martlet, a Scientific Workflow Language for Abstracted Parallelisation",
			"type" :     "Paper",
			"author" :   "Daniel James Goodman",
			"url" :      "http://www2007.org/program/paper.php?id=479",
            "event" :    "Web Services (1 of 2)",
			"abstract" : "The workflow language Martlet described in this paper implements a new programming model that allows users to write parallel programs and analyse distributed data without having to be aware of the details of the parallelisation. Martlet abstracts the parallelisation of the computation and the splitting of the data through the inclusion of constructs inspired by functional programming. These allow programs to be written as an abstract description that can be adjusted automatically at runtime to match the data set and available resources. Using this model it is possible to write programs to perform complex calculations across a distributed data set such as Singular Value Decomposition or Least Squares problems, as well as creating an intuitive way of working with distributed systems<br /><br /> Having described and evaluated Martlet against other functional languages for parallel computation, this paper goes on to look at how Martlet might develop. In doing so it covers both possible additions to the language itself, and the use of JIT compilers to increase the range of platforms it is capable of running on."
		},
		{
			"label" :    "Subspace: Secure Cross-Domain Communication for Web Mashups",
			"type" :     "Paper",
			"author" :   [
				"Collin Jackson",
				"Helen Wang"
			],
			"url" :      "http://www2007.org/program/paper.php?id=801",
            "event" :    "Security, Privacy, Reliability, and Ethics (1 of 3)",
			"abstract" : "Combining data and code from third-party sources has enabled a new wave of web mashups that add creativity and functionality to web applications. However, browsers are poorly designed to pass data between domains, often forcing web developers to abandon security in the name of functionality. To address this deficiency, we developed Subspace, a novel cross-domain communication mechanism that allows efficient communication across domains without sacrificing security. Our prototype requires only a small JavaScript library, and works across all major browsers. We believe Subspace can serve as a new secure communication primitive for web mashups."
		},
		{
			"label" :    "Summarizing Email Conversations with Clue Words",
			"type" :     "Paper",
			"author" :   [
				"Giuseppe Carenini",
				"Raymond Ng",
				"Xiaodong Zhou"
			],
			"url" :      "http://www2007.org/program/paper.php?id=631",
            "event" :    "Data Mining (2 of 5)",
			"abstract" : "With the ever increasing popularity of emails, email overload becomes a major problem for email users. Email summarization is one way not only to solve this problem, but also to make use of one\'s email corpus.  In this paper, we propose a new framework for email summarization. One novelty is to use a fragment quotation graph to try to capture an email conversation. The second novelty is to use clue words to measure the importance of sentences in conversation summarization. Based on clue words and their scores, we propose a method called CWS, which is capable of producing a summary of any length as requested by the user. We provide a comprehensive comparison of CWS with various existing methods on the Enron data set.  Preliminary results suggest that CWS provides better summaries than existing methods."
		},
		{
			"label" :    "Explorations in the use of Semantic Web Technologies for Product Information Management",
			"type" :     "Paper",
			"author" :   [
				"Jean-Sebastien Brunner",
				"Li Ma",
                "Chen Wang",
				"Daniel C. Wolfson",
				"Yue Pan",
                "Kavitha Srivinas"
			],
			"url" :      "http://www2007.org/program/paper.php?id=776",
            "event" :    "Semantic Web (2 of 5)",
			"abstract" : "Master data is core business entities a company uses repeatedly across many business processes and systems (such as lists or hierarchies of customers, suppliers, accounts, products, or organizational units). Product information is the most important kind of master data and product information management (PIM) is becoming critical for modern enterprises because it provides a richer business context for various applications. Existing PIM systems are less flexible and scalable for on-demand business, as well as weak to completely capture and use the semantics of master data. This paper explores how to use semantic web technologies to enhance a collaborative PIM system by simplifying modelling and representation while preserving enough dynamic flexibility. Furthermore, we build a semantic PIM system using one of the state-of-art ontology repositories and summarize encountered challenges based on our experimental results, especially on performance and scalability. We believe that our study and experiences are valuable for both semantic web community and master data management community."
		},
		{
			"label" :    "Expertise Networks in Online Communities: Structure and Algorithms",
			"type" :     "Paper",
			"author" :   [
				"Jun Zhang",
				"Mark Ackerman",
				"Lada Adamic"
			],
			"url" :      "http://www2007.org/program/paper.php?id=516",
            "event" :    "E*-Applications (1 of 2)",
			"abstract" : "Web-based communities have become an important place for people to seek and share expertise. We find that networks in these communities typically differ in their topology from other online networks such as the World Wide Web. Systems targeted to augment web-based communities by automatically identifying users with expertise, for example, need to adapt to the underlying interaction dynamics. In this study, we analyze the Java Forum, a large online help-seeking community, using social network analysis methods.  We test a set of network-based ranking algorithms, including PageRank and HITS, on this large size social network in order to identify users with high expertise. We then use simulations to identify a small number of simple rules governing the question-answer dynamic in the network. These simple rules not only replicate the structural characteristics and algorithm performance on the empirically observed Java Forum, but also allow us to evaluate how other algorithms may perform in communities with different characteristics. We believe this approach will be fruitful for practical algorithm design and implementation for online expertise-sharing communities."
		},
		{
			"label" :    "P-TAG: Large Scale Automatic Generation of Personalized Annotation TAGs for the Web",
			"type" :     "Paper",
			"author" :   [
				"Paul - Alexandru Chirita",
				"Stefania Costache",
				"Siegfried Handschuh",
				"Wolfgang Nejdl"
			],
			"url" :      "http://www2007.org/program/paper.php?id=481",
            "event" :    "Semantic Web (5 of 5)",
			"abstract" : "The success of the Semantic Web depends on the availability of Web pages annotated with metadata. Free form metadata or tags, as used in social bookmarking and folksonomies based systems, have become more and more popular and successful. Such tags are relevant keywords associated with or assigned to a piece of information (e.g., a Web page), thus describing the item and enabling keyword-based classification. In this paper we propose P-TAG, a method which automatically generates personalized tags for Web pages. Keywords are generated based on the content of the Web page but also based on the content of the user\'s Desktop, thus expressing a personalized viewpoint very relevant for personal tags. We implemented and tested several algorithms for this approach and evaluated the relevance of the resulting keywords. These evaluations showed very promising results and we are therefore very confident that such a user oriented automatic tagging approach can provide large scale personalized metadata annotation as an important step towards realizing the Semantic Web."
		},
		{
			"label" :    "Consistency-preserving Caching of Dynamic Database Content",
			"type" :     "Paper",
			"author" :   [
				"Niraj Tolia",
				"M. Satyanarayanan"
			],
			"url" :      "http://www2007.org/program/paper.php?id=120",
            "event" :    "Performance and Scalability (1 of 2)",
			"abstract" : "With the growing use of dynamic web content generated from relational databases, traditional caching solutions for throughput and latency improvements are ineffective.  We describe a middleware layer called Ganesh that reduces the volume of data transmitted without semantic interpretation of queries or results.  It achieves this reduction through the use of cryptographic hashing to detect similarities with previous results.  These benefits do not require any compromise of the strict consistency semantics provided by the back-end database. Further, Ganesh does not require modifications to applications, web servers, or database servers, and works with closed-source applications and databases.  Using two benchmarks representative of dynamic web sites, measurements of our prototype show that it can increase end-to-end throughput by as much as twofold for non-data intensive applications and by as much as tenfold for data intensive ones."
		},
		{
			"label" :    "A High-Performance Interpretive Approach to Schema-Directed Parsing",
			"type" :     "Paper",
			"author" :   [
				"Morris Matsa",
				"Eric Perkins",
				"Abraham Heifets",
				"Margaret Gaitatzes Kostoulas",
				"Daniel Silva",
				"Noah Mendelsohn",
				"Michelle Leger"
			],
			"url" :      "http://www2007.org/program/paper.php?id=507",
            "event" :    "XML and Web Data (2 of 2)",
			"abstract" : "XML delivers key advantages in interoperability due to its flexibility, expressiveness, and platform-neutrality.  As XML has become a performance-critical aspect of the next generation of business computing infrastructure, however, it has become increasingly clear that XML parsing often carries a heavy performance penalty, and that current, widely-used parsing technologies are unable to meet the performance demands of an XML-based computing infrastructure.  Several efforts have been made to address this performance gap through the use of grammar-based parser generation.  While the performance of generated parsers has been significantly improved, adoption of the technology has been hindered by the complexity of compiling and deploying the generated parsers.  Through careful analysis of the operations required for parsing and validation, we have devised a set of specialized bytecodes, designed for the task of XML parsing and validation.  These bytecodes are designed to engender the benefits of fine-grained composition of parsing and validation that make existing compiled parsers fast, while being coarse-grained enough to minimize interpreter overhead.  This technique of using an interpretive, validating parser balances the need for performance against the requirements of simple tooling and robust scalable infrastructure. Our approach is demonstrated with a specialized schema compiler, used to generate bytecodes which in turn drive an interpretive parser. With almost as little tooling and deployment complexity as a traditional interpretive parser, the bytecode-driven parser usually demonstrates performance within 20% of the fastest fully compiled solutions."
		},
		{
			"label" :    "Long Distance Wireless Mesh Network Planning: Problem Formulation and Solution",
			"type" :     "Paper",
			"author" :   [
				"Sayandeep Sen",
				"Bhaskaran Raman"
			],
			"url" :      "http://www2007.org/program/paper.php?id=366",
            "event" :    "Technology for Developing Regions (2 of 2)",
			"abstract" : "Several research efforts as well as deployments have chosen IEEE 802.11 as a low-cost, long-distance access technology to bridge the digital divide.  In this paper, we consider the important issue of planning such networks to the minimize system cost.  This is a non-trivial task since it involves several sets of variables: the network topology, tower heights, antenna types to be used and their orientations, and radio transmit powers.  The task is further complicated due to the presence of network performance constraints, and  the inter-dependence among the variables.  Our first contribution in this paper is the formulation of this problem in terms of the variables, constraints and the optimization criterion.  Our second contribution is in identifying the dependencies among the variables and breaking-down the problem into four tractable sub-parts. In this process, we extensively use domain knowledge to strike a balance between tractability and practicality.<br /><br /> We have evaluated the proposed algorithms using random input sets as well as real-life instances with success.  We have been able to show detailed planning of network topology, required tower heights, antenna types, and transmit powers for the Ashwini project, a long distance WiFi network under deployment in Andhra Pradesh, India,  In this case, we are able to achieve within 2\\% additional cost of a lower bound estimate."
		},
		{
			"label" :    "Using Google Distance to weight approximate ontology matches",
			"type" :     "Paper",
			"author" :   [
				"Risto Risto Gligorov",
				"Zharko Aleksovski",
				"Warner ten Kate",
				"Frank van Harmelen"
			],
			"url" :      "http://www2007.org/program/paper.php?id=464",
            "event" :    "Semantic Web (3 of 5)",
			"abstract" : "Discovering mappings between concept hierarchies is widely regarded as one of the hardest and most urgent problems facing the Semantic Web. The problem is even harder in domains where concepts are inherently vague and ill-defined, and cannot be given a crisp definition. A notion of approximate concept mapping is required in such domains, but until now, no such notion is available.<br /><br /> The first contribution of this paper is a definition for concepts is decomposed into a number of submappings, and a \\emph{sloppiness value} determines the fraction of these submappings that can be ignored when establishing the mapping.<br /><br /> A potential problem of such a definition is that with an increasing sloppiness value, it will gradually allow mappings between any two arbitrary concepts. To improve on this trivial behaviour, we need to design a heuristic weighting which minimises the sloppiness required to conclude desirable matches, but at the same time maximises the sloppiness required to conclude undesirable matches. The second contribution of this paper is to show that a \\emph{Google-based similarity measure} has exactly these desirable properties.<br /><br /> We establish these results by \\emph{experimental validation in the domain of musical genres}. We show that this domain does suffer from ill-defined concepts. We take two real-life genre hierarchies from the Web, we compute approximate mappings between them at varying levels of sloppiness, and we validate our results against a hand-crafted Gold Standard.<br /><br /> Our method makes use of the huge amount of knowledge that is implicit in the current Web, and exploits this knowledge as a heuristic for establishing approximate mappings between ill-defined concepts."
		},
		{
			"label" :    "Detecting Near-Duplicates for Web Crawling",
			"type" :     "Paper",
			"author" :   [
				"Gurmeet Manku",
				"Arvind Jain",
				"Anish Das Sarma"
			],
			"url" :      "http://www2007.org/program/paper.php?id=215",
            "event" :    "Data Mining (3 of 5)",
			"abstract" : "Near-duplicate documents are  commonly found on the web.   A pair of near-duplicate  web pages  differ from  each other  in a  very small portion.   The differences  commonly consist  of  advertisements and timestamps. Such differences are  irrelevant for web search.  During web  crawling, it  is useful  to quickly  ascertain whether  a newly crawled web  page is  a near-duplicate of  a previously  crawled web page or not.<br /><br /> In the  course of developing  a practical system  for near-duplicate detection,   we  make   two  research   contributions.    First,  we demonstrate the effectiveness of Charikar\'s fingerprinting technique for  identifying  near-duplicate web  pages.   We  show  that for  8 billion   web-pages,  a   good  choice   of  parameters   is  64-bit fingerprints  and 3-bit  Hamming distances.   Second, we  present an algorithmic technique for  identifying existing f-bit fingerprints that differ from  a given fingerprint in at  most k bit-positions, for  small k.   Our technique  is useful  for both  online queries (single  fingerprints) and  batch  queries (multiple  fingerprints). Experimental evaluation over real  data confirms the practicality of our design."
		},
		{
			"label" :    "MyXDNS: A Request Routing DNS Server With Decoupled Server Selection",
			"type" :     "Paper",
			"author" :   [
				"Hussein Alzoubi",
				"Michael Rabinovich",
				"Oliver Spatscheck"
			],
			"url" :      "http://www2007.org/program/paper.php?id=748",
            "event" :    "Technology for Developing Regions (2 of 2)",
			"abstract" : "This paper presents the architecture and the preliminary evaluation of a request routing DNS server that decouples server selection from the rest of DNS functionality.  Our DNS server, which we refer to as MyXDNS, exposes well-defined APIs for uploading an externally computed  server selection policy and for interacting with an external network proximity service. With MyXDNS, researchers can explore their own network proximity metrics and request routing algorithms without having to worry about DNS internals. Furthermore, MyXDNS is based on open-source MyDNS and is available to public. Stress-testing of MyXDNS indicated that it achieves its flexibility at an acceptable cost: a single MyXDNS running on a low-level server can process 3000 req/sec with sub-millisecond response even in the presence of continuous updates to server selection policy."
		},
		{
			"label" :    "Spam Double-Funnel: Connecting Web Spammers with Advertisers",
			"type" :     "Paper",
			"author" :   [
				"Yi-Min Wang",
				"Ming Ma",
				"Yuan Niu",
				"Hao Chen"
			],
			"url" :      "http://www2007.org/program/paper.php?id=111",
            "event" :    "Industrial Practice and Experience (1 of 1)",
			"abstract" : "Spammers use questionable search engine optimization (SEO) techniques to promote their spam links into top search results. In this paper, we focus on one prevalent SEO technique, redirection, where one can identify spam pages by the third-party domains that these pages redirect traffic to. We propose a five-layer, double-funnel model for describing end-to-end redirection spam, and identify prominent domains on each layer using two sets of commercial keywords \' one targeting spammers and the other targeting advertisers. The methodology and findings are useful for search engines to strengthen their ranking algorithms against spam, for legitimate website owners to locate and remove spam doorway pages, and for legitimate advertisers to identify unscrupulous syndicators who serve ads on spam pages."
		},
		{
			"label" :    "On Anonymizing Query Logs via Token-based Hashing",
			"type" :     "Paper",
			"author" :   [
				"Ravi Kumar",
				"Jasmine Novak",
				"Bo Pang",
				"Andrew Tomkins"
			],
			"url" :      "http://www2007.org/program/paper.php?id=582",
            "event" :    "Security, Privacy, Reliability, and Ethics (1 of 3)",
			"abstract" : "In this paper we study the privacy preservation properties of a specific technique for query log anonymization: token-based hashing. In this approach, each query is tokenized, and then a secure hash function is applied to each token.  We show that statistical linguistic techniques may be applied to partially compromise the anonymization.  We then analyze the specific risks that arise from these partial compromises, focused on revelation of identity from unambiguous names, addresses, and so forth, and the revelation of facts associated with an identity that are deemed to be highly sensitive.  Our goal in this work is twofold: to show that token-based hashing is unsuitable for anonymization, and to present a concrete risk analysis framework for evaluating other proposals."
		},
		{
			"label" :    "DIANE - An Integrated Approach to Automated Service Discovery, Matchmaking and Composition",
			"type" :     "Paper",
			"author" :   [
				"Ulrich K%27ster",
				"Birgitta K%27nig-Ries",
				"Mirco Stern",
				"Michael Klein"
			],
			"url" :      "http://www2007.org/program/paper.php?id=57",
            "event" :    "Web Services (2 of 2)",
			"abstract" : "Automated matching of semantic service descriptions is the key to automatic service discovery and binding. But when trying to find a match for a certain request it may often happen, that the request cannot be serviced by a single offer but could be handled by combining existing offers. In this case automatic service composition is needed. Although automatic composition is an active field of research it is mainly viewed as a planning problem and treated separatedly from service discovery. In this paper we argue that an integrated approach to the problem is better than seperating these issues as is usually done. We propose an approach that integrates service composition into service discovery and matchmaking to match service requests that ask for multiple connected effects, discuss general issues involved in describing and matching such services and present an efficient algorithm implementing our ideas."
		},
		{
			"label" :    "Learning to Detect Phishing Emails",
			"type" :     "Paper",
			"author" :   [
				"Ian Fette",
				"Norman Sadeh-Koniecpol",
				"Anthony Tomasic"
			],
			"url" :      "http://www2007.org/program/paper.php?id=550",
            "event" :    "Security, Privacy, Reliability, and Ethics (2 of 3)",
			"abstract" : "Each month, more attacks are launched with the aim of making web users believe that they are communicating with a trusted entity for the purpose of stealing account information, logon credentials, and identity information in general. This attack method, commonly known as \"phishing,\" is most commonly initiated by sending out emails with links to spoofed websites that harvest information. We present a method for detecting these attacks, which in its most general form is an application of machine learning on a feature set designed to highlight user-targeted deception in electronic communication. This method is applicable, with slight modification, to detection of phishing websites, or the emails used to direct victims to these sites. We evaluate this method on a set of approximately 860 such phishing emails, and 6950 non-phishing emails, and correctly identify over 96% of the phishing emails while only mis-classifying on the order of 0.1% of the legitimate emails. We conclude with thoughts on the future for such techniques to specifically identify deception, specifically with respect to the evolutionary nature of the attacks and information available."
		},
		{
			"label" :    "Mapping-Driven XML Transformation",
			"type" :     "Paper",
			"author" :   [
				"Haifeng Jiang",
				"Howard Ho",
				"Lucian Popa",
				"Wook-Shin Han"
			],
			"url" :      "http://www2007.org/program/paper.php?id=504",
            "event" :    "XML and Web Data (1 of 2)",
			"abstract" : "Clio is an existing schema-mapping tool that provides user-friendly means to manage and facilitate the complex task of transformation and integration of heterogeneous data such as XML over the Web or in XML databases. By means of mappings from source to target schemas, Clio can help users conveniently establish the precise semantics of data transformation and integration. In this paper we study the problem of how to efficiently implement such data transformation (i.e., generating target data from the source data based on schema mappings). We present a three-phase framework for high-performance XML-to-XML transformation based on schema mappings, and discuss methodologies and algorithms for implementing these phases. In particular, we elaborate on novel techniques such as streamed extraction of mapped source values and scalable disk-based merging of overlapping data (including duplicate elimination). We compare our transformation framework with alternative methods such as using XQuery or SQL/XML provided by current commercial databases. The results demonstrate that the three-phase framework (although as simple as it is) is highly scalable and outperforms the alternative methods by orders of magnitude."
		},
		{
			"label" :    "Dynamic Personalized Pagerank in Entity-Relation Graphs",
			"type" :     "Paper",
			"author" :   "Soumen Chakrabarti",
			"url" :      "http://www2007.org/program/paper.php?id=324",
            "event" :    "Search (7 of 7)",
			"abstract" : "Extractors and taggers turn unstructured text into entity-relation (ER) graphs where nodes are entities (email, paper, person, conference, company) and edges are relations (wrote, cited, works-for). Typed proximity search of the form type=person NEAR company~\"IBM\", paper~\"XML\" is an increasingly useful search paradigm in ER graphs. Proximity search implementations either perform a Pagerank-like computation at query time, which is slow, or precompute, store and combine per-word Pageranks, which can be very expensive in terms of preprocessing time and space. We present HubRank, a new system for fast, dynamic, space-efficient proximity searches in ER graphs. During preprocessing, HubRank computes and indexes certain \"sketchy\" random walk fingerprints for a small fraction of nodes, carefully chosen using query log statistics. At query time, a small \"active\" subgraph is identified, bordered by nodes with indexed fingerprints. These fingerprints are adaptively loaded to various resolutions to form approximate personalized Pagerank vectors (PPVs). PPVs at remaining active nodes are now computed iteratively. We report on experiments with CiteSeer\'s ER graph and millions of real CiteSeer queries. Some representative numbers follow. On our testbed, HubRank preprocesses and indexes 52 times faster than whole-vocabulary PPV computation. A text index is 56 MB. Whole-vocabulary PPVs would consume 102 GB. If PPVs are truncated to 56 MB, precision compared to true Pagerank drops to 0.55; in contrast, HubRank has precision 0.91 at 63 MB. HubRank\'s average query time is 328 milliseconds; query-time Pagerank computation takes 11 seconds on average."
		},
		{
			"label" :    "Extraction and Search of Chemical Formulae in Text Documents on the Web",
			"type" :     "Paper",
			"author" :   [
				"Bingjun Sun",
				"Qingzhao Tan",
				"Prasenjit Mitra",
				"C. Lee Giles"
			],
			"url" :      "http://www2007.org/program/paper.php?id=100",
            "event" :    "E*-Applications (2 of 2)",
			"abstract" : "Often scientists seek to search for articles on the Web related to a particular chemical. When a scientist searches for a chemical formula using a search engine today, she gets back articles where the exact keyword string expressing the chemical formula is found.  Searching for the exact occurrence of keywords while searching results in  two problems for this domain: a) if the author searches for CH4 and the article has H4C, the article is not returned, and b) ambiguous searches like \"He\" return all documents where Helium is mentioned as well as documents where the pronoun \"he\" occurs. To remedy these deficiencies, we propose a chemical formula search engine. To build a chemical formula search engine, we must solve the following problems: (1) extract chemical formulae from text documents, (2) index chemical formulae, and (3) design a ranking function for articles where the chemical formulae occur. Furthermore, query models are introduced for formula search, and for each a scoring scheme based on features of partial formulae is proposed to measure the relevance of chemical formulae and queries. We evaluate algorithms for identifying chemical formulae in documents using a classification method based on Support Vector Machines (SVM), and a probabilistic model based on conditional random fields (CRF). Different methods for SVM and CRF to tune the trade-off between recall and precision for imbalanced data are proposed to improve the over-all performance. A feature selection method based on frequency and discrimination is used to remove uninformative and redundant features. Experiments show that our approaches of chemical formula extraction work well, especially after trade-off tuning. The results also demonstrate that feature selection can reduce the index size without changing the ranked query results much."
		},
		{
			"label" :    "PRIVE: Anonymous Location-Based Queries in Distributed Mobile Systems",
			"type" :     "Paper",
			"author" :   [
				"Gabriel Ghinita",
				"Panos Kalnis",
				"Spiros Skiadopoulos"
			],
			"url" :      "http://www2007.org/program/paper.php?id=223",
            "event" :    "Pervasive Web and Mobility (1 of 1)",
			"abstract" : "Nowadays, mobile users with positioning devices can access Location Based Services (LBS) and query about points of interest in their proximity. For such applications to succeed, privacy and confidentiality are essential. Encryption alone is not adequate; although it safeguards the system against eavesdroppers, the queries themselves may disclose the location and identity of the user. Recently, there have been proposed centralized architectures based on  k-Anonymity, which utilize an intermediate anonymizer between the mobile users and the LBS. However, the anonymizer must be updated continuously with the current locations of all users. Moreover, the complete knowledge of the entire system poses a security threat, if the anonymizer is compromised.<br /><br /> In this paper we address two issues: (i) We show that existing approaches may fail to provide spatial anonymity for some distributions of user locations and describe a novel technique which solves this problem. (ii) We propose PRIVE, a decentralized architecture for preserving the anonymity of users issuing spatial queries to LBSs. Mobile users self-organize into an overlay network with good fault tolerance and load balancing properties. PRIVE avoids the bottleneck caused by centralized techniques both in terms of anonymization and location updates. Moreover, the status is distributed in numerous users, rendering the system resilient to attacks. Extensive experimental studies suggest that PRIVE is applicable to real-life scenarios with large populations of mobile users."
		},
		{
			"label" :    "Wherefore Art Thou R3579X?  Anonymized Social Networks, Hidden Patterns, and Structural Steganography",
			"type" :     "Paper",
			"author" :   [
				"Lars Backstrom",
				"Cynthia Dwork",
				"Jon Kleinberg"
			],
			"url" :      "http://www2007.org/program/paper.php?id=232",
            "event" :    "Data Mining (5 of 5)",
			"abstract" : "In a social network, nodes correspond to people or other social entities, and edges correspond to social links between them.  In an effort to preserve privacy, the practice of anonymization replaces names with meaningless unique identifiers.  We describe a family of schemes such that even from a single anonymized copy of a social network, it is possible for an adversary to learn whether edges exist or not between specific targeted pairs of nodes."
		},
		{
			"label" :    "Google News Personalization: Scalable Online Collaborative Filtering",
			"type" :     "Paper",
			"author" :   [
				"Abhinandan Das",
				"Mayur Datar",
				"Ashutosh Garg",
				"Shyam Rajaram"
			],
			"url" :      "http://www2007.org/program/paper.php?id=570",
            "event" :    "Industrial Practice and Experience (1 of 1)",
			"abstract" : "Several approaches to collaborative filtering have been studied but seldom have the studies been reported for large (several millions of users and items) and dynamic (the underlying item set is continually changing) settings. In this paper we describe our approach to collaborative filtering for generating personalized recommendations for users of Google News. We generate recommendations using three approaches: collaborative filtering using MinHash clustering,  Probabilistic Latent Semantic Indexing (PLSI), and covisitation counts. We combine recommendations from different algorithms using a linear model. Our approach is content agnostic and consequently domain independent, making it easily adaptible for other applications and languages with minimal effort. This paper will describe our algorithms and system setup in detail, and report results of running the recommendations engine on Google News."
		},
		{
			"label" :    "Homepage Live: Automatic Block Tracing for Web Personalization",
			"type" :     "Paper",
			"author" :   [
				"Jie Han",
				"Hua-Jun Zeng",
				"Dingyi Han",
				"Chenxi Lin",
				"Zheng Chen",
				"Yong Yu"
			],
			"url" :      "http://www2007.org/program/paper.php?id=615",
            "event" :    "Browsers and User Interfaces (1 of 2)",
			"abstract" : "The emergence of personalized homepage providers, e.g. Google Homepage and Microsoft Windows Live, has enabled Web users to select interesting Web contents and to aggregate them in a single Web page. The interesting web contents are usually some specific blocks of Web pages rather than a whole Web page. To satisfy the users\' requirements, personalized homepage providers predefine a lot of candidate content blocks for users to composite their own homepages. However, it requires tremendous manual efforts to define the content blocks and the coverage is still very limited. In this paper, we propose a novel personalized homepage system, called \'Homepage Live\', to allow users to use drag-and-drop action to collect their favorite Web content blocks and organize them in a single page. Moreover, Homepage Live can also automatically trace the changes of blocks with the evolvement of the containing pages by measuring the tree edit distance of the selected blocks. Besides, by exploiting the immutable elements of Web pages, the tracing algorithm performance is significant improved. Experimental results demonstrate the effectiveness and efficiency of our proposed algorithm."
		},
		{
			"label" :    "Towards Expressive Syndication on the Web",
			"type" :     "Paper",
			"author" :   [
				"Christian Halaschek-Wiener",
				"James Hendler"
			],
			"url" :      "http://www2007.org/program/paper.php?id=558",
            "event" :    "Semantic Web (2 of 5)",
			"abstract" : "Syndication systems on the Web have attracted vast amounts of attention in recent years. As technologies have emerged and matured, there has been a transition to more expressive syndication approaches; that is subscribers and publishers are provided with more expressive means of describing their interests and published content, enabling more accurate information filtering. In this paper, we formalize a syndication architecture that utilizes expressive Web ontologies and logic-based reasoning for selective content dissemination. This provides finer grained control for filtering and automated reasoning for discovering implicit subscription matches, both of which are not achievable in less expressive approaches. We then address one of the main limitations with such a syndication approach, namely matching newly published information with subscription requests in an efficient and practical manner. To this end, we investigate continuous query answering for a large subset of the Web Ontology Language (OWL); specifically, we formally define continuous queries (i.e., subscriptions) for OWL knowledge bases and present a novel algorithm for continuous query answering in a large subset of this language. Lastly, an evaluation of the query approach is shown, demonstrating its effectiveness for syndication purposes."
		},
		{
			"label" :    "DETECTIVES: DETEcting Coalition hiT Inflation attacks in adVertising nEtworks Streams",
			"type" :     "Paper",
			"author" :   [
				"Ahmed Metwally",
				"Divyakant Agrawal",
				"Amr El Abbadi"
			],
			"url" :      "http://www2007.org/program/paper.php?id=70",
            "event" :    "E*-Applications (2 of 2)",
			"abstract" : "Click fraud is jeopardizing the industry of Internet advertising. Internet advertising is crucial for the thriving of the entire Internet, since it allows producers to advertise their products, and hence contributes to the well being of e-commerce. Moreover, advertising supports the intellectual value of the Internet by covering the running expenses of the content publishers\' sites. Some publishers are dishonest, and use automation to generate traffic to defraud the advertisers. Similarly, some advertisers automate clicks on the advertisements of their competitors to deplete their competitors\' advertising budgets. This paper describes the advertising network model, and focuses on the most sophisticated type of fraud, which involves coalitions among fraudsters. We build on several published theoretical results to devise the Similarity-Seeker algorithm that discovers coalitions made by pairs of fraudsters. We then generalize the solution to coalitions of arbitrary sizes. Before deploying our system on a real network, we conducted comprehensive experiments on data samples for proof of concept. We detected numerous coalitions that span numerous sites. Interestingly, 93% of the discovered sites were real fraudsters."
		},
		{
			"label" :    "Visibly Pushdown Automata for Streaming XML",
			"type" :     "Paper",
			"author" :   [
				"Viraj Kumar",
				"Madhusudan Parthasarathy",
				"Mahesh Viswanathan"
			],
			"url" :      "http://www2007.org/program/paper.php?id=788",
            "event" :    "XML and Web Data (1 of 2)",
			"abstract" : "We propose the study of visibly pushdown automata (VPA) for processing XML documents.  VPAs are pushdown automata where the input determines the stack operation, and XML documents are naturally visibly pushdown with the VPA pushing onto the stack on open-tags and popping the stack on close-tags.  In this paper we demonstrate the power and ease visibly pushdown automata give in the design of streaming algorithms for XML documents.<br /><br /> We study the problems of type-checking streaming XML documents against SDTD schemas, and the problem of typing tags in a streaming XML document according to an SDTD schema.  For the latter problem, we consider both pre-order typing and post-order typing of a document, which  dynamically determines types at open-tags and close-tags respectively as soon as they are met. We also generalize the problems of pre-order and post-order typing to prefix querying. We show that a deterministic VPA yields an algorithm to the problem of answering in one pass the set of all answers to any query that has the property that a node satisfying the query is determined solely by the prefix leading to the node. All the streaming algorithms we develop in this paper are based on the construction of deterministic VPAs, and hence, for any fixed problem, the algorithms process each element of the input in constant time, and use space O(d), where d is the depth of the document."
		},
		{
			"label" :    "CANTINA: A Content-Based Approach to Detecting Phishing Web Sites",
			"type" :     "Paper",
			"author" :   [
				"Yue Zhang",
				"Jason Hong",
				"Lorrie Cranor"
			],
			"url" :      "http://www2007.org/program/paper.php?id=557",
            "event" :    "Security, Privacy, Reliability, and Ethics (2 of 3)",
			"abstract" : "Phishing is a significant problem involving fraudulent email and web sites that trick unsuspecting users into revealing private information. In this paper, we present the design, implementation, and evaluation of CANTINA, a novel, content-based approach to detecting phishing web sites, based on the well-known TF-IDF algorithm used in information retrieval. We also discuss the design and evaluation of several heuristics we developed to reduce our false positive rates. Our experiments show that CANTINA is good at detecting phishing sites, correctly labeling approximately 95% of phishing sites."
		},
		{
			"label" :    "Learning Information Intent via Observation",
			"type" :     "Paper",
			"author" :   [
				"Anthony Tomasic",
				"Isaac Simmons",
				"John Zimmerman"
			],
			"url" :      "http://www2007.org/program/paper.php?id=586",
            "event" :    "Browsers and User Interfaces (2 of 2)",
			"abstract" : "Users in an organization frequently request help by sending request messages to assistants that express an information intent: an intention to update an information system. Assistants spend a significant amount of time and effort processing these messages. For example, human resource assistants process requests to update personnel records, and executive assistants process requests to schedule conference rooms or to make travel reservations. To process the intent of a message, assistants read the message and then locate, complete, and submit a form that corresponds to the expressed intent . Automatically or semi-automatically processing the intent of a message on behalf of an assistant would ease the mundane and repetitive nature of this kind of work. For a well-understood domain, a straightforward application of natural language processing techniques can be used to build an intelligent form interface to semi-automatically process information intents. However, high performance parsers are based on machine learning algorithms that require a large collection of messages that have been labeled by an expert. The generation of a labeled corpus of messages is a major barrier to the construction of a parser. In this paper, we investigate the construction of a natural language processing system and an intelligent form system that observes an assistant processing a request. The intelligent form system then generates a labeled training corpus by interpreting the observations. This paper reports on the measurement of the performance of the machine learning algorithms based on real data. The combination of observations, machine learning and interaction design produces an effective intelligent form interface based on natural language processing."
		},
		{
			"label" :    "Robust Methodologies for Modeling Web Click Distributions",
			"type" :     "Paper",
			"author" :   [
				"Kamal Ali",
				"Mark Scarr"
			],
			"url" :      "http://www2007.org/program/paper.php?id=56",
            "event" :    "Search (5 of 7)",
			"abstract" : "Metrics such as click counts are vital to online businesses but their measurement has been problematic due to inclusion of high variance robot traffic. We posit that by applying statistical methods more rigorous than have been employed to date that we can build a robust model of the distribution of clicks following which we can set probabilistically sound thresholds to address outliers and robots. Prior research in this domain has used inappropriate statistical methodology to model distributions and current industrial practice eschews this research for conservative ad-hoc click-level thresholds. Prevailing belief is that such distributions are scale-free power law distributions but using more rigorous statistical methods we find the best description of the data is instead provided by a scale-sensitive Zipf-Mandelbrot mixture distribution. Our results are based on ten datasets from various verticals in the Yahoo domain. Since mixture models can overfit the data we take care to use the BIC log-likelihood method which penalizes overly complex models. Using a mixture model in the web activity domain makes sense because there are likely multiple classes of users. In particular, we have noticed that there is a significantly large set of ``users\'\' that visit the Yahoo portal exactly once a day. We surmise these may be robots testing internet connectivity by pinging the Yahoo main website.<br /><br /> Backing up our quantitative analysis is graphical analysis in which empirical distributions are plotted against theoretical distributions in log-log space using robust cumulative distribution plots. This methodology has two advantages: plotting in log-log space allows one to visually differentiate the various exponential distributions and secondly, cumulative plots are much more robust to outliers. We plan to use the results of this work for applications for robot removal from web metrics business intelligence systems."
		},
		{
			"label" :    "Internet-Scale Collection of Human Reviewed Data",
			"type" :     "Paper",
			"author" :   [
				"Qi Su",
				"Dmitry Pavlov",
				"Jyh-Herng Chow",
				"Wendell Baker"
			],
			"url" :      "http://www2007.org/program/paper.php?id=461",
            "event" :    "E*-Applications (1 of 2)",
			"abstract" : "Enterprise data processing and content aggregation systems often require extensive use of human reviewed data (e.g. for training and monitoring machine learning-based applications). Today these needs are often met by in-house efforts or offshore contracting. Emerging applications attempt to provide automation for human reviewed data collection at Internet-scale. We conduct extensive experiments to study the effectiveness of one such application. We also study the feasibility of using Yahoo! Answers, a general question-answering forum, for human review data collection."
		},
		{
			"label" :    "Dynamics of bid optimization in online advertisement auctions",
			"type" :     "Paper",
			"author" :   [
				"Christian Borgs",
				"Jennifer Chayes",
				"Omid Etesami",
				"Nicole Immorlica",
				"Kamal Jain",
				"Mohammad Mahdian"
			],
			"url" :      "http://www2007.org/program/paper.php?id=89",
            "event" :    "Search (5 of 7)",
			"abstract" : "We consider the problem of online keyword advertising auctions among multiple bidders with limited budgets, and study a natural bidding heuristic in which advertisers attempt to optimize their utility by equalizing their return-on-investment across all keywords.   We show that existing auction mechanisms combined with this heuristic can experience cycling (as has been observed in many current systems), and therefore propose a modified class of mechanisms with small random perturbations. This perturbation is reminiscent of the small time-dependent perturbations employed in the dynamical systems literature to convert many types of chaos into attracting motions. We show that the perturbed mechanism provably converges in the case of first-price auctions and experimentally converges in the case of second-price auctions.  Moreover, the point of convergence has a natural economic interpretation as the unique market equilibrium in the case of first-price mechanisms.  In the case of second-price auctions, we conjecture that it converges to the ``supply-aware\'\' market equilibrium. Thus, our results can be alternatively described as a tatonnement process for convergence to market equilibrium in which prices are adjusted on the side of the buyers rather than the sellers. We also observe that perturbation in mechanism design is useful in a broader context: In general, it can allow bidders to ``share\'\' a particular item, leading to stable allocations and pricing for the bidders, and improved revenue for the auctioneer."
		},
		{
			"label" :    "Optimized Query Planning of Continuous Aggregation Queries in Dynamic Data Dissemination Networks",
			"type" :     "Paper",
			"author" :   [
				"Rajeev Gupta",
				"Krithi Ramamritham"
			],
			"url" :      "http://www2007.org/program/paper.php?id=216",
            "event" :    "Performance and Scalability (1 of 2)",
			"abstract" : "Continuous queries are used to monitor changes to time varying data and to provide results useful for online decision making. Typically a user desires to obtain the value of some aggregation function over distributed data items, for example, to know (a) the average of temperatures sensed by a set of sensors (b) the value of index of mid-cap stocks. In these queries a client specifies a coherency or accuracy requirement as part of the query. In this paper we present a low-cost, scalable technique to answer continuous aggregation queries using a content distribution network of dynamic data items. In such a network of data aggregators, each data aggregator serves a set of data items at specific coherencies. Just as various fragments of a dynamic web-page are served by one or more nodes of a CDN, our technique involves decomposing a client query into sub-queries and executing sub-queries on judiciously chosen data aggregators For executing an incoherency bounded continuous query, a query plan is required which includes the set of sub-queries, their individual incoherency bounds and data aggregators which can execute these sub-queries. An optimal query execution plan should satisfy client query\'s coherency requirement with least cost, measured in terms of the number of refresh messages sent from aggregators to the client. For estimating query execution cost, we build a continuous query cost model which can be used to estimate the number of messages required to satisfy the client specified incoherency bound. Performance results using real-world traces show that our cost based query planning leads to queries being executed using less than one third the number of messages required by existing schemes."
		},
		{
			"label" :    "Integrating Value-based Requirement Engineering Models to WebML using VIP Business Modeling Framework",
			"type" :     "Paper",
			"author" :   [
				"Farooque Azam",
				"Zhang Li",
				"Rashid Ahmad"
			],
			"url" :      "http://www2007.org/program/paper.php?id=395",
            "event" :    "Web Engineering (1 of 2)",
			"abstract" : "Requirement engineering is emerging as an increasingly important discipline for supporting Web application development, as these are designed to satisfy diverse stakeholder needs, additional functional, information, multimedia and usability requirements as compared to traditional software applications. Moreover, when considering innovative e-commerce applications, value-based requirements engineering is an extremely relevant methodology which exploits the concept of economic value during the requirements engineering activity. In contrast, most of the methodologies proposed for the development of Web applications, primarily focus on the system design, and paying less attention to the requirements engineering, and specifically to value-based requirement engineering. Focusing this aspect, the paper presents integration of value-based requirement engineering models to WebML models using our recently proposed VIP Business Modeling Framework. The integration process is demonstrated using a well-known e-commerce application example by first presenting example VIP business models and then deriving WebML process, structural and other models from these business models."
		},
		{
			"label" :    "Domain Independent Information Extraction from Web Tables",
			"type" :     "Paper",
			"author" :   [
				"Wolfgang Gatterbauer",
				"Paul Bohunsky",
				"Marcus Herzog",
				"Bernhard Kr%27pl",
				"Bernhard Pollak"
			],
			"url" :      "http://www2007.org/program/paper.php?id=790",
            "event" :    "Data Mining (1 of 5)",
			"abstract" : "Traditionally, information extraction from web tables has focused on small more or less homogeneous corpora, often based on assumptions about the semantics and use of <table> tags. A multitude of implementation forms of tables render these approaches difficult to scale. In this paper, we address the problem of domain-independent information extraction from web tables by shifting the focus from the tree-based representation of web pages to the 2-dimensional representation as intended by human authors for human readers. This additional visual information would allow us to fill the gap between syntax and domain dependent semantic. We show that this approach gives us a new set of features, which allow each of the steps of table location, recognition and interpretation to work without any reliance on domain-specific knowledge or domain-specific table templates."
		},
		{
			"label" :    "GlobeTP: Template-Based Database Replication for Scalable Web Applications",
			"type" :     "Paper",
			"author" :   [
				"Tobias Groothuyse",
				"Swaminathan Sivasubramanian",
				"Guillaume Pierre"
			],
			"url" :      "http://www2007.org/program/paper.php?id=308",
            "event" :    "Performance and Scalability (1 of 2)",
			"abstract" : "Generic database replication algorithms do not scale linearly in throughput as they require to apply all update, deletion and insertion (UDI) queries to every database replica. The throughput is therefore limited to the point where the number of UDI queries alone is sufficient to overload one server. In such scenarios, partial replication of a database can help, as update queries are executed only by a subset of all servers. In this paper we propose GlobeTP, a system that employs partial replication to improve database throughput. GlobeTP exploits the fact that a Web application\'s query workload is composed of a small set of read and write templates. Using knowledge of these templates and their respective execution costs, GlobeTP provides database table placements that produce significant improvements in database throughput. We demonstrate the efficiency of this technique using two different industry standard benchmarks. In our experiments, GlobeTP increases the throughput by 57% to 150% compared to full replication, while using identical hardware configuration. Furthermore, adding a single query cache improves the throughput by another 30% to 60%."
		},
		{
			"label" :    "Organizing and Searching the World Wide Web of Facts - Step Two: Harnessing the Wisdom of the Crowds",
			"type" :     "Paper",
			"author" :   "Marius Pasca",
			"url" :      "http://www2007.org/program/paper.php?id=560",
            "event" :    "Data Mining (2 of 5)",
			"abstract" : "As part of a large effort to acquire large repositories of facts from unstructured text on the Web, a seed-based framework for textual information extraction allows for weakly supervised extraction of class attributes (e.g., \"side effects\" and \"generic equivalent\" for drugs) from anonymized query logs. The extraction is guided by a small set of seed attributes, without any need for handcrafted extraction patterns or further domain-specific knowledge. The attributes of classes pertaining to various domains of interest to Web search users have accuracy levels significantly exceeding current state of the art. Inherently noisy search queries are shown to be a highly valuable, albeit unexplored, resource for Web-based information extraction, for the task of class attribute extraction as well as for named entity discovery."
		},
		{
			"label" :    "Navigation-Aided Retrieval",
			"type" :     "Paper",
			"author" :   [
				"Shashank Pandit",
				"Christopher Olston"
			],
			"url" :      "http://www2007.org/program/paper.php?id=162",
            "event" :    "Search (1 of 7)",
			"abstract" : "Users searching for information in hypermedia environments often perform querying followed by manual navigation. Yet, the conventional text/hypertext retrieval paradigm does not explicity take post-query navigation into account. This paper proposes a new retrieval paradigm, called navigation-aided retrieval (NAR), which treats both querying and navigation as first-class activities. In the NAR paradigm, querying is seen as a means to identify starting points for navigation, and navigation is guided based on information supplied in the query. NAR is a generalization of the conventional probabilistic information retrieval paradigm, which implicitly assumes no navigation takes place.<br /><br /> This paper presents a formal model for navigation-aided retrieval, and reports empirical results that point to the real-world applicability of the model. The experiments were performed over a large Web corpus provided by TREC, using human judgments on a new rating scale developed for navigation-aided retrieval. In the case of ambiguous queries, the new retrieval model identifies good starting points for post-query navigation. For less ambiguous queries that need not be paired with navigation, the output closely matches that of a conventional retrieval system."
		},
		{
			"label" :    "Compiling Cryptographic Protocols for Deployment on the Web",
			"type" :     "Paper",
			"author" :   [
				"Jay McCarthy",
				"Joshua D. Guttman",
				"John D. Ramsdell",
				"Shriram Krishnamurthi"
			],
			"url" :      "http://www2007.org/program/paper.php?id=140",
            "event" :    "Security, Privacy, Reliability, and Ethics (3 of 3)",
			"abstract" : "Cryptographic protocols are useful for trust engineering in Web transactions.  The Cryptographic Protocol Programming Language (CPPL) provides a model wherein trust management annotations are attached to protocol actions, and are used to constrain the behavior of a protocol participant to be compatible with its own trust policy.<br /><br /> The first implementation of CPPL generated stand-alone, single-session servers, making it unsuitable for deploying protocols on the Web. We describe a new compiler that uses a constraint-based analysis to produce multi-session server programs.  The resulting programs run without persistent TCP connections for deployment on traditional Web servers.  Most importantly, the compiler preserves existing proofs about the protocols.  We present an enhanced version of the CPPL language, discuss the generation and use of constraints, show their use in the compiler, formalize the preservation of properties, present subtleties, and outline implementation details."
		},
		{
			"label" :    "Analysis of Topological Characteristics of Huge Online Social Networking Services",
			"type" :     "Paper",
			"author" :   [
				"Yong-Yeol Ahn",
				"Seungyeop Han",
				"Haewoon Kwak",
				"Sue Moon",
				"Hawoong Jeong"
			],
			"url" :      "http://www2007.org/program/paper.php?id=676",
            "event" :    "Semantic Web (5 of 5)",
			"abstract" : "Social networking services are a fast-growing business in the Internet. However, it is unknown if online relationships and their growth patterns are the same as in real-life social networks. In this paper, we compare the structures of three online social networking services: Cyworld, MySpace, and orkut, each with more than 10 million users, respectively. We have access to complete data of Cyworld\'s ilchon (friend) relationships and analyze its degree distribution, clustering property, degree correlation, and evolution over time. We also use Cyworld data to evaluate the validity of snowball sampling method, which we use to crawl and obtain partial network topologies of MySpace and orkut. Cyworld, the oldest of the three, demonstrates a changing scaling behavior over time in degree distribution. The latest Cyworld data\'s degree distribution exhibits a multi-scaling behavior, while those of MySpace and orkut have simple scaling behaviors with different exponents. Very interestingly, each of the two exponents corresponds to the different segments in Cyworld\'s degree distribution. Certain online social networking services encourage online activities that cannot be easily copied in real life; we show that they deviate from close-knit online social networks which show a similar degree correlation pattern to real-life social networks."
		},
		{
			"label" :    "A Fault Model and Mutation Testing of Access Control Policies",
			"type" :     "Paper",
			"author" :   [
				"Evan Martin",
				"Tao Xie"
			],
			"url" :      "http://www2007.org/program/paper.php?id=447",
            "event" :    "Security, Privacy, Reliability, and Ethics (3 of 3)",
			"abstract" : "To increase confidence in the correctness of specified policies, policy developers can conduct policy testing by supplying typical test inputs (requests) and subsequently checking test outputs (responses) against expected ones.  Unfortunately, manual testing is tedious and few tools exist for automated testing of XACML policies.<br /><br /> We present a fault model for access control policies and a framework to explore it. The framework includes mutation operators used to implement the fault model, mutant generation, equivalent-mutant detection, and mutant-killing determination. This framework allows us to investigate our fault model, evaluate coverage criteria for test generation and selection, and determine a relationship between structural coverage and fault-detection effectiveness. We have implemented the framework and applied it to various XACML policies. Our experimental results offer valuable insights into choosing mutation operators in mutation testing and choosing coverage criteria in test generation and selection."
		},
		{
			"label" :    "The Two Cultures: Mashing Up Web 2.0 and the Semantic Web (position paper)",
			"type" :     "Paper",
			"author" :   [
				"Anupriya Ankolekar",
				"Markus Kr\'tzsch",
				"Thanh Tran",
				"Denny Vrandecic"
			],
			"url" :      "http://www2007.org/program/paper.php?id=777",
            "event" :    "Semantic Web (5 of 5)",
			"abstract" : "There is a common perception that there are two competing visions about the future evolution of the Web: the Semantic Web and Web 2.0 visions. We believe that the technologies and core strengths of these visions are complementary, rather than in competition. In fact, both technologies need each other in order to scale beyond their own strongholds. The Semantic Web can learn from Web 2.0\'s focus on community and interactivity, while Web 2.0 can draw from Semantic Web infrastructure to facilitate things like mashups.<br /><br /> In order to demonstrate the complementarity of the two ideas, we outline a \'semantic\' weblogs vision in this paper that enhances a Web 2.0 scenario with  semantic technologies. This vision can be realized in the short-term and shows clear advantages of the Semantic Web in the open field. We discuss the challenges and highlight open issues that arise from this vision, such as caching in the presence of dynamically generated pages. As a result, we outline research questions and implementation issues that will be of high relevance in the near future and finish with some thoughts about the future web ecosystem."
		}
	],
	"types" :      {
		"Paper" : {
			pluralLabel:    "Papers"
		}
	},
	"properties" : {
		"author" :   {
			label:          "author",
            pluralLabel:    "authors",
			"valueType" :   "item"
		},
		"event" : {
            valueType :     "item"
		},
		"url" : {
            valueType :     "url"
		},
		"abstract" : {
		}
	}
}