Emiran Curtmola - Home Page @ University of California San Diego

About me

Emiran Curtmola is currently a Teradata Database Query Optimizer Engineer. He has earned his Ph.D. in Computer Science from UC San Diego. Here, he was part of the UCSD Database group and was affiliated with the CNS center where he collaborated with Alin Deutsch and Yannis Papakonstantinou.

Research Interests
My research lies primarily in foundational aspects of Databases at the intersection with information retrieval and distributed information systems. My current focus is on query optimization, unstructured data management, search (XML full-text, algorithms and systems), XML technologies, web-scale data integration and exchange, Semantic Web, distributed and P2P computing, and data privacy.

Professional Service
  · Program committee member: EDBT 2010, DEXA (2009, 2010, 2011), DTA (2009, 2010), ADC (2011)

  · External conference reviewer: VLDB'08 , VLDB PhD Workshop 2009

  · Teaching Assistant at UC San Diego:
    Database System Applications (CSE132B) - Fall 2003, Spring 2008, Spring 2009
    Server-side Web Applications (CSE135) - Spring 2009

  · Teaching Assistant at Polytechnic University of Bucharest, Romania:
    Data Structures and Algorithm Analysis, Fundamentals of Computer Graphics, Switching Theory and Logical Design, Numerical Calculus

Internships

  · IBM Almaden Research Center, USA, 2007-2008
    Mentor: Fatma Özcan, Andrey Balmin

  · AT&T Labs Research, USA, 2004-2006
    Mentor: Sihem Amer-Yahia, Divesh Srivastava

  · Infineon Technologies AG, Germany, 2002-2003
    Mentor: Raik Brinkmann, Hermann Ilmberger

Background
· Ph.D. from University of California San Diego
Computer Science and Engineering Department

Thesis: Democratic Community-based Search with XML Full-Text Queries [Abstract]

Publication Abstract

As the web evolves, it is becoming easier to form online communities based on shared interests, and to create and publish data on a wide variety of topics. With this democratization of information creation, it is natural to query, in an ad-hoc and expressive fashion, the global collection that is the union of all local data collections of others within the community. In order to publish and locate documents of interest while fully delivering on the promise of free data exchange, any community-supporting infrastructure needs to enforce the key requirement to preserve privacy of the association of content providers with potential sensitive information. This privacy-preserving publishing requirement prevents censorship, harassment, or discrimination of users by third parties. It also precludes some obvious approaches that reuse and build on existing centralized technologies including search engines and hosted online communities.

This dissertation facilitates democratization of data publishing and efficient search with powerful full-text queries over the community global collection by means of a novel distributed framework that disseminates queries in online communities. We address two challenging issues that arise in this context: the design of distributed access methods to publishers and the evaluation of expressive queries (i.e., XML full-text) locally at the publisher thereof.

First, given the virtual nature of the global data collection, we study the problem of efficiently discovering publishers in the community that contain documents matching a user query. We call such peers relevant publishers. We propose a novel distributed infrastructure in which data resides only with the publishers owning it. The infrastructure disseminates user queries to publishers, who answer them at their own discretion, under data-location anonymity constraints. That is the query forwarding infrastructure prevents leaking information about which publishers are capable of answering a certain query.

Second, once queries reach relevant publishers, we study how they efficiently process the incoming queries over their local repositories. Given that the commonly used data model for information exchange on the Web is semi-structured (e.g., XML), we propose algorithms for the evaluation and optimization of expressive XML queries that integrate structured and full-text search, including the W3C XQuery Full-Text standard.

  · M.S. from University of California San Diego
    Computer Science and Engineering Department

  · B.S. from Polytechnic University of Bucharest, Romania
    Computer Science and Engineering Department

Selected Talks

XML Distributed Retrieval - In "Ranked XML Querying" at Dagstuhl, Germany, 2008
Check out the report paper at DB & IR Integration: Report on the Dagstuhl Seminar "Ranked XML Querying"
A Logical Framework for XML Full-Text Search - AT&T Labs Research, 2005
GalaTex, an XML Full-Text Search Engine - AT&T Labs Research, 2004

Papers in Conferences and Workshops

WikiAnalytics: Disambiguation of Keyword Search Results on Highly Heterogeneous Structured Data [Abstract] ,
      In International Workshop on the Web and Databases. WebDB 2010
      Andrey Balmin and Emiran Curtmola

WikiAnalytics
      IBM Research Report RJ10466, May 2010
      Andrey Balmin and Emiran Curtmola

Publication Abstract

Wikipedia infoboxes is an example of a seemingly structured, yet extraordinarily heterogeneous dataset, where any given record has only a tiny fraction of all possible fields. Such data cannot be queried using traditional means without a massive a priori integration effort, since even for a simple request the result values span many record types and fields. On the other hand, the solutions based on the keyword search are too imprecise to exactly capture the user's intent.

To address these limitations, we propose WikiAnalytics system that utilizes a novel search paradigm in order to derive tables of precise and complete results from Wikipedia infobox records. The user starts with a keyword search that finds a superset of the result records, and then browses the clusters of the records deciding which are and are not relevant. WikiAnalytics uses three categories of clustering features based on record types, fields, and values that matched query keywords, respectively. Since the system cannot predict which combination of features will be important to the user, it efficiently generates all possible clusters of records by all sets of features. We utilize a novel data structure, universal navigational lattice (UNL), that compactly encodes all possible clusters. WikiAnalytics provides a dynamic and intuitive interface that lets the user explore the UNL and construct homogeneous structured tables, which can be further queried and aggregated using the conventional tools.
Load-Balanced Query Dissemination in Privacy-Aware Online Communities [Abstract] ,
      In ACM SIGMOD International Conference on Management of Data. SIGMOD 2010
      Emiran Curtmola, Alin Deutsch, K.K. Ramakrishnan, and Divesh Srivastava

Censorship-resistant Publishing
      Technical Report CS2010-0956, UC San Diego, March 2010
      Emiran Curtmola, Alin Deutsch, K.K. Ramakrishnan, and Divesh Srivastava

Publication Abstract

As the web evolves, it is becoming easier to form communities based on shared interests, and to create and publish data on a wide variety of topics. With this democratization of information creation comes the natural desire to make one's data accessible for querying within the community and also be able to query the global collection that is the union of all local data collections of others within the community. In order to fully deliver on the promise of free data exchange, any community-supporting infrastructure needs to enforce the key requirement to preserve privacy of the association of content providers with potential sensitive published information. This privacy preserving publishing requirement prevents censorship, harassment, or discrimination of users by third parties. It also precludes some obvious approaches that reuse and build on existing centralized technologies, e.g., search engines, hosted online communities, etc.

We propose a novel privacy-preserving enabling distributed infrastructure in which data resides only with the publishers owning it. The infrastructure disseminates user queries to publishers, who answer them at their own discretion. The infrastructure enforces a publisher k-anonymity} guarantee, which prevents leakage of information about which publishers are capable of answering a certain query.

Given the virtual nature of the global data collection, we study the challenging problem of efficiently locating publishers in the community that contain data items matching a specified query. We propose a distributed index structure, UQDT, that is organized as a union of Query Dissemination Trees (QDTs), and realized on an overlay (i.e., logical) network infrastructure. Each QDT has data publishers as its leaf nodes, and overlay network nodes as its internal nodes; each internal node routes queries to publishers, based on a summary of the data advertised by publishers in its subtrees. We experimentally evaluate design tradeoffs, and demonstrate that UQDT can maximize throughput by preventing any overlay network node from becoming a bottleneck.
WikiAnalytics: Ad-hoc Querying of Highly Heterogeneous Structured Data [Abstract]
In International Conference on Data Engineering. ICDE 2010 Demonstration
Andrey Balmin and Emiran Curtmola

Publication Abstract

Searching and extracting meaningful information out of highly heterogeneous datasets is a hot topic that received a lot of attention. However, the existing solutions are based on either rigid complex query languages (e.g., SQL, XQuery/XPath) which are hard to use without full schema knowledge, without an expert user, and which require up-front data integration. At the other extreme, existing solutions employ keyword search queries over relational databases as well as over semistructured data which are too imprecise to specify exactly the user's intent.

To address these limitations, we propose an alternative search paradigm in order to derive tables of precise and complete results from a very sparse set of heterogeneous records. Our approach allows users to disambiguate search results by navigation along conceptual dimensions that describe the records. Therefore, we cluster documents based on fields and values that contain the query keywords. We build a universal navigational lattice (UNL) over all such discovered clusters. Conceptually, the UNL encodes all possible ways to group the documents in the data corpus based on where the keywords hit.

We describe, WikiAnalytics, a system that facilitates data extraction from the Wikipedia infobox collection. WikiAnalytics provides a dynamic and intuitive interface that lets the average user explore the search results and construct homogeneous structured tables, which can be further queried and mashed up (e.g., filtered and aggregated) using the conventional tools.
Search Driven Analysis of Heterogeneous XML Data [Abstract]
In Conference on Innovative Data Systems Research. CIDR 2009
Andrey Balmin, Latha Colby, Emiran Curtmola, Quanzhong Li, and Fatma Özcan

Publication Abstract

Analytical processing on XML repositories is usually enabled by designing complex data transformations that shred the documents into a common data warehousing schema. This can be very time consuming and costly, especially if the underlying XML data has a lot of variety in structure, and only a subset of attributes constitutes meaningful dimensions and facts. Today, there is no tool to explore an XML data set, discover interesting attributes, dimensions and facts, and rapidly prototype an OLAP solution.

In this paper, we propose a system, called SEDA (Search, Explore, Discover and Analyze), that enables users to start with simple keyword-style querying, and interactively refine the query based on result summaries. SEDA then maps query results onto a set of known, or newly created, facts and dimensions, and derives a star schema and its instantiation to be fed into an off-the-shelf OLAP tool, for further analysis.
XTreeNet: Democratic Community Search [Abstract]
In International Conference on Very Large Data Bases. VLDB 2008 Demonstration
Emiran Curtmola, Alin Deutsch, Dionysios Logothetis, K.K. Ramakrishnan, Divesh Srivastava, and Kenneth Yocum

Publication Abstract

We describe XTreeNet, a distributed query dissemination engine which facilitates democratization of publishing and efficient data search among members of online communities with powerful full-text queries. This demonstration shows XTreeNet in full action. XTreeNet serves as a proof of concept for democratic community search by proposing a distributed novel infrastructure in which data resides only with the publishers owning it. Expressive user queries are disseminated to publishers. Given the virtual nature of the global data collection (e.g., the union of all local data published in the community) our infrastructure efficiently locates the publishers that contain matching documents with a specified query, processes the complex full-text query at the publisher and returns all relevant documents to querier.
SEDA: A System for Search, Exploration, Discovery and Analysis of XML Data [Abstract]
In International Conference on Very Large Data Bases. VLDB 2008 Demonstration
Andrey Balmin, Latha Colby, Emiran Curtmola, Quanzhong Li, Fatma Özcan, Sharath Srinivash, and Zografoula Vagena

Publication Abstract

Keyword search in XML repositories is a powerful tool for interactive data exploration. Much work has recently been done on making XML search aware of relationship information embedded in XML document structure, but without a clear winner in all data and query scenarios. Furthermore, due to its imprecise nature, search results cannot easily be analyzed and summarized to gain more insights into the data. We address these shortcomings with SEDA: a system for Search, Exploration, Discovery, and Analysis of XML Data. SEDA is based on a paradigm of search and user interaction to help users start with simple keyword-style querying and perform rich analysis of XML data by leveraging both the content and structure of the data. SEDA is an interactive system that allows the user to refine her query iteratively to explore the XML data and discover interesting relationships.

SEDA first employs a top-k algorithm to compute the most relevant top-k answers fast, and returns tuples of nodes ranked by relevance. SEDA provides several novel data structures and techniques for efficient top-k computation over graph-structured XML data. SEDA also computes all the contexts in which the query terms are found and all the connection paths that connect the query terms in the XML data. These two summaries enable the user to refine her query by disambiguating the contexts and connections relevant to her query. With the user feedback, the system has enough information to compute all query results, not just the top-k. From the complete results, SEDA automatically deduces a star schema, which is then instantiated with the query results and augmented with additional values required for a well-defined data cube. The tables computed at this step are input into an OLAP engine for further analysis.
A Platform for Search in the Big Web 2.0 [Abstract]
In SIGMOD 2007 PhD Workshop on Innovative Database Research. IDAR 2007
Emiran Curtmola

Publication Abstract

The recent explosion of the amount of different types of information being generated from so many different places under different social types of interactions between users has made search a hot topic for many research communities. While the traditional web search focused on simple keyword search and on references between pages, nowadays getting the right information at the right time is getting harder all the time posing a critical need for expressive, efficient, relevant and flexible search tools.

We study the search in large-scale social systems by capturing logically the natural way people search and discover information: the relevance of keywords relative to the document structure, the importance of references between pages and the associations generated by the online social context. We argue that the key for successful search is to provide a strong theoretical basis to enable the development of theory and practical optimization algorithms. We are the first to show how to transfer the well-established relational world expertise into keyword search. The thesis of this research is to build a prototype based on this formalism and to demonstrate how we can leverage it to address these search challenges.
Flexible and Efficient XML Search with Complex Full-Text Predicates [Abstract] , ,
In ACM SIGMOD International Conference on Management of Data. SIGMOD 2006
Sihem Amer-Yahia, Emiran Curtmola, and Alin Deutsch

Publication Abstract

Recently, there has been extensive research that generated a wealth of new XML full-text query languages, ranging from simple Boolean search to combining sophisticated proximity and order predicates on keywords.

While computing least common ancestors of query terms was proposed for efficient evaluation of conjunctive keyword queries by exploiting the document structure, no such solution was developed to evaluate complex full-text queries. We present efficient evaluation algorithms based on a formalization of full-text XML queries in terms of keyword patterns and an algebra which manipulates pattern matches. Our algebra captures most existing languages and their varying semantics and our algorithms combine relational query evaluation techniques with the exploitation of document structure to process queries with complex full-text predicates.

We show how scoring can be incorporated into our framework without compromising the algorithms complexity. Our experiments show that considering element nesting dramatically improves the performance of queries with complex full-text predicates.
Rewriting Nested XML Queries Using Nested Views [Abstract] ,
In ACM SIGMOD International Conference on Management of Data. SIGMOD 2006
Nicola Onose, Alin Deutsch, Yannis Papakonstantinou, and Emiran Curtmola

Publication Abstract

We present and analyze an algorithm for equivalent rewriting of XQuery queries using XQuery views, which is complete for a large class of XQueries featuring nested FLWR blocks, XML construction and join equalities by value and identity. These features pose significant challenges which lead to fundamental extension of prior work on the problems of rewriting conjunctive and tree pattern queries. Our solution exploits the Nested XML Tableaux (NEXT) notation which enables a logical foundation for specifying XQuery semantics. We present a tool which inputs XQuery queries and views and outputs an XQuery rewriting, thus being usable on top of any of the existing XQuery processing engines. Our experimental evaluation shows that the tool scales well for large numbers of views and complex queries.
GalaTex: A Conformant Implementation of the XQuery Full-Text Language [Abstract]
In International Workshop on XQuery Implementation, Experience and Perspectives. XIME-P 2005
Emiran Curtmola, Sihem Amer-Yahia, Philip Brown, and Mary Fernández

Publication Abstract

We describe GALATEX, the first complete implementation of XQuery Full-Text, a W3C specification that extends XPath 2.0 and XQuery 1.0 with full-text search capabilities. XQuery Full-Text provides composable full-text search primitives such as simple keyword search, Boolean queries, and keyword-distance predicates. GALATEX is intended to serve as a reference implementation for XQuery Full-Text and as a platform for addressing new research problems such as scoring full-text query results, optimizing XML queries over both structure and text, and evaluating top-k queries on scored results. GALATEX is an all-XQuery implementation initially focused on completeness and conformance rather than on efficiency. We describe its implementation on top of Galax, a complete XQuery implementation and identify some performance challenges, possible solutions, and their interactions with XQuery implementations.

Selected Posters

Querying XML Peers [Abstract]
Center for Networked Systems. In CNS Research Review 2008
Emiran Curtmola, Alin Deutsch, Yannis Papakonstantinou, K.K. Ramakrishnan, and Divesh Srivastava

Publication Abstract

As the web evolves, it is becoming easier to form communities based on shared interests, and to create and publish data on a wide variety of topics. With this "democratization of information creation" comes the natural desire to make one's data accessible for querying within the community and also be able to query the global collection that is the union of all local data collections of others within the community.

In order to fully deliver on the promise of free data exchange, any community-supporting infrastructure needs to enforce the key requirement of being resistant to censorship by third parties, be they of governmental, corporate, or of other special interest nature. Censorship resistance precludes some obvious approaches that reuse and build on existing centralized technologies, e.g., search engines, hosted online communities, etc.

We propose a distributed censorship-resistant enabling infrastructure in which data resides only with the publishers owning it. The infrastructure disseminates user queries to publishers, who answer them at their own discretion. The infrastructure prevents third parties from pinpointing which publisher advertises what data (without extensively colluding with or attacking community members).

Given the virtual nature of the global data collection, we study the challenging problem of efficiently locating publishers in the community that contain data items matching a specified query. We propose a distributed index structure, UQDT, that is organized as a union of Query Dissemination Trees (QDTs), and realized on an overlay (i.e., logical) network infrastructure. Each QDT has data publishers as its leaf nodes, and overlay network nodes as its internal nodes; each internal node routes queries to publishers, based on a summary of the data advertised by publishers in its subtree.

We experimentally evaluate design tradeoffs, and demonstrate that UQDT can maximize throughput by preventing any overlay network node from becoming a bottleneck.
GalaTex: A Conformant Implementation of the XQuery Full-Text Language [Abstract] ,
In International World Wide Web Conference. WWW 2005
Emiran Curtmola, Sihem Amer-Yahia, Philip Brown, and Mary Fernández

Publication Abstract

We describe GalaTex, the first complete implementation of XQuery Full-Text, a W3C specification that extends XPath 2.0 and XQuery 1.0 with full-text search. XQuery Full-Text provides composable full-text search primitives such as keyword search, Boolean queries, and keyword-distance predicates. GalaTex is intended to serve as a reference implementation for XQuery Full-Text and as a platform for addressing new research problems such as scoring full-text query results, optimizing XML queries over both structure and text, and evaluating top-k queries on scored results. GalaTex is an all-XQuery implementation initially focused on completeness and conformance rather than on efficiency. We describe its implementation on top of Galax, a complete XQuery implementation.
Implementation and Open Research Issues in XML Full-Text Search [Abstract]
In New York Area DB/IR Day 2005
Emiran Curtmola, Sihem Amer-Yahia, and Alin Deutsch

Publication Abstract

The increase of large XML repositories being made available lately has created and determined the need to search both the structure and text content of XML documents. While current XML query processing languages, XPath 2.0 and XQuery 1.0 which are the W3C recommended standards for querying XML documents, operate on structured XML data, they are limited in expressing full-text queries.Recently, the W3C has been working on XQuery Full-Text, a language that extends XPath and XQuery with fully composable full-text search primitives such as phrase matching, Boolean connectives, keyword-distance, stemming and thesauri.

In this poster, I will describe the data model and the query semantics as well as different query evaluation strategies for XQuery Full-Text. I will also discuss the architecture of GalaTex, the first conformant implementation of XQuery Full-Text, which uses Galax as a complete XQuery processor. GalaTex is initially focused on completeness and conformance rather than on efficiency. However, its main benefit is to serve as a reference implementation for XQuery Full-Text and as a platform for addressing new research ideas in XML full-text search. I will discuss ideas on optimizing XML queries over both structure and text, providing a logical framework for evaluating top-K answers based on score pruning, and full-text query equivalence.

A demonstration of GalaTex is provided at GALATEX and will also be available along with this poster.

Project Demos

XTreeeNet - Democratic Community Search (work in progress)
SEDA: A System for Search, Exploration, Discovery and Analysis of XML Data
GalaTex - XQuery Full-Text extension of XPath and XQuery Languages
REFORM - A System for Rewriting XML Nested Queries Using Nested Views