CIKM 98 Tutorials

Seventh International Conference on Information and Knowledge Management

Nov. 3, 1998, Washington, D.C., USA.

Association for Computing Machinery Special Interest Group on Information Retrieval

Sponsored by ACM SIGIR and SIGMIS.
CIKM 98 Tutorials
T1. DATA MINING ON LARGE DATABASES (8:00AM-12:00PM)

Rajeev Rastogi and Kyuseok Shim
Bell Laboratories

T2. MODELS IN INFORMATION RETRIEVAL (8:00AM-12:00PM)

Fredric C. Gey Ph.D
University of California

T4. DATA WAREHOUSING DESIGN TECHNIQUES FOR ROLAP (1:00-5:00PM)

Il-Yeol Song
Drexel University

T5. METADATA REPOSITORIES: ENABLING INFORMATION ASSET MANAGEMENT (1:00-5:00PM)

Sandra Heiler and Gail Mitchell
GTE Laboratories, Inc.

Tutorial registration must be done via CIKM98 registration form.


T1. DATA MINING ON LARGE DATABASES
Tuesday, Nov. 3, 8:00AM-12:00PM

Rajeev Rastogi and Kyuseok Shim
Bell Laboratories
600 Mountain Ave
Murray Hill, NJ 07974
Email: shim@research.bell-labs.com

Tutorial Outline
Following topics will be discussed during the tutorial.

  1. Introduction : (20 minutes).
    Brief overview and discussion on data mining techniques developed for large databases.
  2. Association Rules : (30 minutes)
    Presents the association rules, optimized association rules and correlations.
  3. Classification : (30 minutes)
    Describes the state of the art classifiers for large databases. These include PUBLIC, Rain-Forest, SLIQ and SPRINT algorithms.
  4. Clustering : (30 minutes)
    Illustrates the characteristics of traditional clustering algorithms and present techniques developed for large databases. We cover CLARANS, BIRCH, CURE and ROCK algorithms.
  5. Similar Time Sequences : (30 minutes)
    Illustrates the existing techniques developed for similar time sequences.
  6. Other Applications and Future Research : (20 minutes)
    Discusses other interesting problems and research issues

Biography of Instructors :

  1. Rajeev Rastogi

    Rastogi Rastogi received the B. Tech degree in Computer Science from the Indian Institute of Technology, Bombay in 1988, and the masters and Ph.D. degrees in Computer Science from the University of Texas, Austin, in 1990 and 1993, respectively. He joined Bell Laboratories in Murray Hill, New Jersey, in 1993 and is currently a member of technical staff (MTS) in the Information Sciences Research Center.

    Rajeev Rastogi is active in the field of databases and has served as a program committee member for several conferences in the area. His writings have appeared in a number of ACM and IEEE publications and other professional conferences and journals. His research interests include database systems, storage systems and knowledge discovery. His most recent research has focused on the areas of high-performance transaction systems, continuous-media storage servers, tertiary storage systems, data mining, and multidatabase transaction management.

  2. Kyuseok Shim

    Kyuseok Shim is currently leading the Serendip Data Mining project in Bell Laboratories. Before that, he worked for Rakesh Agrawal's Quest Data Mining project at IBM Almaden Research Center. He also worked as a summer intern for two summers at Hewlett Packard Laboratories. He received B.S. degree in Electrical Engineering from Seoul National University, and the MS and Ph.D. degrees in Computer Science from University of Maryland, College Park.

     

    Kyuseok Shim has been working in the area of databases focusing on data mining, data warehousing, query processing and query optimization, and constraint-based database systems. He has published several research papers in prestigious database conferences and journals. He has also served as a program committee member on database and knowledge discovery conferences.


T2. MODELS IN INFORMATION RETRIEVAL
Tuesday, Nov. 3, 8:00AM-12:00PM

Fredric C. Gey Ph.D
Data Archivist and Assistant Director
UC Data Archive & Technical Assistance
University of California
2538 Channing Way, # 5100
Berkeley, CA 94720-5100
Phone: (510) 642-6571
FAX : (510) 643-8292

COURSE DESCRIPTION:

Information retrieval algorithms have emerged as the key to effective search of large collections of unstructured text such as found on the Internet. Vector space algorithms are used by Lycos and AltaVista, while Inktome uses a probabilistic document retrieval algorithms.

The three major theoretical models in information retrieval are Boolean/logic, vector space, and probabilistic. This tutorial will explain the unique characteristics and problems of each model and how each model has evolved along different lines. Modern variants of the basic models are explained.

The attendees of this tutorial will obtain a basic understanding of the major theoretical models upon which modern text retrieval software is based. The tutorial should provide each participant with a starting point for further self-education.

1/2 hour

Background and historical development
Luhn and statistical text characteristics
Statistical weights and the IDF concept

1 hour

Boolean set and logic models
Fuzzy logic (RUBRIC/TOPIC)
Weighted boolean and P-Norm (INQUERY)
Recent logic models

1 hour

Vector space and geometric models
Basic vector similarity measures
Generalized vector space model
Latent Semantic Indexing
Pivoted normalization similarity

1 hour

Probabilistic models
Probabilistic indexing and querying
2- Poisson and OKAPI
Relevance weights and relevance feedback
Inference nets and neural network approaches
Regression models

1/2hour

Performance measurement and analysis
Recall, precision, fallout measures
Limitations to performance assessment -- interjudge consistency, completeness
Statistical significance tests

Materials: 110 Course overheads, and 4 pages of bibliographic references will be provided.

WHO SHOULD ATTEND:
This course is designed to provide a fast-paced yet rigorous introduction to the basic models of Information Retrieval for academic and industrial research and development computer scientists whose background lies outside the Information Retrieval area.

ABOUT THE INSTRUCTOR:
Fredric Gey's research specializes in probabilistic document retrieval using logistic regression techniques. He is principal investigator of NSF grant IRI 9630765 Probabilistic Retrieval of Full-Text Document Collections Using Logistic Regression. He is Co-principal Investigator for the ARPA research contract "Search Support for Unfamiliar Metadata Vocabularies," July 1997-June 2000. He directs the UC Berkeley entries to the TREC conferences, and is designated as General Chairman for SIGIR99 to be held at the University of California, Berkeley during the summer of 1999. He holds a PhD in Information Science from UC Berkeley.


T4. DATA WAREHOUSING DESIGN TECHNIQUES FOR ROLAP
Tuesday, Nov. 3, 1:00-5:00

 

Il-Yeol Song, Ph.D.
Associate Professor
College of Information Science and Technology
Drexel University
Philadelphia, PA 19104
Phone: (215) 895-2489
Fax: (215) 895-2494
Email: song@drexel.edu

Level : Beginning to Intermediate.

Intended Audience : Professionals who are working or thinking for data warehousing based on relational database systems.

Tutorial Abstract :

A data warehouse is an integrated data repository containing historical data of a corporate for supporting decision-making processes. Recently, data warehouses became the focus of corporate information management with the most advanced database technology. The basic strategy for accessing individual and aggregate data in a data warehouse using relational databases is known as ROLAP (Relational OLAP). This tutorial presents technology overview for the development of data warehousing. It compares ROLAP and MOLAP (Multidimensional OLAP) then discusses techniques for designing star schema. We will look at the multiple variations of the star schema that exist and the differences in the properties of these different schema. It also discusses the techniques for optimizing the performance of data warehouse systems based on relational database systems. Specifically, the discussion includes storage, parallel processing technology, indexing technology, including bit map indexes, join indexes, multi-table join indexes, indexing strategies, query optimization based on star schema, and partitioning techniques. It concludes with the survey of commercial markets, tools, trends, research issues and challenges.

Biography of Instructor :

Il-Yeol Song is an associate professor in the College of Information Science and Technology at Drexel University, Philadelphia, PA. He received his M.S. and Ph.D degrees in Computer Science from Louisiana State University in 1984 and 1988, respectively. His current research areas include database modeling and design, data warehousing, object-oriented database systems, and object-oriented analysis and design. He has published over 60 refereed technical articles in various journals, international conferences, and books. In 1992, he received an exemplary teaching award as well as a research scholar award from Drexel University. He has won eight Sigma Xi research awards from the Drexel Sigma Xi scientific research competition. He has worked as a program committee member for over twenty five international conferences and workshops. He was the guest editor for a 1995 special issue of Journal of Computer and Software Engineering entitled "Methodologies and Tools for Intelligent Information Systems." He will be the guest editor for a special issue of Journal of Computer Science and Information Management entitled "Applications and Technologies for Next Generation Database Systems," scheduled for early 1999. He is the program co-chair of First ACM Int'l Workshop on Data Warehousing and OLAP (DOLAP that will be held with CIKM98 in November 7, D.C.


T5. METADATA REPOSITORIES: ENABLING INFORMATION ASSET MANAGEMENT
Tuesday, Nov. 3, 1:00-5:00

Sandra Heiler and Gail Mitchell
GTE Laboratories, Inc.
sh04@gte.com, gmitchell@gte.com

Metadata repositories have long been used by software engineering tools to store and manage descriptions of system components, and by data administrators to document information stores. More recently, they are being used to support the integration of various tools, databases, and applications, and their use is being expanded to manage metadata for many more kinds of applications, including data warehousing. In this half-day tutorial, we present an industrial perspective on repository technology and its uses in managing an enterprise's information assets.

The tutorial starts with a description of repository technology. It examines requirements for managing metadata and describes how these are met by the technology. In particular, we discuss repository architectures, integration mechanisms, repository metamodels, and associated tools for populating, accessing, maintaining, and administering the repository. We identify various implementation strategies for repositories, and look at the state-of-the-art in repository products.

The second part of the tutorial examines the use of repositories. We begin with a discussion of issues in populating a repository and in implementing applications using repositories. We then describe a number of applications of repository technology, including software lifecycle support, production planning and management, and decision support systems and data warehousing. Finally, we look at how the repositories supporting these applications combine to provide for enterprise-wide information asset management, and we identify research issues in moving to this broader use.

Instructors

Sandra Heiler is the Principal Investigator of the Data and Database Research project at GTE Laboratories, where her research focuses on the use of metadata repositories to support enterprise-wide management of information and software components. In particular, her work is directed to the use of metadata and repository technology to integrate distributed, heterogeneous systems and databases, and to support data warehousing. She is also involved in the application of this technology to legacy system migration and data archiving in a large SAP rollout. Ms. Heiler's earlier work at GTE Laboratories was with the Distributed Object Management Department, where she did research on object model integration and interoperability frameworks, and on object views and identifiers. She joined GTE from CCA and Xerox Advanced Information Technology, where she developed object models and object management systems for VLSI and software engineering environments, as well as transaction models to support cooperative work in those environments.

Ms. Heiler has more than 35 years of experience in database research and development and in applications of database and metadata technology. Her previous work includes developing data management systems for statistical databases and Decision Support Systems, and for managing other specialized data types, including engineering, statistical, and bibliographic data. She has authored papers and presented tutorials on object models and object views, semantic interoperability, integration frameworks, and object-oriented systems. Her current research interests include the use of metadata to capture data semantics and to support data warehousing.


Gail Mitchell is a Principal Member of the Technical Staff at GTE Laboratories where she works on problems in integrating enterprise information. Current research interests include querying in heterogeneous systems (including over the web), data warehousing, and data integration. Recent activities focus on metadata repository technology to support legacy system migration and information integration.

Dr. Mitchell received her PhD from Brown University for her research on extensible query optimization for object-oriented database systems. She has authored a number of papers in the areas of object query languages, extensible query processing, and distributed object management, and has taught courses,workshops and tutorials on these topics at such venues as OOPSLA, MIT, OGI, DEC, and NATO ASI. She is active on program committees, more recently SIGMOD, VLDB and CIKM, and was editor of a TAPOS issue on Distributed Object Systems and co-editor of the book Persistent Object Bases.


Back