home

= CSE665: Information Retrieval and Web Search =

Program:
PhD(CS)

**ANNOUNCEMENT FOR TURNITIN** 1. Goto http://www.turnitin.com. 2. Create your account as a student. 3. Add the course "Info Retrieval and Web Search". Course code is 4849706. 4. Password for the course is already emailed to you. 5. Go to Homework1 and upload your work.

Semester:
Spring 2012

Instructor:
Dr. Shakeel A. Khoja

**Credit Hours:**
3 (3 credit hours for theory)

Prerequisite(s):
BS Level course in Web Development / Web Engineering

Course Description:
This course covers the foundations of Information Retrieval (IR) as well as advanced or more recent topics in Web Information Retrieval (WIR). Core topics include material necessary to understand how an IR system is constructed and study of recent topics of research in the area. In IR, topics such as IR models (Boolean, vector space, probabilistic, latent semantic indexing, and neural nets), Indexing models (storing and accessing), file organization, query processing, and document clustering will be covered. Advanced research topics such as Aggregated Search, Digital Advertising, Digital Libraries, Discovery of Spam and Opinions in the Web, Evaluation, Information Retrieval in Context, Multimedia Resource Discovery, Scalability Challenges in Web Search Engines, and Users in Interactive Information Retrieval Evaluation will also be discussed.

Course Objectives:

 * 1)  Develop an in-depth knowledge of:
 * 2)  Information Retrieval and Web Search Issues.
 * 3)  New advancements in the area of web search algorithms.
 * 4)  Document clustering.
 * 5)  Multimedia resource utilization.
 * 6)  Read and understand recent research issues in the area.
 * 7)  Understand the utility of above mentioned systems in web based applications such as digital libraries, archives, and semantic web search services.

Books:

 * Ricardo Baeza-Yates and Berthier Ribeiro-Neto, “Modern Information Retrieval”, Addison-Wesley.
 * Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze, “Introduction to Information Retrieval”, Cambridge University Press, 2008.
 * Massimo Melucci, Ricardo Baeza-Yates, “Advanced Topics in Information Retrieval”, Springer Heidelberg.
 * Maristella Agosti, “Information Access through Search Engines and Digital Libraries”, Springer Heidelberg.
 * Witten, Mofal and Bell, “Managing Giga bytes: Compressing, indexing documents and images- 2nd Edition”, Morgan Kauffman Publishers.
 * Jones and Willett, “Readings in Information Retrieval”, Morgan Kauffman Publishers.
 * Various WWW sources <span style="font-family: Tahoma,sans-serif;">.

Grading Policy:
Mid-Term = 15% + 15% Final Exam = 30% Project = 30% Class Discussion and participation = 10%

2. || 2.  || 5-6 || **<span style="font-family: Arial,sans-serif; font-size: 10pt;">Web Search Issues: **<span style="font-family: Arial,sans-serif; font-size: 10pt;">Page Ranking (rank algorithm), duplicate elimination, search-by-example, measuring search, engine index quality. ||  ||   || || # [|XML retrieval] <span style="display: block; font-family: Arial,sans-serif; font-size: 10pt; text-align: justify;">information and knowledge, which includes Information Retrieval and Knowledge <span style="display: block; font-family: Arial,sans-serif; font-size: 10pt; text-align: justify;">Discovery in the Web, including spam detection, opinion mining and relation mining. ||  ||   || || **<span style="font-family: Arial,sans-serif; font-size: 10pt;">Digital Libraries: **<span style="font-family: Arial,sans-serif; font-size: 10pt;"> Historical background of digital libraries, main concepts of present digital library systems, usability, interoperability and evaluation issues. ||  ||   ||
 * //<span style="font-family: Arial,sans-serif; font-size: 10pt;">Week // || //<span style="font-family: Arial,sans-serif; font-size: 10pt;">Topics to be covered (tentative) // || Lecture Notes || Additional Material ||
 * <span style="display: block; font-family: Arial,sans-serif; font-size: 10pt; text-align: center;">1 || **<span style="font-family: Arial,sans-serif; font-size: 10pt;">Information Retrieval Techniques: **<span style="font-family: Arial,sans-serif; font-size: 10pt;">Information Retrieval models, indexing models, query processing, file organizations and document clustering. || [[file:Intro.ppt]] || 1. []
 * <span style="display: block; font-family: Arial,sans-serif; font-size: 10pt; text-align: center;">2 || **<span style="font-family: Arial,sans-serif;">Search Engine Architecture: **Basic Building blocks: text acquisition, text transformation, index creation, user interaction, ranking, evaluation. Putting it all together || [[file:searchengines_chap2.pptx]] || 1. [[file:searchengines_theory of a large scale web search engines.pdf]]
 * = 3 || **Stemming Algorithms:** Algorithmic issues in classic information retrieval, string algorithms, e.g. approximate matching, other algorithmic issues related to the Web: networking & routing, cacheing, security, e-commerce and business models. || [[file:searchengines_chap4.pptx]] || 1. [[file:stemming_algo_papers.pdf]] ||
 * = 4 || Crawling the Web: Deciding what to search, retrieving web pages, Freshness, Focussed crawling, Documents crawling, documents feed. || [[file:searchengines_chap3.pptx]] || 1. [|Google crawlers] ||
 * = <span style="display: block; font-family: Arial,sans-serif; font-size: 10pt; text-align: center;">
 * = 7 || <span style="font-family: Arial,sans-serif; font-size: 10pt;">Query Languages: <span style="display: block; font-family: Arial,sans-serif; font-size: 10pt; text-align: justify;">keyword-based queries, Single-Word Queries, context queries, Boolean queries, Natural Language queries, pattern matching, structural queries, query protocols, research issues and modern trends in query processing. || [[file:searchengines_chap6.pptx]] || # [| Video lectures of workshop on Web Search Click Data 2009, Barcelona Spain]
 * 1) [|Google tutorial on writing effective queries] ||
 * <span style="display: block; font-family: Arial,sans-serif; font-size: 10pt; text-align: center;">8 || **<span style="font-family: Arial,sans-serif; font-size: 10pt;">Retrieval Evaluation: **<span style="font-family: Arial,sans-serif; font-size: 10pt;">Retrieval performance evaluation, recall and precision, alternative measures, example of reference collections (academic and research collections). || Student Presentations ||   ||
 * <span style="display: block; font-family: Arial,sans-serif; font-size: 10pt; text-align: center;">9 || **<span style="font-family: Arial,sans-serif; font-size: 10pt;">Text and Multimedia Languages and Properties: **<span style="font-family: Arial,sans-serif; font-size: 10pt;">Metadata and its formats, markup languages, multimedia issues, trends and research issues. || [[file:__Modern_Information_Retrieval_chapter6.pdf]]
 * 1) [[file:multimedia information retrieval.pdf]]
 * 2) [[file:content_based_MM_retrieval.pdf]] ||
 * <span style="display: block; font-family: Arial,sans-serif; font-size: 10pt; text-align: center;">10 || **<span style="background-color: #ffffff; font-family: Arial,sans-serif; font-size: 10pt;">Multimedia Resource Discovery: ** <span style="background-color: #ffffff; display: block; font-family: Arial,sans-serif; font-size: 10pt; text-align: justify;">Challenges and opportunities of Multimedia Information Retrieval and its applications. Image and video search, automated annotation of visual components, and content-based retrieval || [[file:deep-CBIRImageRetrieval.ppt]] || # [[file:multimedia resource discovery.pdf]]
 * 1) [[file:music_indexing.pdf]]
 * 2) [[file:VisualInformationRetreival_vinay_kumar_columbia.pdf]] ||
 * = 11 || Multimedia Resource Discovery (continued) || [[file:multimedia resource discovery.pdf]] || # [[file:images_database_indexing_using_jpeg_coefficients.pdf]]
 * 1) [[file:context data digital photos.pdf]]
 * 2) [[file:image_mining.pdf]] ||
 * = 12 || **<span style="font-family: Arial,sans-serif; font-size: 10pt;">Information Retrieval in Context: **<span style="font-family: Arial,sans-serif; font-size: 10pt;">Study of complex set of variables describing the intentions and personal characteristics behind the person searching, the data and systems available, and the physical, social and organizational environments. As a research case study exploration of IR techniques in social websites such as Facebook, Linkedin, twitter and MySpace will be explored. || [[file:Information_Retrieval_in_context.pdf]] || # [[file:context data digital photos.pdf]] ||
 * <span style="display: block; font-family: Arial,sans-serif; font-size: 10pt; text-align: center;">13 || **<span style="font-family: Arial,sans-serif;">Search Engine Evaluation: **Evaluation is the key to building effective and efficient search engines. Effectiveness, efficiency, and cost are related parameters, on which evaluation tests are conducted. In this session we will discuss rationale of evaluation, evaluating corpus, and query logging. || [[file:search-engine-book-chap8.pptx]] ||  ||
 * = 14 || <span style="font-family: Arial,sans-serif;">Presentations and demonstration of research work of students. ||  ||   ||
 * || // Tentative Topics // ||  ||   ||
 * || **<span style="font-family: Arial,sans-serif; font-size: 10pt;">Integrating model of learning cycles: **<span style="font-family: Arial,sans-serif; font-size: 10pt;">Learning cycles involving data,
 * <span style="display: block; font-family: Arial,sans-serif; font-size: 10pt; text-align: center;">