The First International Workshop on
Learning from Limited or Noisy Data
For Information Retrieval
July 12th, 2018, Ann Arbor, Michigan, USA. Co-located with SIGIR 2018.
Tell Me More

About the Workshop

In recent years, machine learning approaches, and in particular deep neural networks, have yielded significant improvements on several natural language processing and computer vision tasks; however, such breakthroughs have not yet been observed in the area of information retrieval. Besides the complexity of IR tasks, such as understanding the user's information needs, a main reason is the lack of high-quality and/or large-scale training data for many IR tasks. This necessitates studying how to design and train machine learning algorithms where there is no large-scale or high-quality data in hand. Therefore, considering the quick progress in development of machine learning models, this is an ideal time for a workshop that especially focuses on learning in such an important and challenging setting for IR tasks.

The goal of this workshop is to bring together researchers from industry, where data is plentiful but noisy, with researchers from academia, where data is sparse but clean, to discuss solutions to these related problems.


9:00 - 9:10 Opening
9:10 - 10:00 Keynote by Marc Najork [Link]
10:00 - 10:15 Coffee break
10:15 - 11:30 Paper presentations [Accepted Papers]
11:30 - 12:00 Discussion panel and Closing


"Using biased data for Learning-to-Rank" by Marc Najork

Recent years have seen great advances in using machine-learned ranking functions for relevance prediction. Any learning-to-rank framework requires abundant labeled training examples. In web search, labels may either be assigned explicitly (say, through crowd-sourced assessors) or based on implicit user feedback (say, result clicks). In personal (e.g. email) search, obtaining labels is more difficult: document-query pairs cannot be given to assessors due to privacy constraints, and clicks on query-document pairs are extremely sparse (since each user has a separate corpus), noisy and biased. Over the past several years, we have worked on techniques for training ranking functions on result clicks in an unbiased and scalable fashion. Our techniques are used in many Google products, such as Gmail, Inbox, Drive and Calendar. In this talk, I will present an overview of this line of research.

Marc Najork is a Research Engineering Director at Google, where he manages a team working on a portfolio of machine learning problems. Before joining Google in 2014, Marc spent 12 years at Microsoft Research Silicon Valley and 8 years at Digital Equipment Corporations’s Systems Research Center in Palo Alto. He received a Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign. Marc has published about 60 papers and holds 26 issued patents. Much of his past research has focused on improving web search, and on understanding the evolving nature of the web. He served as ACM TWEB editor-in-chief, CACM news board co-chair, WWW 2004 program co-chair, WSDM 2008 conference chair, and in numerous senior PC member roles.

Accepted Papers

  • Distributed Evaluations: Ending Neural Point Metrics, Daniel Cohen, Scott M. Jordan, and W. Bruce Croft. [Link]

  • Explainable Agreement through Simulation for Tasks with Subjective Labels, John Foley. [Link]

  • Information Retrieval in African Languages, Hussein Suleman. [Link]

  • Highly Relevant Routing Recommendation Systems for Handling Few Data Using MDL Principle, Diyah Puspitaningrum, I.S.W.B. Prasetya, and P.A. Wicaksono. [Link]

  • Learning to Rank from Samples of Variable Quality, Mostafa Dehghani and Jaap Kamps. [Link]

  • Multilingual Sentiment Analysis: An RNN-Based Framework for Limited Data, Ethem Can, Aysu Ezen-Can, and Fazli Can. [Link]

  • Named Entity Recognition with Extremely Limited Data, John Foley, Sheikh Muhammad Sarwar, and James Allan. [Link]

  • Towards Theoretical Understanding of Weak Supervision for Information Retrieval, Hamed Zamani and W. Bruce Croft. [Link]
  • Organizers

    Hamed Zamani

    University of Massachusetts Amherst

    Mostafa Dehgahni

    University of Amsterdam

    Hang Li


    Nick Craswell


    Program Committee:

  • Michael Bendersky, Google, USA
  • Daniel Cohen, UMass Amherst, USA
  • W. Bruce Croft, UMass Amherst, USA
  • J. Shane Culpepper, RMIT Univ., Australia
  • Maarten de Rijke, Univ. of Amsterdam, The Netherlands
  • Jiafeng Guo, Chinese Academy of Sciences, China
  • Claudia Hauff, TU Delf, The Netherlands
  • Jaap Kamps, Univ. of Amsterdam, The Netherlands
  • Craig Macdonald, Univ. of Glasgow, UK
  • Bhaskar Mitra, Microsoft and UCL, UK
  • Amirmohammad Rooshenas, UMass Amherst, USA
  • Min Zhang, Tsinghua University, China
  • Yongfeng Zhang, Rutgers University, USA
  • Call for Paper

    We invite two kinds of contributions: research papers (up to 6 pages) and position papers (up to 2 pages). Submissions must be in English, in PDF format, and should not exceed the appropriate page limit in the current ACM two-column conference format (including references and figures). Suitable LaTeX and Word templates are available from the ACM Website. The papers can represent reports of original research, preliminary research results, or proposals for new work. The review process is single-blind. Papers will be evaluated according to their significance, originality, technical content, style, clarity, relevance to the workshop, and likelihood of generating discussion. Authors should note that changes to the author list after the submission deadline are not allowed without permission from the PC Chairs. At least one author of each accepted paper is required to register for, attend, and present the work at the workshop. All short papers are to be submitted via EasyChair at

    Papers presented at the workshop will be required to be uploaded to but will be considered non-archival, and may be submitted elsewhere (modified or not), although the workshop site will maintain a link to the arXiv versions. This makes the workshop a forum for the presentation and discussion of current work, without preventing the work from being published elsewhere.

    Relevant topics include, but are not limited to:
    • Learning from noisy data for IR
      • Learning from automatically constructed data
      • Learning from implicit feedback data, e.g., click data
    • Distant or weak supervision and learning from IR heuristics
    • Unsupervised and semi-supervised learning for IR
    • Transfer learning for IR
    • Incorporating expert/domain knowledge to improve learning-based IR models
      • Learning from labeled features
      • Incorporating IR axioms to improve machine learning models

    Important Dates:

    • Submission deadline: May 4, 2018
    • Paper notifications: May 25, 2018
    • Camera-ready deadline: June 8, 2018
    • Workshop Day: July 12, 2018