English (unofficial) translations of posts at kexue.fm
Source

Cool Papers Update: Building a Simple In-Site Search System

Translated by Gemini Flash 3.0 Preview. Translations can be inaccurate, please refer to the original post for important stuff.

Since the post "A More Convenient Way to Open Cool Papers: Chrome Redirect Extension", Cool Papers has undergone two major changes. One is the introduction of the venue branch, which gradually includes paper collections from various conferences over the years, such as ICLR and ICML. This part is manually expanded dynamically, and readers are welcome to request specific conferences. The other change is the subject of this article: the new in-site search function added the day before yesterday.

This article will briefly introduce the new features and provide a basic summary of the process of building the in-site search system.

Introduction

On the homepage of Cool Papers, we can see the search entry point:

Cool Papers (2024.05.07)

The characteristics of the search function are as follows:

  1. It only searches the title and summary fields; specifying fields is currently not supported.

  2. You can specify to search either the arxiv branch or the venue branch; mixed searching across both branches is not supported.

  3. Special characters (non-English letters and numbers) in the search query will be removed.

  4. Words in the search query are not automatically stemmed, meaning searching for "images" will not match "image".

  5. On the search results page, it can be used in conjunction with the original in-page search function.

Overall, it is currently a very simple text search function intended to meet the basic needs of some users. For more complex requirements, updates will be introduced gradually. Features envisioned for the future include specifying fields, searching Kimi FAQ content, sorting by stars, specifying dates/categories (for arXiv), specifying conferences (for venue), and even enabling addition/subtraction operations like common search engines (to exclude certain keywords). These will depend on subsequent user feedback—there is no fixed schedule yet.

Summary

In fact, the demand for in-site search was raised by users as soon as Cool Papers was opened to the public at the beginning of the year. The reason it was delayed is mainly that Cool Papers collects papers daily, and initially, the number of papers was not large, making in-site search less meaningful. After four months of accumulation, the number of arXiv papers collected by Cool Papers has reached over 80,000, and with the addition of conference papers in the venue branch, the total has reached nearly 170,000 papers. It is now worth searching.

Once it was decided to proceed, the next question was how to do it. A retrieval system based on keyword searching of article content is called "Full-text Search," which is generally built on inverted indices and BM25 similarity. In other words, it is a mature algorithm. In terms of implementation, the backend of Cool Papers is BottlePy, so we needed to find a full-text search library available for Python to facilitate integration into Cool Papers.

There are not many choices for the "Python + Full-text Search" combination. The most classic one is a library called Whoosh. From a functional perspective, it indeed meets the needs of Cool Papers. However, the problem with Whoosh is that it hasn’t been updated since April 2016, which raises concerns about potential hidden issues. Another option is to switch to a database with built-in full-text search capabilities, such as MongoDB. If MongoDB had been used to store data from the beginning, this would undoubtedly be the simplest solution. However, Cool Papers chose Python’s built-in key-value database, Shelve. Switching to MongoDB now would involve too much engineering work, and in the simple scenarios of Cool Papers, MongoDB’s speed cannot match Shelve.

After searching fruitlessly many times, I accidentally discovered a small but powerful alternative to Whoosh—tantivy. This is a full-text search library written in Rust, but it provides Python bindings so it can be used as a Python library. The API is similar to Whoosh, but it is still actively updated. As is well known, Rust is famous for its efficiency, so it can be said that tantivy satisfies all my ideals for a full-text search library—fast, compact, and concise.

After selecting the search library, the rest was frontend work. In "Happy New Year! Recording the Development Experience of Cool Papers", I already mentioned that I am a frontend novice with absolutely no artistic talent. Designing a UI is extremely difficult for me; I can only constantly search, copy-paste, and ask GPT-4 and Kimi for help. Through various bits and pieces and patches, I finally managed to create a usable interface. During the development process, I also optimized the original built-in in-page search; users should now notice a significant increase in speed when using in-page search.

Conclusion

During the May Day holiday, while everyone was looking at KAN (Kolmogorov-Arnold Networks), I took a bit of a break from reading papers to add the in-site search function to Cool Papers. I wouldn’t say it’s "long-awaited," but it is a feature that some users have been requesting for a long time. Here is a brief introduction and a summary of the building experience.

When reposting, please include the original address of this article:
https://kexue.fm/archives/10088

For more detailed reposting matters, please refer to:
"Scientific Space FAQ"