Since the post "A More Convenient Way to Open
Cool Papers: Chrome Redirect Extension", Cool Papers has undergone two major
changes. One is the introduction of the venue branch, which
gradually includes paper collections from various conferences over the
years, such as ICLR and ICML. This part is manually expanded
dynamically, and readers are welcome to request specific conferences.
The other change is the subject of this article: the new in-site search
function added the day before yesterday.
This article will briefly introduce the new features and provide a basic summary of the process of building the in-site search system.
Introduction
On the homepage of Cool Papers, we can see the search entry point:
The characteristics of the search function are as follows:
It only searches the
titleandsummaryfields; specifying fields is currently not supported.You can specify to search either the
arxivbranch or thevenuebranch; mixed searching across both branches is not supported.Special characters (non-English letters and numbers) in the search query will be removed.
Words in the search query are not automatically stemmed, meaning searching for "images" will not match "image".
On the search results page, it can be used in conjunction with the original in-page search function.
Overall, it is currently a very simple text search function intended to meet the basic needs of some users. For more complex requirements, updates will be introduced gradually. Features envisioned for the future include specifying fields, searching Kimi FAQ content, sorting by stars, specifying dates/categories (for arXiv), specifying conferences (for venue), and even enabling addition/subtraction operations like common search engines (to exclude certain keywords). These will depend on subsequent user feedback—there is no fixed schedule yet.
Summary
In fact, the demand for in-site search was raised by users as soon as
Cool Papers was opened to the public at the beginning of the year. The
reason it was delayed is mainly that Cool Papers collects papers daily,
and initially, the number of papers was not large, making in-site search
less meaningful. After four months of accumulation, the number of arXiv
papers collected by Cool Papers has reached over 80,000, and with the
addition of conference papers in the venue branch, the
total has reached nearly 170,000 papers. It is now worth searching.
Once it was decided to proceed, the next question was how to do it. A retrieval system based on keyword searching of article content is called "Full-text Search," which is generally built on inverted indices and BM25 similarity. In other words, it is a mature algorithm. In terms of implementation, the backend of Cool Papers is BottlePy, so we needed to find a full-text search library available for Python to facilitate integration into Cool Papers.
There are not many choices for the "Python + Full-text Search"
combination. The most classic one is a library called Whoosh. From a functional
perspective, it indeed meets the needs of Cool Papers. However, the
problem with Whoosh is that it hasn’t been updated since April 2016,
which raises concerns about potential hidden issues. Another option is
to switch to a database with built-in full-text search capabilities,
such as MongoDB. If MongoDB had been used to store data from the
beginning, this would undoubtedly be the simplest solution. However,
Cool Papers chose Python’s built-in key-value database,
Shelve. Switching to MongoDB now would involve too much
engineering work, and in the simple scenarios of Cool Papers, MongoDB’s
speed cannot match Shelve.
After searching fruitlessly many times, I accidentally discovered a
small but powerful alternative to Whoosh—tantivy. This is
a full-text search library written in Rust, but it provides Python
bindings so it can be used as a Python library. The API is similar to
Whoosh, but it is still actively updated. As is well known, Rust is
famous for its efficiency, so it can be said that tantivy
satisfies all my ideals for a full-text search library—fast, compact,
and concise.
After selecting the search library, the rest was frontend work. In "Happy New Year! Recording the Development Experience of Cool Papers", I already mentioned that I am a frontend novice with absolutely no artistic talent. Designing a UI is extremely difficult for me; I can only constantly search, copy-paste, and ask GPT-4 and Kimi for help. Through various bits and pieces and patches, I finally managed to create a usable interface. During the development process, I also optimized the original built-in in-page search; users should now notice a significant increase in speed when using in-page search.
Conclusion
During the May Day holiday, while everyone was looking at KAN (Kolmogorov-Arnold Networks), I took a bit of a break from reading papers to add the in-site search function to Cool Papers. I wouldn’t say it’s "long-awaited," but it is a feature that some users have been requesting for a long time. Here is a brief introduction and a summary of the building experience.
When reposting, please include the
original address of this article:
https://kexue.fm/archives/10088
For more detailed reposting matters,
please refer to:
"Scientific Space FAQ"