, Philipp Scharpf, Moritz Schubotz, Bela Gipp
Citation-based Information Retrieval (IR) methods for scientific documents have proven to be effective in academic disciplines that use many references. In science, technology, engineering, and mathematics (STEM), researchers cite less often but employ mathematical concepts to refer to prior knowledge. Our long-term goal is to generalize citation-based IR-methods and apply the generalized method to both classical references and mathematical concepts. In this paper, we restrict ourselves to mathematical formulae and define a Formula Concept Retrieval challenge with two subtasks: Formula Concept Discovery (FCD) and Formula Concept Recognition (FCR). While the former aims at the definition and exploration of a Formula Concept that names bundled equivalent representations of a formula, the latter is designed to match a given formula to a prior assigned concept ID. Moreover, we present first Machine Learning based approaches to tackle the FCD and FCR tasks, which we apply to a standardized test-collection (NTCIR arXiv dataset). Our FCD approach yields a recall of 68% for retrieving equivalent representations of frequent formulae, and 72% for extracting the formula name from the surrounding text. FCD and FCR will enable citing formulae within mathematical documents and facilitate semantic search as well as similarity computations for plagiarism detection or document recommender systems.
Proceedings of 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019).
July 21-25, 2019
42nd International ACM SIGIR Conference on Research and Development in Information Retrieval
Natural Language Processing, Mathematical Language Processing, Mathematical Information Retrieval, Feature Analysis, Machine Learning