Our project aimed to build a search engine for the X5Learn project, an open educational ressources (OER) database that comprises content from different sites in one web application [2]. Some documents are in non-textual forms, like video lectures and podcasts. This makes search more difficult as these documents will have to be transcribed before searching, thus containing transcription errors called noise. We considered different techniques and architectures while building this project. We tried to make informed decision when choosing our architecture, frameworks, and algorithms. For instance, in order to choose our content retrieval method, we used different algorithms and compared their accuracy using statistical analysis.
Retated projectsRelated works include 100,000 Podcasts: A spoken English Document Corpus, which uses the same dataset [6]. The method they use is to implement standard retrieval models (BM25 and Query Likelihood (QL, a Language Model) with Dirichlet Smoothing), with the RM3 relevance model using Pyserini [10]. This is built with the Lucene search Library [4], and stemming is added with the Porter Stemmer. They then compare the accuracy of the different algorithms using nDCG@10 and nDCG@5. Our project tries to follow a similar methodology to the aforementionned paper, while using new technologies and building libraries to allow developpers to test search algorithms on their datasets. Our approach is pretty similar. However, we use newer technologies like ElasticSearch in order to make the implementation easier. We also build on top of that and try Learn-to-rank (LTR) models for this task. LTR models are retrieval models which make use of machine learning.
Technology review
To build the project, we used ReactJS for the front-end, Python Flask for the back-end, and ElasticSearch for the search-engine. Our choice was based on our experience, and on recommendation from our clients. We could have also implemented the front-end
with Python but thought using Java Script was better suited for our task and more practical, and that it meant having a better separation between the front-end and the back-end. We also considered using Lucene or building our search engine
from scratch, but Elasticsearch proved to be more user-friendly and modern. Furthermore, we thought its abundance of built-in features would gain us time. It is also the most maintainable and scalable solution.
As it is the
standard for such applications, we decided to use the Model-View-Controller [17] architecture. This structure makes the code more scalable, while increasing its read The search algorithms we considered were BM25, Query Likelihood with Dirichlet
Smoothing, and LambdaMart (a Machine-learning approach). After comparing these algorithhms on another dataset [7] which also includes noise, and after reviewing literature on the topic[5], we decided to use the Language Model with Dirichlet
Smoothing.
if you want more information on how we built the search engine, you can check the report on how it is implemented and tested.
Link to the report[1] “Google Cloud Speech-to-T.” [Online]. Available: https://cloud.google.com/speech-to-text.
[2] “x5learn.org.” [Online]. Available: https://x5learn.org/.
[3] “stemming @ www.elastic.co.” [Online]. Available: https://www.elastic.co/guide/en/elasticsearch/reference/current/stemming.html.
[4] Apache, “Lucene.” [Online]. Available: http://lucene.apache.org/.
[5] G. Bennett, F. Scholer, and A. Uitdenbogerd, A Comparative Study of Probabalistic and Language Models for Infor-mation Retrieval. 2008.
[6] A. Clifton, M. Eskevich,
G. J. F. Jones, B. Carterette, and R. Jones, “100 , 000 Podcasts : A Spoken English Document Corpus,” pp. 5903–5917, 2020.
[7] A. Clifton et al., “The Spotify Podcast Dataset,” 2020, [Online]. Available: http://arxiv.org/abs/2004.04270.
[8] D. Davies, “How Search Engine Algorithms Work: Everything You Need to Know,” Search Engine J., 2020, [Online]. Available: https://www.searchenginejournal.com/search-engines/algorithms/#whysearc.
[9] X. Han and S. Lei, “Feature Selection
and Model Comparison on Microsoft Learning-to-Rank Data Sets,” vol. 231, no. Fall, 2018, [Online]. Available: http://arxiv.org/abs/1803.05127.
[10] lintool, “PySerini @ pypi.org.” https://pypi.org/project/pyserini/.
[11] J. B. Lovins,
“Development of a stemming algorithm,” Mech. Transl. Comput. Linguist., vol. 11, pp. 22–31, 1968.
[12] S. E. Robertson and K. Sparck Jones, “Relevance Weighting of Search Terms,” in Document Retrieval Systems, GBR: Taylor Graham Publishing,
1988, pp. 143–160.
[13] D. Turnbull, E. Berhardson, D. Causse, and D. Worley, “Elasticsearch Learning to Rank Documentation,” 2019. [Online]. Available: https://media.readthedocs.org/pdf/elasticsearch-learning-to-rank/latest/elasticsearch-learning-to-rank.pdf.
[14] W. J. Conover, “Practical Nonparametric Statistics, 3rd Edition.”
[15] C. Zhai and J. Lafferty, “A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval,” SIGIR Forum, vol. 51, no. 2, pp. 268–276,
2017, doi: 10.1145/3130348.3130377.
[16] “Google Cloud Speech-to-T.” [Online]. Available: https://cloud.google.com/speech-to-text.
[17] Stephen Walther, ASP.NET MVC Framework Unleashed, Sams Publishing - 2009, (ISBN 9780768689785)