Deep Dive into MS MARCO Web Search: Unpacking Dataset Characteristics

by datasets... June 29th, 2025

Explore a comprehensive analysis of the MS MARCO Web Search dataset, detailing its multilingual distribution, significant data skew, and rigorous test-train overlap minimization for robust model evaluation.

Table of Links

Abstract and 1 Introduction

2 Background and Related work

2.1 Web Scale Information Retrieval

2.2 Existing Datasets

3 MS Marco Web Search Dataset and 3.1 Document Preparation

3.2 Query Selection and Labeling

3.3 Dataset Analysis

3.4 New Challenges Raised by MS MARCO Web Search

4 Benchmark Results and 4.1 Environment Setup

4.2 Baseline Methods

4.3 Evaluation Metrics

4.4 Evaluation of Embedding Models and 4.5 Evaluation of ANN Algorithms

4.6 Evaluation of End-to-end Performance

5 Potential Biases and Limitations

6 Future Work and Conclusions, and References

3.3 Dataset Analysis

We have constructed two scales of the datasets: Set-100M and Set10B. Table 2 gives ...

Copyright of this story solely belongs to hackernoon.com . To see the full text click HERE

Table of Links

3.3 Dataset Analysis

Share:

More related news