OVERVIEW OF THE TREC 2023 PRODUCT PRODUCT SEARCH

TRACK

Daniel Campos

, Surya Kallumadi

, Corby Rosset

, ChengXiang Zhai

, and Alessandro Magnani

Snowﬂake , [email protected]

Lowes

Microsoft

University of Illinois Urbana-Champaign

Walmart

ABSTRACT

This is the ﬁrst year of the TREC Product search track. The focus this year was the creation of

a reusable collection and evaluation of the impact of the use of metadata and multi-modal data on

retrieval accuracy. This year we leverage the new product search corpus, which includes contextual

metadata. Our analysis shows that in the product search domain, traditional retrieval systems are

highly effective and commonly outperform general-purpose pretrained embedding models. Our

analysis also evaluates the impact of using simpliﬁed and metadata-enhanced collections, ﬁnding no

clear trend in the impact of the expanded collection. We also see some surprising outcomes; despite

their widespread adoption and competitive performance on other tasks, we ﬁnd single-stage dense

retrieval runs can commonly be noncompetitive or generate low-quality results both in the zero-shot

and ﬁne-tuned domain.

1 Introduction

At TREC 2023, we hosted the ﬁrst TREC Product Search Track, looking to create a reusable general benchmark for

evaluating the performance of retrieval methods in the product search domain. We focus on providing a benchmark

similar in scale and format to NQ Kwiatkowski et al. [2019], or the Deep Learning Track Craswell et al. [2021] but

focused on product search. In providing a simple-to-use dataset, we believe broad experimentation using popular

retrieval libraries Lin et al. [2021] Gao et al. [2022] can lead to broad improvements in retrieval performance.

In this ﬁrst year of the track, we created a novel collection based on the ESCI Product Re-ranking dataset Reddy

et al. [2022], sampled novel queries, created enriched metadata in the form of additional text and images along with

seeded evaluation results with a broad range of baseline runs to aid in collection reusability and to allow iteration and

experimentation on the use of additional context.

Unlike previous product search corpora, the Product Search Track is multi-modal and has a large enough scale to

explore the usage of neural retrieval methods. We observe somewhat surprising results using this scaled dataset and

a wide variety of baseline runs. Single-stage retrieval models that leverage vector representations do not consistently

outperform traditional retrieval methods such as BM25. Moreover, in the zero-shot setting, we ﬁnd that larger vector-

based models do not always beat their more minor variants, which is at odds with other evaluation corpora such as

MTEB Muennighoff et al. [2023]. Finally, while additional metadata can improve retrieval performance at a macro

level, extra information cannot guarantee performance. In evaluating per-query performance, we ﬁnd that vector-based

systems lose performance with the other metadata. Please see the participant papers for more insights about what we

learned this year.

2 Task description

The product search track has one task: product ranking. Within this task, various enriched datasets are opened to

participants to allow them to enrich the collection as they see ﬁt. Participants were allowed to submit up to three

Figure 1: Prompt used in our synthetic query generation on randomly selected product. The sampled product is

included in the placeholders {product}.

ofﬁcial runs. When submitting each run, participants indicated what external data, pretrained models, and other

resources were used, as well as information on what style of the model was used.

In the ranking task, given a query, the participants were expected to retrieve a ranked list of products from the full

collection based on the estimated likelihood that the product would meet the user’s need. Participants could submit up

to 100 products per query for this end-to-end ranking task.

We ﬁrst selected a subset of 200 queries for judging in the pooling and judging process. Then NIST started judging

these queries, throwing out queries without high disagreement or deemed un-judgable. If at least 50% of the judged

products are relevant or there is no relevant product, the query is deemed un-judgable. This led to a judged test set of

186 queries, which we compare the quality of runs. The track received 62 submissions to the passage ranking task, 39

of which were baseline runs that the track coordinators submitted.

Judgments were collected for each query product pair on a four-point scale:

[3] Perfectly relevant: The product is exactly what the user wants.

[2] Highly relevant: The product could match the user query, but it may be a substitute for the original query

intent. It may have a slightly different style, brand, or type, but a user would be satisﬁed if they received this

product.

[1] Related: The product seems related to the query but not the item the user seeks. Products in this category

could complement the user’s intended product.

[0] Irrelevant: The product has nothing to do with the query.

For binary metrics, we map judgment levels 3,2 to relevant and 1,0 to irrelevant.

The collection is based on the ESCI Shopping queries dataset Reddy et al. [2022]. While this dataset is focused on

improving product search, it lacks a clear end-to-end retrieval benchmark. Instead, the dataset includes a re-ranking

task in which the top 40 results retrieved from the Amazon product corpus must be re-ranked for improved relevance.

While this re-ranking task is quite important to the end-to-end performance of a product search engine, it does not

allow for ample understanding of what impacts the performance of end-to-end retrieval in the product domain.

Since there is no source of publicly available shopping queries, nor does the ESCI dataset have a publicly accessible

test dataset, we created a new set of 998 evaluation queries leveraging GPT-4 and some heuristic-based sampling. For

query generation, we leveraged GPT-4 along with Prompt, which built on the work of InPars Bonifacio et al. [2022]

Jeronymo et al. [2023], and we created 500 queries using the prompt shown in table 1.

Figure 2: Each product in the collection contains basic information such as a title and product description along with

contextual metadata, which includes attributes such as reviews, dimensions, etc.

Figure 3: Some examples of product images. Some items have multiple images, while others have none.

To avoid cases where this approach fails and to study how models perform with more typical product search queries

with high keyword overlap, we generate queries by selecting sub-spans of product titles or descriptions. In generating

queries with GPT-4, we aimed to create a reliable way of generating new and interesting queries for the collection, as

we do not have a method to sample novel queries reliably.

3 Datasets

This year, we leverage an enriched and ﬁltered product search dataset based on the ESCI dataset Reddy et al. [2022].

We will ﬁrst describe the dataset and its generation before we describe how we adapted it to best suit the track.

Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product Search The benchmark

for improving product search, or ESCI, is a large-scale benchmarking dataset that focuses on subsets of product search

use cases and provides frameworks from which improvement can be studied. Unlike other product retrieval datasets,

the ESCI corpus contains English, Japanese, and Spanish queries.

The dataset centers around three tasks important to the world of product search. It can be used to improve customer

experience: Query-Product Ranking, Multi-class Product Classiﬁcation, and Product Substitute Identiﬁcation. For

Query-Product Ranking: given a user-speciﬁed query and the top 40 products retrieved by a commercial product

search engine, this task aims to rank the products to have more relevant products ranked above non-relevant products.

For Multi-class Product Classiﬁcation: given a query and a result of products, classify products into the following

matches: Exact, substitute, complement, and irrelevant. For Product Substitute Identiﬁcation: given a product and a

list of potential substitutes, identify which could be substituted.

Within the three tasks, there are two variants of product collections, with the product ranking task using the smaller

collection and the other tasks using a larger task. Given our focus on retrieval, we leverage the former. Within each

task, there is a large training data set that contains query product pairs that have been annotated as exact match (E),

substitute (S), complement (C), and irrelevant (I). The data contains the following ﬁelds: example id, query, query id,

product id, product locale, ESCI label, small version, large version, split, product title, product description, product

bullet point, product brand, product color, and source.

The smaller ranking dataset has 48,300 unique queries and 1,118,011 relevance judgments. The data sets are stratiﬁed

into train, dev, and test, of which only the labels for the train and dev have been released publicly. On average, each

query has 20 judgments for English and 28 for other languages.

Item Instances Notes

Collection 1,661,907 90+% of products have at least 1 image

Train Queries 30,734 Train + Dev

Train QREL 392,119.00 N/A

2023 Test Queries 926 N/A

2023 Test Queries (Judged) 182 N/A

Table 1: High-level statistics on the size of the collection and queries of the TREC Product Search 2023 Collection

Product Search Track Corpus While the full ESCI dataset is multilingual and features over 3 million items, we

narrowed our focus to English only. We attempt to enrich the dataset with additional Metadata and images for these

English products as we believe this can be very important for product search. The ESCI dataset is focused on text

information that ignores any behavioral, categorical, visual, or numerical features that can be used for ranking. Product

metadata enrichment improves product representations by including additional helpful information such as reviews,

attributes such as size and color, and categorical ordering from extracted Metadata from Amazon’s online catalog

Figure 2 shows an example product with its additional Metadata. We crawled and removed images for each product

using the ASIN from this enrichment. Product images contain one to ten thumbnail-size images for a given product,

which were shufﬂed from Amazon and joined with the ESCI dataset. Since these images are extracted from product

thumbnails, each image is only 64x64, which allows the entire collection to be relatively small. Some product image

examples can be found in ﬁgure 3. Numerical details on the collection can be found in table 1

4 Results and analysis

Submitted runs A total of 4 groups participated in the TREC 2023 Product Search Track, including baseline runs

submitted by the track coordinators. Across all groups, we received 62 run submissions, including 49 baseline runs.

Table 2 and ﬁgure 4 summarizes the submission statistics for this year’s track. This set of runs led to 182 evaluated

queries, which we believe will likely make this a highly reusable collection apt for future experimentation.

This year, we had fewer participating groups than we hoped for compared to similar tracks (Deep learning had

15 groups in 2019, 25 in 2020, and 19 in 2021). We believe this might indicate the broader saturation of the IR

community by large-scale datasets focused on single-stage retrieval via neural language models.

Overall results Table 3 presents a standard set of relevance quality metrics for product search ranking runs. Reported

metrics include Normalized Discounted Cumulative Gain (NDCG) [Järvelin and Kekäläinen, 2002] at depth 10 and

100, Recall at depth 10 and 100, and the Infrared Average Precision (InfAP) [Yilmaz and Aslam, 2006]. Results

represent the mean score across the 182 queries that NIST assessed, and scores are computed using TRECEVAL.

Metadata was extracted from https://github.com/shuttie/esci-s/

Table 2: TREC 2023 Product Search Track run submission statistics.

All Groups Coordinator Baselines

Number of groups 4 1

Number of total runs 62 39

Figure 4: Relative system ordering based on mean NDC@10

None of these results leverage the existing development portion of ESCI or the unreleased eval set. In subsequent

discussions, we employ NDCG@10 as our primary evaluation metric to analyze the ranking quality produced by

different methods.

To analyze how different approaches perform in the high recall domain, we employ recall at 100 (R@100), which

compares how often the positive set is present in the top 100 candidates even if they are not often ranked highly. In the

product search, users often add sorting and ﬁltering forms via price, size, color, etc. When a user removes portions of

the ranked candidate set, the recall of a larger ﬁlter set becomes highly important.

Looking at the results in table 3, we see clear gains from hybrid retrieval systems that leverage multiple retrievers to

improve performance (f_splade_bm25,cfdaclip_MR_A). We further see that in this domain, there is a consistent and

effective performance for traditional retrieval methods such as BM25, which is one of the top-performing systems

despite a lack of collection optimization.

When we evaluate speciﬁc queries as shown in table 4, we ﬁnd that there are some queries, such as Elite (Elite)

Volano/drivo/Kura For Body 329770001 or small measuring rice bin Asvel or Dinosaur Pee Pee Teepee Wee, where

one or a few retrieval systems have high NDCG@10 scores. In contrast, most systems have scores of 0. Each of

these queries is looking for speciﬁc items, and surprisingly, the systems that excel at spear-ﬁshing each product are

inconsistent across questions.

Performance on long vs. short queries To better understand the variance between variations in query length, we

stratiﬁed the queries based on query length to analyze whether system ordering depends on query length. We study

this because shorter queries, such as Google Wi-Fi System Mesh, tend to be more broad in the world of product search.

In contrast, longer queries, such as 21x21 beige sun shade sail patio UV protection outdoor backyard, focus on

ﬁnding speciﬁc products. We stratify queries by setting queries with 7 or more words that are too long and everything

else assorted. This leaves 182 total queries, 81 short queries, and 101 long queries. When we use these stratiﬁed sets

of queries, we ﬁnd high Kendall tau with 0.93442 (7.13e-27) and 0.9640 (1.70e-28) for short and long, respectively.

This agreement is surprisingly high as we see a large variation in NDCG as shown in table 11 where, for example,

with BM25, the NDCG@10 from long to short queries is > 10% relative move.

Metadata vs. Simple Collection To understand the impact of enriching the collection with textual metadata in our

baselines, we provided runs that use the simple collection and enhanced metadata. Tables 5 and 6 provide detailed

Table 3: Summary of results for all runs.

Run Group InfAP NDCG @10 NDCG @100 R@10 R@100

f_splade_bm25 F 0.6068 0.7505 0.7244 0.4919 0.8015

f_splade_clip_bm25 F 0.5731 0.7327 0.7143 0.4739 0.8001

cfdaclip_MR_A JBNU 0.5910 0.7257 0.7019 0.4766 0.7857

cfdaclip_ER_B JBNU 0.5905 0.7256 0.7010 0.4766 0.7862

cfdaclip_ER_A JBNU 0.5902 0.7252 0.7008 0.4765 0.7840

JBNU-C JBNU 0.5885 0.7251 0.7074 0.4700 0.7870

cfdaclip_MR_B JBNU 0.5903 0.7251 0.7010 0.4765 0.7859

metadata-enhanced-gte-small-zero-shot Baselines 0.4955 0.6647 0.6500 0.4363 0.7416

simple-gte-small-zero-shot Baselines 0.4818 0.6612 0.6492 0.4375 0.7372

JBNU-2 JBNU 0.4792 0.6583 0.6208 0.4359 0.7272

BM25-pyserini-simple-collection Baselines 0.4769 0.6540 0.6148 0.4287 0.7241

JBNU-1 JBNU 0.4828 0.6531 0.6185 0.4092 0.7272

BM25-pyserini-metadata-collection Baselines 0.4729 0.6408 0.6160 0.4254 0.7272

f_gpt_rerank F 0.4673 0.6225 0.6599 0.3765 0.8001

JBNU-A JBNU 0.4500 0.5989 0.5607 0.3772 0.6636

r_gpt3d5_turbo r 0.4174 0.5950 0.5889 0.3806 0.7272

metadata-enhanced-all-mpnet-base-v2-zero-shot Baselines 0.4144 0.5937 0.5541 0.3862 0.6512

simple-all-mpnet-base-v2-zero-shot Baselines 0.4000 0.5895 0.5508 0.3806 0.6348

JBNU-B JBNU 0.4349 0.5763 0.5380 0.3580 0.6339

metadata-enhanced-all-MiniLM-L12-v2-zero-shot Baselines 0.3844 0.5660 0.5309 0.3821 0.6558

metadata-enhanced-all-MiniLM-L6-v2-zero-shot Baselines 0.3688 0.5328 0.5164 0.3654 0.6415

simple-all-MiniLM-L12-v2-zero-shot Baselines 0.3483 0.5288 0.5161 0.3502 0.6365

metadata-enhanced-trec-product-search-gte-small Baselines 0.3520 0.5168 0.5101 0.3443 0.5859

metadata-enhanced-trec-product-search-e5-small-v2 Baselines 0.3488 0.5119 0.5082 0.3481 0.6096

metadata-enhanced-trec-product-search-gte-base Baselines 0.3423 0.5009 0.5004 0.3400 0.5895

simple-e5-large-zero-shot Baselines 0.3339 0.4998 0.4490 0.3428 0.5537

simple-all-MiniLM-L6-v2-zero-shot Baselines 0.3261 0.4952 0.4924 0.3334 0.6099

simple-trec-product-search-gte-small Baselines 0.3194 0.4901 0.4902 0.3080 0.5692

simple-trec-product-search-gte-base Baselines 0.3067 0.4777 0.4813 0.3123 0.5676

simple-trec-product-search-all-miniLM-L12-v2 Baselines 0.3060 0.4763 0.4589 0.3100 0.5351

metadata-enhanced-trec-product-search-bge-small-en Baselines 0.3193 0.4721 0.4708 0.3012 0.5565

metadata-trec-product-search-all-miniLM-L12-v2 Baselines 0.3129 0.4681 0.4603 0.3081 0.5581

metadata-trec-product-search-all-miniLM-L6-v2 Baselines 0.3144 0.4673 0.4675 0.3181 0.5528

simple-trec-product-search-all-miniLM-L6-v2 Baselines 0.3008 0.4591 0.4599 0.2931 0.5429

metadata-enhanced-gte-large-zero-shot Baselines 0.2503 0.4501 0.4103 0.2698 0.4978

simple-trec-product-search-bge-small-en Baselines 0.2726 0.4379 0.4328 0.2741 0.5080

search-dpr-bert-base Baselines 0.2648 0.4272 0.4333 0.2796 0.5068

metadata-enhanced-trec-product-search-e5-base-v2 Baselines 0.2703 0.4242 0.4118 0.2793 0.5148

metadata-enhanced-trec-product-search-bge-base-en Baselines 0.2709 0.4237 0.4165 0.2938 0.4955

metadata-enhanced-trec-product-search-dpr-bert Baselines 0.2636 0.4165 0.4276 0.2774 0.5208

simple-trec-product-search-all-mpnet-base-v2 Baselines 0.2377 0.4090 0.4013 0.2507 0.4747

metadata-trec-product-search-all-mpnet-base-v2 Baselines 0.2611 0.4089 0.4118 0.2643 0.5006

simple-trec-product-search-bge-base-en Baselines 0.2448 0.4064 0.4027 0.2628 0.4728

simple-gte-large-zero-shot Baselines 0.2146 0.3930 0.3654 0.2319 0.4294

simple-bge-small-zero-shot Baselines 0.1898 0.3680 0.3475 0.2059 0.4188

metadata-enhanced-bge-base-en-zero-shot Baselines 0.1919 0.3396 0.3290 0.2211 0.4301

simple-bge-base-zero-shot Baselines 0.1178 0.2948 0.2458 0.1479 0.2664

simple-gte-base-zero-shot Baselines 0.0522 0.1493 0.1131 0.0581 0.0965

simple-bge-large-zero-shot Baselines 0.0498 0.1486 0.1056 0.0537 0.0787

simple-e5-base-zero-shot Baselines 0.0439 0.1168 0.0938 0.0527 0.0894

metadata-enhanced-e5-base-v2-zero-shot Baselines 0.0276 0.0936 0.0861 0.0375 0.1021

metadata-enhanced-gte-base-zero-shot Baselines 0.0285 0.0604 0.0614 0.0333 0.0940

simple-bert-base-uncased-zero-shot Baselines 0.0074 0.0352 0.0294 0.0100 0.0374

metadata-enhanced-bge-large-en-zero-shot Baselines 0.0101 0.0323 0.0287 0.0108 0.0314

simple-contriever-base-zero-shot Baselines 0.0049 0.0159 0.0159 0.0062 0.0372

metadata-enhanced-e5-small-v2-zero-shot Baselines 0.0102 0.0142 0.0116 0.0090 0.0126

simple-e5-small-zero-shot Baselines 0.0071 0.0113 0.0098 0.0089 0.0130

metadata-enhanced-contriever-base-msmarco Baselines 0.0022 0.0081 0.0108 0.0026 0.0332

metadata-enhanced-trec-product-search-bge-large-en Baselines 0.0000 0.0021 0.0008 0.0000 0.0000

simple-trec-product-search-gte-large Baselines 0.0000 0.0015 0.0007 0.0001 0.0001

metadata-enhanced-trec-product-search-e5-large-v2 Baselines 0.0000 0.0015 0.0007 0.0000 0.0000

metadata-enhanced-trec-product-search-gte-large Baselines 0.0000 0.0011 0.0006 0.0000 0.0000

results on the impact of using metadata vs simple data across some baseline runs. Based on this data, we note that

when focused on NDCG@10, the introduction of metadata is somewhat of a mixed message. Some models see beneﬁt

from the additional information, and others see losses.

Despite this variability, our impact tends to be relatively small, as good retrievers models know an effect of less than

5%. When focused on recall, the message changes, as most models see an improvement in recall by using metadata.

Query Max-Mean Gap Systems Score 0

Elite (Elite) Volano/drivo/Kura For Body 329770001 0.9781 96.77%

small measuring rice bin Asvel 0.8037 70.97%

Switch protective ﬁlm Japanese glass blue light reduction water repellent anti-ﬁngerprint 0.7968 77.42%

Elegant satin ﬂoral lace ribbon lingerie set 0.7264 59.68%

onlypuff Pocket Shirts for Women Casual 0.6750 27.42%

Ekouaer Long Nightgown,Women’s Loungewear Short Sleeve 0.6590 33.87%

Lugz Women’s Empire Hi Wvt Fashion Boot 0.6559 24.19%

Dinosaur Pee Pee Teepee Wee 0.6553 41.94%

10th birthday decorations for girl 0.6548 19.35%

juDanzy kids knee high tube socks with grips 0.6461 58.06%

Matching Delivery Robe and Swaddle Blanket 0.6429 41.94%

UCGOU Bubble Mailers 7.25x12 Inch Teal 25 Pack 0.6277 24.19%

Cicy Bell Women’s Sunﬂower 0.6211 62.90%

Canomo Lamp Light Kit Make a 0.6209 19.35%

5L matte black rectangular trash can with soft close lid and anti-bag slip liner for bathroom or kitchen 0.6206 50.00%

Women’s UPF 50+ cotton linen bucket sun hat beige small 0.6163 24.19%

Small breed wet dog food Hill’s Science Diet Chicken & Barley Recipe 0.6038 40.32%

DKB Evian Jetted Whirlpool 0.6031 27.42%

OtterBox Symmetry Disney Princess Mulan iPhone Xs iPhone X case 0.6002 45.16%

Stars in the Desert book 0.5958 58.06%

DC Collectibles Batman Arkham Origins 0.5930 35.48%

fall sunﬂower pumpkin placemats set of 6 cotton linen washable table mats 0.5853 27.42%

300 piece jigsaw puzzle Kitchen Memories by Steve Crisp 0.5821 30.65%

Marvel Avengers Endgame Gauntlet T-Shirt 0.5816 20.97%

Pahajim Women Fashion Purses Handbags Shoulder Tote Bags Top Handle Satchel 0.5811 46.77%

Mai Puru Endo Mai’s First Photo Collection 0.5806 66.13%

girls princess dress up costume headband accessories 0.5764 19.35%

Nanatang Badﬂower Logo Men’s Long Sleeve Sweatshirt’s 0.5756 14.52%

HAPY SHOP 80 Pcs Silver Alligator Hair 0.5705 45.16%

VERYKE L-Shaped sectional sofa chenille fabric golden legs living room 0.5689 27.42%

ZINUS Owen Wood Platform 0.5667 29.03%

Xperia 10 II Blue Light Cut Glass Film Asahi Japanese Ultra Thin Anti-Bubble Anti-Fingerprint 0.5630 67.74%

Acer V6 V196LB 19" HD 0.5619 50.00%

Table 4: Per query gap between Mean NDCG@10 and Max NDCG and the % of retrieval systems with a NDCG@10

of zero.

Run Zero Shot NDCG @10 (metadata) NDCG @10 (simple) Impact

BM25 Y 0.6408 0.6540 -0.0133

all-MiniLM-L12-v2 Y 0.5660 0.5288 0.0371

all-MiniLM-L6-v2 Y 0.5328 0.4952 0.0376

all-mpnet-base-v2 Y 0.5937 0.5895 0.0043

bge-base-en Y 0.3396 0.2948 0.0448

bge-large-en Y 0.0323 0.1486 -0.1163

contriever-base Y 0.0081 0.0159 -0.0079

e5-base-v2 Y 0.0936 0.1168 -0.0232

e5-small-v2 Y 0.0142 0.0113 0.0029

gte-base Y 0.0604 0.1493 -0.0889

gte-large Y 0.4501 0.3930 0.0571

gte-small Y 0.6647 0.6612 0.0035

bge-base-en N 0.4237 0.4064 0.0174

bge-small-en N 0.4721 0.4379 0.0342

Bert-base N 0.4165 0.4272 -0.0107

gte-base N 0.5009 0.4777 0.0233

gte-large N 0.0011 0.0015 -0.0005

gte-small N 0.5168 0.4901 0.0267

all-miniLM-L12-v2 N 0.4681 0.4763 -0.0082

all-miniLM-L6-v2 N 0.4673 0.4591 0.0082

all-mpnet-base-v2 N 0.4089 0.4090 -0.0001

Table 5: NDCG@10 performance of retrieval methods using the simple collection and metadata enhanced collection.

However, like the impact on the top ten, recall improvements are minor, with a few exceptions. Across both metrics,

we do not see any impact trend related to ﬁne-tuned or zero-shot models.

Finetune vs. Zero-Shot As part of our baselines, we evaluated a set of naive baselines where we ﬁnetune in a single

form and compare the impact of ﬁne-tuning across runs. We leverage the Tevatron library and follow the training

Run Zero Shot R@100 (Metadata) R@100 (Simple) Impact

BM25 Y 0.7272 0.7241 0.0032

all-MiniLM-L12-v2 Y 0.6558 0.6365 0.0193

all-MiniLM-L6-v2 Y 0.6415 0.6099 0.0316

all-mpnet-base-v2 Y 0.6512 0.6348 0.0163

bge-base-en Y 0.4301 0.2664 0.1637

bge-large-en Y 0.0314 0.0787 -0.0473

contriever-base Y 0.0332 0.0372 -0.0040

e5-base-v2 Y 0.1021 0.0894 0.0127

e5-small-v2 Y 0.0126 0.0130 -0.0005

gte-base Y 0.0940 0.0965 -0.0026

gte-large Y 0.4978 0.4294 0.0684

gte-small Y 0.7416 0.7372 0.0044

bge-base-en N 0.4955 0.4728 0.0227

bge-small-en N 0.5565 0.5080 0.0485

Bert-base N 0.5208 0.5068 0.0140

gte-base N 0.5895 0.5676 0.0219

gte-large N 0.0000 0.0001 -0.0001

gte-small N 0.5859 0.5692 0.0167

all-miniLM-L12-v2 N 0.5581 0.5351 0.0230

all-miniLM-L6-v2 N 0.5528 0.5429 0.0099

all-mpnet-base-v2 N 0.5006 0.4747 0.0259

Table 6: Recall@100 performance of retrieval methods using the simple collection and metadata enhanced collection.

procedure from the NQ implementation

. We train each model for 40 epochs using 4 A100 with a batch size of 128,

cross-device negatives, and learning rates of 1e-5,2e-5,3e-5,5e-5, and 1e-4, selecting the model that had the lowest

validation loss at the end. These runs are not meant to be highly optimized ﬁne-tuning runs but general explorations

on the impact of ﬁne-tuning. As shown in tables 9 and 10, we see a consistent trend that the larger models suffer from

ﬁne-tuning (indicating the ﬁne-tuning recipe was incorrect), but smaller models see large gains, in some cases going

from completely unusable to highly competitive.

5 Conclusion

This is the ﬁrst year of the TREC Product Search Track. In our ﬁrst year, we aimed to create a complete collection that

could reliably evaluate the performance of different retrieval methods in the product search domain. In creating this

initial collection, we introduce forms of collection enrichment via product images and metadata. Using this collection,

we create a large set of artiﬁcial queries using a large language model and evaluate system performance using pooled

judgment via query-product relevant judgments.

While this year’s participation was light, we believe that the strong fundamentals of this collection pave the way to

broader experimentation. This report summarizes the product search track’s creation, the systems’ high-level perfor-

mance, and the perceived impact of additional metadata.

Acknowledgement

We thank Noveen Sachdeva, who helped ﬁnd, extract, and process the images for this track.

References

Luiz Henrique Bonifacio, Hugo Abonizio, Marzieh Fadaee, and Rodrigo Nogueira. Inpars: Data augmentation

for information retrieval using large language models. ArXiv, abs/2202.05144, 2022. URL https://api.

semanticscholar.org/CorpusID:246705967.

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Ellen M Voorhees, and Ian Soboroff. Trec deep

learning track: Reusable test collections in the large data regime. In Proc. SIGIR, pages 2369–2375, 2021.

https://github.com/texttron/tevatron/blob/main/examples/

Luyu Gao, Xueguang Ma, Jimmy J. Lin, and Jamie Callan. Tevatron: An efﬁcient and ﬂexible toolkit for dense

retrieval. ArXiv, abs/2203.05765, 2022. URL https://api.semanticscholar.org/CorpusID:247411217.

K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM TOIS, 20(4):422–446, 2002.

Vitor Jeronymo, Luiz Henrique Bonifacio, Hugo Abonizio, Marzieh Fadaee, Roberto de Alencar Lotufo, Jakub Zavrel,

and Rodrigo Nogueira. Inpars-v2: Large language models as efﬁcient dataset generators for information retrieval.

ArXiv, abs/2301.01820, 2023. URL https://api.semanticscholar.org/CorpusID:255440689.

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redﬁeld, Michael Collins, Ankur P. Parikh, Chris Alberti, Danielle

Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei

Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc V. Le, and Slav Petrov. Natural questions: A benchmark for

question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.

URL https://api.semanticscholar.org/CorpusID:86611921.

Jimmy J. Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, Rodrigo Nogueira, and David R.

Cheriton. Pyserini: A python toolkit for reproducible information retrieval research with sparse and dense represen-

tations. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information

Retrieval, 2021. URL https://api.semanticscholar.org/CorpusID:235366815.

Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. Mteb: Massive text embedding benchmark,

2023.

Chandan K. Reddy, Lluís Màrquez i Villodre, Francisco B. Valero, Nikhil S. Rao, Hugo Zaragoza, Sambaran Bandy-

opadhyay, Arnab Biswas, Anlu Xing, and Karthik Subbian. Shopping queries dataset: A large-scale esci bench-

mark for improving product search. ArXiv, abs/2206.06588, 2022. URL https://api.semanticscholar.org/

CorpusID:249642102.

Emine Yilmaz and Javed A. Aslam. Estimating average precision with incomplete and imperfect judgments. In Inter-

national Conference on Information and Knowledge Management, 2006. URL https://api.semanticscholar.

org/CorpusID:7500319.

A Per Query Performance

B Impact of Finetuning on single stage Language Model Based Retrieval

C Stratifying based on query length

Query Max Mean Gap Max vs Mean Systems with NDCG@10 = 0

Elite (Elite) Volano/drivo/Kura For Body 329770001 1.0000 0.0219 0.9781 96.77%

small measuring rice bin Asvel 1.0000 0.1963 0.8037 70.97%

Switch protective ﬁlm Japanese glass blue light reduction water repellent anti-ﬁngerprint 0.9056 0.1088 0.7968 77.42%

Elegant satin ﬂoral lace ribbon lingerie set 0.8122 0.0858 0.7264 59.68%

onlypuff Pocket Shirts for Women Casual 1.0000 0.3250 0.6750 27.42%

Ekouaer Long Nightgown,Women’s Loungewear Short Sleeve 0.8415 0.1825 0.6590 33.87%

Lugz Women’s Empire Hi Wvt Fashion Boot 1.0000 0.3441 0.6559 24.19%

Dinosaur Pee Pee Teepee Wee 1.0000 0.3447 0.6553 41.94%

10th birthday decorations for girl 0.9558 0.3010 0.6548 19.35%

juDanzy kids knee high tube socks with grips 1.0000 0.3539 0.6461 58.06%

Matching Delivery Robe and Swaddle Blanket 1.0000 0.3571 0.6429 41.94%

UCGOU Bubble Mailers 7.25x12 Inch Teal 25 Pack 0.9117 0.2840 0.6277 24.19%

Cicy Bell Women’s Sunﬂower 0.8891 0.2680 0.6211 62.90%

Canomo Lamp Light Kit Make a 0.9157 0.2948 0.6209 19.35%

5L matte black rectangular trash can with soft close lid and anti-bag slip liner for bathroom or kitchen 0.8923 0.2717 0.6206 50.00%

Women’s UPF 50+ cotton linen bucket sun hat beige small 1.0000 0.3837 0.6163 24.19%

Small breed wet dog food Hill’s Science Diet Chicken & Barley Recipe 0.9904 0.3866 0.6038 40.32%

DKB Evian Jetted Whirlpool 0.9083 0.3052 0.6031 27.42%

OtterBox Symmetry Disney Princess Mulan iPhone Xs iPhone X case 0.9242 0.3240 0.6002 45.16%

Stars in the Desert book 1.0000 0.4042 0.5958 58.06%

DC Collectibles Batman Arkham Origins 0.9369 0.3439 0.5930 35.48%

fall sunﬂower pumpkin placemats set of 6 cotton linen washable table mats 0.9813 0.3960 0.5853 27.42%

300 piece jigsaw puzzle Kitchen Memories by Steve Crisp 1.0000 0.4179 0.5821 30.65%

Marvel Avengers Endgame Gauntlet T-Shirt 0.9842 0.4026 0.5816 20.97%

Pahajim Women Fashion Purses Handbags Shoulder Tote Bags Top Handle Satchel 1.0000 0.4189 0.5811 46.77%

Mai Puru Endo Mai’s First Photo Collection 0.6353 0.0547 0.5806 66.13%

girls princess dress up costume headband accessories 0.8645 0.2881 0.5764 19.35%

Nanatang Badﬂower Logo Men’s Long Sleeve Sweatshirt’s 0.9375 0.3619 0.5756 14.52%

HAPY SHOP 80 Pcs Silver Alligator Hair 0.8525 0.2820 0.5705 45.16%

VERYKE L-Shaped sectional sofa chenille fabric golden legs living room 0.9095 0.3406 0.5689 27.42%

ZINUS Owen Wood Platform 1.0000 0.4333 0.5667 29.03%

Xperia 10 II Blue Light Cut Glass Film Asahi Japanese Ultra Thin Anti-Bubble Anti-Fingerprint 0.6367 0.0737 0.5630 67.74%

Acer V6 V196LB 19" HD 0.6797 0.1178 0.5619 50.00%

eyebrow tattoo stickers natural black dark brown 0.5980 0.0393 0.5587 80.65%

buffalo plaid ﬂannel pajama pants for women with pockets 0.7937 0.2350 0.5587 35.48%

Women’s Hotouch fringe vest faux suede 70s hippie sleeveless cardigan 1.0000 0.4460 0.5540 35.48%

Microsoft Surface Laptop 2 0.9698 0.4175 0.5523 25.81%

men’s sauna suit diet pants sportswear fat burner ﬁtness wear exercise running weight loss black 0.6477 0.0972 0.5505 59.68%

Fabri.YWL 1955 Ford Thunderbird Vintage Look Reproduction 0.9639 0.4182 0.5457 37.10%

Figma Avengers Age of Ultron Iron Man Mark 43 Exclusive 0.8974 0.3579 0.5395 20.97%

Under Armour Men’s Jungle Rat Tactical 0.9614 0.4224 0.5390 24.19%

Black Cast Iron Norfolk Door Latch 8" Tall Thumb Lock with Mounting Hardware 0.9056 0.3707 0.5349 35.48%

CRAFTSMAN CMXZVBE38759 2-1/2 in. x 20 ft. POS-I-LOCK 0.7560 0.2226 0.5334 58.06%

Columbia Women’s Newton Ridge Waterproof Hiking Boot lightweight comfortable 0.9567 0.4238 0.5329 17.74%

KATUMO Baby Crib Mobile Woodland Hot Air Balloons Birds Clouds Nursery Decoration 1.0000 0.4707 0.5293 22.58%

Spider-Man Miles Morales Venom 3 Pack T-Shirts for kids 0.7881 0.2589 0.5292 53.23%

double zipper smart key case genuine leather 0.8514 0.3239 0.5275 14.52%

Alex Evenings womens Plus Size Midi Scoop Neck Shift 0.8538 0.3269 0.5269 33.87%

cotton candy party kit 100 cones ﬁve ﬂavors 0.8898 0.3647 0.5251 25.81%

black wood electric guitar ornament 1.0000 0.4852 0.5148 24.19%

Motorola Moto E6 Play 5.5" pantalla, 2GB RAM, 32GB almacenamiento, Android 9.0, Dual SIM, Gris 0.9307 0.4170 0.5137 20.97%

Harley-Davidson Enjoy Ride Oval Embossed Tin Sign, 18 1.0000 0.4875 0.5125 24.19%

Quiksilver Men’s Long Sleeve Rashguard UPF 50 Sun Protection Surf Shirt 0.8611 0.3492 0.5119 16.13%

Eternal Nail Polish Set (BEACH WALK) 0.9788 0.4681 0.5107 20.97%

CINRA 20PCS Disposable Tattoo Tubes Grips Mixed Sizes 0.8786 0.3683 0.5103 27.42%

Dotty the Dalmatian plush mascot costume men’s 1.0000 0.4950 0.5050 29.03%

Levi’s Women’s High Waisted Mom Shorts 1.0000 0.4954 0.5046 14.52%

HDMI to VGA DVI HDMI adapter 4 in 1 video converter for laptop monitor projector 0.7962 0.2922 0.5040 32.26%

Nike Dri-Fit Men’s Half Zip Golf Top 0.8553 0.3514 0.5039 17.74%

MixMatchy Women’s Striped Print Ribbed Knit Crop 1.0000 0.4976 0.5024 12.90%

Dr. Martens Women’s Shriver Hi Fashion Boot chunky sole 0.8888 0.3906 0.4982 25.81%

Thankful Deluxe Fall Party Dinnerware Bundle with Dinner Plates, Dessert Plates, and Large Napkins 0.9658 0.4710 0.4948 19.35%

Summer Clean Rinse Baby Bather (Gray) 0.9713 0.4771 0.4942 25.81%

large black hair claw clips for thick hair women girls 0.7120 0.2186 0.4934 27.42%

green camouﬂage RFID trifold canvas wallet for men with mini coin purse and front pocket for kids 0.8033 0.3148 0.4885 32.26%

Find RAINBEAN foldable laptop table for bed with storage space and bamboo wood grain 0.9647 0.4786 0.4861 19.35%

short black cosplay wig for men 0.8409 0.3579 0.4830 20.97%

Detroit Axle - 4WD 8-Lug Front Wheel 0.8406 0.3578 0.4828 24.19%

Apple Lime Green Deco Mesh Ribbon 10 inch for Christmas Wreath and Spring Decorations 0.9260 0.4444 0.4816 24.19%

RJ-Sport No-Tie, Elastic Shoelaces, 1.0000 0.5212 0.4788 22.58%

hypoallergenic baby bottle dishwashing liquid fragrance free 0.9267 0.4495 0.4772 24.19%

AUGYMER stainless steel serrated bread knife 7.9 inches 1.0000 0.5245 0.4755 20.97%

Kaenon Men’s Polarized Sunglasses Burnet Full Coverage Matte Tortoise 1.0000 0.5278 0.4722 20.97%

2-pack aluminum no ﬁrearms guns or weapons allowed sign 10x7 waterproof 0.9922 0.5203 0.4719 25.81%

LED Candles, Ymenow Warm 0.6399 0.1688 0.4711 37.10%

Ralph Lauren Meadow Lane Kaley King Comforter Blue Multi 0.9902 0.5192 0.4710 19.35%

Portable RGB Gaming Mechanical Keyboard USB Type 0.8618 0.3913 0.4705 17.74%

Coppertone Defend & Care Sensitive Skin Sunscreen Lotion Broad 0.8670 0.4012 0.4658 24.19%

Nike Kids’ Grade School Zoom Air Pegasus 0.9933 0.5282 0.4651 27.42%

Blue rectangular plastic tablecloth 54x108 inch for parties and outdoor 0.9292 0.4659 0.4633 14.52%

Talking Products, Talking Tile 0.6971 0.2342 0.4629 51.61%

Double sided rug tape for laminate ﬂooring 2 inch 30 yards 0.9273 0.4690 0.4583 19.35%

Nordic Ware Platinum Collection Anniversary Bundtlette 0.8968 0.4404 0.4564 30.65%

glass essential oil diffuser ultrasonic aromatherapy humidiﬁer wood natural 0.9373 0.4842 0.4531 20.97%

purple ergonomic document copy holder with side arm 0.8840 0.4311 0.4529 25.81%

iPhone 6 Plus replacement touch screen LCD panel with repair tools 0.7864 0.3340 0.4524 17.74%

TRUE LINE Automotive Universal 7 Inch 0.4998 0.0486 0.4512 77.42%

Kiss Tweetheart False Nails with accents and super hold adhesive 1.0000 0.5496 0.4504 19.35%

Mayton wooden bunkie board slats Twin XL Beige 1.0000 0.5506 0.4494 29.03%

16 oz clear plastic cups with ﬂat lids set of 50 0.9556 0.5091 0.4465 11.29%

Beneﬁt Boi ing Hydrating Concealer Light Medium 1.0000 0.5560 0.4440 27.42%

Google Wi-Fi System Mesh 1.0000 0.5565 0.4435 16.13%

YONEX AC1025P Tennis Badminton Grip 0.4920 0.0488 0.4432 70.97%

ASICS Women’s Gel-Venture 7 trail running shoes 1.0000 0.5592 0.4408 32.26%

Blushing Cherry Blossom Fragrance Oil (60ml) For 0.8977 0.4585 0.4392 19.35%

Ateco Pastry Leaf Tube Set White 1.0000 0.5610 0.4390 24.19%

Garmin ECHOMAP Plus 73cv Ice 0.9540 0.5151 0.4390 24.19%

7x5ft Black and Gold Balloons Backdrop for Birthday Party Photography 0.7982 0.3599 0.4383 22.58%

5 Pcs Colorful Mini Silicone Whisks Stainless Steel Non Stick for Cooking Baking 0.9337 0.4956 0.4381 17.74%

Custom Self Inking Rubber Stamp 4 Lines Extra Ink Pad A1848 0.8462 0.4123 0.4339 20.97%

Organic baby toddler short sleeve tight ﬁt pajamas Lamaze 0.7029 0.2693 0.4336 29.03%

Table 7: Per Query Mean NDCG Scores and Variability on ﬁrst 100 queries. For each query, we measure using

NDCG@10 the min, mean, gap between mean and max, and the percentage of systems that have an NDCG@10 of 0,

denoting a completely irrelevant retrieval set

Query Max Mean Gap Max vs Mean Systems with NDCG@10 = 0

Microsoft Surface Pro Signature Type Cover Platinum FFP-00141 for Surface Pro 7 0.9721 0.5385 0.4336 17.74%

Kids sandwich cutter and sealer set with star, heart, and circle shapes for lunchbox and bento box 0.9181 0.4863 0.4318 22.58%

ergonomic stool adjustable height footrest ring parquet wheels 0.6176 0.1858 0.4318 24.19%

Onvian Bike Alarm with Remote, Wireless 1.0000 0.5709 0.4291 22.58%

acrylic lollipop holder cake pop stand with sticks, bags, and twist ties 0.8547 0.4272 0.4275 16.13%

90 degree right angled Micro USB 2.0 Male to Female extension cable 50cm for tablet phone 0.9359 0.5108 0.4251 14.52%

Goodman 4 Ton 14 0.7710 0.3463 0.4247 24.19%

OTTERBOX COMMUTER SERIES iPhone 11 Pro case with PopSockets PopGrip 0.8303 0.4067 0.4236 29.03%

Probrico ﬂat black cabinet pulls 3 inch hole centers T bar handle kitchen dresser 10 pack 0.8018 0.3829 0.4190 22.58%

Madagascar Bourbon Planifolia Grade A Vanilla Beans 5-6 inches 0.8182 0.4013 0.4169 27.42%

Nerf Mega Accustrike Dart Reﬁll Combat Blaster 0.8542 0.4395 0.4147 17.74%

100% Cotton Throw Blanket for Couch Sofa Bed Outdoors Hypoallergenic 83"x70" Brown 0.8888 0.4753 0.4135 8.06%

Zoostliss Crimping Tool for Coaxial Cable RG6 RG59 with Blue F Connector 0.9872 0.5770 0.4102 17.74%

adidas ladies climalite short-sleeve polo 0.8080 0.3980 0.4100 16.13%

Hofdeco nautical indoor outdoor pillow cover navy blue compass anchor 12x20 set of 2 0.8266 0.4170 0.4096 20.97%

Hurley Men’s Printed Backpack Light Carbon 0.9072 0.4995 0.4077 27.42%

Giorgio Armani Code Colonia men’s Eau de Toilette Spray 2.5 oz 0.9571 0.5504 0.4067 20.97%

Extra Large Moist Heating Pad 1.0000 0.5952 0.4048 20.97%

HOBO Vintage Euro Slide Credit Card Holder Wallet 0.7783 0.3753 0.4030 20.97%

Gueray Portable CD Player, 0.9758 0.5729 0.4029 16.13%

HDMI to RCA converter for TV, Roku, Fire Stick, DVD, Blu-ray player 1.0000 0.5973 0.4027 12.90%

Ombre highlight short bob human hair wig brown to blonde Brazilian straight remy for black women 0.9140 0.5142 0.3998 12.90%

Muga Black Felt Letter Board 12x12 inch with 485 Precut Letters and Stand 0.7829 0.3846 0.3983 19.35%

Deﬂecto 5-bin horizontal tilt bin storage system black 0.9789 0.5811 0.3978 17.74%

Iridescent Metal Lighter Case for BIC Lighters, Lighter 0.9098 0.5131 0.3967 20.97%

Jill & Joey Maternity Belt Belly Band Medium Beige Pregnancy Support 0.8930 0.4980 0.3950 20.97%

Calvin Klein Women’s Invisibles Hipster Panty 1.0000 0.6086 0.3914 19.35%

KOKUYO Campus Notebook B5 Dotted A-Ruled 5-Colors 6 Pack 0.9047 0.5154 0.3893 24.19%

magnetic dart board for kids boys girls gifts 1.0000 0.6108 0.3892 22.58%

Rokinon Cine DS Lens Kit Micro Four Thirds 16mm 35mm 50mm 85mm 0.9597 0.5726 0.3871 22.58%

Zealer 1800pcs Crystals AB Nail Art Rhinestones Flatback Glass Charms for Nails Decoration Eye Makeup Clothes Shoes Mix Sizes 0.7793 0.3936 0.3857 24.19%

Perricone MD Neuropeptide Facial Conformer 0.8735 0.4879 0.3856 22.58%

Soda Pop Can Covers Made in USA BPA-Free Retains Fizz 0.9023 0.5233 0.3790 17.74%

Baby Toddler Girls Long Sleeve 1.0000 0.6216 0.3784 9.68%

Blue pet grooming gloves for hair removal and massage 0.9742 0.5976 0.3766 19.35%

AnnTec LED Candle Light, LED Glass, 0.4367 0.0602 0.3765 67.74%

Osprey Daylite Shoulder Sling for daily essentials and quick hikes 0.8261 0.4508 0.3753 12.90%

Tminnov Baby Diaper Caddy 0.8567 0.4832 0.3735 12.90%

NFL Women’s OTS Fleece Hoodie 0.9003 0.5308 0.3695 17.74%

39 watt halogen PAR20 clear medium base bulb 0.7402 0.3713 0.3689 17.74%

Daiwa Liberty Club Short Swing 0.3904 0.0224 0.3680 91.94%

4 ton hydraulic low proﬁle ﬂoor jack with dual piston quick lift pump 0.9062 0.5384 0.3678 20.97%

universal knife and tool sharpener with adjustable angle guides 0.6768 0.3095 0.3673 22.58%

OTC 4842 Heavy Duty Valve Spring Compressor 1.0000 0.6335 0.3665 19.35%

Spencer 50’/15M Logger Tape Reﬁll 0.9121 0.5463 0.3658 19.35%

Toozey Dog Pooper Scooper, Upgraded Adjustable Long 1.0000 0.6383 0.3617 16.13%

Lovelyshop Blue Gems Rhinestone 0.4991 0.1385 0.3606 22.58%

sparkly rhinestone mesh face mask for women masquerade party glitter bling Christmas 0.9120 0.5517 0.3603 12.90%

Enameled Cast Iron Dutch Oven 6.5 quart Olive Green 0.9013 0.5429 0.3584 16.13%

Huffy 12V Battery Powered Ride On ATV for kids aged 2-5 0.7172 0.3600 0.3572 24.19%

artiﬁcial eucalyptus plant for wedding jungle theme party home decor 1.0000 0.6433 0.3567 19.35%

Klein Tools Mini Tube Cutter for copper and aluminum tubing 0.7522 0.3980 0.3542 25.81%

An Untamed Heart Red River of the North book 1.0000 0.6463 0.3537 24.19%

New American citizen mug US ﬂag coffee cup gift 0.9053 0.5531 0.3522 12.90%

Ernie Ball Custom Gauge 11 Nickel Guitar String 6 Pack 0.8578 0.5087 0.3491 24.19%

Bobbi Brown Lip Color Rum Raisin 3.4g 1.0000 0.6513 0.3487 19.35%

21x21 beige sun shade sail patio UV protection outdoor backyard 0.7790 0.4350 0.3440 12.90%

Carhartt Men’s Heavyweight Short-Sleeve Pocket T-Shirt 1.0000 0.6572 0.3428 16.13%

Hokkaido Deer Lotion CICA Ceramide Vitamin C 5.1 ﬂ oz 0.4177 0.0752 0.3425 56.45%

Rise Pea Protein Bar, 0.9030 0.5638 0.3393 19.35%

Hunter Watson Indoor ceiling fan with LED light and pull chain control, New Bronze ﬁnish 0.9504 0.6115 0.3389 16.13%

Belkin QODE Ultimate Pro Keyboard Case for iPad Air 2 White 0.9291 0.5909 0.3382 17.74%

The Fifth Agreement book self-mastery Toltec Wisdom 0.9427 0.6060 0.3367 14.52%

Echo Dot Wall Mount Holder Black AhaStyle 0.9878 0.6540 0.3338 4.84%

Soft Spiked Light Up Bracelets with Flashing Blinking LED Lights 0.7561 0.4228 0.3333 14.52%

migraine relief hat hot cold therapy reusable MarkGifts 0.9336 0.6005 0.3331 20.97%

cordless portable blender rechargeable battery smoothie mixer travel 1.0000 0.6693 0.3307 19.35%

LYXOTO Baby Hair Clips set of 10 bow knot cute stylish birthday gift 0.7265 0.3989 0.3276 11.29%

Plaskidy plastic forks for kids set of 16 BPA free dishwasher safe 0.7287 0.4106 0.3181 12.90%

Nantucket Neighbors (Nantucket Beach Plum Cove Book 2) 1.0000 0.6838 0.3162 16.13%

Physicians Formula Highlighter Makeup Powder Mineral Glow Pearls, Light Bronze 0.9623 0.6466 0.3157 20.97%

Organic Kabuli Chana 4 lb White 0.7598 0.4503 0.3095 20.97%

PEARL IZUMI Elite Thermal Arm Warmer for cool weather cycling with water-shedding ﬂeece and unisex sizing 0.9928 0.6874 0.3054 19.35%

GE 6-Device Backlit Universal Remote Control for Samsung, Vizio, LG, Sony, Sharp, Roku, Apple TV, Blu-Ray, DVD, Master Volume Control, Rose Gold 1.0000 0.6963 0.3037 19.35%

SPLOTY Tire Inﬂator Air Compressor Portable 12V 1.0000 0.6979 0.3021 19.35%

VAKA Luminous Light Up Quad Roller Skate 0.6643 0.3624 0.3019 19.35%

Black+Decker 5-Cup Coffeemaker Dust Proof Cover 0.6907 0.3924 0.2983 20.97%

Celestial Seasonings Honey Vanilla Chamomile Herbal Tea 20 Count 0.9059 0.6086 0.2973 17.74%

large washable donut dog bed with calming cuddler and head support 0.8315 0.5356 0.2959 12.90%

professional hair cutting scissors extremely sharp blades 6 inch barber scissors set for men and women Fagaci 0.7568 0.4615 0.2953 17.74%

Stoplight Sleep Enhancing Alarm Clock for Kids Train Car Clock 0.7001 0.4072 0.2929 17.74%

McCormick Coconut Extract 2 ﬂ oz gluten-free non-GMO 0.6143 0.3433 0.2710 19.35%

XR Extinction Rebellion Rebel For Life T-Shirt 0.9449 0.6863 0.2586 11.29%

Dogﬁsh 500GB Msata Internal SSD 0.2658 0.0126 0.2532 91.94%

Car glass coating agent with shampoo conditioner and water repellent for glossy ﬁnish 0.2038 0.0398 0.1640 53.23%

Table 8: Per Query Mean NDCG Scores and Variability on remaining 82 queries. For each query, we measure using

NDCG@10 the min, mean, gap between mean and max, and the percentage of systems that have an NDCG@10 of 0,

denoting a completely irrelevant retrieval set

Run Collection Type Zero Shot Fine-tune Impact

all-MiniLM-L12-v2 Simple 0.4952 0.4763 -0.0189

all-MiniLM-L12-v2 Metadata 0.5328 0.4681 -0.0647

all-miniLM-L6-v2 Simple 0.5895 0.4591 -0.1303

all-miniLM-L6-v2 Metadata 0.5937 0.4673 -0.1264

all-mpnet-base-v2 Simple 0.5895 0.4090 -0.1804

all-mpnet-base-v2 Metadata 0.5937 0.4089 -0.1848

bge-base-en Simple 0.2948 0.4064 0.1116

bge-base-en Metadata 0.3396 0.4237 0.0841

bge-large-en Metadata 0.0323 0.0021 -0.0302

bge-small-en Simple 0.3680 0.4379 0.0699

e5-large-v2 Simple 0.4998 0.0015 -0.4983

e5-base-v2 Metadata 0.0936 0.4242 0.3306

e5-small-v2 Metadata 0.0142 0.5119 0.4977

gte-base Simple 0.1493 0.4777 0.3284

gte-base Metadata 0.0604 0.5009 0.4405

gte-large Simple 0.3930 0.0015 -0.3915

gte-large Metadata 0.4501 0.0011 -0.4490

gte-small Simple 0.6612 0.4901 -0.1711

gte-small Metadata 0.6647 0.5168 -0.1479

bert-base Metadata 0.4165 0.0352 -0.3813

Table 9: Impact on NDCG@10 of ﬁne-tuning across baseline runs.

Run Collection Type Zero Shot Fine-tune Impact

all-MiniLM-L12-v2 Simple 0.6099 0.5351 -0.0747

all-MiniLM-L12-v2 Metadata 0.6415 0.5581 -0.0833

all-miniLM-L6-v2 Simple 0.6348 0.5429 -0.0920

all-miniLM-L6-v2 Metadata 0.6512 0.5528 -0.0984

all-mpnet-base-v2 Simple 0.6348 0.4747 -0.1602

all-mpnet-base-v2 Metadata 0.6512 0.5006 -0.1506

bge-base-en Simple 0.2664 0.4728 0.2064

bge-base-en Metadata 0.4301 0.4955 0.0654

bge-large-en Metadata 0.0314 0.0000 -0.0314

bge-small-en Simple 0.4188 0.5080 0.0893

e5-large-v2 Simple 0.5537 0.0000 -0.5537

e5-base-v2 Metadata 0.1021 0.5148 0.4127

e5-small-v2 Metadata 0.0126 0.6096 0.5970

gte-base Simple 0.0965 0.5676 0.4711

gte-base Metadata 0.0940 0.5895 0.4956

gte-large Simple 0.4294 0.0001 -0.4293

gte-large Metadata 0.4978 0.0000 -0.4978

gte-small Simple 0.7372 0.5692 -0.1679

gte-small Metadata 0.7416 0.5859 -0.1557

bert-base Metadata 0.5208 0.0374 -0.4834

Table 10: Impact on Recall@100 of ﬁne-tuning across baseline runs.

Run NDCG @10 (All) NDCG@10 (Short) NDCG@10 (Long)

f_splade_bm25 0.7505 0.7407 0.7578

f_splade_clip_bm25 0.7327 0.7078 0.7511

cfdaclip_ER_B 0.7256 0.7119 0.7358

cfdaclip_MR_B 0.7251 0.7108 0.7357

cfdaclip_MR_A 0.7257 0.7122 0.7357

cfdaclip_ER_A 0.7252 0.7111 0.7356

JBNU-C 0.7251 0.7130 0.7341

JBNU-1 0.6531 0.6060 0.6878

simple-gte-small-zero-shot 0.6612 0.6253 0.6877

metadata-enhanced-gte-small-zero-shot 0.6647 0.6343 0.6872

JBNU-2 0.6583 0.6238 0.6838

BM25-pyserini-simple-collection 0.6540 0.6184 0.6803

BM25-pyserini-metadata-collection 0.6408 0.6043 0.6643

f_gpt_rerank 0.6225 0.6162 0.6273

JBNU-A 0.5989 0.5616 0.6265

metadata-enhanced-all-mpnet-base-v2-zero-shot 0.5937 0.5586 0.6197

JBNU-B 0.5763 0.5294 0.6110

metadata-enhanced-all-MiniLM-L12-v2-zero-shot 0.5660 0.5072 0.6093

r_gpt3d5_turbo 0.5950 0.5766 0.6087

simple-all-mpnet-base-v2-zero-shot 0.5895 0.5687 0.6048

simple-all-MiniLM-L12-v2-zero-shot 0.5288 0.4721 0.5707

metadata-enhanced-trec-product-search-gte-small 0.5168 0.4457 0.5693

metadata-enhanced-all-MiniLM-L6-v2-zero-shot 0.5328 0.4938 0.5616

simple-e5-large-zero-shot 0.4998 0.4230 0.5565

metadata-enhanced-trec-product-search-e5-small-v2 0.5119 0.4583 0.5514

metadata-enhanced-trec-product-search-gte-base 0.5009 0.4471 0.5407

simple-trec-product-search-gte-small 0.4901 0.4535 0.5171

metadata-enhanced-trec-product-search-bge-small-en 0.4721 0.4192 0.5112

simple-all-MiniLM-L6-v2-zero-shot 0.4952 0.4746 0.5104

metadata-trec-product-search-all-miniLM-L12-v2 0.4681 0.4193 0.5042

simple-trec-product-search-gte-base 0.4777 0.4430 0.5032

metadata-trec-product-search-all-miniLM-L6-v2 0.4673 0.4427 0.4856

simple-trec-product-search-all-miniLM-L12-v2 0.4763 0.4650 0.4847

simple-trec-product-search-all-miniLM-L6-v2 0.4591 0.4444 0.4700

metadata-enhanced-trec-product-search-e5-base-v2 0.4242 0.3698 0.4644

metadata-enhanced-gte-large-zero-shot 0.4501 0.4361 0.4604

metadata-enhanced-trec-product-search-bge-base-en 0.4237 0.3847 0.4525

simple-trec-product-search-bge-small-en 0.4379 0.4267 0.4462

search-dpr-bert-base 0.4272 0.4029 0.4452

metadata-enhanced-trec-product-search-dpr-bert 0.4165 0.3805 0.4430

metadata-trec-product-search-all-mpnet-base-v2 0.4089 0.3784 0.4315

simple-trec-product-search-all-mpnet-base-v2 0.4090 0.3851 0.4267

simple-trec-product-search-bge-base-en 0.4064 0.3932 0.4161

simple-gte-large-zero-shot 0.3930 0.3647 0.4140

simple-bge-small-zero-shot 0.3680 0.3378 0.3903

metadata-enhanced-bge-base-en-zero-shot 0.3396 0.2902 0.3761

simple-bge-base-zero-shot 0.2948 0.2923 0.2967

simple-gte-base-zero-shot 0.1493 0.1696 0.1342

simple-bge-large-zero-shot 0.1486 0.1805 0.1250

simple-e5-base-zero-shot 0.1168 0.1350 0.1033

metadata-enhanced-e5-base-v2-zero-shot 0.0936 0.1127 0.0795

metadata-enhanced-gte-base-zero-shot 0.0604 0.0580 0.0621

simple-bert-base-uncased-zero-shot 0.0352 0.0417 0.0305

simple-contriever-base-zero-shot 0.0159 0.0044 0.0244

metadata-enhanced-e5-small-v2-zero-shot 0.0142 0.0046 0.0212

metadata-enhanced-bge-large-en-zero-shot 0.0323 0.0495 0.0197

simple-e5-small-zero-shot 0.0113 0.0046 0.0163

metadata-enhanced-contriever-base-msmarco 0.0081 0.0032 0.0116

metadata-enhanced-trec-product-search-bge-large-en 0.0021 0.0024 0.0019

simple-trec-product-search-gte-large 0.0015 0.0022 0.0010

metadata-enhanced-trec-product-search-gte-large 0.0011 0.0011 0.0010

metadata-enhanced-trec-product-search-e5-large-v2 0.0015 0.0023 0.0009

Table 11: Mean NDCG@10 performance of systems with full sample of queries (182), short queries (81), and long

queries (101) across various systems.