RRF - Reciprocal Rank Fusion

import numpy as np
import pandas as pd
from openai import OpenAI
from dotenv import load_dotenv
from fastembed import SparseTextEmbedding
from transformers import AutoTokenizer
from sklearn.metrics.pairwise import cosine_similarity

load_dotenv()

True

tokenizer = AutoTokenizer.from_pretrained("prithivida/Splade_PP_en_v1")  

sparse_embedding_model=SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")

Fetching 5 files: 100%|██████████| 5/5 [00:00<00:00, 26681.32it/s]

RRF is a retrieval strategy. It is used to combine the retrieval using dense embedding and sparse embedding. It gives the rank to the documents. It does not return any similarity score.

query = "what is the reason for inflation in the current times?"

doc_1= """Housing has been the primary driver of inflation from March 2025 through March 2026, accounting for roughly half of the overall 3.3% annual price increase. Within the Consumer Price Index, shelter-related costs alone contributed 1.5 percentage points to that figure. 
 Mortgage rates hovering near 6% continue to price out first-time buyers, while tight inventory fuels bidding wars on entry-level homes, keeping housing costs persistently elevated. 
"""

doc_2="""
During 2021 and 2022, COVID-19 lockdowns disrupted manufacturing globally, while Russia's invasion of Ukraine sent oil and grain prices soaring. By 2025, however, global shipping costs had declined, goods began moving efficiently again, and core goods inflation largely flattened. 
NCHStats
 In 2022, annual price growth had surged to eight percent — driven first by pandemic disruptions and later by commodity market turmoil — prompting the Federal Reserve to initiate a series of interest rate hikes to bring price growth back under control. 
"""

doc_3="""Among the core drivers of inflation in 2026 are the lagged effects of tariffs, an expanding fiscal deficit potentially exceeding 7% of GDP, a tighter labor market stemming from shifts in immigration policy, and monetary conditions that are looser than commonly appreciated. 
 Economists note that companies have now depleted the inventories they stockpiled ahead of tariff implementation, and while CEOs initially avoided sharp one-time price increases, they have been raising prices in smaller increments over a longer period. """

doc_4="""Food prices rose 3.1% through 2025, with food away from home climbing 4.1%. Energy costs also increased 2.3%, driven by a 10.8% jump in utility gas prices and a 6.7% rise in electricity, even as gasoline prices continued their multi-year decline. 
 Food inflation has cooled compared to 2022 peaks, though certain items like eggs remain volatile due to supply disruptions such as avian flu outbreaks, and these categories could flare up again if new shocks occur."""

doc_5="""Global core inflation is projected to remain stable at around 2.8% in 2026, though regional divergences are increasingly coming to the fore — with inflation expected to accelerate in the U.S. while moderating in Europe, partly driven by notable currency moves and shifting goods price pressures. 
 The labor market remains resilient, with global unemployment projected to hold near 4.9%, and while wage growth has slowed from its 2022–2023 pace, upward wage pressure in migrant-dependent sectors continues to feed into services inflation."""

df = pd.DataFrame(np.zeros((5,2)), columns = ["dense", "sparse"])
df

	dense	sparse
0	0.0	0.0
1	0.0	0.0
2	0.0	0.0
3	0.0	0.0
4	0.0	0.0

client = OpenAI()
resp = client.embeddings.create(
            model='text-embedding-3-small',
            input=[query, doc_1, doc_2, doc_3, doc_4, doc_5],
        )
query_embed = resp.data[0].embedding
doc_1_embed = resp.data[1].embedding
doc_2_embed = resp.data[2].embedding
doc_3_embed = resp.data[3].embedding
doc_4_embed = resp.data[4].embedding
doc_5_embed = resp.data[5].embedding

print("=========OpenAI Raw Text Similarity========")

a = np.array(query_embed).reshape(1, -1)        
docs_embed = [doc_1_embed, doc_2_embed, doc_3_embed, doc_4_embed, doc_5_embed]

for i in range(5):
    b = np.array(docs_embed[i]).reshape(1, -1)
    similarity = cosine_similarity(a, b)[0][0]   
    df.loc[i, "dense"] = similarity
    print(f"cosine similarity query with doc_{i+1}: {similarity}")

=========OpenAI Raw Text Similarity========
cosine similarity query with doc_1: 0.4758122599555489
cosine similarity query with doc_2: 0.43875027996847515
cosine similarity query with doc_3: 0.49595345989719
cosine similarity query with doc_4: 0.3906074219882445
cosine similarity query with doc_5: 0.4352872059064374

df

	dense	sparse
0	0.475812	0.0
1	0.438750	0.0
2	0.495953	0.0
3	0.390607	0.0
4	0.435287	0.0

def sparse_cosine_similarity(indices_a, values_a, indices_b, values_b):
    vec_a = dict(zip(indices_a, values_a))
    vec_b = dict(zip(indices_b, values_b))
    
    # Dot product
    common_indices = set(vec_a.keys()) & set(vec_b.keys())
    print("=====common words=====")
    token_weights = [
    tokenizer.convert_ids_to_tokens(int(idx)) for idx in common_indices                                                                     
    ]
    print(token_weights)
    dot = sum(vec_a[i] * vec_b[i] for i in common_indices)
    
    # Magnitudes
    mag_a = sum(v ** 2 for v in vec_a.values()) ** 0.5
    mag_b = sum(v ** 2 for v in vec_b.values()) ** 0.5
    
    if mag_a == 0 or mag_b == 0:
        return 0.0
    
    return dot / (mag_a * mag_b)

results = list(sparse_embedding_model.embed([query, doc_1, doc_2, doc_3, doc_4, doc_5]))

query_sparse=results[0]
doc_1_sparse=results[1]
doc_2_sparse=results[2]
doc_3_sparse=results[3]
doc_4_sparse=results[4]
doc_5_sparse=results[5]

docs_sparse = [doc_1_sparse, doc_2_sparse, doc_3_sparse, doc_4_sparse, doc_5_sparse]

for i in range(5):
    similarity = sparse_cosine_similarity(query_sparse.indices, query_sparse.values, docs_sparse[i].indices, docs_sparse[i].values)
    df.loc[i, "sparse"] = similarity
    print(f"Keyword similarity between query and doc_{i+1}:",similarity)

=====common words=====
['conflict', 'deficit', 'increase', 'motivation', 'rate', 'inflation', 'index']
Keyword similarity between query and doc_1: 0.22604841315833812
=====common words=====
['conflict', 'cause', 'economic', 'economics', 'increase', 'reason', 'rate', 'inflation', 'because', 'problem', 'index']
Keyword similarity between query and doc_2: 0.14548275063493354
=====common words=====
['correlation', 'policy', 'motivation', 'times', 'increase', 'economics', 'reason', 'index', 'causes', 'because', 'problem', 'cause', 'economic', 'deficit', 'phenomenon', 'null', 'rate', 'inflation', 'currency']
Keyword similarity between query and doc_3: 0.26174080943580597
=====common words=====
['conflict', 'cause', 'economic', 'deficit', 'economics', 'increase', 'rate', 'inflation', 'because', 'problem', 'index']
Keyword similarity between query and doc_4: 0.1859540064692551
=====common words=====
['currency', 'economic', 'economics', 'increase', 'motivation', 'rate', 'inflation', 'because', 'index', 'current']
Keyword similarity between query and doc_5: 0.22267342240390803

df

	dense	sparse
0	0.475812	0.226048
1	0.438750	0.145483
2	0.495953	0.261741
3	0.390607	0.185954
4	0.435287	0.222673

df = df.sort_values(by = ["dense"], ascending = False)
df

	dense	sparse
2	0.495953	0.261741
0	0.475812	0.226048
1	0.438750	0.145483
4	0.435287	0.222673
3	0.390607	0.185954

df['dense_rank'] = range(1, 6)
df

	dense	sparse	dense_rank
2	0.495953	0.261741	1
0	0.475812	0.226048	2
1	0.438750	0.145483	3
4	0.435287	0.222673	4
3	0.390607	0.185954	5

df = df.sort_values(by = ["sparse"], ascending = False)
df['sparse_rank'] = range(1, 6)
df

	dense	sparse	dense_rank	sparse_rank
2	0.495953	0.261741	1	1
0	0.475812	0.226048	2	2
4	0.435287	0.222673	4	3
3	0.390607	0.185954	5	4
1	0.438750	0.145483	3	5

df['rrf_score'] = np.zeros(5)

for i in range(5):
    df.loc[i, "rrf_score"] = 1/(60 + df.loc[i, "dense_rank"]) + 1/(60 + df.loc[i, "sparse_rank"])

df = df.sort_values(by=["rrf_score"], ascending = False)
df

	dense	sparse	dense_rank	sparse_rank	rrf_score
2	0.495953	0.261741	1	1	0.032787
0	0.475812	0.226048	2	2	0.032258
4	0.435287	0.222673	4	3	0.031498
1	0.438750	0.145483	3	5	0.031258
3	0.390607	0.185954	5	4	0.031010

df["rrf_rank"] = range(1, 6)
df

	dense	sparse	dense_rank	sparse_rank	rrf_score	rrf_rank
2	0.495953	0.261741	1	1	0.032787	1
0	0.475812	0.226048	2	2	0.032258	2
4	0.435287	0.222673	4	3	0.031498	3
1	0.438750	0.145483	3	5	0.031258	4
3	0.390607	0.185954	5	4	0.031010	5

Doc_3 > Doc_1 > Doc_5 > Doc_2 > Doc_4 are related to the query.