SPLADE – Shataxi Dubey

from fastembed import SparseTextEmbedding
from transformers import AutoTokenizer

/home/shataxi/practise/bigredsky-ATS/bigredsky-backend/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

There are multiple SPLADE models

model_id=“naver/splade-v3”, it requires some permission

model_id=“naver/splade-cocondenser-selfdistil”, it is the best SPLADE model

But, we will use the model_id=prithivida/Splade_PP_en_v1

tokenizer = AutoTokenizer.from_pretrained("prithivida/Splade_PP_en_v1")  

sparse_embedding_model=SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")

Fetching 5 files: 100%|██████████| 5/5 [00:00<00:00, 45294.86it/s]

def sparse_cosine_similarity(indices_a, values_a, indices_b, values_b):
    vec_a = dict(zip(indices_a, values_a))
    vec_b = dict(zip(indices_b, values_b))
    
    # Dot product
    common_indices = set(vec_a.keys()) & set(vec_b.keys())
    print("=====common words=====")
    token_weights = [
    tokenizer.convert_ids_to_tokens(int(idx)) for idx in common_indices                                                                     
    ]
    print(token_weights)
    dot = sum(vec_a[i] * vec_b[i] for i in common_indices)
    
    # Magnitudes
    mag_a = sum(v ** 2 for v in vec_a.values()) ** 0.5
    mag_b = sum(v ** 2 for v in vec_b.values()) ** 0.5
    
    if mag_a == 0 or mag_b == 0:
        return 0.0
    
    return dot / (mag_a * mag_b)

# jd_req_skills = '''
# ● Classroom Management ● Conflict Resolution ● Student Engagement and Motivation ● Lesson Plan Execution ● Multitasking and Organization ● Compassion and Patience ● Creative Problem Solving ● Proficiency with Microsoft Office and basic educational tools ● Strong communication and interpersonal skills ● Ability to work with students from diverse backgrounds
# '''

# candidate_skills='''Technical Skills ● Cloud & Devops ○ Kubernetes ○ EKS, Minikube, K3s ○ ArgoCD & GitOps ○ Helm Charts ○ CI-CD Pipelines | Gitlab CI | Github Actions ○ IaC - Terraform, Cloudformation ○ AWS Cloud Services - Solutions Architect Associate (AWS Certified) ○ System Design and Architecture ○ Decentralized Apps ○ Monitoring - Grafana, Loki, Prometheus ● Web3 and Blockchain ○ EVM chains | Ethereum | Polygon ○ Solidity Smart Contracts | Hardhat ○ SSI (Self Sovereign Identity) - Verifiable Credentials, VPs, OpenID and w3c standards ○ DID protocols, Cryptography, Privado iD ○ Selective Disclosure and Zero Knowledge Proofs ○ Private Chains | Hyperledger Besu ○ Chainlink, The Graph Protocol, Ocean Protocol, Uniswap, AAVE, more ○ OpenZeppelin Contracts | Defi | NFTs | DAOs ○ Upgradable Smart Contracts, Account Abstraction ○ Decentralized Storage | IPFS ○ NodeJS / Typescript - Restful APIs ○ Blockchain Backend Services ● AI and Development Tools ○ Claude Code (skills, agents, workflows) ○ Antigravity/Cursor, AI-assisted Development ○ Prompt Engineering Other Skills ● Project Management ● Solutions Architect ● System Design and Architecture'''

# jd_req_skills='''● Bachelor’s degree in Education or a related field ● Valid State Teaching License (Texas) ● PRAXIS II Certification'''
# candidate_skills = '''EDUCATION AND LICENSING Jun 2008 - Present PRAXIS II Certification Austin, Texas Sep 2005 - Jun 2008 B.A. in Education Austin, Texas'''

# jd_req_skills = "Certified Kubernetes Administrator"
# candidate_skills = "CKA"

# jd_req_skills = "machine learning engineer"
# candidate_skills = "Artificial intelligence engineer"

jd_req_skills = "Experience with Kubernetes orchestration"
candidate_skills = "Managed AWS infrastructure, deployed 50+ microservices"

results = list(sparse_embedding_model.embed([jd_req_skills, candidate_skills]))

jd_sparse = results[0]
candidate_sparse = results[1]

print(jd_sparse.indices)
print(jd_sparse.values)

[ 1022  2007  2063  2189  2229  2299  2316  2514  2614  2638  2836  3004
  3153  3194  3315  3325  3370  3466  3508  3850  4032  4164  4378  4543
  5281  5677  6028  6189  6322  6512  6602  6907  7159  7241  7589  7603
 13970]
[6.67957842e-01 5.02769470e-01 9.84465554e-02 1.07767475e+00
 1.16471696e+00 2.91791826e-01 4.91769649e-02 4.98197109e-01
 2.42059282e-03 2.03349993e-01 2.53558844e-01 1.54326886e-01
 5.77651978e-01 1.90327749e-01 3.02954823e-01 2.55235314e+00
 2.13443618e-02 3.26343775e-01 1.06131005e+00 7.85874665e-01
 2.20954323e+00 2.67894026e-02 1.30836979e-01 6.78200126e-02
 1.05577260e-01 1.78682709e+00 2.49487787e-01 6.27155304e-01
 1.45783818e+00 4.34518978e-02 4.22243416e-01 1.12758875e-01
 2.12012386e+00 2.52186060e-01 2.14359894e-01 1.32606119e-01
 2.22890830e+00]

SPLADE gives all the words closer to the words in jd skills. If orchestration word is present, it is closely related to music, opera. This is the reason, why we see the words “music” and “opera” in the token weights

# Map index → token for a sparse result                                                                                                
token_weights = [
    tokenizer.convert_ids_to_tokens(int(idx)) for idx in jd_sparse.indices                                                                        
]
print('======JD requires skills=====')
print(token_weights)
print(len(token_weights))
print(len(jd_req_skills.split()))

token_weights = [
    tokenizer.convert_ids_to_tokens(int(idx)) for idx in candidate_sparse.indices                                                                        
]

print('======Candidate skills=====')
print(token_weights)
print(len(token_weights))
print(len(candidate_skills.split()))

======JD requires skills=====
['8', 'with', '##e', 'music', '##es', 'song', 'band', 'feel', 'sound', '##ne', 'performance', 'theatre', 'dance', 'engine', 'musical', 'experience', '##ation', 'effect', '##tion', 'opera', 'orchestra', 'concert', 'audience', 'composer', 'experienced', '##ber', 'technique', 'symphony', 'experiences', 'arrangement', 'instrument', 'genre', '##net', 'ensemble', 'conductor', 'emotion', 'ku']
37
4
======Candidate skills=====
['+', '##s', 'many', 'company', 'small', 'service', '2000', '50', '40', 'network', 'management', 'technology', 'systems', 'industry', 'key', '500', 'manager', 'managed', '45', '80', 'equipment', 'capacity', 'software', 'vehicle', 'brand', 'plus', 'fifty', 'strategy', 'manage', 'infrastructure', 'deployed', '##vic', '##ser', 'hardware', 'deployment', 'micro', 'deploy', 'aw']
38
6

The cosine similarity computed here will always be greater than 0 because Term frequency and inverse document frequency are always positive. If the cosine similarity is close to 0, then words are unrelated. If the cosine similarity is close to 1, then words are related.

similarity = sparse_cosine_similarity(jd_sparse.indices, jd_sparse.values, candidate_sparse.indices, candidate_sparse.values)
print(similarity)

=====common words=====
[]
0.0

Difference between various sparse embedding model

UniCOIL sparse embeddings (for comparison with SPLADE and BM25)

UniCOIL (Uni-Contextualized Inverted Index Language model): - Uses BERT to produce a CONTEXTUAL hidden state per token - Projects each token’s hidden state to a single scalar via Linear(hidden_size → 1) + ReLU - Only tokens present in the input get a weight — NO vocabulary expansion

Where it sits between BM25 and SPLADE: BM25 — exact tokens, statistical weights (TF×IDF), no neural network UniCOIL — exact tokens, CONTEXTUAL learned weights, neural (BERT backbone) SPLADE — expands to related vocab terms, contextual weights, neural (BERT backbone)

Example: “Kubernetes orchestration” BM25: activates → [kubernetes, orchestration] (weights: TF×IDF counts) UniCOIL: activates → [ku, ##ber, ##netes, or, ##che, ##stration, …] (same WordPiece subwords as SPLADE, but weights are contextual scalars) SPLADE: activates → [kubernetes, orchestration, container, cluster, k8, …] (expands to semantically related vocabulary)

NOTE: The linear layer below uses random weights since loading the full pre-trained checkpoint (castorini/unicoil-noexp-msmarco-passage) requires mapping custom state-dict keys. For production use, load that checkpoint. The architecture and token activation pattern are correct regardless.