Dense embedding

Dense embedding to capture the semantic between the query and the documents in RAG.
dense embedding
retireval
semantic matching
topic matching
Author

Shataxi Dubey

Published

April 26, 2026

import numpy as np
from openai import OpenAI
from dotenv import load_dotenv
from sklearn.metrics.pairwise import cosine_similarity

load_dotenv()
True
# jd_req_skills = '''
# ● Classroom Management ● Conflict Resolution ● Student Engagement and Motivation ● Lesson Plan Execution ● Multitasking and Organization ● Compassion and Patience ● Creative Problem Solving ● Proficiency with Microsoft Office and basic educational tools ● Strong communication and interpersonal skills ● Ability to work with students from diverse backgrounds
# '''

# candidate_skills='''Technical Skills ● Cloud & Devops ○ Kubernetes ○ EKS, Minikube, K3s ○ ArgoCD & GitOps ○ Helm Charts ○ CI-CD Pipelines | Gitlab CI | Github Actions ○ IaC - Terraform, Cloudformation ○ AWS Cloud Services - Solutions Architect Associate (AWS Certified) ○ System Design and Architecture ○ Decentralized Apps ○ Monitoring - Grafana, Loki, Prometheus ● Web3 and Blockchain ○ EVM chains | Ethereum | Polygon ○ Solidity Smart Contracts | Hardhat ○ SSI (Self Sovereign Identity) - Verifiable Credentials, VPs, OpenID and w3c standards ○ DID protocols, Cryptography, Privado iD ○ Selective Disclosure and Zero Knowledge Proofs ○ Private Chains | Hyperledger Besu ○ Chainlink, The Graph Protocol, Ocean Protocol, Uniswap, AAVE, more ○ OpenZeppelin Contracts | Defi | NFTs | DAOs ○ Upgradable Smart Contracts, Account Abstraction ○ Decentralized Storage | IPFS ○ NodeJS / Typescript - Restful APIs ○ Blockchain Backend Services ● AI and Development Tools ○ Claude Code (skills, agents, workflows) ○ Antigravity/Cursor, AI-assisted Development ○ Prompt Engineering Other Skills ● Project Management ● Solutions Architect ● System Design and Architecture'''

# candidate_work_exp='''Work experience A) Cloud, DevOps and Blockchain @ smartSense Consulting Solutions (2022 - Present) Projects: ● PredictSwitch - DevOps ○ Project: AI-powered recruitment platform for financial advisors. ○ Tech Stack: AWS, Terraform, Github Actions, ArgoCD, Helm Charts ○ Responsibilities: i. Manage AWS Infrastructure using terraform for consistent and reproducible environments ii. Managing K8S cluster using kubeadm on AWS servers iii. Prepare Github Actions for all applications for CI-CD pipelines iv. Integrated ArgoCD for GitOps-based continuous delivery ● Dynamo - DevOps and Backend ○ Project: Aggregated marketplace for cloud service providers across the EU region. ○ Tech Stack: Kubernetes, Helm Charts, ArgoCD, Gitlab CI, Hashicorp Vault, Loki-stack, Gaia-X services and more ○ Responsibilities: i. Managing Different Environments and making sure all the Apps are up and running ii. Designing efficient automation pipelines for easy deployment of the apps and environment cleanups iii. Self-Hosted Gitlab runners for various pipelines iv. Release management with Gitlab, JIRA and various automation tools v. Worked on NodeJS backend responsible for cryptographic operations used to sign Verifiable Credentials and get Compliance from Gaia-X services vi. Worked on various open source tools and tech including Saleor components(marketplace), Gaia-X Services for SSI and VCs for different service offerings, Backend and Frontend applications (GraphQL APIs) vii. Logging and Monitoring - Grafana and loki-stack for monitoring apps and getting alerts ● Ostrich AI - DevOps and Team Management ○ Project: Unified compliant platform for secure AI deployment, on-demand compute scaling, and custom ML/GenAI model development ○ Tech Stack: Kubernetes, Minikube, AWS Cloud services, Blockchain(Besu), NFTs, AI/ML ○ Responsibilities: i. Team management and coordination - Coordinated with the Client to understand the business requirements and refine them, and then conveying workable functionalities/features to the development teams including AI/ML, Backend and Frontend Teams ii. Worked on smart contracts and services which are used to tokenize Data Assets and track their usage using blockchain - This is a Patented(IP) Technology now! iii. Also worked on a concept of Decentralized network of hardwares which can be used to run AI/ML jobs on them for an incentive in a compliant manner ● smartProof - Web3 and Backend ○ Project: Blockchain-based certification anchoring engine on Ethereum and Polygon ○ Tech Stack: Solidity Smart Contracts, ETH & Polygon Chains, NodeJS Backend, Hardhat ○ Production Running Application which is a certification anchoring engine on Ethereum and Polygon Blockchains ○ Upgradable Smart Contracts and Meta Transactions(Gasless txs) ○ Gas Optimization, Backend functionalities, Key management and more ● Gaia-X - SSI, Cryptography, Backend ○ Project: Open federated ecosystem enabling self-sovereign identity and compliant data exchange across EU region ○ Gaia-X focuses on SSI and open federated eco-systems in the European region where data can flow freely but in a fully compliant manner ○ Got to work on various new age technologies and concepts like SSI - Self Sovereign Identity, Verifiable Credentials and Verifiable Presentations ○ NodeJS, NestJS, GxDCH - Gaia-X Digital Clearing House ● XFSC - DevOps, SSI ○ We were part of a workshop which was focused on using the XFSC components, their internal workings and showcasing their deployment on cloud ○ Worked on about 25 of XFSC apps/components and tools across multiple languages which made the whole XFSC architecture ● Ocean Protocol - POC ○ Developed and worked on a POC using Ocean protocol which is basically a Smart Contract protocol used to tokenize Data Assets(AI models and algorithms) on Chain and perform Computations on them in isolated environments to preserve data privacy ● NFT Marketplace - POC ○ Initial learnings in blockchain tools and tech including Smart Contracts, ERC 721 NFTs, Indexing tools to query blockchain B) Trainee Software Engineer @ India Film Project - Asia's Largest Content Festival (1 year) ● Tech Stack: Wordpress ● Role & Responsibilities: ○ Involved in developing WordPress Website ○ Managed the festival website and Worked with the Brands & Solution team ○ Closely worked with presales and post sales team C) Software Intern @ Silverwing Technologies Pvt. Ltd (1 year) ● Tech Stack: JAVA ● Role & Responsibilities ○ Developed a dynamic web project in MVC pattern using JAVA'''

# jd_req_skills='''● Bachelor’s degree in Education or a related field ● Valid State Teaching License (Texas) ● PRAXIS II Certification'''
# candidate_skills = '''EDUCATION AND LICENSING Jun 2008 - Present PRAXIS II Certification Austin, Texas Sep 2005 - Jun 2008 B.A. in Education Austin, Texas'''

# jd_req_skills = '''
# Classroom Management Conflict Resolution Student Engagement and Motivation Lesson Plan Execution
# '''

# candidate_skills='''Cloud & Devops Kubernetes EKS Minikube K3s  ArgoCD GitOps Helm Charts CI-CD Pipelines 
# '''

# jd_req_skills = "happy"
# candidate_skills = "sad"

# jd_req_skills = "Certified Kubernetes Administrator"
# candidate_skills = "CKA"

# jd_req_skills = "machine learning engineer"
# candidate_skills = "AI engineer"

# jd_req_skills="Strong Python skills"
# candidate_skills="Architected Python microservices, improved latency by 40%"

# jd_req_skills = "Experience with Kubernetes orchestration"
# candidate_skills = "Managed AWS infrastructure, deployed 50+ microservices"

# jd_req_skills = "I do not want a java developer"
# candidate_skills = "I am skilled in java"

# jd_req_skills = "I want a java developer"
# candidate_skills = "I am skilled in java"
def openai_embeddings_similarity(client, jd_req_skills, candidate_skills):
    resp = client.embeddings.create(
                model='text-embedding-3-small',
                input=[jd_req_skills, candidate_skills],
            )
    jd_skills_embedding = resp.data[0].embedding
    candidate_skills_embedding = resp.data[1].embedding

    print("=========OpenAI Raw Text Similarity========")
    similarity = np.dot(jd_skills_embedding, candidate_skills_embedding)
    print(similarity)

    similarity = similarity / (np.linalg.norm(jd_skills_embedding) * np.linalg.norm(candidate_skills_embedding))
    print(f"Similarity computed using numpy dot product: {similarity}")

    a = np.array(jd_skills_embedding).reshape(1, -1)                                                                                       
    b = np.array(candidate_skills_embedding).reshape(1, -1)
    similarity = cosine_similarity(a, b)[0][0]   
    print(f"Similarity computed using scikit learn cosine similarity: {similarity}")

    print(f'Norm of a vector {np.linalg.norm(jd_skills_embedding)}')
client = OpenAI()
jd_req_skills = "I do not want a java developer"
candidate_skills = "I am skilled in java"

print(f"Cosine similarity for JD {jd_req_skills} and candidate_skills {candidate_skills}")
openai_embeddings_similarity(client, jd_req_skills, candidate_skills)

jd_req_skills = "I want a java developer"
candidate_skills = "I am skilled in java"

print(f"\nCosine similarity for JD {jd_req_skills} and candidate_skills {candidate_skills}")
openai_embeddings_similarity(client, jd_req_skills, candidate_skills)
Cosine similarity for JD I do not want a java developer and candidate_skills I am skilled in java
=========OpenAI Raw Text Similarity========
0.5604065811329946
Similarity computed using numpy dot product: 0.560503588679478
Similarity computed using scikit learn cosine similarity: 0.560503588679478
Norm of a vector 0.9995877295552451

Cosine similarity for JD I want a java developer and candidate_skills I am skilled in java
=========OpenAI Raw Text Similarity========
0.6364499261881065
Similarity computed using numpy dot product: 0.6361443178296838
Similarity computed using scikit learn cosine similarity: 0.636144317829684
Norm of a vector 1.0003751365017943

The norm of the vector is close to 1 because OpenAI returns normalized vectors

Here we see, when the JD is “I do not want a java developer” and the candidate skill is “I am skilled in Java”, the cosine similarity is 0.56. It is not close to 0 even when the logically the two sentences are opposite. This is happening because vector representation of the sentences is based on the context in which they appear. The two statements appear in the same context. They talk about the same topic i.e “Java Developer”. The two sentences share the same semantic space, hence having cosine similarity > 0.5

However, when the JD is “I want a java developer” and the candidate skill is “I am skilled in Java”, the cosine similarity improves to 0.64. This happens because they are more semantically closer and the JD does not include the “not” which aligns the two sentences at both the word and the intent level.