Transformer accelerated this trend decisively. Self-attention constructs a relation graph over the whole sequence at once, capturing order and long-range dependencies in parallel.

 Understood. Here’s the English translation:


I’ll organize this as a narrative centered on how the intention to “measure distance” pushed methods forward and converged on semantic distance after Transformer.

In the early web era, the starting point was a need to handle scale while quickly telling related documents and duplicates apart. TF-IDF/BM25 measured “closeness” via cosine distance on term frequencies, while shingling → MinHash/LSH and then SimHash “fingerprinted” documents to find near-duplicates fast. Here, distance mostly captured surface-level overlap, not meaning.

Soon, the problem that surface overlap misses semantic closeness came to the fore, bringing in latent topic models like LSA/LDA. They built low-dimensional spaces from word co-occurrence and measured distance in a “latent structure.” Word order and rich context were still weak, but distance moved one step beyond mere surface matching.

Word2Vec/GloVe then became a turning point by directly learning distributed semantics through prediction tasks. With cleaner geometry in vector space, Doc2Vec and pooled embeddings made sentence/document vectors practical. Distance shifted its center of gravity from “frequency” to the geometry of meaning.

In machine translation, RNN Seq2Seq with Attention visualized long-range dependency as alignment, weighting “which words affect which.” That pushed toward better internal representations of sentence meaning, laying a foothold for the idea that a good encoder yields a good semantic distance.

Transformer accelerated this trend decisively. Self-attention constructs a relation graph over the whole sequence at once, capturing order and long-range dependencies in parallel. The motivation was to surpass RNN limits and improve translation, but the consequence was the spread of high-quality contextual representations, rapidly firming up the basis for measuring distance in semantic space.

BERT/USE/SBERT then supplied sentence representations that directly measure meaning well via pretraining and adaptation. Especially SBERT and SimCSE use contrastive learning to explicitly make “things that should be close” close and “things that should be far” far, aligning distance with semantic relatedness as the training objective.

Search itself entered the “learned” phase. DPR/ColBERT learn query and document embeddings so that inner product (the inverse of distance) maps directly to retrieval score. Distance is no longer a byproduct; it becomes a task-aligned objective, unifying near-duplicate detection, similarity search, clustering, and recommendation within the same semantic space.

Finally, on the infrastructure side, FAISS/HNSW/ScaNN and vector databases made nearest-neighbor search at web scale practical. In production, hybrid BM25 + vectors balances precision × recall × speed, and RAG, deduplication, summarization, and recommendation all connect on the same embedding foundation.

In short, the intention has been consistent: “At massive scale, find what’s close—quickly and without missing it.” Its resolution advanced through
surface overlap → latent structure → prediction-learned semantic spaces → task-aligned learned distances.
Transformer—born to “make translation better,” not to detect duplicates—nonetheless boosted the quality and scale of contextual representations, and thereby cemented the era where better embeddings mean better distance.

Comments

Popular posts from this blog

Japan Jazz Anthology Select: Jazz of the SP Era

In practice, the most workable approach is to measure a composite “civility score” built from multiple indicators.