Adrien Grand
@jpountz.bsky.social
📤 57
📥 28
📝 73
#Lucene
developer
It's interesting how the Elasticsearch and Datadog (
www.datadoghq.com/blog/enginee...
) approaches to wildcard search differ. Both use n-gram indexes, but with different strategies to contain storage amplification. Datadog hashes 4-grams while ES aggressively normalizes 3-grams.
loading . . .
Inside Husky’s query engine: Real-time access to 100 trillion events | Datadog
See how Husky enables interactive querying across 100 trillion events daily by combining caching, smart indexing, and query pruning.
https://www.datadoghq.com/blog/engineering/husky-query-architecture/
about 2 months ago
0
0
0
Ge Song merged a good ~15% speedup for BM25F queries in Lucene
benchmarks.mikemccandless.com/CombinedOrHi...
(last data point)
github.com/apache/lucen...
loading . . .
Lucene CombinedOrHighMed queries/sec
https://benchmarks.mikemccandless.com/CombinedOrHighMed.html
about 2 months ago
0
2
0
New blog: vectorized evaluation of disjunctive queries
jpountz.github.io/2025/10/11/v...
It explains how Lucene manages to be fast at evaluating top hits by BM25 score, even with hard queries that have only stop words or tens of terms.
loading . . .
Vectorized evaluation of disjunctive queries
In a previous blog post, I explained how Lucene significantly improved query evaluation efficiency by migrating to a vectorized execution model, and described the algorithm that Lucene uses to evaluat...
https://jpountz.github.io/2025/10/11/vectorized-evaluation-of-disjunctive-queries.html
about 2 months ago
0
0
1
reposted by
Adrien Grand
Doug Turnbull
3 months ago
BM25F is an adjustment to BM25 that accounts for multiple fields, beating out naive summing of BM25 scores
softwaredoug.com/blog/2025/09...
loading . . .
BM25F from scratch
BM25 run across multiple fields isn’t as simple as summing a bunch of field-level BM25 scores.
https://softwaredoug.com/blog/2025/09/18/bm25f-from-scratch
0
1
1
Lucene 10.3 is out with 40% faster lexical search, 15% faster dense vector search and 30% faster terms dictionary lookups.
lucene.apache.org/core/corenew...
loading . . .
Lucene™ Core News
Apache Lucene is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for...
https://lucene.apache.org/core/corenews.html#apache-lucenetm-1030-available
3 months ago
1
3
3
Lucene just bumped the block size of its postings lists from 128 to 256. This gave very good speedups (up to 45%) to most queries, and up to 10-15% slowdowns to filtered term queries.
benchmarks.mikemccandless.com/2025.09.10.1...
loading . . .
https://benchmarks.mikemccandless.com/2025.09.10.18.04.41.html
3 months ago
1
1
0
#Lucene
just switched from a binary heap to a ternary heap to collect top hits by score. This helps a small bit when computing top-100 hits (~2% on the fastest queries) but up to 15% when computing top-1000 hits - thanks to better cache efficiency
github.com/apache/lucen...
loading . . .
Adding 3-ary LongHeap to speed up collectors like TopDoc*Collectors by RamakrishnaChilaka · Pull Request #15140 · apache/lucene
Description This PR updates LongHeap from a fixed 2-ary heap to a 3-ary heap (the code is generic with n-ary Heap). The change improves cache locality and reduces heap operations for larger heaps, ...
https://github.com/apache/lucene/pull/15140
3 months ago
0
1
0
I just ran the Tantivy benchmark (
tantivy-search.github.io/bench/
) on Lucene 10.2 vs a Lucene 10.3 snapshot build. Lucene 10.2 already performed very well, but Lucene 10.3 is on another level. Very exciting.
add a skeleton here at some point
3 months ago
0
1
0
I just blogged about how Lucene improved query evaluation efficiency by ~40% through vectorization:
jpountz.github.io/2025/08/28/c...
loading . . .
Compilation vs. vectorization, search engine edition
Virtual function calls are quite expensive, which is why database systems have been looking into ways to avoid performing one or more virtual function calls per record when processing a query. Two mai...
https://jpountz.github.io/2025/08/28/compiled-vs-vectorized-search-engine-edition.html
3 months ago
0
2
1
Why you probably should configure an index sort on your Lucene indexes:
jpountz.github.io/2025/07/26/w...
loading . . .
Why you should configure an index sort on your Lucene indexes
Some time ago, I wrote that “if you do not configure an index sort on your Lucene indexes, you are missing search-time efficiency benefits that are almost certainly worth the (low) index-time overhead...
https://jpountz.github.io/2025/07/26/why-you-should-configure-an-index-sort.html
4 months ago
0
3
1
I spent some time looking at the Vespa source code to see how it compares with Lucene
jpountz.github.io/2025/07/25/m...
loading . . .
More on Vespa vs. Lucene/Elasticsearch
In a previous post, I took a look at the Vespa vs. Elasticsearch benchmark that the Vespa people run. The results made me want to dig a little deeper to see how Vespa and Lucene/Elasticsearch differ i...
https://jpountz.github.io/2025/07/25/more-on-vespa-vs-elasticsearch.html
4 months ago
0
3
1
This small change yielded a ~5% speedup on several queries of Lucene's nightly benchmarks (see last data point at
benchmarks.mikemccandless.com/OrStopWords....
). Can you guess why?
5 months ago
1
2
0
Last month, Lucene changed query evaluation to work in a more term-at-a-time fashion within small-ish windows of doc IDs. This yielded a good speedup on its own (annotation IL
benchmarks.mikemccandless.com/OrHighMed.html
).
loading . . .
Lucene BooleanQuery (OR, high freq, medium freq term) queries/sec
https://benchmarks.mikemccandless.com/OrHighMed.html
5 months ago
1
0
0
Lucene is getting an increasing number of high-quality contributions from ByteDance employees, especially around performance. Good to see that this project keeps attracting contributors from all around the world.
5 months ago
0
1
0
Another common point I did not expect: Vespa's strict vs. unstrict iterators is quite similar to Lucene's two-phase iteration. And both projects use this feature to effectively combine dynamic pruning with filtering (a hard and underappreciated problem IMO).
5 months ago
0
0
0
Someone asked me for my opinion on the Vespa vs. Elasticsearch performance comparison today at Berlin Buzzwords, so I gave it a try:
jpountz.github.io/2025/06/17/a...
loading . . .
A look at the Vespa vs. Elasticsearch benchmark
I was attending Berlin Buzzwords today and someone asked me about the Elasticsearch vs. Vespa comparison produced by the Vespa people, so I thought I’d publish my thoughts.
https://jpountz.github.io/2025/06/17/analysis-of-Elasticsearch-vs-Vespa.html
6 months ago
0
1
0
Andrei Dan kindly captured pictures of Luca and I telling the story of how the Lucene 10 release went
6 months ago
0
1
0
Via
@rmuir.org
: Linux 6.15 introduced a big speedup for Lucene on AMD processors
benchmarks.mikemccandless.com/FilteredOrHi...
(last data point, not annotated yet) thanks to faster TLB invalidation
www.phoronix.com/review/amd-i...
loading . . .
Lucene FilteredOrHighMed queries/sec
https://benchmarks.mikemccandless.com/FilteredOrHighMed.html
6 months ago
0
1
0
Uwe now explains how Lucene takes advantage of the Panama foreign memory and vector support in spite of the fact that these features are still preview/incubating in the JDK
6 months ago
0
3
0
Uwe Schindler gives a short history of Apache Lucene at
#bbuzzz
6 months ago
0
1
0
Lucene is getting faster at deep search by switching to a more efficient heap implementation to collect top hits.
github.com/apache/lucen...
loading . . .
Move HitQueue in TopScoreDocCollector to a LongHeap by gf2121 · Pull Request #14714 · apache/lucene
This tries to encode ScoreDoc#score and ScoreDoc#doc to a comparable long and use a LongHeap instead of HitQueue. This seems to help apparently when i increase topN = 1000 (mikemccand/luceneutil#35...
https://github.com/apache/lucene/pull/14714
6 months ago
0
2
0
A nice optimization landed on the hash table that Lucene uses to build inverted indexes:
github.com/apache/lucen...
. Some previously unused bits are now used to cache hash codes, effectively making collisions cheaper to resolve.
loading . . .
Cache high-order bits of hashcode to speed up BytesRefHash by bugmakerrrrrr · Pull Request #14720 · apache/lucene
Description This PR tries to utilize the unused part of the id to cache the high-order bits of the hashcode to speed up BytesRefHash. I used 1 million 16-byte UUIDs to benchmark this change, and t...
https://github.com/apache/lucene/pull/14720
6 months ago
0
1
0
There has been a big regression in Lucene's nightly benchmarks recently after a kernel upgrade. Mike and
@rmuir.org
found that it was caused by a change in the Linux scheduler configuration.
github.com/apache/lucen...
loading . . .
Nightly benchmark regression on 2025.05.01 · Issue #14630 · apache/lucene
Description I'm seeing a big performance change (mostly regression) on 2025.05.01 benchmark, without an annotation. There are many commits diff for this run, i have not managed to identify but mayb...
https://github.com/apache/lucene/issues/14630#issuecomment-2889212582
7 months ago
1
2
0
I wanted to share what I learned from Tantivy's "Search Benchmark, the Game", so I set up GitHub pages and wrote two blogs, on general observations on the benchmark
jpountz.github.io/2025/05/12/a...
and how it helped drive performance improvements in Lucene
jpountz.github.io/2025/04/12/w...
loading . . .
An analysis of Search Benchmark, the Game
“Search Benchmark, the Game” is maintained at https://github.com/quickwit-oss/search-benchmark-game by the Tantivy folks and published at https://tantivy-search.github.io/bench/. I don’t know the full...
https://jpountz.github.io/2025/05/12/analysis-of-Search-Benchmark-the-Game.html
7 months ago
0
1
1
Yelp's nrtSearch was just upgraded to Lucene 10. Also switched from persistent storage to object storage as a source of truth, and plans on doing NRT replication via object storage instead of over the network. Very similar to Elasticsearch Serverless.
engineeringblog.yelp.com/2025/05/nrts...
loading . . .
Nrtsearch 1.0.0: Incremental Backups, Lucene 10, and More
Nrtsearch 1.0.0: Incremental Backups, Lucene 10, and More Sarthak Nandi and Andrew Prudhomme May 8, 2025 It has been over 3 years since we published our Nrtsearch blog post and...
https://engineeringblog.yelp.com/2025/05/nrtsearch-v1-release.html
7 months ago
0
0
0
The search library benchmark from the Tantivy folks was just updated with Lucene 10.2
tantivy-search.github.io/bench
. Lucene now performs much better at the COUNT collection type, a bit better at TOP_K. Still somewhat slow at TOP_100_COUNT and phrase queries across all collection types.
loading . . .
Search benchmark, the game
The search benchmark
https://tantivy-search.github.io/bench
7 months ago
0
2
0
Lucene's histogram collector is becoming more sophisticated, it can now take advantage of points indexes when the query fully matches a segment, which can give a multiple fold performance boost.
github.com/apache/lucen...
loading . . .
Logic for collecting Histogram efficiently using Point Trees by jainankitk · Pull Request #14439 · apache/lucene
Description This PR adds multi range traversal logic to collect the histogram on numeric field indexed as pointValues for MATCH_ALL cases. Even for non-match all cases like PointRangeQuery, if the ...
https://github.com/apache/lucene/pull/14439
7 months ago
0
0
0
I feel bad when I see users needing to shard their indexes only to not hit the 2B doc count limit. 2B was a lot when Lucene was created, not anymore. We should fix it.
github.com/apache/lucen...
add a skeleton here at some point
7 months ago
0
3
2
Julie's argument for principled approaches in her BM25F blog reminded me of this paper
www.microsoft.com/en-us/resear...
which discusses folding features like page rank into the BM25 score by making their contribution look like the contribution of a term.
loading . . .
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/craswell_sigir05.pdf
8 months ago
0
0
0
Finally took the time to read this great blog on BM25F by Julie Tibshirani
sourcegraph.com/blog/keeping...
. Great to see more usage of BM25F, lots of applications are still combining scores across fields via sum/max when BM25F would be a more robust choice.
loading . . .
https://sourcegraph.com/blog/keeping-i…
8 months ago
0
3
0
Elasticsearch 9, now on Lucene 10.
add a skeleton here at some point
8 months ago
0
1
0
In case you missed it: since last year, Lucene's nightly benchmarks include queries on whole paragraphs.
github.com/mikemccand/l...
benchmarks.mikemccandless.com/OrMany.html
This is not typical for user-facing search, but many users do this, especially with the rise of RAG.
8 months ago
1
0
0
I'm glad to share that Luca Cavanna and I will be speaking about the
#Lucene
10.0 release at Berlin Buzzwords in June.
8 months ago
0
4
1
It's time to redo benchmarks!
#Lucene
10.2 was just released, with - huge speedups to non-scoring boolean queries, range queries and filtered vector search, - better merging defaults for faster search, - much faster merging of vectors And more...
lucene.apache.org/core/corenew...
loading . . .
Lucene™ Core News
Apache Lucene is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for...
https://lucene.apache.org/core/corenews.html#apache-lucenetm-1020-available
8 months ago
1
6
1
Something that I wish was better known: if you do not configure an index sort on your
#Lucene
indexes, you are missing search-time efficiency benefits that are almost certainly worth the (low) index-time overhead.
8 months ago
1
2
1
Guo Feng contributed a 2.5x (!) speedup to
#Lucene's
numeric range queries by using vectorization. HZ sped up query evaluation, ID sped up decoding data from the index. Lots of great performance improvements coming in Lucene 10.2.
9 months ago
0
3
0
Two good speedups on queries sorted by field on
#Lucene's
nightly benchmarks last night. This is due to a recent optimization to conjunctive queries being generalized to intersecting a query and a collector.
github.com/apache/lucen...
9 months ago
0
2
0
reposted by
Adrien Grand
Ben Trent
9 months ago
Filtered vector search is crazy important. So we made HNSW filtered search in Apache Lucene better. At similar recall, it can be 3-5x faster!
1
5
1
You may not have heard of the Terrier IR platform
terrier.org
but both Lucene's Similarity framework
lucene.apache.org/core/10_1_0/...
and Elasticsearch's Retriever framework
www.elastic.co/guide/en/ela...
are heavily inspired by Terrier.
loading . . .
Terrier IR Platform - Homepage
Terrier Information Retrieval Platform Homepage
http://terrier.org/
10 months ago
0
1
0
2.5y ago, Tencent published how they optimize "histogram queries" with Lucene
www.vldb.org/pvldb/vol15/...
(section 4.5.3) I'm adding an implementation of this idea to Lucene, that takes advantage of sparse indexing introduced in Lucene 10.0
github.com/apache/lucen...
.
loading . . .
https://www.vldb.org/pvldb/vol15/p3472-yu.pdf
10 months ago
0
2
0
The change has been merged and nightly benchmarks just caught up:
benchmarks.mikemccandless.com/2025.01.14.1...
. I don't remember many improvements of this magnitude to non-trivial workloads in Lucene's history.
add a skeleton here at some point
11 months ago
0
5
2
Common wisdom is that indexing is slow (it is!), so when ingestion is slow, it's natural to blame indexing. But there are often overlooked costs, like JSON parsing, date parsing
blunders.io/posts/es-ben...
or number parsing
lemire.me/blog/2020/03...
loading . . .
Elasticsearch Benchmarking, Part 1: Date Parsing
https://blunders.io/posts/es-benchmark-1-date-parse
11 months ago
2
1
0
Quizz: I'm working on a change that produces the following speedup when running the Lucene benchmark. Can you guess what the change does? I'll open a PR shortly.
11 months ago
1
3
1
Mike Sokolov is exploring applying recursive graph bisection to HNSW graphs in Lucene. This gives significantly faster search (as we hoped) but also faster indexing (unexpected!)
github.com/apache/lucen...
loading . . .
HNSW BP reordering by msokolov · Pull Request #14097 · apache/lucene
Description This is similar to the previous PR (#13683) on the same issue, but the difference here is: rebased on main (there was a 10.0 major release in between) refactored to make BpVectorReorde...
https://github.com/apache/lucene/pull/14097
11 months ago
0
2
0
reposted by
Adrien Grand
Ben Trent
11 months ago
Something a little different from my typical blogs. This line of code in Apache Lucene took me 3 days to write. For fixing bugs, it's about the journey, not necessarily the destination.
www.elastic.co/search-labs/...
(the cover art was provided by one of my kids :))
loading . . .
Lucene bug adventures: Fixing a corrupted index exception - Elasticsearch Labs
Sometimes, a single line of code takes days to write. Here, we get a glimpse of an engineer's pain and debugging over multiple days to fix a potential Apache Lucene index corruption.
https://www.elastic.co/search-labs/blog/lucene-corrupted-index-exception
1
4
1
Lucene has been evaluating disjunctive queries by loading (windows of) postings into a bit set and or-ing these bit sets for 20+ years. It started using the same approach for conjunctive queries a few days ago.
benchmarks.mikemccandless.com/CountAndHigh...
(annotation HS)
loading . . .
Lucene CountAndHighHigh queries/sec
https://benchmarks.mikemccandless.com/CountAndHighHigh.html
12 months ago
1
2
1
Lucene just got a big speedup for exhaustive evaluation of disjunctive queries when not computing scores:
benchmarks.mikemccandless.com/CountOrHighH...
(annotation HR). Same algorithm, but the code was refactored to help the JVM compiler auto-vectorize a hot loop.
loading . . .
Lucene CountOrHighHigh queries/sec
https://benchmarks.mikemccandless.com/CountOrHighHigh.html
12 months ago
0
2
0
Filtered disjunctions just got a good speedup on Lucene's nightly benchmarks, especially when the disjunction has many terms
benchmarks.mikemccandless.com/CountFiltere...
(annotation HQ).
loading . . .
Lucene CountFilteredOrMany queries/sec
https://benchmarks.mikemccandless.com/CountFilteredOrMany.html
12 months ago
1
2
2
DisjunctionMaxQuery just had a good speedup:
benchmarks.mikemccandless.com/DismaxOrHigh...
Even though Lucene now has support for BM25F, DisjunctionMaxQuery is still a popular way to query across several fields by taking the maximum score, see e.g. Elasticsearch's best_fields option.
loading . . .
https://benchmarks.mikemccandless.com/DismaxOrHighMe…
12 months ago
1
1
0
Phrase queries got a speedup lately by removing as much bookkeeping about positions as possible when advancing the doc ID. This helps because positions are only read when all phrase terms match. The speedup is even bigger on filtered phase queries.
benchmarks.mikemccandless.com/FilteredPhra...
loading . . .
Lucene FilteredPhrase queries/sec
https://benchmarks.mikemccandless.com/FilteredPhrase.html
12 months ago
0
2
0
Load more
feeds!
log in