Commit graph

3,576 commits

Author SHA1 Message Date
Pascal Seitz
1e859fd78d fix term aggregation u32::MAX overflow issue 2026-06-18 17:07:43 +08:00
Pascal Seitz
f451fa938f explain why naive scorer must accumulate scores in WAND order 2026-06-17 18:58:58 +08:00
Pascal Seitz
2a82dd6f64 fix flaky test 2026-06-17 18:58:58 +08:00
Pascal Seitz
c096b2ad89 aggregation/terms: charge fused term_counts to the memory limit
term_counts (one u32/term) was allocated but not charged to
AggregationLimitsGuard, so a memory limit could be exceeded silently.
Charge it, skip allocating it when unbounded, and add a regression test.
2026-06-16 21:23:23 +08:00
Pascal Seitz
ac7a3d347c add comment, hoist variables 2026-06-16 21:23:23 +08:00
Pascal Seitz
03520a0719 add top level comment 2026-06-16 21:23:23 +08:00
Pascal Seitz
86a4c47bed merge loops, histo with bounds may benefit from single vec opt 2026-06-16 21:23:23 +08:00
Pascal Seitz
fb23e8908f add histogram with bounds 2026-06-16 21:23:23 +08:00
Pascal Seitz
3ca510dff0 aggregation/terms: tidy fused term×histogram grid construction
Rename the value threaded through build_segment_term_collector and
maybe_build_collector from max_term_id to col_max_val/max_column_val — it
is the column's max value, only later reused as the max term id. Make the
grid-size arithmetic overflow-/zero-safe (saturating_add, checked_div).
2026-06-16 21:23:23 +08:00
Pascal Seitz
3cb400c300 clarify counts/term_counts field docs
Spell out that `counts` is the flattened per-term × time-bucket grid (each
term's own contiguous slice) and that `term_counts` is only needed when the
per-term total can't be derived from that grid (i.e. with hard bounds).
2026-06-16 21:23:23 +08:00
Pascal Seitz
ef13489d63 skip hard_bounds that can't exclude any value
When a histogram's hard_bounds are wider than the column's value range, the
per-doc `bounds.contains` check can never fail. Collapse such bounds to the
unbounded sentinel in `normalize_histogram_req`, so both the general histogram
hot loop and the fused term×histogram path skip the check — the latter then
derives per-term counts from the grid (the ~17% win) instead of falling back to
per-doc counting just because `bounds != [MIN, MAX]`.

Only the collect-time filter is affected: empty-bucket emission reads
`req.hard_bounds` directly, and hard_bounds only ever clips that range, so a
wider-than-data bound leaves results unchanged. Covered by new tests on the
general and fused paths, including mid-interval (bucket-splitting) bounds.

Also tighten the fused-path u32-overflow guard to bound on `num_vals()` (the
per-value increment count) rather than `num_docs()`, and document why the fused
collector's hot-loop fields are hoisted into locals (re-reading them from memory
each iteration measured ~15% slower).
2026-06-16 21:23:23 +08:00
Pascal Seitz
9f7aea4765 derive term counts 2026-06-16 21:23:23 +08:00
Pascal Seitz
2c8536ab11 add specialized TermHistogram 2026-06-16 21:23:23 +08:00
Pascal Seitz
05f4c02ac5 add dense histogram, optional sub-buckets 2026-06-16 21:23:23 +08:00
Pascal Seitz
d137779219 add no sub-gg fastpath 2026-06-16 21:23:23 +08:00
Pascal Seitz
8f9846ac80 use get_range when possible 2026-06-16 21:23:23 +08:00
Pascal Seitz
52e24a9757 add status -> date histogram bench 2026-06-16 21:23:23 +08:00
trinity-1686a
00714326af
Merge pull request #2960 from Darkheir/fix/query_grammar_boost_and_escape
fix(query-grammar): Fix issues on boosted and regex queries
2026-06-16 12:03:23 +02:00
Mohammad Dashti
799f7b4646 Built SUM final result in each branch directly.
Keeps the empty-bucket coercion visible at the boundary instead of a
shared binding, following the reviewer's suggested shape.
2026-06-16 03:10:30 +08:00
Mohammad Dashti
fc88d80726 docs: drop downstream-specific name from none_if_no_match doc
The flag's purpose is described well enough by "SQL-style consumers";
no need to call out a specific downstream.
2026-06-16 03:10:30 +08:00
Mohammad Dashti
6a684e7c38 feat: opt-in none_if_no_match flag on SumAggregation for SQL-style null
Switch the default serialized output of `sum` on empty / all-missing
buckets back to `"value": 0` to match Elasticsearch, and gate the
SQL-style `"value": null` behavior behind a new
`none_if_no_match: Option<bool>` flag on `SumAggregation`.

`IntermediateSum::finalize` still returns `Option<f64>` internally so
the Rust API stays parallel to min/max/avg, but the ES-vs-SQL choice is
made at the boundary in `IntermediateMetricResult::into_final_metric_result`:
`None` is coerced to `Some(0.0)` unless `none_if_no_match` is set on the
aggregation request.

Adds `AggregationVariants::as_sum()` accessor for that boundary check
and two end-to-end tests covering both the default ES behavior and the
opt-in null behavior on an empty index.
2026-06-16 03:10:30 +08:00
Mohammad Dashti
94fe52cc67 docs: clarify SUM finalize returning None diverges from Elasticsearch
Surface the trade-off in the doc comment so future reviewers see why
this differs from ES (which returns "value": 0 for sum over
empty/all-missing buckets) and what consumers (ParadeDB SQL NULL) the
None variant is meant to serve.
2026-06-16 03:10:30 +08:00
Mohammad Dashti
2ff39f6f7f fix: return None from SUM when no values were collected
IntermediateSum::finalize() returned Some(0.0) even when count==0
(all documents had missing/NULL values). This differs from MIN, MAX,
and AVG which all return None for count==0.

The 0.0 came from IntermediateStats' default sum initialization.
Consumers (like ParadeDB) that map None to SQL NULL were incorrectly
getting 0 for SUM on all-NULL groups.

Fixes paradedb/paradedb#4621
2026-06-16 03:10:30 +08:00
Windforce17
1d06328cb3 Add BlockSegmentPostings::rank() for skip-list-based positional counting
Add a public rank(target) method on BlockSegmentPostings that returns the
number of docs with a doc id strictly smaller than target. It jumps to the
candidate block through the skip list and decodes a single block, so the cost
is O(skip-list entries) + one block decode rather than O(doc_freq).

This is a useful primitive for range counting over a posting list (e.g. number
of matches in a [lo, hi) doc-id window) without iterating every matched doc.

To support it, expose SkipReader::remaining_docs() (pub(crate)). Like seek(),
rank() advances the cursor forward only and must be called with non-decreasing,
valid (<= TERMINATED) targets. Adds a unit test covering multi-block lists and
the below-first / above-last / empty edge cases.
2026-06-15 18:56:49 +08:00
Darkheir
7fd1dbe9f5
fix(query-grammar): Fix issues on boosted and regex queries
Signed-off-by: Darkheir <raphael.cohen@sekoia.io>
2026-06-15 10:50:07 +02:00
Pascal Seitz
b19f0ddc77 fix clippy 2026-06-09 23:14:12 +08:00
Pascal Seitz
b4acfcf881 cleanup AggregationsSegmentCtx
The metric/cardinality/histogram _mut getters had no callers needing
mutation; their two uses already pass the resulting reference as &T.

simplify req_data ownership: clone into collectors, Rc only for filter BitSet

Replace Vec<Option<Box<T>>> + take/put-back round-trip with Vec<T> +
direct clone into collector. Collectors now own their per-segment
request data outright, removing the borrow-checker dance that the
take/put-back pattern existed to satisfy.

The structural clones are cheap (Column<u64> is Arc-internal) except
for the filter aggregation, whose DocumentQueryEvaluator carries a
precomputed per-segment BitSet sized by max_doc. Wrap that in
Rc<DocumentQueryEvaluator> so FilterAggReqData::clone() bumps a
refcount instead of duplicating the BitSet. Move SegmentFilterCollector's
matching_docs_buffer out of FilterAggReqData so its pre-allocated
capacity is preserved per collector instead of being lost on every clone.
2026-06-09 23:14:12 +08:00
dependabot[bot]
3a8240b123 Bump codecov/codecov-action from 6.0.0 to 7.0.0
Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 6.0.0 to 7.0.0.
- [Release notes](https://github.com/codecov/codecov-action/releases)
- [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md)
- [Commits](57e3a136b7...fb8b3582c8)

---
updated-dependencies:
- dependency-name: codecov/codecov-action
  dependency-version: 7.0.0
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-06-09 14:48:17 +08:00
dependabot[bot]
fd9713e1ca
Bump actions/checkout from 6.0.2 to 6.0.3 (#2949)
Bumps [actions/checkout](https://github.com/actions/checkout) from 6.0.2 to 6.0.3.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](de0fac2e45...df4cb1c069)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: 6.0.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-08 10:55:54 +02:00
dependabot[bot]
96f3784f79
Bump github/codeql-action from 4.35.2 to 4.36.1 (#2948)
Bumps [github/codeql-action](https://github.com/github/codeql-action) from 4.35.2 to 4.36.1.
- [Release notes](https://github.com/github/codeql-action/releases)
- [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md)
- [Commits](95e58e9a2c...87557b9c84)

---
updated-dependencies:
- dependency-name: github/codeql-action
  dependency-version: 4.36.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-08 10:49:04 +02:00
dependabot[bot]
87a6679a79
Bump actions/upload-artifact from 7.0.0 to 7.0.1 (#2917)
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 7.0.0 to 7.0.1.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](bbbca2ddaa...043fb46d1a)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-version: 7.0.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-08 10:48:48 +02:00
dependabot[bot]
864a6aa72c
Update murmurhash32 requirement from 0.3 to 0.4 (#2894)
Updates the requirements on [murmurhash32](https://github.com/quickwit-inc/murmurhash32) to permit the latest version.
- [Commits](https://github.com/quickwit-inc/murmurhash32/commits)

---
updated-dependencies:
- dependency-name: murmurhash32
  dependency-version: 0.4.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-08 10:48:32 +02:00
Paul Masurel
abcf6754a2
CR comments from https://github.com/quickwit-oss/tantivy/pull/2940 (#2952)
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2026-06-08 10:47:58 +02:00
Kanishk Sachan
70a8e56ee5 test(postings): add unit tests for TermFrequencyRecorder
Closes #2285

The TermFrequencyRecorder was completely untested. Add five focused tests:

- term_frequency_recorder_has_term_freq: verifies the recorder
  correctly advertises term-frequency support via has_term_freq()
- term_frequency_recorder_zero_docs: term_doc_freq() returns Some(0)
  before any documents are recorded
- term_frequency_recorder_term_doc_freq_single_doc: one document with
  two occurrences yields term_doc_freq() == Some(1)
- term_frequency_recorder_term_doc_freq_multiple_docs: three documents
  with varying term frequencies yield term_doc_freq() == Some(3),
  confirming the count tracks documents, not occurrences
- term_frequency_recorder_single_occurrence_per_doc: each of three
  documents has exactly one occurrence
- term_frequency_recorder_high_frequency_doc: a single document with
  1000 occurrences still yields term_doc_freq() == Some(1)
2026-06-06 14:44:51 +08:00
Paul Masurel
62705526e8
Add sve + neon filter vec implementation as spotted by Adam (#2940)
* Add filter_vec benchmarks (dense, sparse, full coverage)

Uses get_ids_for_value_range to exercise both the bitpacking decode and
the filter_vec SIMD path together under realistic cache conditions.

* Add NEON and SVE implementations for filter_vec

Adds aarch64-specific SIMD paths (NEON always available on aarch64;
SVE gated on nightly + non-Apple target) with routing logic in mod.rs
that selects the best available instruction set at runtime.

* Using asm! to workaround the lack of stabilized SVE intrinsics

* showing instruction set

* improved proptesting

* removing build.rs

---------

Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2026-06-04 17:51:26 +02:00
Paul Masurel
a27c64998f
Cargo clippy fix (#2943)
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2026-06-01 14:39:44 +02:00
Paul Masurel
46b3fb9ed3
Relying on upstream version of datasketch and stop using HLL 4. (#2936)
We were relying on a fork for:

a bugfix in LIST serialization
a better API exposing a new Coupon type, required for caching coupons.
We also stop using HLL8 in hope to fix
https://datadoghq.atlassian.net/browse/CLOUDPREM-625

Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2026-05-19 13:29:35 +02:00
trinity-1686a
fbe620b9b4
Merge pull request #2933 from quickwit-oss/1686a/sstable-opt
optimise sstable index access pattern
2026-05-19 11:43:17 +02:00
trinity-1686a
95d8a3989a
cr 2026-05-19 11:38:48 +02:00
trinity-1686a
ea61a68db4
skip sstable index binary search when ordinal is in same block 2026-05-16 11:35:38 +02:00
trinity-1686a
c367df37c1
refactor sstable index 2026-05-16 11:30:02 +02:00
Mohammad Dashti
d99a5d4e91 Rename validate_aggregation_fields to validate_aggregation_fields_exist
Applies @PSeitz's review suggestion to make the function name more
descriptive of what it checks. Also adds a doc note clarifying why
validation is opt-in rather than enforced by default.
2026-05-16 15:45:20 +08:00
Mohammad Dashti
2de6f075ce Fixed the example 2026-05-16 15:45:20 +08:00
Mohammad Dashti
18080067c7 Applied PR comment:
I would move it outside of the aggregation. You can fetch the fields from the aggregation request and do a validation in a helper function
2026-05-16 15:45:20 +08:00
Mohammad Dashti
95db7d2e5c Revert "Revert all impl."
This reverts commit d5e0991549a05bf80f19f853f7689ad69f96e7e5.
2026-05-16 15:45:20 +08:00
Mohammad Dashti
fc017c4c74 Applied PR comments. 2026-05-16 15:45:20 +08:00
Mohammad Dashti
141c91d028 Added a flag: strict_validation 2026-05-16 15:45:20 +08:00
Mohammad Dashti
36a83e7c1a Fixed agg validation 2026-05-16 15:45:20 +08:00
jinhelin
be11f8a6a1 Fix opening positions file error 2026-05-14 15:55:59 +08:00
dependabot[bot]
4305e4029e Update binggan requirement from 0.16.1 to 0.17.0
Updates the requirements on [binggan](https://github.com/pseitz/binggan) to permit the latest version.
- [Changelog](https://github.com/PSeitz/binggan/blob/main/CHANGELOG.md)
- [Commits](https://github.com/pseitz/binggan/commits)

---
updated-dependencies:
- dependency-name: binggan
  dependency-version: 0.17.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-05-12 15:10:20 +08:00