Dynamic entropy-compressed sequences and full-text indexes
0202 electrical engineering, electronic engineering, information engineering
SUFFIX ARRAYS
02 engineering and technology
DOI:
10.1145/1367064.1367072
Publication Date:
2008-07-02T12:09:19Z
AUTHORS (2)
ABSTRACT
We give new solutions to the Searchable Partial Sums with Indels problem. Given a sequence of
n
k
-bit numbers, we present a structure taking
kn
+
o
(
kn
) bits of space, able of performing operations
sum
,
search
,
insert
, and
delete
, all in
O
(log
n
) worst-case time, for any
k
=
O
(log
n
). This extends previous results by Hon et al. [2003c] achieving the same space and
O
(log
n
/log log
n
) time complexities for the queries, yet offering complexities for
insert
and
delete
that are amortized and worse than ours, and supported only for
k
=
O
(1). Our result matches an existing lower bound for large values of
k
.
We also give new solutions to the Dynamic Sequence problem. Given a sequence of
n
symbols in the range [1,σ] with binary zero-order entropy
H
0
, we present a dynamic data structure that requires
nH
0
+
o
(
n
log σ) bits of space, which is able of performing
rank
and
select
, as well as inserting and deleting symbols at arbitrary positions, in
O
(log
n
log σ) time. Our result is the
first
entropy-bound dynamic data structure for
rank
and
select
over general sequences.
In the case σ = 2, where both previous problems coincide, we improve the dynamic solution of Hon et al. [2003c] in that we compress the sequence. The only previous result with entropy-bound space for dynamic binary sequences is by Blandford and Blelloch [2004], which has the same complexities as our structure, but does not achieve constant 1 multiplying the entropy term in the space complexity.
Finally, we present a new dynamic compressed full-text self-index, for a collection of texts over an alphabet of size σ, of overall length
n
and
h
th order empirical entropy
H
h
. The index requires
nH
h
+
o
(
n
log σ) bits of space, for any
h
≤ α log
sigma
n
and constant 0 < α < 1. It can count the number of occurrences of a pattern of length
m
in time
O
(
m
log
n
log σ). Each such occurrence can be reported in
O
(log
2
n
log log
n
) time, and displaying a context of length ℓ from a text takes time
O
(log
n
(ℓ log σ + log
n
log log
n
)). Insertion/deletion of a text to/from the collection takes
O
(log
n
log σ) time per symbol. This significantly improves the space of a previous result by Chan et al. [2004] in exchange for a slight time complexity penalty. We achieve at the same time the
first
dynamic index requiring essentially
nH
h
bits of space, and the
first
construction of a compressed full-text self-index within that working space. Previous results achieve at best
O
(
nH
h
space with constants larger than 1 [Ferragina and Manzini 2000; Arroyuelo and Navarro 2005] and higher time complexities.
An important result we prove in this paper is that the wavelet tree of the Burrows-Wheeler transform of a text, if compressed with a technique that achieves zero-order compression locally (e.g., Raman et al. [2002]), automatically achieves
h
th order entropy space for any
h
. This unforeseen relation is essential for the results of the previous paragraph, but it also derives into significant simplifications on many existing static compressed full-text self-indexes that build on wavelet trees.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (40)
CITATIONS (67)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....