Efficient Neural Compression with Inference-time Decoding

Neural decoding
DOI: 10.48550/arxiv.2406.06237 Publication Date: 2024-06-10
ABSTRACT
This paper explores the combination of neural network quantization and entropy coding for memory footprint minimization. Edge deployment quantized models is hampered by harsh Pareto frontier accuracy-to-bitwidth tradeoff, causing dramatic accuracy loss below a certain bitwidth. can be alleviated thanks to mixed precision quantization, allowing more flexible bitwidth allocation. However, standard benefits remain limited due 1-bit frontier, that forces each parameter encoded on at least 1 bit data. introduces an approach combines precision, zero-point push compression boundary Resnets beyond with drop 1% ImageNet benchmark. From implementation standpoint, compact decoder architecture features reduced latency, thus inference-compatible decoding.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....