Efficient Neural Compression with Inference-time Decoding
Neural decoding
DOI:
10.48550/arxiv.2406.06237
Publication Date:
2024-06-10
AUTHORS (3)
ABSTRACT
This paper explores the combination of neural network quantization and entropy coding for memory footprint minimization. Edge deployment quantized models is hampered by harsh Pareto frontier accuracy-to-bitwidth tradeoff, causing dramatic accuracy loss below a certain bitwidth. can be alleviated thanks to mixed precision quantization, allowing more flexible bitwidth allocation. However, standard benefits remain limited due 1-bit frontier, that forces each parameter encoded on at least 1 bit data. introduces an approach combines precision, zero-point push compression boundary Resnets beyond with drop 1% ImageNet benchmark. From implementation standpoint, compact decoder architecture features reduced latency, thus inference-compatible decoding.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....