Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compress and decompress function should be revisited #31

Closed
Oldes opened this issue Apr 15, 2021 · 8 comments
Closed

compress and decompress function should be revisited #31

Oldes opened this issue Apr 15, 2021 · 8 comments

Comments

@Oldes
Copy link
Owner

Oldes commented Apr 15, 2021

At this moment there is:

compress: make native! [[
    {Compresses data. Default is deflate with Adler32 checksum and uncompressed size in last 4 bytes.}
    data [binary! string!] "If string, it will be UTF8 encoded"
    /part length "Length of source data"
    /zlib {Use ZLIB (Adler32 checksum) without uncompressed length appended}
    /gzip "Use ZLIB with GZIP envelope (using CRC32 checksum)"
    /lzma "Use LZMA compression"
    /level lvl [integer!] "Compression level 0-9"
]]

and

decompress: make native! [[
    "Decompresses data. Result is binary."
    data [binary!] "Source data to decompress"
    /part "Limits source data to a given length or position"
    length [number! series!] "Length of compressed data (must match end marker)"
    /zlib "Data are in ZLIB format with Adler32 checksum"
    /gzip "Data are in ZLIB format with CRC32 checksum"
    /lzma "Data are in LZMA format"
    /deflate "Data are raw DEFLATE data"
    /size
    bytes [integer!] {Number of decompressed bytes. If not used, size is detected from last 4 source data bytes.}
]]

The compress doc-string for /zlib refinement is wrong, as there is no difference between compress and compress/zlib:

>> compress "aha"
== #{789C4BCC4804000257012B03000000}

>> compress/zlib "aha"
== #{789C4BCC4804000257012B03000000}

In Red these functions are currently defined as:

compress: make native! [[
    {compresses data. return GZIP format (RFC 1952) by default.}
    data         [any-string! binary!] 
     /zlib      "Return ZLIB format (RFC 1950)."
     /deflate   "Return DEFLATE format (RFC 1951)."
]]
decompress: make native! [[
    {Decompresses data. Data in GZIP format (RFC 1952) by default. }
    data         [binary!] 
     /zlib      "Data in ZLIB format (RFC 1950)."
        size         [integer!] "Uncompressed data size. Use 0 if don't know."
     /deflate   "Data in DEFLATE format (RFC 1951)."
        size         [integer!] "Uncompressed data size. Use 0 if don't know."
]]

But I think that using refinements for (de)compression types is not the right way... these function should be flexible enough to handle more types once available :-/

@acook
Copy link

acook commented Apr 16, 2021

Is the compress function used internally?
Is there a particular reason those algorithms were chosen?

It seems odd to me to want to use compress arbitrarily for all possible algorithms.
But if the goal was to provide a unified interface to all compression libraries through compress, then I could agree that refinements are not the most flexible way to do that.

Personally I would say, just pick your default compression scheme and split the rest of them into dedicated functions.

zip and zip/expand (or zip/decompress or unzip) with the /part refinement and maybe the RFCs as refinements too.
lzma and lzma/expand etc

@Oldes
Copy link
Owner Author

Oldes commented Apr 16, 2021

Thanks for your opinion.
I don't know what you mean with used internally. I consider Rebol to be an universal tool, so any available feature may be needed somewhere. I'm for example still actively working with SWF files and these use both... zlib and lzma compressions.
HTTP scheme handle gzip and deflate and so on.

I don't think it is good to add too many functions for each compression algorithm. I think it should work like checksum is already done.. so I can write:

foreach method system/catalog/checksums [printf [10] reduce[method checksum "hello" method]]

and see:

adler32   103547413
crc24     52719
crc32     907060870
tcp       10943
md4       866437CB7A794BCE2B727ACC0362EE27
md5       5D41402ABC4B2A76B9719D911017C592
sha1      AAF4C61DDCC5E8A2DABEDE0F3B482CD9AEA9434D
sha224    EA09AE9CC6768C50FCEE903ED054556E5BFC8347907F12598AA24193
sha256    2CF24DBA5FB0A30E26E83B2AC5B9E29E1B161E5C1FA7425E73043362938B9824
sha384    59E1748777448C69DE6B800D7A33BBFB9FF1B463E44354C3553BCDB9C666FA90125A3C79F90397BDF5F6A13DE828684F
sha512    9B71D224BD62F3785D96D46AD3EA3D73319BFBC2890CAADAE2DFF72519673CA72323C3D99BA5C11D7C7ACC6E14B8C5DA0C4663475C2E5C3ADEF46F73BCDEC043
ripemd160 108F07B8382412612C048D07D13F814118445ACD

I would like to have also compression schemes, like is already done for checksums, where I can already do:

>> p: open checksum:md5
>> write p "hello"
>> read p
== #{5D41402ABC4B2A76B9719D911017C592}

>> write p "hello"
>> read p
== #{23B431ACFEB41E15D466D75DE822307C}

which is same like:

>> checksum "hellohello" 'md5
== #{23B431ACFEB41E15D466D75DE822307C}

@Oldes
Copy link
Owner Author

Oldes commented Apr 16, 2021

But this issue is just a strong reminder for me, that something must be done one day, as it is bad as it is now.

@Oldes
Copy link
Owner Author

Oldes commented Apr 16, 2021

I should also remind, that the system could be somehow enhance able.. because if I would like to add for example brotli algorithm, it should be probably in some form of extension, because simple brotli stand alone utility has over 775kB (335 compressed), which is quite a lot be be added just in case that someone may need it.

I still like the idea, that there should be single entry point for multiple compression algorithms.

@Oldes
Copy link
Owner Author

Oldes commented Apr 16, 2021

It should be also noted, that one thing is the compression algorithm and second is a codec.
For example there is already GZIP codec.

>> save %/r/hello.gz to binary! "hello"
== %/r/hello.gz

>> read %/r/hello.gz
== #{1F8B080000000000020ACB48CDC9C9070086A6103605000000}

>> to string! load %/r/hello.gz
== "hello"

>> codecs/gzip/decode #{1F8B080000000000020ACB48CDC9C9070086A6103605000000}
== #{68656C6C6F}

and or (so far) read-only ZIP and TAR codecs.

@acook
Copy link

acook commented Apr 16, 2021

By "internal" I mean some part of Rebol's own codebase relies on it heavily, not how it's used by you or me outside of the main distribution.

The idea of making compress a port with various algorithms extendable at compile time and/or runtime is interesting. If that worked for streaming data (not just files) that could be really useful and versatile.

@Oldes
Copy link
Owner Author

Oldes commented Oct 18, 2021

Meanwhile in Red:

compress: make native! [[
    "Compresses data" 
    data [any-string! binary!] 
    method [word!] "zlib deflate gzip" 
    return: [binary!]
]]

decompress: make native! [[
    "Decompresses data" 
    data [binary!] 
    method [word!] "zlib deflate gzip" 
    /size {Specify an uncompressed data size (ignored for GZIP)} 
    sz [integer!] "Uncompressed data size; must not be negative" 
    return: [binary!]
]]

@Oldes Oldes closed this as completed in 2cd3c8f Oct 19, 2021
@Oldes
Copy link
Owner Author

Oldes commented Oct 19, 2021

With the above commit it's now:

compress: make native! [[
    "Compresses data."
    data [binary! string!] "If string, it will be UTF8 encoded"
    method [word!] "zlib deflate gzip lzma"
    /part length "Length of source data"
    /level lvl [integer!] "Compression level 0-9"
]]

decompress: make native! [[
    "Decompresses data."
    data [binary!] "Source data to decompress"
    method [word!] "zlib deflate gzip lzma"
    /part "Limits source data to a given length or position"
    length [number! series!] "Length of compressed data (must match end marker)"
    /size
    bytes [integer!] "Number of uncompressed bytes."
]]

Oldes added a commit that referenced this issue Mar 4, 2023
FIX: 32bit builds crashes when struct is being recycled
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants