Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-40286: Add randbytes() method to random.Random #19527

Merged
merged 1 commit into from
Apr 17, 2020
Merged

bpo-40286: Add randbytes() method to random.Random #19527

merged 1 commit into from
Apr 17, 2020

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Apr 14, 2020

Add random.randbytes() function and random.Random.randbytes()
method to generate random bytes.

Modify secrets.token_bytes() to use SystemRandom.randbytes()
rather than calling directly os.urandom().

Rename also genrand_int32() to genrand_uint32(), since it returns an
unsigned 32-bit integer, not a signed integer.

https://bugs.python.org/issue40286

@vstinner
Copy link
Member Author

TODO: handle properly inheritance. Currently, random.Random is kind of weird. The base class directly implements Mersenne Twister, it's not an abstract class. Maybe a getrandbytes() implementation using getrandbits() should be implemented for subclasses?

cc @pitrou @serhiy-storchaka

@vstinner
Copy link
Member Author

I'm not sure if my Mersenne Twister implementation is correct. Not sure if it should handle endianness differently for example.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your MT implementation is fine. I don't think we guarantee identical results on different architectures. Otherwise you'll need to choose conventional endianness indeed. @mdickinson

I'm not sure if we should really care about subclassing here other than the classes bundled with Python. Otherwise you'll need to do a bit of plumbing in __init_subclass__.

Lib/random.py Outdated Show resolved Hide resolved
Modules/_randommodule.c Outdated Show resolved Hide resolved
Modules/_randommodule.c Outdated Show resolved Hide resolved
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Allow n = 0.
  • The result should be deterministic and do not depend on endianess. Ideally it should be identical to getrandbits(n*8).to_bytes(n, 'little') (or 'big').
  • See also Raymod's suggestion about the name.

Would be nice to add the following tests:

  • Call seek() with some constant and compare the result of getrandbytes() with a constant.
  • Compare the results of several sequential calls of getrandbits() and getrandbytes() for identical Random objects (or after seek() with the same seed).

Comment on lines 542 to 545
size_t chunk = Py_MIN(n, 4);
memcpy(ptr, &word, chunk);
ptr += chunk;
n -= chunk;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would write

Suggested change
size_t chunk = Py_MIN(n, 4);
memcpy(ptr, &word, chunk);
ptr += chunk;
n -= chunk;
if (n < 4) {
memcpy(ptr, &word, n);
break;
}
memcpy(ptr, &word, 4);
ptr += 4;
n -= 4;

memcpy() and incrementing/decrementing with constant size may be faster.

Also do not forget about endianess.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would write (...)

Done.

Also do not forget about endianess.

I don't know how to handle the endianness. Should use htonl() to convert word to big endian?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can use htonl() or swap bytes manually, as it is done in many places in CPython for performace. Search WORDS_BIGENDIAN, PY_BIG_ENDIAN and PY_LITTLE_ENDIAN. Perhaps the most efficient way is:

x = ...
x = ((x & 0x00FF00FFu) << 8) | ((x >> 8) & 0x00FF00FFu);
x = ((x & 0x0000FFFFu) << 16) | ((x >> 16) & 0x0000FFFFu));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created https://bugs.python.org/issue40302 "Add _Py_bswap32() function to pyport.h".

@vstinner vstinner changed the title bpo-40286: Add getrandbytes() method to random.Random bpo-40286: Add randbytes() method to random.Random Apr 15, 2020
@vstinner
Copy link
Member Author

I updated my PR to rename the method to randbytes().

@vstinner
Copy link
Member Author

Allow n = 0.

Done.


Generate *n* random bytes.

If *n* is equal to zero, always returns an empty string for convenience.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think it is needed. It may confuse the user if the degenerate case be explicitly mentioned for one and only one method.

Copy link
Member Author

@vstinner vstinner Apr 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of surprising that a function documented as "generate random bytes" returns a deterministic (non random) result, so I consider that it's worth it to document it.

... explicitly mentioned for one and only one method.

What do you mean? Currently, only randbytes() has this convenience behavior no? I asked Antoine to add a similar sentence for his getrandbits(0) PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

randrange(1, 2), uniform(10, 10) and choice([7]) all return deterministic result. I think that most if not all of random functions can return deterministic result for some arguments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I remove this sentence.

Lib/random.py Outdated Show resolved Hide resolved
Modules/_randommodule.c Show resolved Hide resolved
@vstinner
Copy link
Member Author

Maybe I should wait until https://bugs.python.org/issue40302 "Add _Py_bswap32() function to pyport.h" is solved, rather than using htonl(). I'm not 100% sure that htonl() uses uint32_t on all platforms. It may use "long" type on some platforms and long may be larger than uint32_t.

@serhiy-storchaka
Copy link
Member

htonl() should use 32-bit integers. The name is historical. But it is better to wait for _Py_bswap32(). AFAIK htonl() is not so optimized on all platforms.

Lib/test/test_random.py Outdated Show resolved Hide resolved
Lib/test/test_random.py Outdated Show resolved Hide resolved
@serhiy-storchaka
Copy link
Member

If genrand_uint32() produces numbers 0x01234567 and 0x89abcdef, then getrandbits(6*8) will return 0x89ab01234567, but randbytes(6) will return bytes.fromhex('0123456789ab'). It combines bits in different order.

@vstinner
Copy link
Member Author

If genrand_uint32() produces numbers 0x01234567 and 0x89abcdef, then getrandbits(6*8) will return 0x89ab01234567, but randbytes(6) will return bytes.fromhex('0123456789ab'). It combines bits in different order.

Is it a question? Are you asking to change the result?

@pitrou
Copy link
Member

pitrou commented Apr 17, 2020

FWIW, I don't think there is a problem if two different methods produce different results.

@serhiy-storchaka
Copy link
Member

It would be nice if there is a simple relation between getrandbits() and randbytes(). This would help to implement a fallback for older Python versions.

@vstinner
Copy link
Member Author

It would be nice if there is a simple relation between getrandbits() and randbytes(). This would help to implement a fallback for older Python versions.

Here you have:

import random

def randbytes(gen, n):
    data = bytearray(n)
    i = 0
    while i < n:
        chunk = 4
        word = gen.getrandbits(32)
        word = word.to_bytes(4, 'big')
        chunk = min(n, 4)
        data[i:i+chunk] = word[:chunk]
        i += chunk
    return bytes(data)

gen = random.Random()
gen.seed(8675309)
assert randbytes(gen, 16) == b'f\xf9\xa836\xd0\xa4\xf4\x82\x9f\x8f\x19\xf0eo\x02'

The last line (test) is copied from the unit test of this PR: Python 3.7 with this script and Python 3.9 with my PR produces the same bytes.

@vstinner
Copy link
Member Author

I rebased my PR to fix a merge conflict, I fixed the NEWS entry and added more tests to address Serhiy's latest review.

@serhiy-storchaka
Copy link
Member

I think that the following implementation would be simpler:

def randbytes(gen, n):
    return gen.getrandbits(n * 8).to_bytes(n, 'little')

It is actually the most efficient code that is used now in cases when you need randbytes().

@vstinner
Copy link
Member Author

I prefer to design randbytes() to be efficient and reliable, rather than trying to be compatible with various recipes to generate bytes currently used in the wild.

Serhiy:

I think that the following implementation would be simpler: (...)

People already worked around the lack of randbytes() and have their own implementation, see: https://bugs.python.org/issue40286 for various recipes.

Antoine:

FWIW, I don't think there is a problem if two different methods produce different results.

Yeah, I don't think that we should bother with compatibility with users creative ways to generate random bytes.

For example, some people use sha1(str(random.getrandbits(8*20))).digest() which gives a completely different result than randbyte() of this PR.

Some other people prefer an "explicit" bytes(random.randint(0, 255) for i in range(size)) which consumes 4x more entropy than my proposed implementation: each Python randint() call Mersenne Twister genrand_int32() which consumes 32-bits at once, even if only 8 bits are used.

Lib/random.py Outdated Show resolved Hide resolved
Add random.randbytes() function and random.Random.randbytes()
method to generate random bytes.

Modify secrets.token_bytes() to use SystemRandom.randbytes()
rather than calling directly os.urandom().

Rename also genrand_int32() to genrand_uint32(), since it returns an
unsigned 32-bit integer, not a signed integer.

The _random module is now built with Py_BUILD_CORE_MODULE defined.
@vstinner
Copy link
Member Author

I rebased my PR to get https://bugs.python.org/issue40302 _Py_bswap32() and squashed commits. I replaced the htohl() call with:

#if PY_LITTLE_ENDIAN
        /* Convert to big endian */
        word = _Py_bswap32(word);
#endif

@vstinner vstinner merged commit 9f5fe79 into python:master Apr 17, 2020
@vstinner vstinner deleted the getrandbytes branch April 17, 2020 17:05
@vstinner
Copy link
Member Author

Thanks for the review @serhiy-storchaka and @pitrou!

CuriousLearner added a commit to CuriousLearner/cpython that referenced this pull request May 29, 2020
* master: (1985 commits)
  bpo-40179: Fix translation of #elif in Argument Clinic (pythonGH-19364)
  bpo-35967: Skip test with `uname -p` on Android (pythonGH-19577)
  bpo-40257: Improve help for the typing module (pythonGH-19546)
  Fix two typos in multiprocessing (pythonGH-19571)
  bpo-40286: Use random.randbytes() in tests (pythonGH-19575)
  bpo-40286: Makes simpler the relation between randbytes() and getrandbits() (pythonGH-19574)
  bpo-39894: Route calls from pathlib.Path.samefile() to os.stat() via the path accessor (pythonGH-18836)
  bpo-39897: Remove needless `Path(self.parent)` call, which makes `is_mount()` misbehave in `Path` subclasses. (pythonGH-18839)
  bpo-40282: Allow random.getrandbits(0) (pythonGH-19539)
  bpo-40302: UTF-32 encoder SWAB4() macro use a|b rather than a+b (pythonGH-19572)
  bpo-40302: Replace PY_INT64_T with int64_t (pythonGH-19573)
  bpo-40286: Add randbytes() method to random.Random (pythonGH-19527)
  bpo-39901: Move `pathlib.Path.owner()` and `group()` implementations into the path accessor. (pythonGH-18844)
  bpo-40300: Allow empty logging.Formatter.default_msec_format. (pythonGH-19551)
  bpo-40302: Add pycore_byteswap.h header file (pythonGH-19552)
  bpo-40287: Fix SpooledTemporaryFile.seek() return value (pythonGH-19540)
  Minor modernization and readability improvement to the tokenizer example (pythonGH-19558)
  bpo-40294: Fix _asyncio when module is loaded/unloaded multiple times (pythonGH-19542)
  Fix parameter names in assertIn() docs (pythonGH-18829)
  bpo-39793: use the same domain on make_msgid tests (python#18698)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants