wo competing objectives: 1) A small block size results in more replicates and generally better statistics. 2) A large block size better amortizes the cost of `timer` invocation, and results in a less biased measurement. This is important because CUDA synchronization time is non-trivial (order single to low double digit microseconds) and would otherwise bias the measurement. blocked_autorange sets block_size by running a warmup period, increasing block size until timer overhead is less than 0.1% of the overall computation. This value is then used for the main measurement loop. Returns: A `Measurement` object that contains measured runtimes and repetition counts, and can be used to compute statistics. (mean, median, etc.) r