benchmark/tools/gbench/report.py
Roman Lebedev a6a1b0d765 Benchmarking is hard. Making sense of the benchmarking results is even harder. (#593)
The first problem you have to solve yourself. The second one can be aided.
The benchmark library can compute some statistics over the repetitions,
which helps with grasping the results somewhat.

But that is only for the one set of results. It does not really help to compare
the two benchmark results, which is the interesting bit. Thankfully, there are
these bundled `tools/compare.py` and `tools/compare_bench.py` scripts.

They can provide a diff between two benchmarking results. Yay!
Except not really, it's just a diff, while it is very informative and better than
nothing, it does not really help answer The Question - am i just looking at the noise?
It's like not having these per-benchmark statistics...

Roughly, we can formulate the question as:
> Are these two benchmarks the same?
> Did my change actually change anything, or is the difference below the noise level?

Well, this really sounds like a [null hypothesis](https://en.wikipedia.org/wiki/Null_hypothesis), does it not?
So maybe we can use statistics here, and solve all our problems?
lol, no, it won't solve all the problems. But maybe it will act as a tool,
to better understand the output, just like the usual statistics on the repetitions...

I'm making an assumption here that most of the people care about the change
of average value, not the standard deviation. Thus i believe we can use T-Test,
be it either [Student's t-test](https://en.wikipedia.org/wiki/Student%27s_t-test), or [Welch's t-test](https://en.wikipedia.org/wiki/Welch%27s_t-test).
**EDIT**: however, after @dominichamon review, it was decided that it is better
to use more robust [Mann–Whitney U test](https://en.wikipedia.org/wiki/Mann–Whitney_U_test)
I'm using [scipy.stats.mannwhitneyu](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html#scipy.stats.mannwhitneyu).

There are two new user-facing knobs:
```
$ ./compare.py --help
usage: compare.py [-h] [-u] [--alpha UTEST_ALPHA]
                  {benchmarks,filters,benchmarksfiltered} ...

versatile benchmark output compare tool
<...>
optional arguments:
  -h, --help            show this help message and exit

  -u, --utest           Do a two-tailed Mann-Whitney U test with the null
                        hypothesis that it is equally likely that a randomly
                        selected value from one sample will be less than or
                        greater than a randomly selected value from a second
                        sample. WARNING: requires **LARGE** (9 or more)
                        number of repetitions to be meaningful!
  --alpha UTEST_ALPHA   significance level alpha. if the calculated p-value is
                        below this value, then the result is said to be
                        statistically significant and the null hypothesis is
                        rejected. (default: 0.0500)
```

Example output:
![screenshot_20180512_175517](https://user-images.githubusercontent.com/88600/39958581-ae897924-560d-11e8-81b9-806db6c3e691.png)
As you can guess, the alpha does affect anything but the coloring of the computed p-values.
If it is green, then the change in the average values is statistically-significant.

I'm detecting the repetitions by matching name. This way, no changes to the json are _needed_.
Caveats:
* This won't work if the json is not in the same order as outputted by the benchmark,
   or if the parsing does not retain the ordering.
* This won't work if after the grouped repetitions there isn't at least one row with
  different name (e.g. statistic). Since there isn't a knob to disable printing of statistics
  (only the other way around), i'm not too worried about this.
* **The results will be wrong if the repetition count is different between the two benchmarks being compared.**
* Even though i have added (hopefully full) test coverage, the code of these python tools is staring
  to look a bit jumbled.
* So far i have added this only to the `tools/compare.py`.
  Should i add it to `tools/compare_bench.py` too?
  Or should we deduplicate them (by removing the latter one)?
2018-05-29 11:13:28 +01:00

325 lines
12 KiB
Python

"""report.py - Utilities for reporting statistics about benchmark results
"""
import os
import re
import copy
from scipy.stats import mannwhitneyu
class BenchmarkColor(object):
def __init__(self, name, code):
self.name = name
self.code = code
def __repr__(self):
return '%s%r' % (self.__class__.__name__,
(self.name, self.code))
def __format__(self, format):
return self.code
# Benchmark Colors Enumeration
BC_NONE = BenchmarkColor('NONE', '')
BC_MAGENTA = BenchmarkColor('MAGENTA', '\033[95m')
BC_CYAN = BenchmarkColor('CYAN', '\033[96m')
BC_OKBLUE = BenchmarkColor('OKBLUE', '\033[94m')
BC_OKGREEN = BenchmarkColor('OKGREEN', '\033[32m')
BC_HEADER = BenchmarkColor('HEADER', '\033[92m')
BC_WARNING = BenchmarkColor('WARNING', '\033[93m')
BC_WHITE = BenchmarkColor('WHITE', '\033[97m')
BC_FAIL = BenchmarkColor('FAIL', '\033[91m')
BC_ENDC = BenchmarkColor('ENDC', '\033[0m')
BC_BOLD = BenchmarkColor('BOLD', '\033[1m')
BC_UNDERLINE = BenchmarkColor('UNDERLINE', '\033[4m')
def color_format(use_color, fmt_str, *args, **kwargs):
"""
Return the result of 'fmt_str.format(*args, **kwargs)' after transforming
'args' and 'kwargs' according to the value of 'use_color'. If 'use_color'
is False then all color codes in 'args' and 'kwargs' are replaced with
the empty string.
"""
assert use_color is True or use_color is False
if not use_color:
args = [arg if not isinstance(arg, BenchmarkColor) else BC_NONE
for arg in args]
kwargs = {key: arg if not isinstance(arg, BenchmarkColor) else BC_NONE
for key, arg in kwargs.items()}
return fmt_str.format(*args, **kwargs)
def find_longest_name(benchmark_list):
"""
Return the length of the longest benchmark name in a given list of
benchmark JSON objects
"""
longest_name = 1
for bc in benchmark_list:
if len(bc['name']) > longest_name:
longest_name = len(bc['name'])
return longest_name
def calculate_change(old_val, new_val):
"""
Return a float representing the decimal change between old_val and new_val.
"""
if old_val == 0 and new_val == 0:
return 0.0
if old_val == 0:
return float(new_val - old_val) / (float(old_val + new_val) / 2)
return float(new_val - old_val) / abs(old_val)
def filter_benchmark(json_orig, family, replacement=""):
"""
Apply a filter to the json, and only leave the 'family' of benchmarks.
"""
regex = re.compile(family)
filtered = {}
filtered['benchmarks'] = []
for be in json_orig['benchmarks']:
if not regex.search(be['name']):
continue
filteredbench = copy.deepcopy(be) # Do NOT modify the old name!
filteredbench['name'] = regex.sub(replacement, filteredbench['name'])
filtered['benchmarks'].append(filteredbench)
return filtered
def generate_difference_report(
json1,
json2,
utest=False,
utest_alpha=0.05,
use_color=True):
"""
Calculate and report the difference between each test of two benchmarks
runs specified as 'json1' and 'json2'.
"""
assert utest is True or utest is False
first_col_width = find_longest_name(json1['benchmarks'])
def find_test(name):
for b in json2['benchmarks']:
if b['name'] == name:
return b
return None
utest_col_name = "U-test (p-value)"
first_col_width = max(
first_col_width,
len('Benchmark'),
len(utest_col_name))
first_line = "{:<{}s}Time CPU Time Old Time New CPU Old CPU New".format(
'Benchmark', 12 + first_col_width)
output_strs = [first_line, '-' * len(first_line)]
last_name = None
timings_time = [[], []]
timings_cpu = [[], []]
gen = (bn for bn in json1['benchmarks']
if 'real_time' in bn and 'cpu_time' in bn)
for bn in gen:
fmt_str = "{}{:<{}s}{endc}{}{:+16.4f}{endc}{}{:+16.4f}{endc}{:14.0f}{:14.0f}{endc}{:14.0f}{:14.0f}"
special_str = "{}{:<{}s}{endc}{}{:16.4f}{endc}{}{:16.4f}"
if last_name is None:
last_name = bn['name']
if last_name != bn['name']:
MIN_REPETITIONS = 2
if ((len(timings_time[0]) >= MIN_REPETITIONS) and
(len(timings_time[1]) >= MIN_REPETITIONS) and
(len(timings_cpu[0]) >= MIN_REPETITIONS) and
(len(timings_cpu[1]) >= MIN_REPETITIONS)):
if utest:
def get_utest_color(pval):
if pval >= utest_alpha:
return BC_FAIL
else:
return BC_OKGREEN
time_pvalue = mannwhitneyu(
timings_time[0], timings_time[1], alternative='two-sided').pvalue
cpu_pvalue = mannwhitneyu(
timings_cpu[0], timings_cpu[1], alternative='two-sided').pvalue
output_strs += [color_format(use_color,
special_str,
BC_HEADER,
utest_col_name,
first_col_width,
get_utest_color(time_pvalue),
time_pvalue,
get_utest_color(cpu_pvalue),
cpu_pvalue,
endc=BC_ENDC)]
last_name = bn['name']
timings_time = [[], []]
timings_cpu = [[], []]
other_bench = find_test(bn['name'])
if not other_bench:
continue
if bn['time_unit'] != other_bench['time_unit']:
continue
def get_color(res):
if res > 0.05:
return BC_FAIL
elif res > -0.07:
return BC_WHITE
else:
return BC_CYAN
timings_time[0].append(bn['real_time'])
timings_time[1].append(other_bench['real_time'])
timings_cpu[0].append(bn['cpu_time'])
timings_cpu[1].append(other_bench['cpu_time'])
tres = calculate_change(timings_time[0][-1], timings_time[1][-1])
cpures = calculate_change(timings_cpu[0][-1], timings_cpu[1][-1])
output_strs += [color_format(use_color,
fmt_str,
BC_HEADER,
bn['name'],
first_col_width,
get_color(tres),
tres,
get_color(cpures),
cpures,
timings_time[0][-1],
timings_time[1][-1],
timings_cpu[0][-1],
timings_cpu[1][-1],
endc=BC_ENDC)]
return output_strs
###############################################################################
# Unit tests
import unittest
class TestReportDifference(unittest.TestCase):
def load_results(self):
import json
testInputs = os.path.join(
os.path.dirname(
os.path.realpath(__file__)),
'Inputs')
testOutput1 = os.path.join(testInputs, 'test1_run1.json')
testOutput2 = os.path.join(testInputs, 'test1_run2.json')
with open(testOutput1, 'r') as f:
json1 = json.load(f)
with open(testOutput2, 'r') as f:
json2 = json.load(f)
return json1, json2
def test_basic(self):
expect_lines = [
['BM_SameTimes', '+0.0000', '+0.0000', '10', '10', '10', '10'],
['BM_2xFaster', '-0.5000', '-0.5000', '50', '25', '50', '25'],
['BM_2xSlower', '+1.0000', '+1.0000', '50', '100', '50', '100'],
['BM_1PercentFaster', '-0.0100', '-0.0100', '100', '99', '100', '99'],
['BM_1PercentSlower', '+0.0100', '+0.0100', '100', '101', '100', '101'],
['BM_10PercentFaster', '-0.1000', '-0.1000', '100', '90', '100', '90'],
['BM_10PercentSlower', '+0.1000', '+0.1000', '100', '110', '100', '110'],
['BM_100xSlower', '+99.0000', '+99.0000', '100', '10000', '100', '10000'],
['BM_100xFaster', '-0.9900', '-0.9900', '10000', '100', '10000', '100'],
['BM_10PercentCPUToTime', '+0.1000', '-0.1000', '100', '110', '100', '90'],
['BM_ThirdFaster', '-0.3333', '-0.3334', '100', '67', '100', '67'],
['BM_BadTimeUnit', '-0.9000', '+0.2000', '0', '0', '0', '1'],
]
json1, json2 = self.load_results()
output_lines_with_header = generate_difference_report(
json1, json2, use_color=False)
output_lines = output_lines_with_header[2:]
print("\n".join(output_lines_with_header))
self.assertEqual(len(output_lines), len(expect_lines))
for i in range(0, len(output_lines)):
parts = [x for x in output_lines[i].split(' ') if x]
self.assertEqual(len(parts), 7)
self.assertEqual(parts, expect_lines[i])
class TestReportDifferenceBetweenFamilies(unittest.TestCase):
def load_result(self):
import json
testInputs = os.path.join(
os.path.dirname(
os.path.realpath(__file__)),
'Inputs')
testOutput = os.path.join(testInputs, 'test2_run.json')
with open(testOutput, 'r') as f:
json = json.load(f)
return json
def test_basic(self):
expect_lines = [
['.', '-0.5000', '-0.5000', '10', '5', '10', '5'],
['./4', '-0.5000', '-0.5000', '40', '20', '40', '20'],
['Prefix/.', '-0.5000', '-0.5000', '20', '10', '20', '10'],
['Prefix/./3', '-0.5000', '-0.5000', '30', '15', '30', '15'],
]
json = self.load_result()
json1 = filter_benchmark(json, "BM_Z.ro", ".")
json2 = filter_benchmark(json, "BM_O.e", ".")
output_lines_with_header = generate_difference_report(
json1, json2, use_color=False)
output_lines = output_lines_with_header[2:]
print("\n")
print("\n".join(output_lines_with_header))
self.assertEqual(len(output_lines), len(expect_lines))
for i in range(0, len(output_lines)):
parts = [x for x in output_lines[i].split(' ') if x]
self.assertEqual(len(parts), 7)
self.assertEqual(parts, expect_lines[i])
class TestReportDifferenceWithUTest(unittest.TestCase):
def load_results(self):
import json
testInputs = os.path.join(
os.path.dirname(
os.path.realpath(__file__)),
'Inputs')
testOutput1 = os.path.join(testInputs, 'test3_run0.json')
testOutput2 = os.path.join(testInputs, 'test3_run1.json')
with open(testOutput1, 'r') as f:
json1 = json.load(f)
with open(testOutput2, 'r') as f:
json2 = json.load(f)
return json1, json2
def test_utest(self):
expect_lines = []
expect_lines = [
['BM_One', '-0.1000', '+0.1000', '10', '9', '100', '110'],
['BM_Two', '+0.1111', '-0.0111', '9', '10', '90', '89'],
['BM_Two', '+0.2500', '+0.1125', '8', '10', '80', '89'],
['U-test', '(p-value)', '0.2207', '0.6831'],
['BM_Two_stat', '+0.0000', '+0.0000', '8', '8', '80', '80'],
]
json1, json2 = self.load_results()
output_lines_with_header = generate_difference_report(
json1, json2, True, 0.05, use_color=False)
output_lines = output_lines_with_header[2:]
print("\n".join(output_lines_with_header))
self.assertEqual(len(output_lines), len(expect_lines))
for i in range(0, len(output_lines)):
parts = [x for x in output_lines[i].split(' ') if x]
self.assertEqual(parts, expect_lines[i])
if __name__ == '__main__':
unittest.main()
# vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4
# kate: tab-width: 4; replace-tabs on; indent-width 4; tab-indents: off;
# kate: indent-mode python; remove-trailing-spaces modified;