Testing Infrastructure

Download the archive of our testing infrastructure or clone it as follows:

.../ae $ git clone https://github.com/plum-umd/adaptive-concretization.git

This infrastructure is tested under Python 2.7.x. If you are using Python 3.2/3.3, replace the usage of subprocess32 at line 8 in psketch.py with subprocess. If you are using Python 2.x and want to examine the back-end behavior for various randdegrees, which corresponds to Table 1 in the paper, you need to install subprocess32. If you are not using Python 2.7.9 or Python 3.4 and later, you need to install pip first.

On Debian and Ubuntu,

$ sudo apt-get install python-dev python-pip

On Mac OS X,

$ sudo easy_install pip

On other OSes, download get-pip.py and run the following (which may require administrator access):

$ (sudo) python get-pip.py

Once pip is ready, install subprocess32:

$ sudo pip install subprocess32

To post-analyze Sketch output, we use SciPy.

On Debian and Ubuntu,

$ sudo apt-get install python-numpy python-scipy python-matplotlib

On Mac OS X,

sudo pip install numpy scipy matplotlib
Database setup

We thoroughly tested our algorithm by running Sketch over 12K times across different sets of benchmarks, different degrees, and different numbers of cores. Needless to say, it might be very painful if we run test cases; post-analyze outputs; and depict tables, like Tables 2 and 3 in the paper, by hand. The main purpose of this testing infrastructure is to conduct intensive experimental evaluations in an automated way. Our first step towards such automation is using database to save and retrieve experiment results easily. (You can still see raw data under data/ folder.)

On Debian and Ubuntu,

$ sudo apt-get install mysql-server mysql-client

On Mac OS X,

$ brew install mysql
$ mysql.server restart

The prompt (or install GUI) will ask you for a password for the MySQL root user. (This is not the system root password.) Type whatever password you want. (Note that this will be used later.)

Enter current password for root (enter for none):

OK, successfully used password, moving on...

Once you've installed MySQL, you need to activate it and run the MySQL set up script. (Mac users with brew can skip to the second command below.)

$ sudo mysql_install_db
$ sudo /usr/bin/mysql_secure_installation

Then, the prompt will ask several options you can configure. Since we are using this database for experiment purpose only, we recommend you to set it up in the most secure manner, for example, no anonymous access and no remote access; the easiest way is just to say Yes to all options (except for changing the root password):

...
Remove anonymous users? [Y/n] Y
 ... Success!

...
Disallow root login remotely? [Y/n] Y
 ... Success!

...
Remove test database and access to it? [Y/n] Y
 - Dropping test database...
 ... Success!
 - Removing privileges on test database...
 ... Success!

...
Reload privilege tables now? [Y/n] Y
 ... Success!

Access to the db server and create an account for this experiment:

$ mysql --user=root -p -h 127.0.0.1 mysql
password: *******
mysql> CREATE USER 'sketchperf'@'localhost';
mysql> GRANT ALL PRIVILEGES ON *.* TO 'sketchperf'@'localhost';
mysql> exit
Bye

Access to the db server with that account and generate a database:

$ mysql --user=sketchperf -h 127.0.0.1
mysql> CREATE DATABASE concretization;
mysql> SHOW DATABASES;
mysql> exit
Bye

From now on, you can connect to that database like this:

$ mysql --user=sketchperf -h 127.0.0.1 concretization

In case you want to use different account and database names, use options --user and --db in db.py accordingly.

.../ae/adaptive-concretization $ ./db.py --user user_name --db db_name ...

To connect to the database, you need to install MySQL connector.

On Debian and Ubuntu,

$ sudo apt-get install python-mysql.connector

On Mac OS X,

$ sudo pip install --allow-external mysql-connector-python mysql-connector-python

Inside the database, we will have the following tables:

  1. Experiment: groups each experiment
  2. RunS: records single-threaded runs (for Table 1)
  3. RunP: records (strategical) parallel runs (for Tables 2 and 3)
  4. Dag: records DAG sizes for every harness
  5. Hole: records hole concretization
Experiment is simply a pair of EID and RID, i.e., Runs that have the same EID are grouped as an experiment. All statistics will be calculated per experiment. Dag and Hole are connected to RunS via RID; these were used for our preliminary experiments, and not included in the paper. Refer to comments around lines 21--45 in db.py to see the schemas for aforementioned experiment tables.

You can easily re/define and clean tables in that database:

.../ae/adaptive-concretization $ ./db.py -c init [-v]
.../ae/adaptive-concretization $ ./db.py -c clean [-v]
.../ae/adaptive-concretization $ ./db.py -c reset [-v]

You can also see detailed queries when running db.py with verbosity option -v. (This option is applicable for all the other commands explained below.) All those commands are guarded by either IF EXISTS or IF NOT EXISTS, and thus it is safe to execute under any conditions.

Experiment (short version)

There is a bunch of options in run.py through which you can specify benchmark(s), number of cores, randdegree(s) of interest, and which strategies (such as vanilla Sketch, fixed randdegree, adaptive concretization) to use. For your (and our) convenience, there are configuration files, where we can literally configure all the experiments we want to conduct. config.short.json (for short version) is a subset of config.full.json (for full version of experiments, respectively), and run.py uses config.short.json by default. One thing that is missed in those configuration files is the number of times to repeat the experiments: -r n.

All Sketch output files will be placed under data/ folder. You can check those outputs by running post.py; register those post-analyzed outputs to the database via -c register command in db.py; or run and register at the same time (--register option in run.py). Once the experiments are done, you can calculate the statistics via -c stat command in db.py. (All detailed uses will be explained again with concrete examples.)

If you are running only the experiments specified in the configuration, just follow these instructions:

.../ae/adaptive-concretization $ ./clean.sh
.../ae/adaptive-concretization $ ./run.py -r 3
.../ae/adaptive-concretization $ ./run.py -r 3 --vanilla
.../ae/adaptive-concretization $ ./db.py -c register -s
.../ae/adaptive-concretization $ ./db.py -c register
.../ae/adaptive-concretization $ ./db.py -c stat -s
.../ae/adaptive-concretization $ ./db.py -c stat

The second-to-last command will retrieve the statistics for Table 1, while the last command will show you the statistics for Tables 2 and 3. For non-adaptive concretization, which is the last column in Table 2, please refer to the last part of the experiments below. The following instructions will explain detailed experiments step by step.

The first experiment is about examining Sketch back-end behavior with various randdegrees, to show the hypothesis that optimal randdegree varies from benchmark to benchmark. run.py has an option, -s, that runs back-end instances in parallel via python wrapper: psketch.py. That wrapper simply forks the certain number of back-end instances that you specify, and allows them to run without any interruptions. That is, it will not do anything clever to manage them, e.g., stopping all other instances if one found a solution. (On the other hand, parallel running via Sketch front-end, which will be used for all the other experiments below, manages the instances of back-end and stops all active instances if one found a solution.) In this regard, we can properly collect empirical success rates.

.../ae/adaptive-concretization $ ./clean.sh
.../ae/adaptive-concretization $ ./run.py -s -r n

You can run particular benchmarks with particular degrees:

.../ae/adaptive-concretization $ ./run.py -s -r n [-b name]* [-d degree]*
.../ae/adaptive-concretization $ ./run.py -s -r 1024 -b p_color -b deriv2 -d 64 -d 1024

Check and register Sketch outputs as follows. (Notice that -s option is required for other scripts.)

.../ae/adaptive-concretization $ ./post.py -s
.../ae/adaptive-concretization $ ./db.py -c register -s [-e EID]

As mentioned earlier, you can run and register at the same time, but it is applicable to -s option only:

.../ae/adaptive-concretization $ ./run.py -s -r n --register

You can retrieve the statistics from the database:

.../ae/adaptive-concretization $ ./db.py -c stat -s [-e EID] [--detail] [-v]

The next experiment is about parallel scalability of our algorithm. To obtain statistics about plain Sketch (as in Tables 2 and 3), use --vanilla option in run.py.

.../ae/adaptive-concretization $ ./run.py --vanilla -r n [...]
.../ae/adaptive-concretization $ ./post.py
.../ae/adaptive-concretization $ ./db.py -c register [-e EID]
.../ae/adaptive-concretization $ ./db.py -c state [-e EID]

To obtain statistics about our algorithm, specify the strategy: --strategy WILCOXON. Otherwise, it will repeat the first experiment using Sketch front-end's parallel running. (This happens in order to encompass all experiments settings in one.)

.../ae/adaptive-concretization $ ./run.py --strategy WILCOXON -r n [...]
(post-analysis, registration, and statistics steps are same as above.)

From the statistics, you can learn which degree was mostly chosen at the adaption phase. To test the efficiency of the adaption phase, we can run Sketch with that particular degree, assuming we already knew it:

.../ae/adaptive-concretization $ ./run.py --strategy NOT_SET -d degree -r n [...]
(post-analysis, registration, and statistics steps are same as above.)
Experiment (full version)

The only difference between short and full version of experiments is how many (big) benchmarks are used. (If you skipped the previous section, please go back and read through it.) Simply, config.full.json includes all benchmarks we've used for our paper. To make run.py use a different configuration file, you need to pass the following option: --config config.full.json. For example,

.../ae/adaptive-concretization $ ./run.py --config config.full.json -r n [--timeout t] [...]
(post-analysis, registration, and statistics steps are same as above.)

Notice that you may need to set timeout because Sketch might take too long in some cases (e.g., plain Sketch for p_menu). In sum, the simplest way to conduct the experiments is using the configuration as follows:

.../ae/adaptive-concretization $ ./clean.sh
.../ae/adaptive-concretization $ ./run.py --config config.full.json -r 13
.../ae/adaptive-concretization $ ./run.py --config config.full.json -r 13 --vanilla
.../ae/adaptive-concretization $ ./db.py -c register -s
.../ae/adaptive-concretization $ ./db.py -c register
.../ae/adaptive-concretization $ ./db.py -c stat -s
.../ae/adaptive-concretization $ ./db.py -c stat
Output

Once Sketch outputs were collected, post-analyzed, and registered, we can retrieve detailed statistics from the database. The first experiment about back-end behaviors for various randdegrees shows that optimal degrees vary from benchmark to benchmark. Statistics about this experiment (with -s option) would look like as follows:

benchmark: name, degree: d
success rate: x / n
lucky runs' solution time: Mean (Var)
        quantiles: [ 0% | 25% | 50% | 75% | 100% ]
unlucky runs' solution time: Mean (Var)
        quantiles: [ 0% | 25% | 50% | 75% | 100% ]
...
 & \timeout{} & \timeout{} & \mso{Median}{SIQR}{} & ...

where SIQR stands for Semi-InterQuartile Range. For each benchmark and degree, we measured empirical success rate and both successful and failed runs' elapsed times. Based on those, we calculated expected running time and printed them out in a TeX format: \timeout{} means zero success rate, and \mso{}{}{} is one of our own TeX commends that represent median and SIQR together.

The second experiment compares our adaptive concretization against plain Sketch and shows its scalability by varying the number of cores. Statistics about this experiment would look like as follows:

benchmark: name, core: c, strategy: S
success rate: x / n
TRIAL: Mean (Var)
        quantiles: [ 0% | 25% | 50% | 75% | 100% ]
CTIME: Mean (Var)
        quantiles: [ 0% | 25% | 50% | 75% | 100% ]
FTIME: Mean (Var)
        quantiles: [ 0% | 25% | 50% | 75% | 100% ]
STIME: Mean (Var)
        quantiles: [ 0% | 25% | 50% | 75% | 100% ]
TTIME: Mean (Var)
        quantiles: [ 0% | 25% | 50% | 75% | 100% ]
DEGREE: Mean (Var)
        quantiles: [ 0% | 25% | 50% | 75% | 100% ]
degree histogram
  d: n (n')
  ...

where FTIME, STIME, TTIME, and CTIME stand for failed, successful, wall-clock, and CPU times. (CTIME was missed in the dumped database for the submitted version.) In the degree histogram, n (n') means n' out of n selections were successful.

At the end, the script will print out statistics 'per benchmark' in a TeX format:

Strategy1 & degree1 & column1 & \mso{Median}{SIQR}{}
Strategy1 & degree1 & column2 & \mso{Median}{SIQR}{}
...
Strategy1 & degree2 & column1 & \mso{Median}{SIQR}{}
Strategy1 & degree2 & column2 & \mso{Median}{SIQR}{}
...
Strategy2 & degree1 & column1 & \mso{Median}{SIQR}{}
...

where possible strategies are: VANILLA, FIXED, and WILCOXON.

Raw Data

You can download our database dumps and check data inside it:

$ mysql --user=sketchperf -h 127.0.0.1
mysql> CREATE DATABASE new_db_name;
mysql> exit
Bye
$ mysql -u sketchperf new_db_name < dump.sql