Estimate Remaining Runtimes

Estimates the runtimes of jobs using the random forest implemented in ranger. Observed runtimes are retrieved from the Registry and runtimes are predicted for unfinished jobs.

The estimated remaining time is calculated in the print method. You may also pass n here to determine the number of parallel jobs which is then used in a simple Longest Processing Time (LPT) algorithm to give an estimate for the parallel runtime.

estimateRuntimes(tab, ..., reg = getDefaultRegistry())

# S3 method for RuntimeEstimate
print(x, n = 1L, ...)

Arguments

tab	[`data.table`] Table with column “job.id” and additional columns to predict the runtime. Observed runtimes will be looked up in the registry and serve as dependent variable. All columns in `tab` except “job.id” will be passed to `ranger` as independent variables to fit the model.
...	[ANY] Additional parameters passed to `ranger`. Ignored for the `print` method.
reg	[`Registry`] Registry. If not explicitly passed, uses the default registry (see `setDefaultRegistry`).
x	[`RuntimeEstimate`] Object to print.
n	[`integer(1)`] Number of parallel jobs to assume for runtime estimation.

Value

[RuntimeEstimate] which is a list with two named elements: “runtimes” is a data.table with columns “job.id”, “runtime” (in seconds) and “type” (“estimated” if runtime is estimated, “observed” if runtime was observed). The other element of the list named “model”] contains the fitted random forest object.

Examples

 batchtools:::example_push_temp(1) 
# Create a simple toy registry
set.seed(1)
tmp = makeExperimentRegistry(file.dir = NA, make.default = FALSE, seed = 1)
#> No readable configuration file found
#> Created registry in '/tmp/batchtools-example/reg' using cluster functions 'Interactive'
addProblem(name = "iris", data = iris, fun = function(data, ...) nrow(data), reg = tmp)
#> Adding problem 'iris'
addAlgorithm(name = "nrow", function(instance, ...) nrow(instance), reg = tmp)
#> Adding algorithm 'nrow'
addAlgorithm(name = "ncol", function(instance, ...) ncol(instance), reg = tmp)
#> Adding algorithm 'ncol'
addExperiments(algo.designs = list(nrow = data.table::CJ(x = 1:50, y = letters[1:5])), reg = tmp)
#> Adding 250 experiments ('iris'[1] x 'nrow'[250] x repls[1]) ...
addExperiments(algo.designs = list(ncol = data.table::CJ(x = 1:50, y = letters[1:5])), reg = tmp)
#> Adding 250 experiments ('iris'[1] x 'ncol'[250] x repls[1]) ...

# We use the job parameters to predict runtimes
tab = unwrap(getJobPars(reg = tmp))

# First we need to submit some jobs so that the forest can train on some data.
# Thus, we just sample some jobs from the registry while grouping by factor variables.
library(data.table)
ids = tab[, .SD[sample(nrow(.SD), 5)], by = c("problem", "algorithm", "y")]
setkeyv(ids, "job.id")
submitJobs(ids, reg = tmp)
#> Submitting 50 jobs in 50 chunks using cluster functions 'Interactive' ...
waitForJobs(reg = tmp)
#> [1] TRUE

# We "simulate" some more realistic runtimes here to demonstrate the functionality:
# - Algorithm "ncol" is 5 times more expensive than "nrow"
# - x has no effect on the runtime
# - If y is "a" or "b", the runtimes are really high
runtime = function(algorithm, x, y) {
  ifelse(algorithm == "nrow", 100L, 500L) + 1000L * (y %in% letters[1:2])
}
tmp$status[ids, done := done + tab[ids, runtime(algorithm, x, y)]]
#>      job.id def.id  submitted    started       done error mem.used resource.id
#>   1:      1      1         NA         NA         NA  <NA>       NA          NA
#>   2:      2      2         NA         NA         NA  <NA>       NA          NA
#>   3:      3      3         NA         NA         NA  <NA>       NA          NA
#>   4:      4      4         NA         NA         NA  <NA>       NA          NA
#>   5:      5      5         NA         NA         NA  <NA>       NA          NA
#>  ---                                                                          
#> 496:    496    496         NA         NA         NA  <NA>       NA          NA
#> 497:    497    497         NA         NA         NA  <NA>       NA          NA
#> 498:    498    498         NA         NA         NA  <NA>       NA          NA
#> 499:    499    499 1603265964 1603265964 1603266464  <NA>       NA           1
#> 500:    500    500         NA         NA         NA  <NA>       NA          NA
#>           batch.id log.file                            job.hash job.name repl
#>   1:          <NA>     <NA>                                <NA>     <NA>    1
#>   2:          <NA>     <NA>                                <NA>     <NA>    1
#>   3:          <NA>     <NA>                                <NA>     <NA>    1
#>   4:          <NA>     <NA>                                <NA>     <NA>    1
#>   5:          <NA>     <NA>                                <NA>     <NA>    1
#>  ---                                                                         
#> 496:          <NA>     <NA>                                <NA>     <NA>    1
#> 497:          <NA>     <NA>                                <NA>     <NA>    1
#> 498:          <NA>     <NA>                                <NA>     <NA>    1
#> 499: cfInteractive     <NA> joba17949c0e00e62405c8465e973297f1c     <NA>    1
#> 500:          <NA>     <NA>                                <NA>     <NA>    1
rjoin(sjoin(tab, ids), getJobStatus(ids, reg = tmp)[, c("job.id", "time.running")])
#>     job.id problem algorithm  x y   time.running
#>  1:     32    iris      nrow  7 b 1100.0026 secs
#>  2:     42    iris      nrow  9 b 1100.0024 secs
#>  3:     47    iris      nrow 10 b 1100.0023 secs
#>  4:     66    iris      nrow 14 a 1100.0052 secs
#>  5:     73    iris      nrow 15 c  100.0023 secs
#>  6:     75    iris      nrow 15 e  100.0024 secs
#>  7:     86    iris      nrow 18 a 1100.0025 secs
#>  8:    100    iris      nrow 20 e  100.0026 secs
#>  9:    101    iris      nrow 21 a 1100.0024 secs
#> 10:    103    iris      nrow 21 c  100.0024 secs
#> 11:    123    iris      nrow 25 c  100.0024 secs
#> 12:    125    iris      nrow 25 e  100.0028 secs
#> 13:    161    iris      nrow 33 a 1100.0026 secs
#> 14:    165    iris      nrow 33 e  100.0026 secs
#> 15:    169    iris      nrow 34 d  100.0026 secs
#> 16:    183    iris      nrow 37 c  100.0027 secs
#> 17:    184    iris      nrow 37 d  100.0027 secs
#> 18:    203    iris      nrow 41 c  100.0036 secs
#> 19:    207    iris      nrow 42 b 1100.0024 secs
#> 20:    209    iris      nrow 42 d  100.0029 secs
#> 21:    220    iris      nrow 44 e  100.0023 secs
#> 22:    227    iris      nrow 46 b 1100.0024 secs
#> 23:    229    iris      nrow 46 d  100.0023 secs
#> 24:    231    iris      nrow 47 a 1100.0023 secs
#> 25:    244    iris      nrow 49 d  100.0022 secs
#> 26:    260    iris      ncol  2 e  500.0024 secs
#> 27:    276    iris      ncol  6 a 1500.0025 secs
#> 28:    278    iris      ncol  6 c  500.0025 secs
#> 29:    279    iris      ncol  6 d  500.0024 secs
#> 30:    296    iris      ncol 10 a 1500.0025 secs
#> 31:    320    iris      ncol 14 e  500.0023 secs
#> 32:    340    iris      ncol 18 e  500.0023 secs
#> 33:    347    iris      ncol 20 b 1500.0023 secs
#> 34:    363    iris      ncol 23 c  500.0023 secs
#> 35:    369    iris      ncol 24 d  500.0023 secs
#> 36:    373    iris      ncol 25 c  500.0025 secs
#> 37:    387    iris      ncol 28 b 1500.0023 secs
#> 38:    410    iris      ncol 32 e  500.0024 secs
#> 39:    421    iris      ncol 35 a 1500.0024 secs
#> 40:    436    iris      ncol 38 a 1500.0024 secs
#> 41:    444    iris      ncol 39 d  500.0022 secs
#> 42:    448    iris      ncol 40 c  500.0022 secs
#> 43:    456    iris      ncol 42 a 1500.0023 secs
#> 44:    459    iris      ncol 42 d  500.0023 secs
#> 45:    467    iris      ncol 44 b 1500.0023 secs
#> 46:    468    iris      ncol 44 c  500.0023 secs
#> 47:    475    iris      ncol 45 e  500.0024 secs
#> 48:    482    iris      ncol 47 b 1500.0023 secs
#> 49:    492    iris      ncol 49 b 1500.0023 secs
#> 50:    499    iris      ncol 50 d  500.0023 secs
#>     job.id problem algorithm  x y   time.running

# Estimate runtimes:
est = estimateRuntimes(tab, reg = tmp)
print(est)
#> Runtime Estimate for 500 jobs with 1 CPUs
#>   Done     : 0d 09h 43m 20.1s
#>   Remaining: 3d 17h 37m 8.0s
#>   Total    : 4d 03h 20m 28.1s
rjoin(tab, est$runtimes)
#>      job.id problem algorithm  x y      type   runtime
#>   1:      1    iris      nrow  1 a estimated 1107.0568
#>   2:      2    iris      nrow  1 b estimated 1090.8508
#>   3:      3    iris      nrow  1 c estimated  338.2092
#>   4:      4    iris      nrow  1 d estimated  318.6349
#>   5:      5    iris      nrow  1 e estimated  317.3189
#>  ---                                                  
#> 496:    496    iris      ncol 50 a estimated 1381.9162
#> 497:    497    iris      ncol 50 b estimated 1389.1659
#> 498:    498    iris      ncol 50 c estimated  614.0596
#> 499:    499    iris      ncol 50 d  observed  500.0023
#> 500:    500    iris      ncol 50 e estimated  574.7851
print(est, n = 10)
#> Runtime Estimate for 500 jobs with 10 CPUs
#>   Done     : 0d 09h 43m 20.1s
#>   Remaining: 3d 17h 37m 8.0s
#>   Parallel : 0d 08h 58m 21.4s
#>   Total    : 4d 03h 20m 28.1s

# Submit jobs with longest runtime first:
ids = est$runtimes[type == "estimated"][order(runtime, decreasing = TRUE)]
print(ids)
#>      job.id      type   runtime
#>   1:    466 estimated 1420.0934
#>   2:    461 estimated 1418.7001
#>   3:    462 estimated 1415.5134
#>   4:    457 estimated 1414.7134
#>   5:    487 estimated 1413.4847
#>  ---                           
#> 446:    194 estimated  133.0456
#> 447:    185 estimated  133.0030
#> 448:    204 estimated  131.6954
#> 449:    174 estimated  131.5901
#> 450:    179 estimated  130.4434
if (FALSE) {
submitJobs(ids, reg = tmp)
}

# Group jobs into chunks with runtime < 1h
ids = est$runtimes[type == "estimated"]
ids[, chunk := binpack(runtime, 3600)]
#>      job.id      type   runtime chunk
#>   1:      1 estimated 1107.0568    47
#>   2:      2 estimated 1090.8508    51
#>   3:      3 estimated  338.2092    37
#>   4:      4 estimated  318.6349    33
#>   5:      5 estimated  317.3189    70
#>  ---                                 
#> 446:    495 estimated  581.7197    17
#> 447:    496 estimated 1381.9162    20
#> 448:    497 estimated 1389.1659    15
#> 449:    498 estimated  614.0596     4
#> 450:    500 estimated  574.7851    26
print(ids)
#>      job.id      type   runtime chunk
#>   1:      1 estimated 1107.0568    47
#>   2:      2 estimated 1090.8508    51
#>   3:      3 estimated  338.2092    37
#>   4:      4 estimated  318.6349    33
#>   5:      5 estimated  317.3189    70
#>  ---                                 
#> 446:    495 estimated  581.7197    17
#> 447:    496 estimated 1381.9162    20
#> 448:    497 estimated 1389.1659    15
#> 449:    498 estimated  614.0596     4
#> 450:    500 estimated  574.7851    26
print(ids[, list(runtime = sum(runtime)), by = chunk])
#>     chunk  runtime
#>  1:    47 3493.187
#>  2:    51 3593.783
#>  3:    37 3598.573
#>  4:    33 3599.900
#>  5:    70 3493.489
#>  6:    53 3598.723
#>  7:    71 3491.366
#>  8:    48 3491.841
#>  9:    52 3597.483
#> 10:    54 3587.877
#> 11:    68 3499.779
#> 12:    72 3489.223
#> 13:    55 3583.526
#> 14:    69 3496.272
#> 15:    73 3483.829
#> 16:    46 3519.591
#> 17:    50 3599.943
#> 18:    38 3597.396
#> 19:    65 3512.646
#> 20:    43 3571.763
#> 21:    62 3522.617
#> 22:    66 3511.003
#> 23:    39 3599.908
#> 24:    35 3599.575
#> 25:    61 3533.407
#> 26:    40 3598.645
#> 27:    56 3571.361
#> 28:    57 3565.133
#> 29:    49 3481.931
#> 30:    42 3583.160
#> 31:    58 3555.775
#> 32:    60 3535.954
#> 33:    41 3588.180
#> 34:    36 3599.425
#> 35:    59 3545.174
#> 36:    44 3541.279
#> 37:    34 3599.586
#> 38:    64 3514.492
#> 39:    45 3540.479
#> 40:    63 3517.610
#> 41:    67 3507.819
#> 42:    27 3598.911
#> 43:    24 3599.823
#> 44:    25 3590.607
#> 45:    26 3598.511
#> 46:    23 3599.593
#> 47:    28 3573.496
#> 48:    75 3599.916
#> 49:    12 3559.937
#> 50:    74 3474.824
#> 51:     8 3593.188
#> 52:    20 3521.159
#> 53:    31 3599.784
#> 54:     7 3595.855
#> 55:     5 3594.254
#> 56:    11 3563.352
#> 57:    10 3575.839
#> 58:     6 3599.450
#> 59:    32 3598.576
#> 60:    80 3492.129
#> 61:    82 3471.066
#> 62:    83 3599.780
#> 63:    79 3501.372
#> 64:    76 3593.842
#> 65:    85 3588.259
#> 66:    89 3553.760
#> 67:    91 2151.522
#> 68:    81 3481.753
#> 69:    78 3513.014
#> 70:    87 3570.795
#> 71:    88 3563.106
#> 72:    77 3529.443
#> 73:     3 3599.295
#> 74:    86 3578.904
#> 75:    90 3529.605
#> 76:     2 3599.210
#> 77:    84 3596.381
#> 78:     1 3599.788
#> 79:     4 3595.377
#> 80:     9 3583.777
#> 81:    29 3558.408
#> 82:    18 3572.866
#> 83:    15 3583.955
#> 84:    21 3599.004
#> 85:    19 3567.117
#> 86:    16 3582.283
#> 87:    30 3550.130
#> 88:    17 3578.532
#> 89:    22 3599.427
#> 90:    13 3599.265
#> 91:    14 3595.019
#>     chunk  runtime
if (FALSE) {
submitJobs(ids, reg = tmp)
}

# Group jobs into 10 chunks with similar runtime
ids = est$runtimes[type == "estimated"]
ids[, chunk := lpt(runtime, 10)]
#>      job.id      type   runtime chunk
#>   1:      1 estimated 1107.0568     4
#>   2:      2 estimated 1090.8508     9
#>   3:      3 estimated  338.2092     4
#>   4:      4 estimated  318.6349     8
#>   5:      5 estimated  317.3189     6
#>  ---                                 
#> 446:    495 estimated  581.7197     2
#> 447:    496 estimated 1381.9162     9
#> 448:    497 estimated 1389.1659     2
#> 449:    498 estimated  614.0596     2
#> 450:    500 estimated  574.7851     1
print(ids[, list(runtime = sum(runtime)), by = chunk])
#>     chunk  runtime
#>  1:     4 32227.40
#>  2:     9 32226.68
#>  3:     8 32231.22
#>  4:     6 32293.22
#>  5:     1 32226.47
#>  6:     3 32292.92
#>  7:    10 32227.16
#>  8:     5 32301.32
#>  9:     2 32301.36
#> 10:     7 32300.22

Arguments

Value

See also

Examples