Jobs can be partitioned into “chunks” to be executed sequentially on the computational nodes. Chunks are defined by providing a data frame with columns “job.id” and “chunk” (integer). to submitJobs. All jobs with the same chunk number will be grouped together on one node to form a single computational job.

The function chunk simply splits x into either a fixed number of groups, or into a variable number of groups with a fixed number of maximum elements.

The function lpt also groups x into a fixed number of chunks, but uses the actual values of x in a greedy “Longest Processing Time” algorithm. As a result, the maximum sum of elements in minimized.

binpack splits x into a variable number of groups whose sum of elements do not exceed the upper limit provided by chunk.size.

See examples of estimateRuntimes for an application of binpack and lpt.

chunk(x, n.chunks = NULL, chunk.size = NULL, shuffle = TRUE)

lpt(x, n.chunks = 1L)

binpack(x, chunk.size = max(x))

Arguments

x

[numeric]
For chunk an atomic vector (usually the job.id). For binpack and lpt, the weights to group.

n.chunks

[integer(1)]
Requested number of chunks. The function chunk distributes the number of elements in x evenly while lpt tries to even out the sum of elements in each chunk. If more chunks than necessary are requested, empty chunks are ignored. Mutually exclusive with chunks.size.

chunk.size

[integer(1)]
Requested chunk size for each single chunk. For chunk this is the number of elements in x, for binpack the size is determined by the sum of values in x. Mutually exclusive with n.chunks.

shuffle

[logical(1)]
Shuffles the groups. Default is TRUE.

Value

[integer] giving the chunk number for each element of x.

See also

Examples

ch = chunk(1:10, n.chunks = 2) table(ch)
#> ch #> 1 2 #> 5 5
ch = chunk(rep(1, 10), chunk.size = 2) table(ch)
#> ch #> 1 2 3 4 5 #> 2 2 2 2 2
set.seed(1) x = runif(10) ch = lpt(x, n.chunks = 2) sapply(split(x, ch), sum)
#> 1 2 #> 2.808393 2.706746
set.seed(1) x = runif(10) ch = binpack(x, 1) sapply(split(x, ch), sum)
#> 1 2 3 4 5 6 #> 0.9446753 0.9699941 0.8983897 0.9263065 0.8307960 0.9449773
# Job chunking tmp = makeRegistry(file.dir = NA, make.default = FALSE)
#> Sourcing configuration file '~/.batchtools.conf.R' ...
#> Created registry in '/tmp/batchtools-example/reg1' using cluster functions 'Interactive'
ids = batchMap(identity, 1:25, reg = tmp)
#> Adding 25 jobs ...
### Group into chunks with 10 jobs each ids[, chunk := chunk(job.id, chunk.size = 10)]
#> Key: <job.id> #> job.id chunk #> <int> <int> #> 1: 1 2 #> 2: 2 3 #> 3: 3 2 #> 4: 4 1 #> 5: 5 2 #> 6: 6 1 #> 7: 7 3 #> 8: 8 1 #> 9: 9 3 #> 10: 10 3 #> 11: 11 2 #> 12: 12 3 #> 13: 13 1 #> 14: 14 3 #> 15: 15 1 #> 16: 16 1 #> 17: 17 2 #> 18: 18 3 #> 19: 19 2 #> 20: 20 2 #> 21: 21 3 #> 22: 22 1 #> 23: 23 1 #> 24: 24 1 #> 25: 25 2 #> job.id chunk
print(ids[, .N, by = chunk])
#> chunk N #> <int> <int> #> 1: 2 8 #> 2: 3 8 #> 3: 1 9
### Group into 4 chunks ids[, chunk := chunk(job.id, n.chunks = 4)]
#> Key: <job.id> #> job.id chunk #> <int> <int> #> 1: 1 3 #> 2: 2 4 #> 3: 3 3 #> 4: 4 2 #> 5: 5 4 #> 6: 6 1 #> 7: 7 3 #> 8: 8 1 #> 9: 9 1 #> 10: 10 1 #> 11: 11 4 #> 12: 12 3 #> 13: 13 4 #> 14: 14 2 #> 15: 15 3 #> 16: 16 2 #> 17: 17 2 #> 18: 18 3 #> 19: 19 4 #> 20: 20 2 #> 21: 21 4 #> 22: 22 1 #> 23: 23 1 #> 24: 24 1 #> 25: 25 2 #> job.id chunk
print(ids[, .N, by = chunk])
#> chunk N #> <int> <int> #> 1: 3 6 #> 2: 4 6 #> 3: 2 6 #> 4: 1 7
### Submit to batch system submitJobs(ids = ids, reg = tmp)
#> Submitting 25 jobs in 4 chunks using cluster functions 'Interactive' ...
#> ### [bt]: Setting seed to 6 ...
#> ### [bt]: Setting seed to 8 ...
#> ### [bt]: Setting seed to 9 ...
#> ### [bt]: Setting seed to 10 ...
#> ### [bt]: Setting seed to 22 ...
#> ### [bt]: Setting seed to 23 ...
#> ### [bt]: Setting seed to 24 ...
#> ### [bt]: Setting seed to 4 ...
#> ### [bt]: Setting seed to 14 ...
#> ### [bt]: Setting seed to 16 ...
#> ### [bt]: Setting seed to 17 ...
#> ### [bt]: Setting seed to 20 ...
#> ### [bt]: Setting seed to 25 ...
#> ### [bt]: Setting seed to 1 ...
#> ### [bt]: Setting seed to 3 ...
#> ### [bt]: Setting seed to 7 ...
#> ### [bt]: Setting seed to 12 ...
#> ### [bt]: Setting seed to 15 ...
#> ### [bt]: Setting seed to 18 ...
#> ### [bt]: Setting seed to 2 ...
#> ### [bt]: Setting seed to 5 ...
#> ### [bt]: Setting seed to 11 ...
#> ### [bt]: Setting seed to 13 ...
#> ### [bt]: Setting seed to 19 ...
#> ### [bt]: Setting seed to 21 ...
# Grouped chunking tmp = makeExperimentRegistry(file.dir = NA, make.default = FALSE)
#> Sourcing configuration file '~/.batchtools.conf.R' ...
#> Created registry in '/tmp/batchtools-example/reg2' using cluster functions 'Interactive'
prob = addProblem(reg = tmp, "prob1", data = iris, fun = function(job, data) nrow(data))
#> Adding problem 'prob1'
prob = addProblem(reg = tmp, "prob2", data = Titanic, fun = function(job, data) nrow(data))
#> Adding problem 'prob2'
algo = addAlgorithm(reg = tmp, "algo", fun = function(job, data, instance, i, ...) problem)
#> Adding algorithm 'algo'
prob.designs = list(prob1 = data.table(), prob2 = data.table(x = 1:2)) algo.designs = list(algo = data.table(i = 1:3)) addExperiments(prob.designs, algo.designs, repls = 3, reg = tmp)
#> Adding 9 experiments ('prob1'[1] x 'algo'[3] x repls[3]) ...
#> Adding 18 experiments ('prob2'[2] x 'algo'[3] x repls[3]) ...
### Group into chunks of 5 jobs, but do not put multiple problems into the same chunk # -> only one problem has to be loaded per chunk, and only once because it is cached ids = getJobTable(reg = tmp)[, .(job.id, problem, algorithm)] ids[, chunk := chunk(job.id, chunk.size = 5), by = "problem"]
#> Key: <job.id> #> job.id problem algorithm chunk #> <int> <char> <char> <int> #> 1: 1 prob1 algo 1 #> 2: 2 prob1 algo 2 #> 3: 3 prob1 algo 1 #> 4: 4 prob1 algo 1 #> 5: 5 prob1 algo 2 #> 6: 6 prob1 algo 1 #> 7: 7 prob1 algo 1 #> 8: 8 prob1 algo 2 #> 9: 9 prob1 algo 2 #> 10: 10 prob2 algo 1 #> 11: 11 prob2 algo 2 #> 12: 12 prob2 algo 3 #> 13: 13 prob2 algo 2 #> 14: 14 prob2 algo 4 #> 15: 15 prob2 algo 3 #> 16: 16 prob2 algo 2 #> 17: 17 prob2 algo 2 #> 18: 18 prob2 algo 1 #> 19: 19 prob2 algo 4 #> 20: 20 prob2 algo 2 #> 21: 21 prob2 algo 3 #> 22: 22 prob2 algo 1 #> 23: 23 prob2 algo 1 #> 24: 24 prob2 algo 3 #> 25: 25 prob2 algo 1 #> 26: 26 prob2 algo 4 #> 27: 27 prob2 algo 4 #> job.id problem algorithm chunk
ids[, chunk := .GRP, by = c("problem", "chunk")]
#> Key: <job.id> #> job.id problem algorithm chunk #> <int> <char> <char> <int> #> 1: 1 prob1 algo 1 #> 2: 2 prob1 algo 2 #> 3: 3 prob1 algo 1 #> 4: 4 prob1 algo 1 #> 5: 5 prob1 algo 2 #> 6: 6 prob1 algo 1 #> 7: 7 prob1 algo 1 #> 8: 8 prob1 algo 2 #> 9: 9 prob1 algo 2 #> 10: 10 prob2 algo 3 #> 11: 11 prob2 algo 4 #> 12: 12 prob2 algo 5 #> 13: 13 prob2 algo 4 #> 14: 14 prob2 algo 6 #> 15: 15 prob2 algo 5 #> 16: 16 prob2 algo 4 #> 17: 17 prob2 algo 4 #> 18: 18 prob2 algo 3 #> 19: 19 prob2 algo 6 #> 20: 20 prob2 algo 4 #> 21: 21 prob2 algo 5 #> 22: 22 prob2 algo 3 #> 23: 23 prob2 algo 3 #> 24: 24 prob2 algo 5 #> 25: 25 prob2 algo 3 #> 26: 26 prob2 algo 6 #> 27: 27 prob2 algo 6 #> job.id problem algorithm chunk
dcast(ids, chunk ~ problem)
#> Using 'chunk' as value column. Use 'value.var' to override
#> Aggregate function missing, defaulting to 'length'
#> Key: <chunk> #> chunk prob1 prob2 #> <int> <int> <int> #> 1: 1 5 0 #> 2: 2 4 0 #> 3: 3 0 5 #> 4: 4 0 5 #> 5: 5 0 4 #> 6: 6 0 4