i thought including the onloading/offloading of tensors to gpu mem would add significant lag when running batch llm inference, but it turns out those ops are negligible. makes sense when your gpu's mem bandwidth is 1450 gb/s.
throughput:
w/ mem ops: 155.7 tok/s
w/o mem ops: 156.1 tok/s
3 months ago