split out backpropagation to finding the gradient; and using it in eg gradient descent (page on latter) page on directly solving neural networks instead of training page/h3 on adapting when non-differentiable? just refer to optimisation without diff, eg prob using simulated annealing; or non-prob eg coordinate descent split out alts inc extreme learning machine
epoch: entire dataset passed forwards and then backwards
neural network temperature; perplexity
human fine-tuning vs fine-tuning without human (eg ask to eval)
h3 regular: + gates and skipping * highway network * residual network (resnet) * both these can be used to address vanishing gradients?
Universal approximation theorem + neural network can approximate any function
One-shot learning few-shot learning is zero-shot learning a thing?
thing on quantisation and GPTQ in particular
Low Rank Adaption (LoRA) alternative to checkpoint fine turning smaller outputs faster can combine LoRAs at runtime