Should I expect a processor to optimize C = matmul ( A, conjg(transpose(B)) ) without making two or three temps, or should I write a matmul that has options to do that, or use *GEMM?