Some small suggestions for the Intel instruction set

Programmers trying to make crypto run fast often say things like "Why can't the CPU designer just add a 128-bit multiplication instruction?" Sometimes these questions turn into academic papers analyzing the cycle counts that would be obtained from various instruction-set extensions. What's missing from most of these questions and papers is the CPU designer's perspective: the new instructions cost chip area, and are competing with many other suggestions for productive ways to use the same chip area.