git-svn-id: svn://svn.h5l.se/heimdal/trunk/heimdal@1924 ec53bebd-3082-4978-b11e-865c3cabbd6b
		
			
				
	
	
		
			131 lines
		
	
	
		
			5.7 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			131 lines
		
	
	
		
			5.7 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
First up, let me say I don't like writing in assembler.  It is not portable,
 | 
						|
dependant on the particular CPU architecture release and is generally a pig
 | 
						|
to debug and get right.  Having said that, the x86 architecture is probably
 | 
						|
the most important for speed due to number of boxes and since
 | 
						|
it appears to be the worst architecture to to get
 | 
						|
good C compilers for.  So due to this, I have lowered myself to do
 | 
						|
assembler for the inner DES routines in libdes :-).
 | 
						|
 | 
						|
The file to implement in assembler is des_enc.c.  Replace the following
 | 
						|
4 functions
 | 
						|
des_encrypt(DES_LONG data[2],des_key_schedule ks, int encrypt);
 | 
						|
des_encrypt2(DES_LONG data[2],des_key_schedule ks, int encrypt);
 | 
						|
des_encrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3);
 | 
						|
des_decrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3);
 | 
						|
 | 
						|
They encrypt/decrypt the 64 bits held in 'data' using
 | 
						|
the 'ks' key schedules.   The only difference between the 4 functions is that
 | 
						|
des_encrypt2() does not perform IP() or FP() on the data (this is an
 | 
						|
optimization for when doing triple DES and des_encrypt3() and des_decrypt3()
 | 
						|
perform triple des.  The triple DES routines are in here because it does
 | 
						|
make a big difference to have them located near the des_encrypt2 function
 | 
						|
at link time..
 | 
						|
 | 
						|
Now as we all know, there are lots of different operating systems running on
 | 
						|
x86 boxes, and unfortunately they normally try to make sure their assembler
 | 
						|
formating is not the same as the other peoples.
 | 
						|
The 4 main formats I know of are
 | 
						|
Microsoft	Windows 95/Windows NT
 | 
						|
Elf		Includes Linux and FreeBSD(?).
 | 
						|
a.out		The older Linux.
 | 
						|
Solaris		Same as Elf but different comments :-(.
 | 
						|
 | 
						|
Now I was not overly keen to write 4 different copies of the same code,
 | 
						|
so I wrote a few perl routines to output the correct assembler, given
 | 
						|
a target assembler type.  This code is ugly and is just a hack.
 | 
						|
The libraries are x86unix.pl and x86ms.pl.
 | 
						|
des586.pl, des686.pl and des-som[23].pl are the programs to actually
 | 
						|
generate the assembler.
 | 
						|
 | 
						|
So to generate elf assembler
 | 
						|
perl des-som3.pl elf >dx86-elf.s
 | 
						|
For Windows 95/NT
 | 
						|
perl des-som2.pl win32 >win32.asm
 | 
						|
 | 
						|
[ update 4 Jan 1996 ]
 | 
						|
I have added another way to do things.
 | 
						|
perl des-som3.pl cpp >dx86-cpp.s
 | 
						|
generates a file that will be included by dx86unix.cpp when it is compiled.
 | 
						|
To build for elf, a.out, solaris, bsdi etc,
 | 
						|
cc -E -DELF asm/dx86unix.cpp | as -o asm/dx86-elf.o
 | 
						|
cc -E -DSOL asm/dx86unix.cpp | as -o asm/dx86-sol.o
 | 
						|
cc -E -DOUT asm/dx86unix.cpp | as -o asm/dx86-out.o
 | 
						|
cc -E -DBSDI asm/dx86unix.cpp | as -o asm/dx86bsdi.o
 | 
						|
This was done to cut down the number of files in the distribution.
 | 
						|
 | 
						|
Now the ugly part.  I acquired my copy of Intels
 | 
						|
"Optimization's For Intel's 32-Bit Processors" and found a few interesting
 | 
						|
things.  First, the aim of the exersize is to 'extract' one byte at a time
 | 
						|
from a word and do an array lookup.  This involves getting the byte from
 | 
						|
the 4 locations in the word and moving it to a new word and doing the lookup.
 | 
						|
The most obvious way to do this is
 | 
						|
xor	eax,	eax				# clear word
 | 
						|
movb	al,	cl				# get low byte
 | 
						|
xor	edi	DWORD PTR 0x100+des_SP[eax] 	# xor in word
 | 
						|
movb	al,	ch				# get next byte
 | 
						|
xor	edi	DWORD PTR 0x300+des_SP[eax] 	# xor in word
 | 
						|
shr	ecx	16
 | 
						|
which seems ok.  For the pentium, this system appears to be the best.
 | 
						|
One has to do instruction interleaving to keep both functional units
 | 
						|
operating, but it is basically very efficient.
 | 
						|
 | 
						|
Now the crunch.  When a full register is used after a partial write, eg.
 | 
						|
mov	al,	cl
 | 
						|
xor	edi,	DWORD PTR 0x100+des_SP[eax]
 | 
						|
386	- 1 cycle stall
 | 
						|
486	- 1 cycle stall
 | 
						|
586	- 0 cycle stall
 | 
						|
686	- at least 7 cycle stall (page 22 of the above mentioned document).
 | 
						|
 | 
						|
So the technique that produces the best results on a pentium, according to
 | 
						|
the documentation, will produce hideous results on a pentium pro.
 | 
						|
 | 
						|
To get around this, des686.pl will generate code that is not as fast on
 | 
						|
a pentium, should be very good on a pentium pro.
 | 
						|
mov	eax,	ecx				# copy word 
 | 
						|
shr	ecx,	8				# line up next byte
 | 
						|
and	eax,	0fch				# mask byte
 | 
						|
xor	edi	DWORD PTR 0x100+des_SP[eax] 	# xor in array lookup
 | 
						|
mov	eax,	ecx				# get word
 | 
						|
shr	ecx	8				# line up next byte
 | 
						|
and	eax,	0fch				# mask byte
 | 
						|
xor	edi	DWORD PTR 0x300+des_SP[eax] 	# xor in array lookup
 | 
						|
 | 
						|
Due to the execution units in the pentium, this actually works quite well.
 | 
						|
For a pentium pro it should be very good.  This is the type of output
 | 
						|
Visual C++ generates.
 | 
						|
 | 
						|
There is a third option.  instead of using
 | 
						|
mov	al,	ch
 | 
						|
which is bad on the pentium pro, one may be able to use
 | 
						|
movzx	eax,	ch
 | 
						|
which may not incur the partial write penalty.  On the pentium,
 | 
						|
this instruction takes 4 cycles so is not worth using but on the
 | 
						|
pentium pro it appears it may be worth while.  I need access to one to
 | 
						|
experiment :-).
 | 
						|
 | 
						|
eric (20 Oct 1996)
 | 
						|
 | 
						|
22 Nov 1996 - I have asked people to run the 2 different version on pentium
 | 
						|
pros and it appears that the intel documentation is wrong.  The
 | 
						|
mov al,bh is still faster on a pentium pro, so just use the des586.pl
 | 
						|
install des686.pl
 | 
						|
 | 
						|
3 Dec 1996 - I added des_encrypt3/des_decrypt3 because I have moved these
 | 
						|
functions into des_enc.c because it does make a massive performance
 | 
						|
difference on some boxes to have the functions code located close to
 | 
						|
the des_encrypt2() function.
 | 
						|
 | 
						|
9 Jan 1996 - des-som2.pl is now the correct perl script to use for
 | 
						|
pentiums.  It contains an inner loop from
 | 
						|
Svend Olaf Mikkelsen <svolaf@inet.uni-c.dk> which does raw ecb DES calls at
 | 
						|
273,000 per second.  He had a previous version at 250,000 and the best
 | 
						|
I was able to get was 203,000.  The content has not changed, this is all
 | 
						|
due to instruction sequencing (and actual instructions choice) which is able
 | 
						|
to keep both functional units of the pentium going.
 | 
						|
We may have lost the ugly register usage restrictions when x86 went 32 bit
 | 
						|
but for the pentium it has been replaced by evil instruction ordering tricks.
 | 
						|
 | 
						|
13 Jan 1996 - des-som3.pl, more optimizations from Svend Olaf.
 | 
						|
raw DES at 281,000 per second on a pentium 100.
 |