Here is a list of my papers.

Kang Su Gatlin,
"Portable High Performance Programming via Architecture-Cognizant Divide-and-Conquer Algorithms" (PDF file),
UCSD CSE PhD Dissertation, September 2000.

Kang Su Gatlin and Larry Carter,
"Faster FFTs via Architecture-Cognizance",
PACT 2000, October 2000.

Kang Su Gatlin and Larry Carter,
"Architecture-Cognizant Divide and Conquer Algorithms" (PDF file),
SuperComputing '99, Best Student Paper, November 1999.
The Director's Cut version of this paper can be downloaded by clicking here.

V. Getov and Y.Wei (University of Westminster, UK), L. Carter and K. Gatlin (UCSD),
"Performance Optimisations of the NPB FT Kernel by Special-Purpose Unroller" ,
7th Euromicro Workshop on Parallel and Distributed Processing, February 1999.

Kang Su Gatlin and Larry Carter,
"Memory Hierarchy Considerations for Fast Transpose and Bit-Reversals",
HPCA-5, January 1999.

Larry Carter and Kang Su Gatlin,
"Towards an Optimal Bit-Reverse Permuatation Program",
Foundations of Computer Science '98 , November 1998.

Now the pre-alpha Cache Optimal Bit-Reversal Algorithm (COBRA) source code is available. Very small (less than 4KB and pretty portable I think). Click to download.

A. Snavely and L. Carter (SDSC/UCSD), J. Boisseau and A. Majumdar (SDSC), K.S. Gatlin and N. Mitchell (UCSD), J. Feo and B. Koblenz (Tera Computer),
"Multi-processor Performance on the Tera MTA" ,
SuperComputing 98, Orlando, November 1998.

Boisseau, J., L. Carter, K.S. Gatlin, A. Majumdar and A. Snavely,
"NAS Benchmarks on the Tera MTA",
Workshop on Multi-Threaded Execution, Architecture and Compilation (M-TEAC 98), February 1998.

Carter, L., J. Ferrante, S. Flynn Hummel, B. Alpern, and K. S. Gatlin,
"Hierarchical Tiling: A Methodology for High Performance" ,
UCSD Tech Report CS96-508, Nov 1996.

S. B. Baden, R. Schreiber, K. S. Gatlin, S. J. Fink
"A Preliminary Evaluation of HPF",
Proc. 8th SIAM Conf. on Parallel Proc. for Scientific Computing
March 1997, Minneapolis, MN.

Alpern, B., L. Carter, and Kang Su Gatlin
"Microparallelism and High-Performance Protein Matching",
SuperComputing '95

Keywords: cache tlb memory hierarchy divide and conquer recursive skeletons registers hierarchical tiling recursion high performance programming bit reversal protein matching tera mta nas murphy reordering architecture cognizant cache oblivious cache aware fft mmx altivec multimedia instruction sets