鴻鵠國際為台灣地區官方合作廠商，並且可提供技術支援與收費的教育訓練
Modern graphics processing units (GPUs) contain hundreds of arithmetic units and can be harnessed to provide tremendous acceleration for numerically intensive scientific applications such as molecular modeling. The increased capabilities and flexibility of recent GPU hardware combined with high level GPU programming languages such as CUDA and OpenCL has unlocked this computational power and made it accessible to computational scientists. The key to effective GPU computing is the design and implementation of dataparallel algorithms that scale to hundreds of tightly coupled processing units. Many molecular modeling applications are well suited to GPUs, due to their extensive computational requirements, and because they lend themselves to dataparallel implementations. Several exemplary results from our GPU computing work are presented in Klaus Schulten's Keynote Lecture from the 2010 GPU Technology Conference.Molecular DynamicsContinuing increases in high performance computing technology have rapidly expanded the domain of biomolecular simulation from isolated proteins in solvent to complex aggregates, often in a lipid environment. Such systems routinely comprise 100,000 atoms, and several published NAMD simulations have exceeded 1,000,000 atoms. Studying the function of even the simplest biomolecular machines requires simulations of 100 ns or longer, even when employing simulation techniques for accelerating processes of interest. One of the most time consuming calculations in a typical molecular dynamics simulation is the evaluation of forces between atoms that do not share bonds. The high degree of parallelism and floating point arithmetic capability of GPUs can attain performance levels twenty times that of a single CPU core. The twentyfold acceleration provided by the GPU decreases the runtime for the nonbonded force evaluations such that it can be overlapped with bonded forces and PME longrange force calculations on the CPU. These and other CPUbound operations must be ported to the GPU before further acceleration of the entire NAMD application can be realized.MultiResolution Molecular Surface VisualizationMolecular surface visualization allows researchers to see where structures are exposed to solvent, where structures come into contact, and to view the overall architecture of large biomolecular complexes such as transmembrane channels and virus capsids. Recently, we have developed a new GPUaccelerated multiresolution molecular surface representation, enabling smooth interactive animation of moderate sized biomolecular complexes consisting of a few hundred thousand to one million atoms, and interactive display of molecular surfaces for multimillion atom complexes, e.g. large virus capsids. The GPUaccelerated QuickSurf representation in VMD achieves performance orders of magnitude faster than the conventional Surf and MSMS representations, and makes VMD the first molecular visualization tool capable of achieving smooth animations of surface representations for systems of up to one million atoms.Molecular Orbital DisplayVisualization of molecular orbitals (MOs) is important for analyzing the results of quantum chemistry simulations. The functions describing the MOs are computed on a threedimensional lattice, and the resulting data can then be used for plotting isocontours or isosurfaces for visualization as well as for other types of analyses. Existing software packages that render MOs perform calculations on the CPU and require runtimes of tens to hundreds of seconds depending on the complexity of the molecular system. We have developed present novel dataparallel algorithms for computing MOs on modern graphics processing units (GPUs) using CUDA. As recently reported, the fastest GPU algorithm achieves up to a 125fold speedup over an optimized CPU implementation running on one CPU core. We have implemented these algorithms within the popular molecular visualization program VMD, which can now produce high quality MO renderings for large systems in less than a second, and achieves the firstever interactive animations of quantum chemistry simulation trajectories using only onthefly calculation.Ion PlacementTo best reproduce physiological conditions, molecular dynamics simulations must be run in the presence of appropriate ions. Generally such simulations are performed in the presence of sodium chloride, although in some cases (such as simulations including nucleic acid) other ions such as magnesium are necessary. Although many tools such as the VMD Autoionize plugin can place a random distribution of ions, molecules requiring counterions for their stability are better treated using ion placement methods which take the electrostatics of the solute into account. One method for doing this is to place important counterions at minima in the electrostatic potential field generated by the biomolecule of interest, iteratively updating the potential field after each ion is placed. While this method of ion placement is simple and computes ion positions matched to the specific target molecule, it can be very computationally demanding for large structures because it requires calculation of the electrostatic potential at all points on a highresolution 3D lattice in the neighborhood of the solute. Coulombbased ionization of very large structures such as viruses could require several days even using moderately sized clusters of computers. However, the calculation of a function on a lattice where all points are independent is an ideal application for GPU acceleration, and as recently reported in the Journal of Computational Chemistry, the use of GPUs to accelerate Coulombbased ion placement leads to speedups of 100 times or more, allowing large structures to be properly ionized in less than an hour on a single desktop computer.  GPU accelerated ion placement for large bacterial ribosome and STMV virus structures 
The direct summation of the Coulomb potential from all atoms to every lattice point requires computational work that grows quadratically, proportional to the product of the number of atoms and the number of lattice points. An algorithmic enhancement known as multilevel summation uses hierarchical interpolation of softened pairwise potentials from lattices of increasing coarseness to compute an approximation to the Coulomb potential. The amount of computational work for multilevel summation grows linearly, proportional to the sum of the number of atoms and the number of lattice points. Ourreported GPUassisted implementation of this method further reduces the time of obtaining large ionized structures to just a few minutes on a single desktop computer. The accuracy of the implementation is sufficient (with an average difference from the direct approach demonstrated to be in the range of 0.025% to 0.037%) to permit identical ion placement as the direct summation approach for small test molecules and nearly identical results for the ribosome. The GPUaccelerated Coulomb potential calculation can be directly applied to calculate timeaveraged electrostatic potentials from molecular dynamics simulations. As we reported, a VMD calculation of the electrostatic potential for one frame of a molecular dynamics simulation of the ribosome takes 529 seconds on a single GPU, as opposed to 5.24 hours on a single CPU core. A multilevel summation calculation for a single frame requires 67 seconds on one GPU. MultiGPU Coulomb Summation Just as scientific computing can be done on clusters composed of a large number of CPU cores, in some cases problems can be decomposed and run in parallel on multiple GPUs within a single host machine, achieving correspondingly higher levels of performance. One of the drawbacks to the use of multicore CPUs for scientific computing has been the limited amount of memory bandwidth available to each CPU socket, often severely limiting the performance of bandwidthintensive scientific codes. Recently this problem has been further exacerbated since the memory bandwidth available to each CPU socket hasn't kept pace with the increasing number of cores in current CPUs. Since GPUs contain their own onboard high performance memory, the available memory bandwidth available for computational kernels scales as the number of GPUs is increased. This property can allow singlesystem multiGPU codes to scale much better than their multicore CPU based counterparts. Highly dataparallel and memory bandwidth intensive problems are often excellent candidates for such multiGPU performance scaling.The direct Coulomb summation algorithm implemented in VMD is an exemplary case for multiGPU acceleration. The scaling efficiency for direct summation across multiple GPUs is nearly perfect  the use of 4 GPUs delivers almost exactly 4X performance increase. A single GPU evaluates up to 39 billion atom potentials per second, performing 290 GFLOPS of floating point arithmetic. With the use of four GPUs, total performance increases to 157 billion atom potentials per second and 1.156 TFLOPS of floating point arithmetic, for a multiGPU speedup of 3.99 and a scaling efficiency of 99.7%, as recently reported. To match this level of performance using CPUs, hundreds of stateoftheart CPU cores would be required, along with their attendant cabling, power, and cooling requirements. While only one of the first steps in our exploration of the use of multiple GPUs, this result clearly demonstrates that it is possible to harness multiple GPUs in a single system with high efficiency. Fluorescence microphotolysis is a noninvasive method of studying dynamics of cellular components using optical microscopy. In its framework, a small area of a fluorescent specimen is illuminated by a focused laser beam, and the fluorescence of the illuminated spot is recorded. Analyzing the change of the fluorescence signal with time, one can extract diffusion constants of the fluorescent molecules. However, such an analysis of experimental data often requires numerical calculations, namely, a diffusionreaction equation (a partial differential equation in time and 2D or 3D space) has to be solved. Numerical schemes for solving this equation on a grid feature a significant degree of parallelism; indeed, the scheme can be represented as a vectormatrix multiplication problem, which is common for graphics applications and can easily be computed on a GPU. On the other hand, the computation of the fluorescent molecules concentration at a given point depends on the concentration at other points, introducing interdependencies that limit parallelism. Nevertheless, it has been demonstrated recently that one can achieve a significant speedup with the GPUaccelerated computation of the fluorescence microphotolysis signals, as compared to the CPU computation. The computation that took about 8 minutes on a CPU has been shown to run in 38 seconds on a GPU. Given that experimentalists need to perform multiple computation runs with various parameters to match the observed fluorescence signals, this 12times speedup is very welcome. As we reported, the GPUs accelerated computation of fluorescence measurements opens new possibilities for experiments that employ new highresolution microscopes (such as the socalled 4Pi microscope), because, due to the intricate pattern of light distribution in such microscopes, numerical solution is necessary to analyze experimental data. Further information on this topic isavailable here.

