-
Notifications
You must be signed in to change notification settings - Fork 0
literature
You should also look at the literature folder of the main repository.
'[[TOC]]
Repa is "REgular, shape-polymorphic, Parallel Arrays" and is documented in three research papers (see below).
-
"Regular, shape-polymorphic, parallel arrays in Haskell" - ICFP 2010
Introduces the repa system
-
"Efficient Parallel Stencil Convolution in Haskell" - Haskell Symposium 2011
Extends repa with stencil operations
-
"Guiding Parallel Array Fusion with Indexed Types" - Haskell Symposium 2012
Allows the library-user to select between several different array representations
This article documents push-arrays: [http://dl.acm.org/citation.cfm?id=2103740]
EmbArBB is a thin wrapper around Intel's ArBB exposing a small DSL, but still involves a lot of clutter. Not that promising in itself, but ArBB might be worth to take a look at.
HArBB is a ArBB back-end to Accelerate, though not supporting all of Accelerate's features and general folds are only efficiently implemented for certain operators (e.g. addition, multiplication and xor), but not for general lambda expressions.
- Meta-Par
- hmatrix
Look at sections 4.2-4.4 for considerations and references when mapping nested data parallelism to cuda.
The Copperhead tech report references in this context :
[1] N. Bell and M. Garland. Implementing sparse matrix-vector multipli- cation on throughput-oriented processors. In SC ’09: Proc. Conference on High Performance Computing Networking, Storage and Analysis, pages 1–11. ACM, 2009.
-
Implementation of a portable nested data-parallel language http://dl.acm.org/citation.cfm?id=155343 has a non-nested
linefitin figure 7.
- Intel Array Building Blocks
- Microsoft Accelerator
- Acceleware
- Copperhead (GPU programming in Python)
- Brook and BrookGPU
- Merge
A system for C++ that compiles to both CUDA and Intel TBB. The main contribution is to adaptively select how much of a computation is scheduled for the CPU and the GPU from the input size (N). They do this by running a training run with different input sizes N for both the CPU and GPU version and then fitting linear functions to these runs (x = input size, y = running time). Given a concrete problem instance of size N that has to be executed, they can now find the optimal division of labour from these two functions.
Other notes:
- Has a method of dividing any program into two parts that can be executed in parallel, such that the results can be combined (one part for CPU another for GPU), this method is not described in the paper.
- Performs stream fusion
- Interfaces with CUBLAS for efficient versions of matrix multiplication etc.
- Analyzes memory requirements of programs before GPU code-generation and divides GPU programs further if the required memory is not available on the GPU. The individual smaller programs are then executed in serial and their results are combined.
- OpenCL specification
- CUDA by Example
- NVIDIA OpenCL programming guide
- Rolfs FAMØS article
- Michael & Joachim's thesis and paper
- SPJ Financial Contracts article
- Longstaff and schwartz
- Coursera course
- Eric Couffignal's dissertation http://eprints.maths.ox.ac.uk/927/1/eric_couffignals.pdf
- High-Performance Quasi-Monte Carlo Financial Simulation: FPGA vs. GPP vs. GPU
- Syntactic
- Deconstraining DSLs
- Embedded interpreters by Nick Benton
- Ken's master dissertation
https://class.coursera.org/compfinance-2012-001/class/index
http://www.cse.chalmers.se/edu/course/pfp/index.html
Funny sidenote: PFP is an abbreviation for both "Parallel Functional Programming" and "Probabilistic Functional Programming"