Skip to content
kfl edited this page Feb 5, 2013 · 4 revisions

You should also look at the literature folder of the main repository.

'[[TOC]]

Functional vector/array DSLs and libraries

Accelerate
Nikola
Data parallel Haskell
Repa

Repa is "REgular, shape-polymorphic, Parallel Arrays" and is documented in three research papers (see below).

  • "Regular, shape-polymorphic, parallel arrays in Haskell" - ICFP 2010

    Introduces the repa system

  • "Efficient Parallel Stencil Convolution in Haskell" - Haskell Symposium 2011

    Extends repa with stencil operations

  • "Guiding Parallel Array Fusion with Indexed Types" - Haskell Symposium 2012

    Allows the library-user to select between several different array representations

Feldspar
Obsidian

This article documents push-arrays: [http://dl.acm.org/citation.cfm?id=2103740]

Data.Vector
EmbArBB and HArBB

EmbArBB is a thin wrapper around Intel's ArBB exposing a small DSL, but still involves a lot of clutter. Not that promising in itself, but ArBB might be worth to take a look at.

HArBB is a ArBB back-end to Accelerate, though not supporting all of Accelerate's features and general folds are only efficiently implemented for certain operators (e.g. addition, multiplication and xor), but not for general lambda expressions.

  • Meta-Par
  • hmatrix
Copperhead

Look at sections 4.2-4.4 for considerations and references when mapping nested data parallelism to cuda.

The Copperhead tech report references in this context :

[1] N. Bell and M. Garland. Implementing sparse matrix-vector multipli- cation on throughput-oriented processors. In SC ’09: Proc. Conference on High Performance Computing Networking, Storage and Analysis, pages 1–11. ACM, 2009.

NESL

Heterogeneous computing and other DSLs/libraries

  • Intel Array Building Blocks
  • Microsoft Accelerator
  • Acceleware
  • Copperhead (GPU programming in Python)
  • Brook and BrookGPU
  • Merge
Qilin

A system for C++ that compiles to both CUDA and Intel TBB. The main contribution is to adaptively select how much of a computation is scheduled for the CPU and the GPU from the input size (N). They do this by running a training run with different input sizes N for both the CPU and GPU version and then fitting linear functions to these runs (x = input size, y = running time). Given a concrete problem instance of size N that has to be executed, they can now find the optimal division of labour from these two functions.

Other notes:

  • Has a method of dividing any program into two parts that can be executed in parallel, such that the results can be combined (one part for CPU another for GPU), this method is not described in the paper.
  • Performs stream fusion
  • Interfaces with CUBLAS for efficient versions of matrix multiplication etc.
  • Analyzes memory requirements of programs before GPU code-generation and divides GPU programs further if the required memory is not available on the GPU. The individual smaller programs are then executed in serial and their results are combined.

GPU programming

  • OpenCL specification
  • CUDA by Example
  • NVIDIA OpenCL programming guide

Finance

DSL construction and infrastructure

Miscellaneous

Coursera: Introduction to Computational Finance and Financial Econometrics

https://class.coursera.org/compfinance-2012-001/class/index

Chalmers' Course on Parallel Functional Programming

http://www.cse.chalmers.se/edu/course/pfp/index.html

Funny sidenote: PFP is an abbreviation for both "Parallel Functional Programming" and "Probabilistic Functional Programming"

Survey "VectorMARK" in progress

Clone this wiki locally