ICICLE - PyTorch for ZK

Suyash Bagad

Invisible Garden 2024

Ingonyama

Why ICICLE?

SNARK (or STARK) proof generation is slow
Scope for parallelisability in modern ZKP provers

\vdots

\underbrace{\hspace{7.57cm}}

\mathcal{O}(N)

\mathcal{O}(N \ \text{log}(N))

\ \textsf{Polynomials} \

\textsf{MSM}

\textsf{FFT}

\textsf{Sumcheck}

Why ICICLE?

SNARK (or STARK) proof generation is slow
Scope for parallelisability in modern ZKP provers
Offload intensive compute to the existing GPU infrastructure

\textsf{Device management}

\textsf{CPU-GPU data transfer}

\textsf{Memory management}

\texttt{prover.cpp}

\texttt{prover.rs}

\texttt{prover.go}

Why ICICLE?

SNARK (or STARK) proof generation is slow
Scope for parallelisability in modern ZKP provers
Offload intensive compute to the existing GPU infrastructure
Focus on writing math, leave the acceleration to ICICLE!

What is ICICLE?

High-performance cryptographic library for accelerating ZKPs
- Versatile: Multi-hardware support for diverse environments
- Efficient: Optimized for ZK computations
- Scalable: Easy-to-use and scalable solution for developers
Optimized primitives like MSM, NTT, Keccak hash, and more
Built-in libraries for multiple fields and curves
Seamless DevEx with bindings in C++, Rust and Go
Backend-agnostic with GPU and CPU support (future Metal? ASICs?)
Designed for easy integration and extension

Nooo..to make my ZK prover faster I need to accelerate MSM and NTT for the BN254 curve. I need GPU/CPU coordination and I'll need to learn CUDA while my prover is written in Rust. Ughhh

icicle go brrrr

What is ICICLE?

Architecture

\mathbb{F}

Modular

arithmetic

NTT

Merkle

trees

MSM

G1 & G2

ECNTT

Hashes

Vector

ops

\mathbb{G}

EC Group operations

\mathbb{L}

Linear

Algebra

\mathbb{P}

Polynomial API

(Univariate and Multivariate)

C++

Rust

Credit: Karthik Inbasekar

Front-end
- Multi-language support
- Abstracts the complexity of working with different backends
CUDA Backend
- Optimized for NVIDIA GPUs
Multi-Device Support
Custom backend possible

PyTorch and ICICLE

Let's Try ICICLE

We will use free GPU by Google Colab, instructions here
Basic NTT vs Optimal NTT
Hands-on Sumcheck using polynomial API

ICICLE in Action: NTT

Host

poly 1

poly 2

poly 10

Device

ICICLE in Action: NTT

Host

poly 1

poly 2

poly 10

Device

\texttt{ntt(}

\texttt{)}

ICICLE in Action: NTT

Host

poly 1

poly 2

poly 10

Device

\texttt{ntt(}

\texttt{)}

ICICLE in Action: NTT

Host

poly 1

poly 2

poly 10

Device

\texttt{ntt(}

\texttt{)}

\texttt{ntt(}

\texttt{)}

ICICLE in Action: NTT

Host

poly 1

poly 2

poly 10

Device

\texttt{ntt(}

\texttt{)}

\texttt{ntt(}

\texttt{)}

ICICLE in Action: NTT

Host

poly 1

poly 2

poly 10

Device

\texttt{ntt(}

\texttt{)}

\texttt{ntt(}

\texttt{)}

\texttt{ntt(}

\texttt{)}

ICICLE in Action: NTT

Host

poly 1

poly 2

poly 10

Device

\texttt{ntt(}

\texttt{)}

\texttt{ntt(}

\texttt{)}

\texttt{ntt(}

\texttt{)}

T_{\textsf{naive}} = 10 \cdot (t_{h2d} + t_{ntt} + t_{d2h})

Can we perform computation and

data transfer simultaneously in icicle?

ICICLE in Action: NTT

Host

poly 1

poly 2

poly 10

Device

One for compute,

one for data transfer

poly 3

ICICLE in Action: NTT

Host

poly 1

poly 2

poly 10

Device

\texttt{ntt(}

\texttt{)}

poly 3

ICICLE in Action: NTT

Host

poly 1

poly 2

poly 10

Device

\texttt{ntt(}

\texttt{)}

poly 3

ICICLE in Action: NTT

Host

poly 1

poly 2

poly 10

Device

\texttt{ntt(}

\texttt{)}

poly 3

\texttt{ntt(}

\texttt{)}

ICICLE in Action: NTT

Host

poly 1

poly 2

poly 10

Device

\texttt{ntt(}

\texttt{)}

poly 3

\texttt{ntt(}

\texttt{)}

ICICLE in Action: NTT

Host

poly 1

poly 2

poly 10

Device

\texttt{ntt(}

\texttt{)}

poly 3

\texttt{ntt(}

\texttt{)}

\texttt{ntt(}

\texttt{)}

ICICLE in Action: NTT

Host

poly 1

poly 2

poly 10

Device

\texttt{ntt(}

\texttt{)}

poly 3

\texttt{ntt(}

\texttt{)}

\texttt{ntt(}

\texttt{)}

How did we read and write simultaneously?

ICICLE in Action: NTT

Device

New input

Output

ICICLE in Action: NTT

Device

New input

Output

ICICLE in Action: NTT

Device

New input

Output

ICICLE in Action: NTT

Device

New input

Output

ICICLE in Action: NTT

Device

New input

Output

ICICLE in Action: NTT

Device

New input

Output

ICICLE in Action: NTT

int main(int argc, char* argv[])
{
  try_load_and_set_backend_device(argc, argv);

  // set these parameters to match the desired NTT size and batch size
  const unsigned log_ntt_size = 20;
  const unsigned batch_size = 16;

  scalar_t basic_root = scalar_t::omega(log_ntt_size);

  const unsigned ntt_size = 1 << log_ntt_size;
  std::cout << "log NTT size: " << log_ntt_size << std::endl;
  std::cout << "Batch size: " << batch_size << std::endl;

  // Create separate streams for overlapping data transfers and kernel execution.
  icicleStreamHandle stream_compute, stream_h2d, stream_d2h;
  ICICLE_CHECK(icicle_create_stream(&stream_compute));
  ICICLE_CHECK(icicle_create_stream(&stream_h2d));
  ICICLE_CHECK(icicle_create_stream(&stream_d2h));

  // Initialize NTT domain
  std::cout << "Init NTT domain" << std::endl;
  auto ntt_init_domain_cfg = default_ntt_init_domain_config();
  // set CUDA backend specific flag for init_domain
  ConfigExtension backend_cfg_ext;
  backend_cfg_ext.set(CudaBackendConfig::CUDA_NTT_FAST_TWIDDLES_MODE, true);
  ntt_init_domain_cfg.ext = &backend_cfg_ext;
  ICICLE_CHECK(bn254_ntt_init_domain(&basic_root, &ntt_init_domain_cfg));

  std::cout << "Concurrent Download, Upload, and Compute In-place NTT" << std::endl;
  int nof_blocks = 32;
  int block_size = ntt_size * batch_size / nof_blocks;
  std::cout << "Number of blocks: " << nof_blocks << ", block size: " << block_size << " Bytes" << std::endl;

  // on-host pinned data
  scalar_t* h_inp[2];
  scalar_t* h_out[2];
  for (int i = 0; i < 2; i++) {
    h_inp[i] = new scalar_t[ntt_size * batch_size];
    h_out[i] = new scalar_t[ntt_size * batch_size];
  }

  // on-device in-place data
  // we need two on-device vectors to overlap data transfers with NTT kernel execution
  scalar_t* d_vec[2];
  for (int i = 0; i < 2; i++) {
    ICICLE_CHECK(icicle_malloc((void**)&d_vec[i], sizeof(scalar_t) * ntt_size * batch_size));
  }

  // initialize input data
  initialize_input(ntt_size, batch_size, h_inp[0]);
  initialize_input(ntt_size, batch_size, h_inp[1]);

  // ntt configuration
  NTTConfig<scalar_t> config_compute = default_ntt_config<scalar_t>();
  config_compute.batch_size = batch_size;
  config_compute.are_inputs_on_device = true;
  config_compute.are_outputs_on_device = true;
  config_compute.is_async = true;
  config_compute.stream = stream_compute;
  //  backend specific config extension
  ConfigExtension ntt_cfg_ext;
  ntt_cfg_ext.set(CudaBackendConfig::CUDA_NTT_ALGORITHM, CudaBackendConfig::NttAlgorithm::MixedRadix);
  config_compute.ext = &ntt_cfg_ext;

  for (int run = 0; run < 10; run++) {
    int vec_compute = run % 2;
    int vec_transfer = (run + 1) % 2;
    std::cout << "Run: " << run << std::endl;
    std::cout << "Compute Vector: " << vec_compute << std::endl;
    std::cout << "Transfer Vector: " << vec_transfer << std::endl;
    START_TIMER(inplace);
    bn254_ntt(d_vec[vec_compute], ntt_size, NTTDir::kForward, &config_compute, d_vec[vec_compute]);
    // we have to delay upload to device relative to download from device by one block: preserve write after read
    for (int i = 0; i <= nof_blocks; i++) {
      if (i < nof_blocks) {
        // copy result back from device to host
        ICICLE_CHECK(icicle_copy_async(
          &h_out[vec_transfer][i * block_size], &d_vec[vec_transfer][i * block_size], sizeof(scalar_t) * block_size,
          stream_d2h));
      }
      if (i > 0) {
        // copy next input from host to device to alternate buffer
        ICICLE_CHECK(icicle_copy_async(
          &d_vec[vec_transfer][(i - 1) * block_size], &h_inp[vec_transfer][(i - 1) * block_size],
          sizeof(scalar_t) * block_size, stream_h2d));
      }
      // synchronize upload and download at the end of the block to ensure data integrity
      ICICLE_CHECK(icicle_stream_synchronize(stream_d2h));
      ICICLE_CHECK(icicle_stream_synchronize(stream_h2d));
    }
    // synchronize compute stream with the end of the computation
    ICICLE_CHECK(icicle_stream_synchronize(stream_compute));
    END_TIMER(inplace, "Concurrent In-Place  NTT");
  }

  // Clean-up
  for (int i = 0; i < 2; i++) {
    ICICLE_CHECK(icicle_free(d_vec[i]));
    delete[] (h_inp[i]);
    delete[] (h_out[i]);
  }
  ICICLE_CHECK(icicle_destroy_stream(stream_compute));
  ICICLE_CHECK(icicle_destroy_stream(stream_d2h));
  ICICLE_CHECK(icicle_destroy_stream(stream_h2d));
  return 0;
}

ICICLE in Action: NTT

Follow instructions from the Google colab file

ICICLE in Action: Polynomial API

Powerful abstraction to perform operations over polynomials
Exercise: Write the sumcheck prover using the polynomial API.

// Given polynomials p1 and p2:
// Addition and multiplication: (p1 + p2)^2
let poly_sum_squared = &(&p1 + &p2) * &(&p1 + &p2);

// Subtraction: (p1 - p2)^2
let poly_diff_squared = &(&p1 - &p2) * &(&p1 - &p2);

// Check the identity: (p1 + p2)^2 + (p1 - p2)^2 = 2 * (p1^2 + p2^2)
let identity1_left = &poly_sum_squared + &poly_diff_squared;
let identity1_right = &(&(&p1 * &p1) + &(&p2 * &p2)) * &constant_two;

// Check the identity: (p1 + p2)^2 - (p1 - p2)^2 = 4 * p1 * p2
let identity2_left = &poly_sum_squared - &poly_diff_squared;
let identity2_right = &(&p1 * &p2) * &constant_four;

ICICLE in Action: Polynomials

Follow instructions from the Github repo

Roadmap

Latest updates (v3.1) include:
- Keccak, Blake2 and Blake3, Poseidon hashing in trees
- Custom merkle trees (for zkVMs)
Work in progress on:
- Sumcheck as a primitive
- Python bindings
- Other backends like Metal and WebGPU (for client-side proving)

Accelerated primitives

Polynomial API

Multi-platform

Get Involved

ICICLE is open-source, contributions are most welcome!
Active grant program of $100,000 💵
Leave a ⭐ on the ICICLE repo: github.com/ingonyama-zk/icicle

ICICLE - PyTorch for ZK

Why ICICLE?

Why ICICLE?

Why ICICLE?

What is ICICLE?

icicle go brrrr

icicle go brrrr

What is ICICLE?

Architecture

PyTorch and ICICLE

Let's Try ICICLE

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: Polynomial API

ICICLE in Action: Polynomials

Roadmap

Get Involved

Icicle - PyTorch for ZK

Icicle - PyTorch for ZK

Suyash Bagad

ICICLE - PyTorch for ZK

Why ICICLE?

Why ICICLE?

Why ICICLE?

What is ICICLE?

icicle go brrrr

icicle go brrrr

What is ICICLE?

Architecture

PyTorch and ICICLE

Let's Try ICICLE

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: NTT

ICICLE in Action: Polynomial API

ICICLE in Action: Polynomials

Roadmap

Get Involved

Icicle - PyTorch for ZK

More from Suyash Bagad