Barretenberg has its own implementation of finite field arithmetic. The implementation targets 254-bit (bn254, grumpkin) and 256-bit (secp256k1, secp256r1) fields. Internally the field is represented as a little-endian C-array of 4 uint64_t limbs. For 254-bit fields, the internal representation must be in the range $[0, 2p)$ (which we refer to as the coarse representation), while for 256-bit fields the internal representation is an arbitrary uint256_t.

Field arithmetic

Introduction to Montgomery form

We use Montgomery multiplication to speed up field multiplication. For an original element $ a \in \mathbb F_p$ the element is represented internally as $$ a⋅R\ mod\ p$$ where $R = 2^d\ mod\ p$. The chosen $d$ depends on the build configuration:

$d=29⋅9=261$ for builds that don't support the uint128_t type, for example, for WASM build
$d=64⋅4=256$ for standard builds (x86_64).

The goal of using Montgomery form is to avoid heavy division modulo $p$. To compute a representative of element $$c = a⋅b\ mod\ p$$ we compute $$c⋅R = (a⋅R)⋅(b⋅R) / R\ mod\ p,$$ but we use an efficient division trick to avoid the naive modular division. Let's look into the standard 4⋅64 case:

First, we compute the value $$c_r=c⋅R⋅R = aR⋅bR$$ in integers and get a value with 8 64-bit limbs
Then we take the lowest limb of $c_r$ (i.e., $c_r[0]$) and multiply it by a special precomputed value $$r_{inv} = -1 ⋅ p^{-1}\ mod\ 2^{64}$$ As a result we get $$k = r_{inv}⋅ c_r[0]\ mod\ 2^{64}$$
Next we update $c_r$ in integers by adding $k⋅p$: $$c_r += k⋅p$$ You might notice that the value of $c_r\ mod\ p$ hasn't changed, since we've added a multiple of the modulus. At the same time, if we look at the expression modulo $2^{64}$: $$c_r + k⋅p = c_r + c_r⋅r_{inv}⋅p = c_r + c_r⋅ (-1)⋅p^{-1}⋅p = c_r - c_r = 0\ mod\ 2^{64}.$$ The result is equivalent modulo $p$, but we zeroed out the lowest limb
We perform the same operation for $c_r[1]$, but instead of adding $k⋅p$, we add $2^{64}⋅k⋅p$. In the implementation, instead of adding $k⋅ p$ to limbs of $c_r$ starting with zero, we just start with limb 1. This ensures that $c_r[1]=0$. We then perform the same operation for 2 more limbs.
At this stage the array $c_r$ has the property that the first 4 limbs of the total 8 limbs are zero. So if we treat the 4 high limbs as a separate integer $c_{r.high}$, $$c_r = c_{r.high}⋅2^{256}=c_{r.high}⋅R\ mod\ p \Rightarrow c_{r.high} = c\cdot R\ mod\ p$$ and we can get the evaluation simply by taking the 4 high limbs of $c_r$.
The previous step has reduced the intermediate value of $cR$ to range $[0,2p)$, so we must check if it is more than $p$ and subtract the modulus once if it overflows.

On a high level, what we are doing is iteratively adding a multiple of $p$ until the current bottom limb is zero, then shifting by a limb (amounting to dividing by $2^{64}$).

Bounds analysis

Why does this work? We present several versions of the analysis, for completeness.

Suppose both $aR$ and $bR$ are less than the modulus $p$ in integers, so $$aR\cdot bR <= (p-1)^2.$$ During each of the $k\cdot p$ addition rounds we can add at most $(2^{64}-1)p$ to the corresponding digits, so at most we add $(2^{256}-1)p$ and the total is $$aR\cdot bR + k_{0,1,2,3}p \le (p-1)^2+(2^{256}-1)p < 2\cdot 2^{256}p \Rightarrow c_{r.high} = \frac{aR\cdot bR + k_{0,1,2,3}p}{2^{256}} < 2p.$$
For our 256-bit fields, we cannot assume that $aR$ and $bR$ are less than the modulus $p$; we simply know that they are 256-bit numbers. Nonetheless, the same analysis shows that the output is less than $2^{256} + p -1$. This means that (conditionally) subtracting one copy of $p$ is enough to get us to the valid range of $[0, 2^{256})$.
For 254-bit fields (e.g. the BN-254 base and scalar fields) we can do even better by employing a simple trick. Note that 4 64-bit limbs allow 256 bits of storage. We relax the internal representation to use values in range $[0,2p)$. The addition, negation and subtraction operation logic doesn't change, we simply replace the modulus $p$ with $2p$, but the multiplication becomes more efficient. The multiplicands are in range $[0,2p)$, but we add multiples of modulus $p$ to reduce limbs, not $2p$. If we revisit the $c_r$ formula: $$aR\cdot bR + k_{0,1,2,3}p \le (2p-1)^2+(2^{256}-1)p = 2^{256}p+4p^2-5p+1 \Rightarrow$$ $$\Rightarrow c_{r.high} = \frac{aR\cdot bR + k_{0,1,2,3}p}{2^{256}} \le \frac{2^{256}p+4p^2-5p+1}{2^{256}}=p +\frac{4p^2 - 5p +1}{2^{256}}, 4p < 2^{256} \Rightarrow$$ $$\Rightarrow p +\frac{4p^2 - 5p +1}{2^{256}} < 2p$$ So we ended in the same range and we don't have to perform additional reductions.

N.B. In the code we refer to this form, when the limbs are only constrained to be in the range $[0,2p)$, as the coarse-representation.

Yuval reduction

For our 254-bit multiplication in WASM, we use a reduction technique found by Yuval. For a reference, please see this hackmd.

Recall that in standard Montgomery reduction, we zero out the lowest limb by adding a carefully chosen multiple of the modulus $p$. In particular, if we were to use standard Montgomery reduction given our limb-decomposition for WASM: given an accumulator $x = \sum_{i=0}^{n} \text{result}_i \cdot 2^{29i}$, we compute $k = \text{result}_0 \cdot (-p^{-1}) \mod 2^{29}$ and add $k \cdot p$ to $x$. This makes the lowest 29 bits zero (since $\text{result}_0 + k \cdot p_0 \equiv 0 \mod 2^{29}$), allowing us to "shift right" by discarding the zeroed limb.

Yuval's method takes a different approach. Instead of adding a multiple of $p$ to zero out the low bits, we directly compute the equivalent value after the divide by $2^{29}$ step. Given the same accumulator $x$, we want to find $x / 2^{29} \mod p$. We can rewrite this as: $$x / 2^{29} = (x - \text{result}_0) / 2^{29} + \text{result}_0 / 2^{29} \mod p.$$

The first term $(x - \text{result}_0) / 2^{29}$ is simply the higher limbs shifted down. The second term requires computing $\text{result}_0 \cdot 2^{-29} \mod p$, which we precompute as r_inv_wasm (stored in 9 limbs).

So instead of computing $k = \text{result}_0 \cdot (-p^{-1})$ and adding $k \cdot p$ (9 multiply-accumulates), we compute $\text{result}_0 \cdot r_inv_wasm$ and add it to the higher limbs (also 9 multiply-accumulates). The key insight is that both approaches require the same number of operations, but Yuval's method avoids the need for a separate "zero out and shift" step—the shift is implicit in how we interpret the result.

In code, wasm_reduce_yuval implements this as:

result_1 += result_0_masked * wasm_r_inv[0] + (result_0 >> 29);
result_2 += result_0_masked * wasm_r_inv[1];
// ... and so on for result_3 through result_9

The term (result_0 >> 29) handles any overflow bits in result_0 beyond the lowest 29 bits, propagating them to result_1. After this operation, result_0 is effectively discarded, and result_1 through result_9 hold the Montgomery-reduced value.

Structure of WASM Montgomery multiplication

In 254-bit WASM multiplication, the full Montgomery reduction requires 9 limb-reductions (to divide by $2^{261} = 2^{29 \cdot 9}$). We apply Yuval's method for the first 8 reductions, and standard Montgomery reduction for the 9th (final) reduction.

Why not use Yuval for all 9? The key issue is that Yuval's method takes a 10-limb input and produces a 10-limb output (the reduced value spans 9 limbs, shifted up by one position). If we used Yuval for the 9th reduction, we would end up with a 10-limb result instead of the desired 9-limb result. The standard Montgomery reduction (wasm_reduce), by contrast, takes 9 limbs and produces 9 limbs (with the lowest limb zeroed and discardable), giving us exactly the 9-limb output we need.

Bounds analysis

We must verify that the output is in $[0, 2p)$ (the coarse representation) without requiring an additional subtraction of $p$.

After the 9 multiply-adds, we have $aR \cdot bR$ stored across 17 limbs. Since both $aR$ and $bR$ are in $[0, 2p)$, this product is at most $4p^2$.

After 8 Yuval reductions and 1 standard reduction, we have computed: $$\frac{aR \cdot bR + k_0 \cdot r_{inv} + k_1 \cdot r_{inv} + \cdots + k_7 \cdot r_{inv} + k_8 \cdot p}{2^{261}}$$

where each $k_i$ is the masked low 29 bits at reduction step $i$. By construction:

$k_0 < 2^{29}$
$k_1 < 2^{58}$ (since it includes carries from the previous step)
For $i < 8$, we have $k_i < 2^{29(i+1)}$
The sum $\sum_{i=0}^{7} k_i < 2^{232}$ (geometric series)
$k_8 < 2^{261} - 2^{232}$

Since $r_{inv} = 2^{-29} \mod p < p$, the total added via Yuval reductions is bounded by $(2^{232} - 1) \cdot p$. The final standard reduction adds at most $(2^{261} - 2^{232}) \cdot p$.

Therefore, the numerator is bounded by: $$4p^2 + (2^{232} - 1) \cdot p + (2^{261} - 2^{232}) \cdot p < 4p^2 + 2^{261} \cdot p$$

Dividing by $2^{261}$: $$\frac{4p^2 + 2^{261} \cdot p}{2^{261}} = p + \frac{4p^2}{2^{261}}$$

For 254-bit primes, $p < 2^{254}$, so $4p^2 < 4 \cdot 2^{508} = 2^{510}$, and: $$\frac{4p^2}{2^{261}} < 1 $$

Thus the result is less than $p + 1$, which is of course in the coarse representation range $[0, 2p)$. No additional reduction is required.

Converting to and from Montgomery form

Obviously we want to avoid using standard form division when converting between forms, so we use Montgomery form to convert to Montgomery form. If we look at a value $a\ mod\ p$ we can notice that this is the Montgomery form of $a\cdot R^{-1}\ mod\ p$, so if we want to get $aR$ from it, we need to multiply it by the Montgomery form of $R\ mod\ p$, which is $R\cdot R\ mod\ p$. So using Montgomery multiplication we compute

$$a \cdot R^2 / R = a\cdot R\ mod\ p$$

To convert from Montgomery form into standard form we multiply the element in Montgomery form by 1:

$$ aR \cdot 1 / R = a\ mod\ p$$

Architecture details

You could say that for each multiplication or squaring primitive there are 3 implementations:

Generic 64-bit implementation when uint128_t type is available (there is efficient multiplication of 64-bit values)
Assembly 64-bit implementation (Intel ADX and no Intel ADX versions)
Implementation targeting WASM

The generic implementation has 2 purposes:

Building barretenberg on platforms we haven't targeted in the past (new ARM-based Macs, for example)
Compile-time computation of constant expressions, since we can't use the assembly implementation for those.

The assembly implementation for x86_64 is optimized. There are 2 versions:

General x86_64 implementation that uses 64-bit registers. The squaring operation is equivalent to multiplication for simplicity and because the original squaring implementation was quite buggy.
Implementation using Intel ADX. It allows simultaneous use of two addition-with carry operations (adox and adcx) on two separate CPU gates (units of execution that can work simultaneously on the same core), which almost halves the time spent adding up the results of uint64_t multiplication.

Implementation for WASM:

We use 9 29-bit limbs for computation (storage stays the same) and we change the Montgomery form. The reason for a different architecture is that WASM doesn't have:

128-bit result 64*64 bit multiplication
64-bit addition with carry

In the past we implemented a version with 32-bit limbs, but as a result, when we accumulated limb products we always had to split 64-bit results of 32-bit multiplication back into 32-bit chunks. Had we not, the addition of 2 64-bit products would have lost the carry flag and the result would be incorrect. There were 2 issues with this:

This spawned in a lot of masking operations
We didn't use more efficient algorithms for squaring, because multiplication by 2 of intermediate products would once again overflow.

Switching to 9 29-bit limbs increased the number of multiplications from 136 to 171. However, since the product of 2 limbs is 58 bits, we can safely accumulate 64 of those before we have to reduce. This allowed us to get rid of a lot of intermediate masking operations, shifts and additions, so the resulting computation turned out to be more efficient.

Interaction of field object with other objects

Most of the time field is used with uint64_t or uint256_t in our codebase, but there is general logic of how we generate field elements from integers:

Converting from signed int takes the sign into account. It takes the absolute value, converts it to montgomery and then negates the result if the original value was negative
Unsigned integers ( <= 64 bits) are just converted to montgomery
uint256_t and uint512_t:
1. Truncate to 256 bits
2. Subtract the modulus until the value is within field
3. Convert to montgomery

Conversion from field elements exists only to unsigned integers and bools. The value is converted from montgomery and appropriate number of lowest bits is used to initialize the value.

N.B. Functions for converting from uint256_t and back are not bijective, since values $ \ge p$ will be reduced.

Field parameters

The field template is instantiated with field parameter classes, for example, class bb::Bn254FqParams. Each such class contains at least the modulus (in 64-bit and 29-bit form), r_inv (used to efficient reductions) and 2 versions of r_squared used for converting to Montgomery form (64-bit and WASM/29-bit version). r_squared and other parameters (such as cube_root, primitive_root and coset_generator) are defined for wasm separately, because the values represent an element already in Montgomery form.

Helpful python snippets

Parse field parameters out of a parameter class (doesn't check and reconstitute endomorphism parameters, but checks correctness of everything else)

import re
def parse_field_params(s):
    def parse_number(line):
        """Expects a string without whitespaces"""
        line=line.replace('U','').replace('L','') # Clear away all postfixes
        if line.find('0x')!=-1: # We have to parse hex
            value= int(line,16)
        else:
            value = int(line)
        return value
 
    def recover_single_value(name):
        nonlocal s
        index=s.find(name)
        if index==-1:
            raise ValueError("Couldn't find value with name "+name)
        eq_position=s[index:].find('=')
        line_end=s[index:].find(';')
        return parse_number(s[index+eq_position+1:index+line_end])
 
    def recover_single_value_if_present(name):
        nonlocal s
        index=s.find(name)
        if index==-1:
            return None
        eq_position=s[index:].find('=')
        line_end=s[index:].find(';')
        return parse_number(s[index+eq_position+1:index+line_end])
 
    def recover_array(name):
        nonlocal s
        index = s.find(name)
        number_of_elements=int(re.findall(r'(?<='+name+r'\[)\d+',s)[0])
        start_index=s[index:].find('{')
        end_index=s[index:].find('}')
        all_values=s[index+start_index+1:index+end_index]
        result=[parse_number(x) for (i,x) in enumerate(all_values.split(',')) if i<number_of_elements]
        return result
 
    def recover_multiple_arrays(prefix):
        chunk_names=re.findall(prefix+r'_\d+',s)
        recovered=dict()
        for name in chunk_names:
            recovered[name]=recover_array(name)
        return recovered
 
    def recover_element_from_parts(prefix,shift):
        """Recover a field element from its parts"""
        chunk_names=re.findall(prefix+r'_\d+',s)
        val_dict=dict()
        for name in chunk_names:
            val_dict[int(name[len(prefix)+1:])]=recover_single_value(name)
        result=0
        for i in range(len(val_dict)):
            result|=val_dict[i]<<(i*shift)
        return result
 
    def reconstruct_field_from_4_parts(arr):
        result=0
        for i, v in enumerate(arr):
            result|=v<<(i*64)
        return result
    parameter_dictionary=dict()
    parameter_dictionary['modulus']=recover_element_from_parts('modulus',64)
    parameter_dictionary['r_squared']=recover_element_from_parts('r_squared',64)
    parameter_dictionary['cube_root']=recover_element_from_parts('cube_root',64)
    parameter_dictionary['primitive_root']=recover_element_from_parts('primitive_root',64)
 
    parameter_dictionary['modulus_wasm']=recover_element_from_parts('modulus_wasm',29)
    parameter_dictionary['r_squared_wasm']=recover_element_from_parts('r_squared_wasm',64)
    parameter_dictionary['cube_root_wasm']=recover_element_from_parts('cube_root_wasm',64)
    parameter_dictionary['primitive_root_wasm']=recover_element_from_parts('primitive_root_wasm',64)
    parameter_dictionary={**parameter_dictionary,**recover_multiple_arrays('coset_generators')}
    parameter_dictionary={**parameter_dictionary,**recover_multiple_arrays('coset_generators_wasm')}
    parameter_dictionary['endo_g1_lo']=recover_single_value_if_present('endo_g1_lo')
    parameter_dictionary['endo_g1_mid']=recover_single_value_if_present('endo_g1_mid')
    parameter_dictionary['endo_g1_hi']=recover_single_value_if_present('endo_g1_hi')
    parameter_dictionary['endo_g2_lo']=recover_single_value_if_present('endo_g2_lo')
    parameter_dictionary['endo_g2_mid']=recover_single_value_if_present('endo_g2_mid')
    parameter_dictionary['endo_minus_b1_lo']=recover_single_value_if_present('endo_minus_b1_lo')
    parameter_dictionary['endo_minus_b1_mid']=recover_single_value_if_present('endo_minus_b1_mid')
    parameter_dictionary['endo_b2_lo']=recover_single_value_if_present('endo_b2_lo')
    parameter_dictionary['endo_b2_mid']=recover_single_value_if_present('endo_b2_mid')
 
    assert(parameter_dictionary['modulus']==parameter_dictionary['modulus_wasm']) # Check modulus representations are equivalent
    modulus=parameter_dictionary['modulus']
    r_wasm_divided_by_r_regular=2**(261-256)
    assert(parameter_dictionary['r_squared']==pow(2,512,modulus)) # Check r_squared
    assert(parameter_dictionary['r_squared_wasm']==pow(2,9*29*2,modulus)) # Check r_squared_wasm
    assert(parameter_dictionary['cube_root']*r_wasm_divided_by_r_regular%modulus==parameter_dictionary['cube_root_wasm'])
    assert(pow(parameter_dictionary['cube_root']*pow(2,-256,modulus),3,modulus)==1) # Check cubic root
    assert(pow(parameter_dictionary['cube_root_wasm']*pow(2,-29*9,modulus),3,modulus)==1) # Check cubic root for wasm
    assert(parameter_dictionary['primitive_root']*r_wasm_divided_by_r_regular%modulus==parameter_dictionary['primitive_root_wasm']) # Check primitive roots are equivalent
    for i in range(8):
        regular_coset_generator=reconstruct_field_from_4_parts([parameter_dictionary[f'coset_generators_{j}'][i] for j in range(4)])
        wasm_coset_generator=reconstruct_field_from_4_parts([parameter_dictionary[f'coset_generators_wasm_{j}'][i] for j in range(4)])
        assert(regular_coset_generator*r_wasm_divided_by_r_regular%modulus == wasm_coset_generator)
 
    return parameter_dictionary

Convert value from python to string for easy addition to bb's tests:

def to_ff(value):

print ("FF(uint256_t{"+','.join(["0x%xUL"%((value>>(i*64))&((1<<64)-1))for i in range(4)])+"})")