## Programmable Logic Matrices

The 1801 series microprocessors are mainly built around the one programmable
logic matrix - PLM. This matrix combines the horizontal microcode, the state
machine control and the operation decode logic and can be considered as simple
combinatorial functions of input logic vector X providing the output vector Y:

```
y[m:0] = f(x[n:0])
```

The state is defined as the address of "microinstruction", the some number
of least significant bits of the input vector is the micro address field
and the output vector contains the new value to be assigned to the micro
address register - "next address" field". These next address bits might be
spread over the output vector bits due to the bit order is defined in the
reverse engineering process, and it is not always obvious where these bits are.
The sixteen most significant bits of the input vector is always the current
PDP-11 instruction opcode.

The matrix encompasses the logical functions in the disjunctive normal form.
There is a set of products compared against the input vector with masking
and the matched products are merged to the output vector with logical OR.

The table presents the matrix parameters for the processors of the 1801 series:

| Model    | Input  | Output | Address | Products |
|----------|--------|--------|---------|----------|
| 1801BM1A | 31     | 34     | 7       | 248      |
| 1801BM1G | 31     | 34     | 7       | 249      |
| 1801BM2  | 28     | 37     | 6       | 198      |

## Matrix routine

The matrix routine implements the matrix function. The supported width
of input vector is limited by 32 bits and the supported width of output
vector is limited by 64 bits, the prototype of the routine looks like:

```
uint64_t f(uint32_t x);
```

The main algorithm is to check all the products and to OR the matching ones:

```
for each product { if product matches input { add to result } }
```

There are some straightforward optimizations possible. Because product masked
matching is the logical XOR and AND, it is possible to use SIMD
instructions to handle multiple vectors in one operation. The SSE2
implementation handles four vectors per product loop iteration,
AVX2 implementation can handle eight vectors per iteration.

Another optimization is to split the input vector values in the multiple
groups, using the values of some selected bits. Then we can check all the
products and assign only matching ones to each group. At the calculation
time, we check the selected bit values and decide the products for which group
should be looped to calculate the results. Because the products that do not
match to this given vector are excluded (missing in the array of products
for this group) the number of loop iterations is lesser.

For example, we have selected the four least significant bits as a group index,
and have built (once at the initialization time) the sixteen arrays of products
matching with group indices 0..15. The input vectors xxxx0 will use
the products of group 0 to calculate the result, the vectors xxxx5 will use
group 5 and so on.

The choice of the most optimal bits is not so straightforward,
the code implements checking all possible bit combinations for the given
matrix and chooses the combination producing the tree (set of product arrays)
of the minimum height.

## Microcode Analyzing Application

The sample application provides testing, scanning matrices for specified
input vector patterns and performance measurement means.

There is the reference matrix for 1801BM1A is implemented in the very
straightforward fashion and this one was proven with Verilog simulation while
documenting the microcode execution. The new matrix routine implementations
including SIMD and tree optimized ones can be directly compared with proven
reference, this functionality is enabled with --match key and provided for
1801BM1A matrix only.

The --speed key enables the performance measurement tests, all input range is
calculated and the resulting time is reported. There are some results obtained
on the developer machines, single core, HT on:

 - Haswell i7-4770, @3.4GHz
 - Ryzen 7 9700X, @5.5GHz

| Matrix/method            | i4770-32  | i4770-64  | 9700x-64  |
|--------------------------|-----------|-----------|-----------|
| vm1a reference           | 5:50      | 5:37      | 2:37      |
| vm1a 32b/tree(4)         | 5:50/0:65 | 5:53/0:51 | 2:37/0:22 |
| vm1a 64b/tree(4)         | 5:27/1:23 | 4:27/0:50 | 1:53/0:15 |
| vm1a 128b (SSE2)/tree(4) | 2:57/0:34 | 2:51/0:27 | 1:37/0:10 |
| vm1a 256b (AVX2)/tree(4) | 1:31/0:28 | 1:25/0:21 | 0:34/0:07 |
| vm2 32b/tree(4)          | -         | 0:36/0:06 | 0:17/0:02 |
| vm2 64b/tree(4)          | -         | 0:29/0:04 | 0:12/0:02 |
| vm2 128b (SSE2)/tree(4)  | -         | 0:17/0:03 | 0:10/0:01 |
| vm2 256b (AVX2)/tree(4)  | -         | 0:09/0:02 | 0:03/0:01 |

The --table method builds the micro address tables and gathers the statistics 
about jumps to target micro addresses.

The application was compiled/tested on Windows x86/x64 environment. Also
was compiled in RHEL 8.2 with x86-64 settings, Makefile is included.