Performance TODOs

> [!IMPORTANT]
> This is a meta issue listing possible performance improvements that:
> * are not too hard, but they aren't easy either: a knowledge of computer science is necessary.
> * do not involving original research, or changes to multiple parts of the VM.
> * should produce a worthwhile performance improvement
> * are self contained:
>    * Not increasing coupling or complexity in the code base
>    * Can be worked on without troublesome merge conflicts

Since this is a meta issue, please make sure there is an issue for the sub-issue before working on it.

In no particular order:

### Convert basic blocks to extended basic blocks in the bytecode compiler

Many local optimizations in the bytecode compiler are limited to a single basic block, but would be more effective and still correct applied to extended basic blocks.

### Better conversion of `LOAD_FAST` to `LOAD_FAST_BORROW` in the bytecode compiler

For example, 
```Py
>>> def f(a,b):
...     return a if a < b else b
>>> dis.dis(f)
  1           RESUME                   0

  2           LOAD_FAST_BORROW_LOAD_FAST_BORROW 1 (a, b)
              COMPARE_OP              18 (bool(<))
              POP_JUMP_IF_FALSE        3 (to L1)
              NOT_TAKEN
              LOAD_FAST_BORROW         0 (a)
              RETURN_VALUE
      L1:     LOAD_FAST                1 (b)
              RETURN_VALUE
```
It is possible that extended basic blocks would fix this, or it might be a separate problem

### Replace with  `_CHECK_STACK_SPACE` with `_CHECK_STACK_SPACE_OPERAND` in the JIT

We removed the optimization that did this because it tried to convert multiple  `_CHECK_STACK_SPACE`s into a single `_CHECK_STACK_SPACE_OPERAND`. Replacing them one by one should be much simpler.

### Function, and maybe code, watchers

We have class and dictionary watchers, and we use them effectively in the JIT. There are a number of optimizations we would like to do, but cannot because functions and code objects can change at runtime and we don't have watchers for them.

We might not need code watchers, as we do a complete de-optimization when any code objects are instrumented. Having code watchers might allow more targetted de-optimizations. We should do function watchers first though.

### Track which locals are `NULL`/immortal/borrowed in the bytecode compiler

We could them use this information to speedup `RETURN_VALUE` as it wouldn't need to `DECREF` those locals. This might make sense in the interpreter, but would probably only be of value in the JIT.

### Reduce or eliminate the cost of updating the insertion order when initializing an object with `STORE_ATTR_INSTANCE_VALUE`

`STORE_ATTR_INSTANCE_VALUE` does three things
* Stores the new value
* Maybe decrefs the refcount on the old values
* Updates the insertion order array

Updating the insertion order array is possibly the most expensive part of this, and could be easily optimised.
We could:
* Instead of recording the position, record the delta from the "natural" position. In many cases this would be zero and we could skip the write
* In the JIT determine cases where we would make no write and eliminate the code for that.


### Optimize reference tracking and eliminate branching during returns and yields
https://github.com/python/cpython/issues/144540

### Optimize `_LOAD_SPECIAL` to a type check and constant load.

The instruction `LOAD_SPECIAL` expands to uop sequence `_INSERT_NULL` + `_LOAD_SPECIAL` which can be optimized to `_GUARD_TYPE_VERSION` + `_LOAD_CONST_INLINE` + `_SWAP 2` 



### Linked PRs
* gh-144414

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance TODOs #144388

Convert basic blocks to extended basic blocks in the bytecode compiler

Better conversion of `LOAD_FAST` to `LOAD_FAST_BORROW` in the bytecode compiler

Replace with `_CHECK_STACK_SPACE` with `_CHECK_STACK_SPACE_OPERAND` in the JIT

Function, and maybe code, watchers

Track which locals are `NULL`/immortal/borrowed in the bytecode compiler

Reduce or eliminate the cost of updating the insertion order when initializing an object with `STORE_ATTR_INSTANCE_VALUE`

Optimize reference tracking and eliminate branching during returns and yields

Optimize `_LOAD_SPECIAL` to a type check and constant load.

Linked PRs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Performance TODOs #144388

Description

Convert basic blocks to extended basic blocks in the bytecode compiler

Better conversion of LOAD_FAST to LOAD_FAST_BORROW in the bytecode compiler

Replace with _CHECK_STACK_SPACE with _CHECK_STACK_SPACE_OPERAND in the JIT

Function, and maybe code, watchers

Track which locals are NULL/immortal/borrowed in the bytecode compiler

Reduce or eliminate the cost of updating the insertion order when initializing an object with STORE_ATTR_INSTANCE_VALUE

Optimize reference tracking and eliminate branching during returns and yields

Optimize _LOAD_SPECIAL to a type check and constant load.

Linked PRs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Better conversion of `LOAD_FAST` to `LOAD_FAST_BORROW` in the bytecode compiler

Replace with `_CHECK_STACK_SPACE` with `_CHECK_STACK_SPACE_OPERAND` in the JIT

Track which locals are `NULL`/immortal/borrowed in the bytecode compiler

Reduce or eliminate the cost of updating the insertion order when initializing an object with `STORE_ATTR_INSTANCE_VALUE`

Optimize `_LOAD_SPECIAL` to a type check and constant load.