You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a meta issue listing possible performance improvements that:
are not too hard, but they aren't easy either: a knowledge of computer science is necessary.
do not involving original research, or changes to multiple parts of the VM.
should produce a worthwhile performance improvement
are self contained:
Not increasing coupling or complexity in the code base
Can be worked on without troublesome merge conflicts
Since this is a meta issue, please make sure there is an issue for the sub-issue before working on it.
In no particular order:
Convert basic blocks to extended basic blocks in the bytecode compiler
Many local optimizations in the bytecode compiler are limited to a single basic block, but would be more effective and still correct applied to extended basic blocks.
Better conversion of LOAD_FAST to LOAD_FAST_BORROW in the bytecode compiler
It is possible that extended basic blocks would fix this, or it might be a separate problem
Replace with _CHECK_STACK_SPACE with _CHECK_STACK_SPACE_OPERAND in the JIT
We removed the optimization that did this because it tried to convert multiple _CHECK_STACK_SPACEs into a single _CHECK_STACK_SPACE_OPERAND. Replacing them one by one should be much simpler.
Function, and maybe code, watchers
We have class and dictionary watchers, and we use them effectively in the JIT. There are a number of optimizations we would like to do, but cannot because functions and code objects can change at runtime and we don't have watchers for them.
We might not need code watchers, as we do a complete de-optimization when any code objects are instrumented. Having code watchers might allow more targetted de-optimizations. We should do function watchers first though.
Track which locals are NULL/immortal/borrowed in the bytecode compiler
We could them use this information to speedup RETURN_VALUE as it wouldn't need to DECREF those locals. This might make sense in the interpreter, but would probably only be of value in the JIT.
Reduce or eliminate the cost of updating the insertion order when initializing an object with STORE_ATTR_INSTANCE_VALUE
STORE_ATTR_INSTANCE_VALUE does three things
Stores the new value
Maybe decrefs the refcount on the old values
Updates the insertion order array
Updating the insertion order array is possibly the most expensive part of this, and could be easily optimised.
We could:
Instead of recording the position, record the delta from the "natural" position. In many cases this would be zero and we could skip the write
In the JIT determine cases where we would make no write and eliminate the code for that.
Optimize reference tracking and eliminate branching during returns and yields
Optimize _LOAD_SPECIAL to a type check and constant load.
The instruction LOAD_SPECIAL expands to uop sequence _INSERT_NULL + _LOAD_SPECIAL which can be optimized to _GUARD_TYPE_VERSION + _LOAD_CONST_INLINE + _SWAP 2
Important
This is a meta issue listing possible performance improvements that:
Since this is a meta issue, please make sure there is an issue for the sub-issue before working on it.
In no particular order:
Convert basic blocks to extended basic blocks in the bytecode compiler
Many local optimizations in the bytecode compiler are limited to a single basic block, but would be more effective and still correct applied to extended basic blocks.
Better conversion of
LOAD_FASTtoLOAD_FAST_BORROWin the bytecode compilerFor example,
It is possible that extended basic blocks would fix this, or it might be a separate problem
Replace with
_CHECK_STACK_SPACEwith_CHECK_STACK_SPACE_OPERANDin the JITWe removed the optimization that did this because it tried to convert multiple
_CHECK_STACK_SPACEs into a single_CHECK_STACK_SPACE_OPERAND. Replacing them one by one should be much simpler.Function, and maybe code, watchers
We have class and dictionary watchers, and we use them effectively in the JIT. There are a number of optimizations we would like to do, but cannot because functions and code objects can change at runtime and we don't have watchers for them.
We might not need code watchers, as we do a complete de-optimization when any code objects are instrumented. Having code watchers might allow more targetted de-optimizations. We should do function watchers first though.
Track which locals are
NULL/immortal/borrowed in the bytecode compilerWe could them use this information to speedup
RETURN_VALUEas it wouldn't need toDECREFthose locals. This might make sense in the interpreter, but would probably only be of value in the JIT.Reduce or eliminate the cost of updating the insertion order when initializing an object with
STORE_ATTR_INSTANCE_VALUESTORE_ATTR_INSTANCE_VALUEdoes three thingsUpdating the insertion order array is possibly the most expensive part of this, and could be easily optimised.
We could:
Optimize reference tracking and eliminate branching during returns and yields
#144540
Optimize
_LOAD_SPECIALto a type check and constant load.The instruction
LOAD_SPECIALexpands to uop sequence_INSERT_NULL+_LOAD_SPECIALwhich can be optimized to_GUARD_TYPE_VERSION+_LOAD_CONST_INLINE+_SWAP 2Linked PRs