PICO8 keeps track of CPU usage using two values: Lua cycles and system cycles. Most operations affect Lua cycles, but some functions have an additional system cycle cost.
There are 8,388,608 cycles per second (2^23), which is about 139,810 cycles per frame at 60 FPS, or 279,620 cycles per frame at 30 FPS. The function call stat(1)
returns the total fraction of the current frame spent on Lua cycles + system cycles, and stat(2)
returns the fraction spent on just system cycles.
For example, cls()
uses 4 Lua cycles and 2048 system cycles, for a total of 2052, so if we assume PICO8 is running at 60 FPS, we can calculate how many times per frame we can call it: 8,388,608cyc/s / 60f/s / 2052cyc = 68
times.
Optimization Tips Edit
Some tips for when your code isn't running fast enough: (these will increase your code's size and reduce its clarity, however  it's a tradeoff)
 First, make sure you know why your code is running slow  which part's costing the most time? Use time() or stat(1) calls to measure this, or just delete blocks of code to see where the problem lies.
 Focus on just the code causing the most slowdown (usually a while/for loop), and only until the desired speed is achieved, as optimizing your whole code will quickly run you out of tokens for no actual gain. (Often, 99% of the time is spent in 1% of the code. Optimizing the rest of the code is pointless).
 Having a stat(1) printh before the end of _update & _draw (or before the flip) that will show you how your game's actual performance is improving (or not) as you're making optimizations is invaluable here.
 If doing an optimization doesn't seem to help actual performance (as measured by the above point's stat(1)), you've probably failed to find the actual problem point, try spending more time on that.
Now that you found the code causing the slowdown:
 You can always remove it if it's not essential. That's one of the only optimizations that will improve your code size and clarity, too!
 Forget about the code for a moment and consider what it's supposed to be doing  what's the fastest way that can be implemented? Can a clever algorithm or data structure be used to avoid pointless calculation?
 For example, pico has a fair(ish) amount of lua memory  2 MB  a function that has a small (or sometimes notsosmall) set of possible inputs and does slow computations on them can often be replaced with a lookup table (which could be computed at startup time, if too large to fit in the code).
 Now onto the microoptimizations:
 Function calls cost, so inlining short calls (replacing the calls with the code inside the function) can help performance (in exchange for severely harming code size and clarity  use with care).
 Access to global or nonlocal variables (locals from other functions) is slower than access to local variables  use local variables instead, if possible. If a global or nonlocal variable is read multiple times, it'd save cycles to cache it in a local variable first (this helps a bit even if the variable's read twice).
Lua cycles Edit
Some standard Lua operation costs: (tested on 0.2.0h)
 Variable access (read):
 Local variables in same function: 0 cycles.
 Global variables: 2 cycles.
 Upvalues (local variables in another function): 2 cycles.
 Assignment statement:
 Simple (x=y): 0 cycles if right side of expression already has a cycle cost. 2 cycles otherwise. (yes, this means x=x+y is cheaper than x=y). [Note; this may be changed/fixed in the future]
 Multiple (x1,x2,..,xn = y1,y2,..,yk): (max(n,k)  1) * 2 cycles, plus 2 cycles for each right side expression without a cycle cost. (E.g. x,y=y,x is 6 cycles). [Note: this used to be cheaper. It might get changed/fixed again in the future]
 Arithmetic operators:
 additive operators (+, ): 1 cycle
 multiplicative operators (*, /, %, \): 2 cycles
 unary minus (): 2 cycles [Note: this is odd]
 exponentiation (^): 2 regular cycles plus a considerable system cycles cost, described in the system
 Local Declaration:
 Defaultinitialized (local x,y,z): 2 cycles, regardless of amount of locals.
 Initialized: 2 cycles per initialized local.
 Binary operators (&, , ^^, <<, >>, >>>, <<>, >><, ~): 1 cycle.
 Logical operators:
 and/or: 0 cycles if shortcircuited, 2 cycles otherwise. +2 extra cycles unless directly inside an if/while/and/or.
 unary not: 2 cycles.
 Relational operators (<, >, <=, >=, ==, !=): 2 cycles. +2 extra cycles unless directly inside an if/while/and/or.
 String concatenation operator (..): 6 cycles
 Memory peek operators (@, %, $): 1 cycle
 Table element access: 2 cycle
 Table construction:
 With at least one positional (liststyle) element: 4 cycles + 2 cycles per (any) element.
 Otherwise: 2 cycles + 2 cycles per (any) element.
 The 2 cycles per element cost is max'ed with the cost of the expression that defines that element. (So {1+2} costs 4 cycles, not 5)
 (Note: This means that {a,b} is 8 cycles, but {[1]=a,[2]=b} is 6 cycles. Funny)
 Table length (#): 2 cycles.
 Function construction: 2 cycles. [Todo: even if it captures locals? That definitely wasn't the case before...]
 Function call: 4 cycles + 2 cycles per argument.
 The 2 cycles per argument cost is max'ed with the cost of the expression that defines that argument. (So func(1+2) costs 6 cycles, not 7)
 This cost is the same regardless of whether the function is accessed through a local, a global, or an upvalue.
 Function return: 2 cycles + 2 cycles per return value.
 If a function returns without an explicit return statement, that also costs 2 cycles. (You can think of it as an implicit return statement)
 If statement: 2 cycle per evaluated if/elseif.
 This cost is max'ed with the cost of the expression in the if/elseif.
 While loop: 2 + 4n cycles, where n is the number of iterations. (Todo: that much?! Need to doublecheck)
 2 cycles per iteration are max'ed with the cost of the expression in the while.
 Numeric for loop: 10 + 2n, where n is the number of iterations.
 do … end: 0 cycles
 Metamethod access: 0 cycles (doesn't include cost of the metamethod itself)
Lua CPU stats were only updated every 2048 cycles as of 0.1.12c, but in 0.2.0 they started being updated at a precision closer to once per conceptual operation.
Functions that add negative Lua cycles Edit
Some functions have negative Lua cycles associated with them that get subtracted from the Lua cycle count by the PICO8 runtime. This mechanism allows PICO8 to make these functions artificially cheaper.
For instance, poke(x,y) should cost 8 cycles because it is a function call with two arguments, but each call subtracts 4 cycles from the Lua cycle counter, for a total of 4 cycle.
The table below lists functions that have their total cost tweaked in this way.
Function  Adjusted cycles  Notes 

peek(x) , peek2(x) , peek4(x)  4  Only when called with 1 argument.
Operators are faster still. 
poke(x,y) , poke2(x,y) , poke4(x,y)  4  Only when called with 2 arguments 
band(x,y) , bor(x,y) , bxor(x,y)  4  Only when called with 2 arguments.
Operators are faster still. 
bnot(x)  4  Only when called with 1 argument.
Operators are faster still. 
shl(x,y) , shr(x,y) , lshr(x,y)
 4  Only when called with 2 arguments.
Operators are faster still. 
rotl(x,y) , rotr(x,y)  4  Only when called with 2 arguments.
Operators are faster still. 
flr(x) , ceil(x)  4  Only when called with 1 argument. 
Functions that add Lua cycles Edit
A few functions consume additional Lua cycles (in addition to the standard cost of 2+(#arguments)):
Out of date  Measured on PICO8 1.1.12d RC10.
Function  Additional cycles  Notes 

add()  10  
all()
 ???  TODO  Results wildly unclear 
del()  if ns > 0 then 8+(2+ns)*6 else 8  n is the size of the table.
s is 1 if deleted and 0 otherwise. 
foreach()  if n > 0 then 4+n*(10+c) else 24  n is the size of the table.
c is the cost of the function passed to the foreach. 
tostr()  if table then 28 else 18  table is true if the argument is a table. 
printh()  32  
menuitem()  32 
The following functions neither add nor subtract cycles, and cost the standard amount:
sgn()
, abs()
, sin(), cos(),
atan2().
camera(), clip(), cursor(), fillp(), pal(), palt().
fget()
, fset(), mget(), mset(), pget(), pset(), sget(), sset().
cocreate(), coresume(), costatus(), dget(), dset(), time(), type().
getmetatable(), setmetatable(), pairs(), next(), rawget(), rawset().
System cycles Edit
A few functions consume system cycles. Note that they will add to their standard Lua cycle cost.
System CPU stats are updated after each call.
Out of date  measured on PICO8 1.1.11g:
Function  Cycles  Notes 

cls()  2048  same cost as rectfill of same size

print()  4+n*16
 n is the number of characters in the string, even those not rendered
spaces, newlines, and doublewidth glyphs each count as one character 
spr()  2*n  n is the number of pixels drawn, including transparent pixels (width × height of the sprite rectangle)
cost is 0 if first argument is outside the [0, 255] range 
sspr()  2*n
 n is the number of pixels drawn, including transparent pixels (width × height of the destination rectangle)

rect()  2*max(1,2*ceil(a/4)) + 2*max(0,2*ceil(b/21))

Where:

rectfill()  2*max(1,flr(n/16))  n is the number of pixels drawn (width × height)

circ()  4+n*8
 warning: that formula is incomplete for clipped circles 
circfill()  2*n*flr((n+9)/4)
 warning: that formula is incomplete for clipped circles 
line()  2*ceil(n/2)  n is the number of pixels drawn; there is an additional cost of 1 if at least one pixel had to be clipped

map() / mapdraw()  2*max(1,n*64)
 n is the number of sprites rendered; only cells that are not zero in the map are considered

music()  32  no cost if no argument 
sfx()  32  no cost if no argument 
memcpy()  2*(n+1)
 n is the number of bytes copied

memset()  2*max(1,ceil(n/2))
 n is the number of bytes set

cstore()  2*max(1, n*64)  n is the number of bytes stored. 
reload()  2*max(1,n*8)
 n is the number of bytes reloaded

btn()  8  no cost if no argument 
btnp()  8  no cost if no argument 
rnd()  8  
srand()  16  
sqrt()  48  only 32 if argument is zero

x^y  16*(n+1)  n is the position of the last fractional bit in y ; for instance, cost is 8 for any integer such as y == 13 , and is 8*3 for y == 1.25

stat()  32 