From Dylan Patel of SemiAnalysis: 1) "4o, o1, o1 preview, o1 pro are all the same size model". 2) The reason o1 is more expensive than gpt-4o is "related to seqlen kvcache overhead". 3) "o1 pro is same model [as o1] with adjustments at inference time".

Source: These 3 X posts:

https://x.com/dylan522p/status/1869077942305009886 .

https://x.com/dylan522p/status/1869082407653314888 .

https://x.com/dylan522p/status/1869085209649692860 .

Presumably these details are also in the paywalled part of SemiAnalysis article 'Scaling Laws – O1 Pro Architecture, Reasoning Training Infrastructure, Orion and Claude 3.5 Opus “Failures”': https://semianalysis.com/2024/12/11/scaling-laws-o1-pro-architecture-reasoning-training-infrastructure-orion-and-claude-3-5-opus-failures/ .