Kimi K2.7 Code, Opus vs GLM 5.2, E2E Testing — AI Daily Jun 22
529 messages · 84 active members
Overview
Topics
Kimi K2.7 Code dropped on platform.kimi.ai — @samtome got beta access; verdict is blazing fast but not smarter than GPT-5.5. @mikeconner's benchmark showed Opus cleanly building a 3D WebGL platformer in half the time vs GLM-5.2's rough-but-complete output at ~1/5 the token cost. @rmktg found Grok handles UGC ad demographic analysis that Codex refuses outright. Sakana's Fugu Ultra was clarified as a harness, not a model.
@tounano argued real-browser E2E tests are 100x slower, hard to parallelize, and burn LLM cache positions. Recommended approach: happy-dom emulated browser for first-boundary-to-end tests, with 1-2 real Playwright tests only for core journeys plus isolated browser features (clipboard, drag-and-drop). When tests fail, walk git history to find the regression rather than letting the LLM blind-patch.
@iamgalba detailed his agentic media buying stack at tegra.co: 20-30 presell pages/month, an image skill trained on 5000 proven BOF/native ads that one-shots hundreds, and a claymation video pipeline (concepts → briefs → timelines → voiceovers). @mb29266 confirmed similar workflows wiring ad account data into image skills for self-ideation. @navuud also walked through getting an iOS web2app video editor approved by Apple by avoiding any offsite payment mention.
@sibunting runs 5 concurrent agents (dev, data, QA, marketing, orchestrator) over Telegram and solved simultaneous compaction by injecting context percentages into the orchestrator. @navuud injects the last ~20k tokens of verbatim messages on compact for seamless continuity. @thewildzeno (FunnelFlux founder) shared a detailed tmux-based agent orchestration setup with adversarial CC↔Codex review loops and bash-scripted integration testing.
Multiple builders (@nlrocks, @csoares12, @arielletolome, @mb29266) reported Claude and Codex guardrailing harder this week, especially on copywriting flows (compliance/fact-checking injected) and ad event testing. @mb29266 flagged Claude Code refusing mild prompts inside CLI for the first time. @nlrocks predicted forced moves to local models; @Guto_gouvea noted Grok still runs with fewer guardrails.
Key Takeaways
- Skip real-browser E2E for most tests — use happy-dom for 100x faster, parallelizable coverage; reserve Playwright for 1-2 core journeys plus clipboard/drag-and-drop.
- Kimi K2.7 Code is fast but not smarter than GPT-5.5; Opus still wins on UI/3D polish vs GLM 5.2, which delivers 5x cheaper tokens for end-to-end completion.
- Grok via API handles marketing analysis tasks (demographics, edgy copy) that Codex and Claude refuse — keep it in the harness for sensitive work.
- For multi-agent setups, inject recent verbatim messages on compact rather than relying on auto-summarization to preserve context fidelity.
- Claude/Codex guardrails are tightening fast on copywriting and ad workflows — plan local model fallbacks now, and avoid offsite payment mentions for iOS web2app approval.
Hot Threads
Why E2E Playwright tests are wasteful — use happy-dom instead
End-to-end agentic system for Google Ads, presells, and claymation video
Managing context compaction across 5 concurrent Claude agents