Kimi K2.7 Code, Opus vs GLM 5.2, E2E Testing — AI Daily Jun 22

529 messages · 84 active members

529
messages
84
active members
@tounano, @mb29266, @iamgalba
top contributors

Overview

Today's discussions spanned testing strategy, model releases, and the widening guardrail gap between providers. @tounano kicked off a deep thread arguing real-browser E2E tests are wasteful — builders should default to happy-dom for 100x faster, parallelizable coverage and reserve Playwright for 1-2 core journeys plus clipboard/drag-and-drop. Consensus: when E2E fails, walk git history rather than letting LLMs blindly patch. On the model front, Kimi K2.7 Code dropped (fast but not smarter than GPT-5.5), @mikeconner benchmarked Opus vs GLM-5.2 on a 3D WebGL platformer (Opus cleaner in half the time; GLM 5x cheaper but rough), and @rmktg showed Grok handling sensitive UGC ad analysis that Codex refused. Sakana's Fugu Ultra turned out to be a harness, not a frontier model. Agentic systems dominated builder shares: @iamgalba detailed his tegra.co stack (presells, claymation video, 5000-image trained skill); @sibunting and @navuud compared multi-agent context-compaction tactics across 5+ Claude agents over Telegram; and @thewildzeno introduced tmux-based adversarial CC↔Codex review loops. Meanwhile @navuud walked through getting an iOS web2app video editor approved by Apple, and multiple builders flagged Claude/Codex guardrails tightening fast on copywriting and ad workflows.

Topics

Kimi K2.7 Code dropped on platform.kimi.ai — @samtome got beta access; verdict is blazing fast but not smarter than GPT-5.5. @mikeconner's benchmark showed Opus cleanly building a 3D WebGL platformer in half the time vs GLM-5.2's rough-but-complete output at ~1/5 the token cost. @rmktg found Grok handles UGC ad demographic analysis that Codex refuses outright. Sakana's Fugu Ultra was clarified as a harness, not a model.

@tounano argued real-browser E2E tests are 100x slower, hard to parallelize, and burn LLM cache positions. Recommended approach: happy-dom emulated browser for first-boundary-to-end tests, with 1-2 real Playwright tests only for core journeys plus isolated browser features (clipboard, drag-and-drop). When tests fail, walk git history to find the regression rather than letting the LLM blind-patch.

@iamgalba detailed his agentic media buying stack at tegra.co: 20-30 presell pages/month, an image skill trained on 5000 proven BOF/native ads that one-shots hundreds, and a claymation video pipeline (concepts → briefs → timelines → voiceovers). @mb29266 confirmed similar workflows wiring ad account data into image skills for self-ideation. @navuud also walked through getting an iOS web2app video editor approved by Apple by avoiding any offsite payment mention.

@sibunting runs 5 concurrent agents (dev, data, QA, marketing, orchestrator) over Telegram and solved simultaneous compaction by injecting context percentages into the orchestrator. @navuud injects the last ~20k tokens of verbatim messages on compact for seamless continuity. @thewildzeno (FunnelFlux founder) shared a detailed tmux-based agent orchestration setup with adversarial CC↔Codex review loops and bash-scripted integration testing.

Multiple builders (@nlrocks, @csoares12, @arielletolome, @mb29266) reported Claude and Codex guardrailing harder this week, especially on copywriting flows (compliance/fact-checking injected) and ad event testing. @mb29266 flagged Claude Code refusing mild prompts inside CLI for the first time. @nlrocks predicted forced moves to local models; @Guto_gouvea noted Grok still runs with fewer guardrails.

Key Takeaways

  • Skip real-browser E2E for most tests — use happy-dom for 100x faster, parallelizable coverage; reserve Playwright for 1-2 core journeys plus clipboard/drag-and-drop.
  • Kimi K2.7 Code is fast but not smarter than GPT-5.5; Opus still wins on UI/3D polish vs GLM 5.2, which delivers 5x cheaper tokens for end-to-end completion.
  • Grok via API handles marketing analysis tasks (demographics, edgy copy) that Codex and Claude refuse — keep it in the harness for sensitive work.
  • For multi-agent setups, inject recent verbatim messages on compact rather than relying on auto-summarization to preserve context fidelity.
  • Claude/Codex guardrails are tightening fast on copywriting and ad workflows — plan local model fallbacks now, and avoid offsite payment mentions for iOS web2app approval.

Hot Threads

@tounanostarted

Why E2E Playwright tests are wasteful — use happy-dom instead

30 replies4 participants
@iamgalbastarted

End-to-end agentic system for Google Ads, presells, and claymation video

18 replies6 participants
@sibuntingstarted

Managing context compaction across 5 concurrent Claude agents

10 replies4 participants

Linked Items