# 01-lang-cpp-performance > 用于 C/C++ 性能优化:profiling、benchmark、缓存与数据布局、SIMD、并发与无锁优化; 当出现性能瓶颈或需要系统性调优时使用。 - Author: wdfk-prog - Repository: wdfk-prog/skills - Version: 20260126112154 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/wdfk-prog/skills - Web: https://mule.run/skillshub/@@wdfk-prog/skills~01-lang-cpp-performance:20260126112154 --- --- # ═══════════════════════════════════════════════════════════════════════════════ # SKILL: Performance # Version: 3.0.0 | SASMP v1.3.0 Compliant | Production-Grade # ═══════════════════════════════════════════════════════════════════════════════ # ───────────────────────────────────────────────────────────────────────────── # IDENTITY # ───────────────────────────────────────────────────────────────────────────── name: 01-lang-cpp-performance version: "3.0.0" description: > 用于 C/C++ 性能优化:profiling、benchmark、缓存与数据布局、SIMD、并发与无锁优化; 当出现性能瓶颈或需要系统性调优时使用。 # ───────────────────────────────────────────────────────────────────────────── # COMPLIANCE # ───────────────────────────────────────────────────────────────────────────── sasmp_version: "1.3.0" skill_version: "3.0.0" # ───────────────────────────────────────────────────────────────────────────── # BONDING # ───────────────────────────────────────────────────────────────────────────── bonded_agent: 05-performance-optimizer bond_type: PRIMARY_BOND category: development # ───────────────────────────────────────────────────────────────────────────── # PARAMETERS # ───────────────────────────────────────────────────────────────────────────── parameters: optimization_target: type: string required: false enum: [throughput, latency, memory, cpu, all] default: all description: "Primary optimization target" profiling_tool: type: string required: false enum: [perf, vtune, valgrind, tracy, instruments] description: "Profiling tool to use" optimization_level: type: string required: false enum: [quick_wins, moderate, aggressive] default: moderate description: "Depth of optimization effort" maintain_readability: type: boolean required: false default: true description: "Whether to prioritize code readability" # ───────────────────────────────────────────────────────────────────────────── # ERROR HANDLING # ───────────────────────────────────────────────────────────────────────────── error_handling: retry_logic: max_attempts: 3 backoff: exponential initial_delay_ms: 1000 max_delay_ms: 16000 jitter: true fallback: on_benchmark_unstable: "increase_iterations" on_profiling_fail: "use_alternative_tool" on_no_improvement: "try_different_approach" on_regression: "rollback_and_analyze" validation: verify_no_regression: true statistical_significance: true test_multiple_inputs: true --- # Performance Skill **Production-Grade Development Skill** | C++ Performance Engineering Optimize C++ code for maximum performance through profiling, analysis, and targeted optimization. --- ## Golden Rules ``` ┌─────────────────────────────────────────────────────────────────┐ │ 1. MEASURE first - never optimize without profiling data │ │ 2. OPTIMIZE hotspots - focus on the 20% that takes 80% time │ │ 3. VERIFY improvements - benchmark before and after │ │ 4. MAINTAIN readability - premature optimization is evil │ └─────────────────────────────────────────────────────────────────┘ ``` --- ## Profiling Tools ### Linux perf ```bash # Record CPU profile perf record -g ./program perf report # Flamegraph generation perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg # Hardware counters perf stat -e cache-misses,cache-references,instructions,cycles ./program # Specific function profiling perf record -g -e cycles:u --call-graph dwarf ./program ``` ### Valgrind Callgrind ```bash # Instruction-level profiling valgrind --tool=callgrind ./program kcachegrind callgrind.out.* # Cache simulation valgrind --tool=cachegrind ./program cg_annotate cachegrind.out.* ``` ### Google Benchmark ```cpp #include static void BM_VectorPushBack(benchmark::State& state) { for (auto _ : state) { std::vector v; v.reserve(state.range(0)); // Fair comparison for (int i = 0; i < state.range(0); ++i) { v.push_back(i); } benchmark::DoNotOptimize(v.data()); benchmark::ClobberMemory(); } state.SetComplexityN(state.range(0)); } BENCHMARK(BM_VectorPushBack) ->Range(8, 8 << 10) ->Complexity(benchmark::oN); BENCHMARK_MAIN(); ``` --- ## Cache Optimization ### Data Layout: AoS vs SoA ```cpp // ❌ Array of Structures (AoS) - cache unfriendly for iteration struct ParticleAoS { float x, y, z; // Position float vx, vy, vz; // Velocity float mass; int id; }; std::vector particles; // 32 bytes per particle // ✅ Structure of Arrays (SoA) - cache friendly struct ParticlesSoA { std::vector x, y, z; // Contiguous positions std::vector vx, vy, vz; // Contiguous velocities std::vector mass; std::vector id; void update_positions(float dt) { const size_t n = x.size(); for (size_t i = 0; i < n; ++i) { x[i] += vx[i] * dt; // Full cache line utilization y[i] += vy[i] * dt; z[i] += vz[i] * dt; } } }; ``` ### Cache Line Alignment ```cpp // Avoid false sharing with cache line alignment struct alignas(64) CacheAlignedCounter { std::atomic count{0}; char padding[56]; // Ensure 64-byte alignment }; // Per-thread counters without false sharing std::array thread_counters; // Hot/cold data separation struct HotData { int frequently_accessed; int also_frequent; }; struct ColdData { std::string rarely_used; std::vector debug_info; }; struct OptimizedNode { HotData hot; ColdData* cold; // Pointer to cold data (loaded on demand) }; ``` --- ## SIMD Vectorization ### Auto-vectorization Hints ```cpp // Help compiler vectorize with restrict and pragmas void add_arrays(float* __restrict a, float* __restrict b, float* __restrict result, size_t n) { #pragma omp simd for (size_t i = 0; i < n; ++i) { result[i] = a[i] + b[i]; } } // Alignment for better vectorization void process_aligned(float* data, size_t n) { float* __restrict aligned_data = std::assume_aligned<32>(data); // C++20 for (size_t i = 0; i < n; ++i) { aligned_data[i] *= 2.0f; } } ``` ### Explicit SIMD (AVX) ```cpp #include void add_vectors_avx(const float* a, const float* b, float* result, size_t n) { size_t i = 0; // Process 8 floats at a time with AVX for (; i + 8 <= n; i += 8) { __m256 va = _mm256_loadu_ps(&a[i]); __m256 vb = _mm256_loadu_ps(&b[i]); __m256 vr = _mm256_add_ps(va, vb); _mm256_storeu_ps(&result[i], vr); } // Handle remainder for (; i < n; ++i) { result[i] = a[i] + b[i]; } } // Horizontal sum with AVX float horizontal_sum_avx(__m256 v) { __m128 lo = _mm256_castps256_ps128(v); __m128 hi = _mm256_extractf128_ps(v, 1); lo = _mm_add_ps(lo, hi); lo = _mm_hadd_ps(lo, lo); lo = _mm_hadd_ps(lo, lo); return _mm_cvtss_f32(lo); } ``` --- ## Multithreading ### Parallel Algorithms (C++17) ```cpp #include #include #include std::vector data(1'000'000); // Parallel sort std::sort(std::execution::par_unseq, data.begin(), data.end()); // Parallel transform std::transform(std::execution::par, data.begin(), data.end(), data.begin(), [](int x) { return x * 2; }); // Parallel reduce long sum = std::reduce(std::execution::par, data.begin(), data.end(), 0L); // Parallel for_each std::for_each(std::execution::par_unseq, data.begin(), data.end(), [](int& x) { x = process(x); }); ``` ### Thread Pool ```cpp #include #include #include #include #include class ThreadPool { std::vector workers_; std::queue> tasks_; std::mutex mutex_; std::condition_variable cv_; std::atomic stop_{false}; public: explicit ThreadPool(size_t threads = std::thread::hardware_concurrency()) { for (size_t i = 0; i < threads; ++i) { workers_.emplace_back([this] { while (true) { std::function task; { std::unique_lock lock(mutex_); cv_.wait(lock, [this] { return stop_ || !tasks_.empty(); }); if (stop_ && tasks_.empty()) return; task = std::move(tasks_.front()); tasks_.pop(); } task(); } }); } } template auto enqueue(F&& f, Args&&... args) -> std::future> { using return_type = std::invoke_result_t; auto task = std::make_shared>( std::bind(std::forward(f), std::forward(args)...) ); std::future res = task->get_future(); { std::lock_guard lock(mutex_); tasks_.emplace([task]() { (*task)(); }); } cv_.notify_one(); return res; } ~ThreadPool() { stop_ = true; cv_.notify_all(); for (auto& worker : workers_) { worker.join(); } } }; ``` --- ## Quick Wins Checklist ### Immediate Optimizations - [ ] Use `reserve()` for vectors with known size - [ ] Prefer `emplace_back()` over `push_back()` - [ ] Move instead of copy when possible - [ ] Use `string_view` for read-only strings - [ ] Avoid unnecessary allocations in loops - [ ] Use `[[likely]]` / `[[unlikely]]` for branch hints ### Data Layout - [ ] Profile cache misses first - [ ] Consider SoA vs AoS for large datasets - [ ] Align hot data to cache lines - [ ] Separate hot and cold data ### Algorithmic - [ ] Choose right container for access pattern - [ ] Use binary search on sorted data - [ ] Avoid redundant computation - [ ] Consider lookup tables for expensive functions --- ## Performance Workflow ``` ┌─────────────┐ ┌──────────────┐ ┌───────────────┐ │ PROFILE │───▶│ IDENTIFY │───▶│ OPTIMIZE │ │ (measure) │ │ (hotspots) │ │ (implement) │ └─────────────┘ └──────────────┘ └───────────────┘ ▲ │ │ ▼ │ ┌──────────────┐ ┌───────────────┐ └──────────────│ VERIFY │◀───│ BENCHMARK │ │ (improved?) │ │ (measure) │ └──────────────┘ └───────────────┘ ``` --- ## Troubleshooting Decision Tree ``` Performance issue? ├── High CPU, low throughput │ ├── Check cache misses → perf stat -e cache-misses │ ├── Check branch mispredictions → perf stat -e branch-misses │ └── Profile hotspots → perf record + flamegraph ├── High latency spikes │ ├── Check for locks → Look for mutex contention │ ├── Check allocations → Use custom allocator │ └── Check I/O blocking → Use async I/O ├── Memory growing │ ├── Memory leak → Valgrind / ASan │ ├── Fragmentation → Custom allocator │ └── Retained references → Check lifetimes └── Inconsistent performance ├── CPU throttling → Check power management ├── NUMA effects → Pin threads to cores └── Context switches → Reduce thread count ``` --- ## Unit Test Template ```cpp #include #include #include class PerformanceTest : public ::testing::Test { protected: static constexpr size_t ITERATIONS = 1000; template auto measure(Func&& f) { auto start = std::chrono::high_resolution_clock::now(); for (size_t i = 0; i < ITERATIONS; ++i) { f(); } auto end = std::chrono::high_resolution_clock::now(); return std::chrono::duration_cast( end - start).count() / ITERATIONS; } }; TEST_F(PerformanceTest, VectorReserveIsFaster) { auto without_reserve = measure([]{ std::vector v; for (int i = 0; i < 1000; ++i) v.push_back(i); }); auto with_reserve = measure([]{ std::vector v; v.reserve(1000); for (int i = 0; i < 1000; ++i) v.push_back(i); }); EXPECT_LT(with_reserve, without_reserve); } TEST_F(PerformanceTest, SoAFasterThanAoS) { // Test cache efficiency auto aos_time = measure([this]{ process_aos(); }); auto soa_time = measure([this]{ process_soa(); }); EXPECT_LT(soa_time, aos_time * 0.8); // At least 20% faster } ``` --- ## Integration Points | Component | Interface | |-----------|-----------| | `build-engineer` | Optimization flags | | `modern-cpp-expert` | Move semantics | | `memory-specialist` | Allocation patterns | | `cpp-debugger-agent` | Performance debugging | --- *C++ Plugin v3.0.0 - Production-Grade Development Skill*