Case study2026 · Solo · Dissertation

Password Analyzer

ML-augmented password security combining a rules engine, breach detection, and a Random Forest classifier — with a deterministic policy engine that prevents any single layer from dominating.

Highlights
  • metric

    Zero weak-to-strong misclassifications

  • metric

    85.39% accuracy, 0.956 macro-AUC

  • architecture

    Defence-in-depth policy engine

  • security

    Privacy-preserving breach detection

  • architecture

    Parallel analytical services

99.99%
Hook metric

Random Forest recall on independently-labelled strong-class passwords. zxcvbn's recall on the same set: 78.47%.

Weak and reused passwords remain a primary vector for account compromise. Rule-based strength meters can be gamed by predictable transformations; breach detection alone cannot evaluate novel inputs; ML classifiers without rigorous benchmarking add complexity rather than measurable value. The Password Analyzer is a defence-in-depth response, integrating three independent analytical layers behind a deterministic policy engine — and evaluated against zxcvbn using stratified cross-validation, ablation, and a McNemar significance test.

Three layers, no single point of judgement

The most consequential architectural decision was refusing to let any single layer dictate the final rating. The rules engine produces an interpretable structural score. The HIBP service performs privacy-preserving breach detection via k-anonymity. A Random Forest classifier — trained on 13 engineered features grounded in password-guessing literature — produces a probabilistic class prediction and confidence.

These signals are resolved by a deterministic policy engine that prioritises security over probability. A confirmed breach forces a weak rating regardless of any other input. A low rules score forces weak before the ML label is consulted. ML can confirm a strong rating only when the rules score is already high and the prediction is confident; it can never unilaterally elevate a structurally weaker password. When the breach check is unavailable, or the ML signal uncertain, the engine falls back to moderate — failing toward caution rather than convenience.

This composition is the project's intellectual contribution. Each layer covers another's blind spots: rules give immediate transparent feedback, breach data catches reused credentials that would score acceptably on structural metrics, ML captures non-linear interactions static heuristics cannot model.

A Diceware passphrase analysed by the system. All three layers agree: rules report high complexity, HIBP reports the password as clean, and the Random Forest predicts the strong class with 93.3% confidence. The policy engine returns a Strong rating.
A Diceware passphrase analysed by the system. All three layers agree: rules report high complexity, HIBP reports the password as clean, and the Random Forest predicts the strong class with 93.3% confidence. The policy engine returns a Strong rating.

Random Forest over alternatives — and an honest construct dependency

A Random Forest was selected over neural alternatives for three reasons: tabular ensemble methods handle engineered structural features without large training corpora; feature-importance analysis preserves interpretability, which matters in security contexts; and inference cost is sub-millisecond, suitable for real-time interactive use.

The harder methodological decision concerned labelling. Weak and moderate classes were drawn from RockYou and labelled by zxcvbn score thresholds — the same tool used as the evaluation benchmark. This is a real construct dependency: per-class comparison on weak and moderate categories is partially tautological because zxcvbn is being measured against labels it effectively defined.

Rather than hide this, the dissertation acknowledges it explicitly and bounds the comparison. The strong class was constructed independently using Diceware passphrases and cryptographically random strings, and treated as the primary methodologically valid test. On that independent class the Random Forest achieves 99.99% recall against zxcvbn's 78.47%. McNemar's test on paired predictions (χ² = 1,519.65, p < 0.001) confirms the systems differ substantially — and the error profiles are complementary, not competing: zxcvbn covers common human patterns better; the RF captures high-entropy structures zxcvbn systematically underrates. The policy engine is designed to exploit that complementarity.

The same UI analysing "Password123". HIBP reports the password as compromised, with 1,505,362 occurrences in the breach corpus. The policy engine returns Weak with a structured rationale, overriding the structural rules and ML signals.
The same UI analysing "Password123". HIBP reports the password as compromised, with 1,505,362 occurrences in the breach corpus. The policy engine returns Weak with a structured rationale, overriding the structural rules and ML signals.

Privacy-preserving breach detection with fail-open semantics

HIBP integration uses the Pwned Passwords k-anonymity API: the password is hashed locally with SHA-1, only the first five hexadecimal characters of the digest are transmitted, and suffix matching is performed locally. The full password and full hash never leave the application boundary. A lru_cache bounds repeated lookups, configurable via environment variable so strict-privacy deployments can disable caching entirely.

The subtle decision is fail-open. On timeout or an unreachable host, the function returns breached=None — not breached=False. Most implementations conflate the two, treating an unavailable API as "clean," which gives users false assurance during outages. Here, the policy engine treats None conservatively: it will not elevate a final rating when paired with weak structural signals or low ML confidence. An attacker cannot exploit API unavailability to obtain unwarranted strength validation.

Outcome

Across 5-fold stratified cross-validation on a balanced 60,000-sample dataset: 85.39% mean accuracy (±0.38%), macro-F1 of 85.34%, macro-AUC of 0.956, and the security-critical headline result — zero weak-to-strong misclassifications across all folds. The most damaging failure mode in strength estimation, where a weak password is falsely elevated to strong, did not occur once across 20,000 weak-class samples.

What I'd change

The labelling dependency on zxcvbn is the most significant methodological limitation. A cleaner approach would derive ground-truth labels from attack-resistance measurements — PCFG-based guess counts or neural password guessers — eliminating the circularity entirely. Training data is drawn from RockYou (2009) and would not generalise to contemporary or non-English construction patterns; replication on more recent corpora is needed. And the evaluation is entirely quantitative — a controlled user study would test whether layered feedback actually changes password creation behaviour, which is the question that ultimately matters.

The contribution of the work is narrower than "we built a better password strength estimator." It's a principled, evidence-based step toward more security-aligned strength estimation, evaluated honestly enough that its limits are as visible as its results.

Stack
PythonFlaskMachine LearningSecurityVue.js
LinksGitHub