1) Model-based indicators (machine learning)

Toxicity/Aggression: Use Jigsaw’s Perspective API to obtain probabilities for “TOXICITY,” “INSULT,” “PROFANITY,” “IDENTITY_ATTACK,” etc. Japanese is supported (see the official language table).
Politeness: Stanford’s research established a framework for estimating “politeness” from markers like request forms, hedges, honorifics, etc. It’s English-centric, but the methodology can be adapted to Japanese. The R package politeness is also useful.

2) Lightweight rules for Japanese (highly interpretable)

Honorific/hedge rate: Ratios of “です/ます,” “〜でしょうか,” “お手数ですが,” “お願いします,” and the like.
Presence of slurs/derogatory terms: A custom NG-word list (including figurative or euphemistic forms).
Imperatives/strong assertions: Frequency of “〜しろ,” “〜に決まってる,” heavy use of exclamation marks, ALL-KATAKANA bursts, etc.
Consideration/evidence markers: Signs of dialogic and verifiable style such as “根拠：,” “出典：,” “もし〜なら.”

Reports/blocks, constructiveness of replies, churn/exit rate, etc., also reflect “civility.” Toxic posts are empirically associated with reduced participation.

Compute 0–100 (one possible formula; tune weights with data):

Base: Civility = 40*(1 - Toxicity) + 20*Politeness + 20*Heuristics + 20*ThreadHealth
- Toxicity: Perspective “TOXICITY” probability (0–1).
- Politeness: Probability from a politeness classifier (0–1).
- Heuristics: Averaged, normalized rule features (0–1).
- ThreadHealth: 0–1 from low report rate, high constructive-reply rate, etc.

Preprocess: Keep URLs/emojis; tokenize with MeCab/Sudachi.
Inference: Perspective for toxicity; politeness via transfer learning from English or start rule-based.
Rule features: Honorific rate, slur hits, imperative/assertion strength, etc.
Fusion: Combine with the weights above and apply thresholds (e.g., 80+ = Exemplary, 60–80 = Good, 40–60 = Caution, <40 = Needs action).
Validation: Compare to human labels; evaluate with AUC/F1. Run a bias audit (e.g., over-flagging sentences that contain identity terms).

Language/dialect bias: Models can misfire across languages and styles; check for language-specific errors.
Quoting/irony/reposts: Quoting harmful text to criticize it may still be flagged as “toxic.”
Chilling effects: Over-strict automation can suppress diversity of expression and deter participation.