In practice, the most workable approach is to measure a composite “civility score” built from multiple indicators.
1) Model-based indicators (machine learning)
-
Toxicity/Aggression: Use Jigsaw’s Perspective API to obtain probabilities for “TOXICITY,” “INSULT,” “PROFANITY,” “IDENTITY_ATTACK,” etc. Japanese is supported (see the official language table).
-
Politeness: Stanford’s research established a framework for estimating “politeness” from markers like request forms, hedges, honorifics, etc. It’s English-centric, but the methodology can be adapted to Japanese. The R package politeness is also useful.
2) Lightweight rules for Japanese (highly interpretable)
-
Honorific/hedge rate: Ratios of “です/ます,” “〜でしょうか,” “お手数ですが,” “お願いします,” and the like.
-
Presence of slurs/derogatory terms: A custom NG-word list (including figurative or euphemistic forms).
-
Imperatives/strong assertions: Frequency of “〜しろ,” “〜に決まってる,” heavy use of exclamation marks, ALL-KATAKANA bursts, etc.
-
Consideration/evidence markers: Signs of dialogic and verifiable style such as “根拠:,” “出典:,” “もし〜なら.”
3) Interaction signals (thread health)
-
Reports/blocks, constructiveness of replies, churn/exit rate, etc., also reflect “civility.” Toxic posts are empirically associated with reduced participation.
4) Composite score (example)
Compute 0–100 (one possible formula; tune weights with data):
-
Base:
Civility = 40*(1 - Toxicity) + 20*Politeness + 20*Heuristics + 20*ThreadHealth
-
Toxicity: Perspective “TOXICITY” probability (0–1).
-
Politeness: Probability from a politeness classifier (0–1).
-
Heuristics: Averaged, normalized rule features (0–1).
-
ThreadHealth: 0–1 from low report rate, high constructive-reply rate, etc.
-
5) Ops flow (implementation tips)
-
Preprocess: Keep URLs/emojis; tokenize with MeCab/Sudachi.
-
Inference: Perspective for toxicity; politeness via transfer learning from English or start rule-based.
-
Rule features: Honorific rate, slur hits, imperative/assertion strength, etc.
-
Fusion: Combine with the weights above and apply thresholds (e.g., 80+ = Exemplary, 60–80 = Good, 40–60 = Caution, <40 = Needs action).
-
Validation: Compare to human labels; evaluate with AUC/F1. Run a bias audit (e.g., over-flagging sentences that contain identity terms).
6) Caveats (very important)
-
Language/dialect bias: Models can misfire across languages and styles; check for language-specific errors.
-
Quoting/irony/reposts: Quoting harmful text to criticize it may still be flagged as “toxic.”
-
Chilling effects: Over-strict automation can suppress diversity of expression and deter participation.
Comments
Post a Comment