Week 3 — Conjugate Bayes第3週

Agenda

Welcome back 0:00
Conjugacy as a pattern 0:10
Beta-Binomial 0:25
Break 1:00
Gaussian-Gaussian with N observations 1:10
Subjective randomness (G&T 2001) 1:45
Close 1:55

Paper presentation assignments論文発表の割り当て

Your paper presentation slots論文発表のスケジュール

Week	Presenter	Paper
4	Shohei	Tenenbaum & Xu (2000), Word learning as Bayesian inference
11	Imai	Mikolov et al. (2013), Efficient estimation of word representations in vector space (word2vec)
12	Tenzin	Miller, Raine & Groh (2023), AI hyperrealism: why AI faces are perceived as more real than human ones

How the presentation works発表のやり方

~15 min talk + 5–10 min discussion that you facilitate
Focus: how the math connects to cognitive science, and the authors’ evidence
Papers have more than fits in 15 min — pick the 2–3 ideas that matter most
End with at least 3 discussion questions
Worth 7.5% of your grade
If you’re concerned about your presentation, DM me for a 1-on-1 (at least one week before your slot) — we’ll plan what to focus on and what to skip

約15分の発表 + 5〜10分のディスカッション（自分で進行）
焦点: 数式モデルと認知科学の繋がり、著者が示した証拠
論文は15分に収まらない — 最も重要な2〜3点を選ぶ
最後に3つ以上のディスカッション質問
成績の7.5%
発表に不安があれば、DMで1対1ミーティングを（発表の少なくとも1週間前まで） — 何に焦点を当てるか、何を飛ばすかを相談

Framing — six questions to organize your talk構成 — 発表を組み立てる6つの問い

(Adapted from Tom Griffiths. Also good for reading ANY paper in this course.)

What is the cognitive science question the paper explores?
What is the context? What previous approaches have been tried?
What is the underlying computational problem? How is it formulated?
What is the solution? Core mathematical ideas — intuition over rigor.
How does the solution connect to human behavior?
What can we conclude? Your own interpretation.

(Tom Griffithsから。本コースで論文を読むときにも有用。)

論文が探る認知科学の問いは何か?
文脈は? これまでどんなアプローチが試みられてきたか?
背後にある計算的問題は? どう定式化されているか?
解法は? 中核となる数式 — 厳密さよりも直観で。
解法は人間の行動とどう繋がるか?
何が結論できるか? 自分自身の解釈。

Full rubric + tips · 完全な評価基準: hml.chibatech.dev/presentation-guidelines.html

Conjugacy as a patternパターンとしての共役性

The move we saw in Week 2第2週で見た動き

Prior: \(\mu \sim N(500, 20^2)\)

Likelihood: observation \(D = 510\), known \(\sigma = 30\)

Posterior: \(\mu \mid D \sim N(503.3, 16.6^2)\)

Notice: prior is Gaussian → posterior is Gaussian. Same family.

事前分布: \(\mu \sim N(500, 20^2)\)

尤度: 観測 \(D = 510\)、既知 \(\sigma = 30\)

事後分布: \(\mu \mid D \sim N(503.3, 16.6^2)\)

注目: 事前がガウス → 事後もガウス。同じ族。

What “conjugate” means — in words「共役」とは — 言葉で

Conjugate = the posterior stays in the same family as the prior.

Gaussian prior + Gaussian likelihood → Gaussian posterior.

Different parameters, same shape of distribution.

We just saw it once. The next question: is this lucky, or a pattern?

共役 = 事後が事前と同じ族にとどまる。

ガウス事前 + ガウス尤度 → ガウス事後。

パラメータは異なるが、分布の形は同じ。

一度見た。次の問い: これは幸運か、パターンか?

What “conjugate” means — formally「共役」とは — 形式的に

A prior family \(\mathcal{F}\) is conjugate to a likelihood \(p(D \mid \theta)\) when the posterior stays in \(\mathcal{F}\) (same functional form, updated params):

事前分布の族 \(\mathcal{F}\) が尤度 \(p(D \mid \theta)\) に共役とは、事後が再び \(\mathcal{F}\) に属する (同じ関数形、パラメータだけ更新) こと:

\[ p(\theta) \in \mathcal{F} \;\; \Longrightarrow \;\; p(\theta \mid D) \in \mathcal{F} \]

The likelihood typically lives in a different family — e.g. Beta prior + Binomial likelihood → Beta posterior. Conjugacy is a property of the pair (prior family, likelihood).

Last week’s Gaussian-Gaussian was the special case where both happened to be in the same family — not the general rule.

尤度は通常別の族 — 例: ベータ事前 + 二項尤度 → ベータ事後。共役性は (事前族, 尤度) の組の性質。

先週のガウス-ガウスは両者が同じ族だった特殊例 — 一般則ではない。

Why conjugacy matters共役性が重要な理由

Property	Why you care
Closed-form posterior	No integration, no sampling
Sequential updates	Today’s posterior = tomorrow’s prior
Interpretable hyperparameters	Prior knowledge as “pseudo-observations”
Fast, exact	Good pedagogy + fast enough to compute on the fly

性質	なぜ重要か
閉形式の事後	積分も標本化も不要
逐次更新	今日の事後 = 明日の事前
解釈可能なハイパーパラメータ	事前知識を「疑似観測」として
高速・厳密	教育に良く、実用速度でも計算可能

Beta-Binomialベータ-二項

Back to the bentos — but now with counts弁当に戻る — でも今度は回数で

Chibany’s prior belief about tonkatsu rate \(\theta\):

70% tonkatsu, 30% hamburger — but how confident?

This semester’s data: 27 tonkatsu out of 40 bentos.

What’s Chibany’s updated belief about \(\theta\)?

チバニーのとんかつ率 \(\theta\) に対する事前信念:

とんかつ70%、ハンバーグ30% — でもどれくらい自信を持って?

今学期のデータ: 40個中27個がとんかつ。

\(\theta\) に対するチバニーの更新後の信念は?

The Binomial likelihood — what you already know二項尤度 — 既知の話

With rate \(\theta\) fixed, \(n\) bentos give \(k\) tonkatsus with probability:

\[ p(k \mid \theta, n) = \binom{n}{k} \, \theta^{k} (1-\theta)^{n-k} \]

Each bento is iid Bernoulli(\(\theta\)). \(\binom{n}{k}\) counts which \(k\) of the \(n\) were tonkatsu.

率 \(\theta\) を固定すると、\(n\) 個の弁当のうち \(k\) 個がとんかつである確率:

\[ p(k \mid \theta, n) = \binom{n}{k} \, \theta^{k} (1-\theta)^{n-k} \]

各弁当は iid ベルヌーイ(\(\theta\))。\(\binom{n}{k}\) は \(n\) 個中どの \(k\) 個がとんかつかの選び方の数。

Now flip your viewpoint視点を反転させる

What if we fix \(k\) and \(n\) (we saw them) and ask: which \(\theta\) made these data likely?

\[ \underbrace{\theta^{k}(1-\theta)^{n-k}}_{\text{a function of } \theta} \]

Same expression — now read as a curve over \(\theta \in [0,1]\). The \(\binom{n}{k}\) drops out: it doesn’t depend on \(\theta\).

\(k\) と \(n\) を固定して（観測したから）、「どの \(\theta\) がこのデータを起こりやすくしたか?」と問うたら?

\[ \underbrace{\theta^{k}(1-\theta)^{n-k}}_{\theta \text{ の関数}} \]

同じ式 — でも今度は \(\theta \in [0,1]\) 上の曲線として読む。\(\binom{n}{k}\) は \(\theta\) に依らないので落ちる。

That curve has a name — Betaその曲線には名前がある — ベータ

Normalize \(\theta^{k}(1-\theta)^{n-k}\) over \(\theta \in [0,1]\) — it integrates to a constant. Call that constant \(B(k+1, n-k+1)\).

\[ \text{Beta}(\theta; \, \alpha, \beta) \;\propto\; \theta^{\alpha - 1}(1-\theta)^{\beta - 1} \]

Setting \(\alpha = k+1\), \(\beta = n-k+1\) recovers exactly the likelihood-as-curve. Beta is the family that generalizes that shape — any \(\alpha, \beta > 0\) allowed, including non-integer.

\(\theta^{k}(1-\theta)^{n-k}\) を \(\theta \in [0,1]\) で積分すると、ある定数になる。それを \(B(k+1, n-k+1)\) とおく。

\[ \text{Beta}(\theta; \, \alpha, \beta) \;\propto\; \theta^{\alpha - 1}(1-\theta)^{\beta - 1} \]

\(\alpha = k+1\)、\(\beta = n-k+1\) とおけば、ちょうど尤度曲線と一致。ベータはその形を一般化した族 — \(\alpha, \beta > 0\) なら何でもよく、非整数も可。

One Beta to start — Chibany’s priorまずは1つ — チバニーの事前

\(\text{Beta}(8, 3)\): mass concentrated above \(0.5\), peak around \(0.78\). Moderate confidence — not razor-sharp, not flat.

5 samples: \(0.755, \; 0.638, \; 0.520, \; 0.748, \; 0.647\)

\(\text{Beta}(8, 3)\): 質量は \(0.5\) より上に集中、ピークは \(0.78\) あたり。中程度の自信 — 鋭くも平坦でもない。

5サンプル: \(0.755, \; 0.638, \; 0.520, \; 0.748, \; 0.647\)

Four Betas, same mean — different shapes4つのベータ、同じ平均 — 異なる形

All four have mean \(0.5\) (because \(\alpha = \beta\)) — but they look completely different. The \(\alpha + \beta\) dial controls concentration, not location.

4つとも平均は \(0.5\)（\(\alpha = \beta\) だから） — でも見た目はまったく違う。\(\alpha + \beta\) ダイヤルは集中度を制御、位置ではない。

Poll — Tanaka’s attic of marblesポール — 田中さんの屋根裏のビー玉

Tanaka finds bags of marbles in his parents’ attic. Each bag is mostly one color (white or black), but overall the count is ~50/50.

He wants to encode this in a Beta prior over \(\theta\) = probability of drawing white. Which \(\text{Beta}(\alpha, \beta)\)?

A. \(\text{Beta}(1, 1)\) — uniform, no info
B. \(\text{Beta}(2, 2)\) — gentle center at \(0.5\)
C. \(\text{Beta}(10, 10)\) — strong center at \(0.5\)
D. \(\text{Beta}(0.5, 0.5)\) — U-shaped at the edges

田中さんが実家の屋根裏でビー玉の袋を見つけた。各袋はほぼ一色（白または黒）だが、全体では約50/50。

これを \(\theta\) =（白を引く確率）に対するベータ事前として表したい。どの \(\text{Beta}(\alpha, \beta)\) ?

A. \(\text{Beta}(1, 1)\) — 一様、情報なし
B. \(\text{Beta}(2, 2)\) — \(0.5\) にやや集中
C. \(\text{Beta}(10, 10)\) — \(0.5\) に強く集中
D. \(\text{Beta}(0.5, 0.5)\) — 端でU字型

Poll — answerポール — 答え

D. \(\text{Beta}(0.5, 0.5)\).

U-shaped: mass piles up near \(0\) and \(1\) (bag-level extremity), symmetric overall. \(\text{Beta}(2,2)\) and \(\text{Beta}(10,10)\) are unimodal at \(0.5\) — they encode “around half white” within a bag, which is the opposite of what Tanaka saw.

U字型: \(0\) と \(1\) の近くに質量が集まり（袋ごとの極端さ）、全体としては対称。\(\text{Beta}(2,2)\) と \(\text{Beta}(10,10)\) は\(0.5\) に単峰 — 「袋の中で約半分が白」を表しており、田中さんが見たものとは逆。

Beta-Binomial — set up the piecesベータ-二項 — 道具立て

Prior: \(\theta \sim \text{Beta}(\alpha, \beta)\) → \(p(\theta) \propto \theta^{\alpha-1}(1-\theta)^{\beta-1}\)

Likelihood: \(k\) tonkatsus in \(n\) bentos → \(p(k \mid \theta) \propto \theta^{k}(1-\theta)^{n-k}\)

事前: \(\theta \sim \text{Beta}(\alpha, \beta)\) → \(p(\theta) \propto \theta^{\alpha-1}(1-\theta)^{\beta-1}\)

尤度: \(n\) 個中 \(k\) 個がとんかつ → \(p(k \mid \theta) \propto \theta^{k}(1-\theta)^{n-k}\)

Beta-Binomial — multiply and read offベータ-二項 — 掛けて読み取る

Posterior \(\propto\) Prior \(\times\) Likelihood:

\[ p(\theta \mid k) \;\propto\; \theta^{\alpha-1}(1-\theta)^{\beta-1} \cdot \theta^{k}(1-\theta)^{n-k} \]

\[ = \; \theta^{(\alpha + k) - 1}(1-\theta)^{(\beta + n - k) - 1} \]

Recognize this: it’s \(\text{Beta}(\alpha + k, \; \beta + n - k)\).

これは: \(\text{Beta}(\alpha + k, \; \beta + n - k)\)。

Beta-Binomial conjugate updateベータ-二項共役更新

Prior: \(\theta \sim \text{Beta}(\alpha, \beta)\)

Data: \(k\) successes in \(n\) trials

Posterior: \(\theta \mid k \sim \text{Beta}(\alpha + k, \; \beta + n - k)\)

Just add the counts. Successes bump \(\alpha\), failures bump \(\beta\).

事前: \(\theta \sim \text{Beta}(\alpha, \beta)\)

データ: \(n\) 試行中 \(k\) 成功

事後: \(\theta \mid k \sim \text{Beta}(\alpha + k, \; \beta + n - k)\)

回数を足すだけ。 成功は \(\alpha\) に、失敗は \(\beta\) に加算。

Worked example — Chibany’s semester計算例 — チバニーの学期

Prior: \(\theta \sim \text{Beta}(7, 3)\) — “70/30 with low confidence” · Data: 27 tonkatsu in 40 · Posterior: \(\text{Beta}(7+27, \; 3+13) = \text{Beta}(34, 16)\)

事前: \(\theta \sim \text{Beta}(7, 3)\) — 「70/30、自信は低め」 · データ: 40個中27個がとんかつ · 事後: \(\text{Beta}(7+27, \; 3+13) = \text{Beta}(34, 16)\)

Mean barely moved (\(0.70 \to 0.68\)) — but the posterior is much sharper. 40 observations of moderate signal added a lot of certainty.

平均はほとんど動かず（\(0.70 \to 0.68\)） — でも事後はずっと鋭くなる。40個の観測で確信が大幅に強まった。

Break休憩

Resume — Gaussian-Gaussian with N observations再開 — N観測のガウス-ガウス

Agenda so far: Beta-Binomial ✓

Now: what if Chibany weighs N bentos, not one?

ここまで: ベータ-二項 ✓

次は: チバニーが1個ではなくN個の弁当を測ったら?

Gaussian-Gaussian — notation lock-inガウス-ガウス — 記号の確認

Symbol	What it is
\(\mu_0, \sigma_0^2\)	Prior mean and variance of \(\mu\)
\(\sigma^2\)	Data noise (known, fixed)
\(D_1, \ldots, D_N\)	\(N\) iid observations
\(\sum_i D_i\)	Sum over the \(N\) observations: \(D_1 + D_2 + \cdots + D_N\)
\(\mu_N, \sigma_N^2\)	Posterior mean and variance of \(\mu\) after seeing \(N\) data

記号	意味
\(\mu_0, \sigma_0^2\)	\(\mu\) の事前の平均と分散
\(\sigma^2\)	データのノイズ（既知・固定）
\(D_1, \ldots, D_N\)	\(N\) 個の iid 観測
\(\sum_i D_i\)	\(N\) 観測の和: \(D_1 + D_2 + \cdots + D_N\)
\(\mu_N, \sigma_N^2\)	\(N\) データを見た後の \(\mu\) の事後の平均と分散

Gaussian-Gaussian — precision is additiveガウス-ガウス — 精度は加法的

Posterior precision:

事後の精度:

\[ \underbrace{\frac{1}{\sigma_N^2}}_{\text{posterior}} \;=\; \underbrace{\frac{1}{\sigma_0^2}}_{\text{prior}} \;+\; \underbrace{\frac{N}{\sigma^2}}_{N \text{ data}} \]

Precision = 1/variance. Each observation adds \(1/\sigma^2\) units. \(N\) observations add \(N/\sigma^2\).

Sanity check: at \(N = 1\), this matches Week 2’s single-observation case.

精度 = 1/分散。 各観測が \(1/\sigma^2\) 単位を加える。\(N\) 観測で \(N/\sigma^2\)。

確認: \(N = 1\) なら第2週の単一観測の式と一致。

Gaussian-Gaussian — posterior meanガウス-ガウス — 事後平均

\[ \mu_N \;=\; \sigma_N^2 \left( \underbrace{\frac{\mu_0}{\sigma_0^2}}_{\text{prior precision} \times \text{prior mean}} \;+\; \underbrace{\frac{\sum_i D_i}{\sigma^2}}_{\text{data precision} \times \text{data sum}} \right) \]

\(\mu_N\) is a precision-weighted average of the prior mean and the data sum. Whoever has more precision wins.

\(\mu_N\) は事前平均とデータ和の精度重み付き平均。精度の大きい方が勝つ。

Poll — Jamal’s shortcutポール — ジャマルの近道

While deriving a posterior over \(\mu\), Jamal notices the non-constant terms (w.r.t. \(\mu\)) have the form of a Gaussian. He drops everything else and concludes the posterior is Gaussian with parameters read off the surviving form. Is he correct?

A. Yes — the dropped terms are absorbed into the normalization constant
B. Yes — you can drop any term, even those involving \(\mu\)
C. No — he dropped some terms that involve \(\mu\)
D. Only if he later multiplies his answer by the dropped terms

\(\mu\) の事後を導出中、ジャマルは（\(\mu\) に関する）非定数項が ガウスの形をしている ことに気づいた。残りを全て落とし、残った形からパラメータを読み取って事後をガウスと結論した。正しい?

A. はい — 落とした項は正規化定数に吸収される
B. はい — \(\mu\) を含む項でさえ落としてよい
C. いいえ — \(\mu\) を含む項を落としてしまった
D. 後で落とした項を掛け直す場合のみ

Poll — answerポール — 答え

A. Yes — the dropped terms are part of the normalization constant.

A posterior is a probability density in \(\mu\). Anything not depending on \(\mu\) is a multiplicative constant — absorbed into \(Z = \int p(\mu \mid D)\, d\mu\).

Recognize the functional form → read off parameters → normalization handles itself.

事後分布は \(\mu\) についての確率密度。\(\mu\) に依存しないものは乗法的な定数で、\(Z = \int p(\mu \mid D)\, d\mu\) に吸収される。

関数形を認識 → パラメータを読み取る → 正規化は勝手に処理される。

Sequential updating — same rule, no new math逐次更新 — 同じ規則、新しい計算なし

Observations arrive one at a time. Posterior after \(k\) observations becomes prior for observation \(k+1\).

観測が1個ずつ到着。\(k\) 観測後の事後が、\(k+1\) 番目の観測の事前になる。

\[ \text{Beta}(34, 16) \xrightarrow[+1 \text{ hamb}]{\text{see 1 more}} \text{Beta}(34, 17) \]

\[ N(503.3, 16.6^2) \xrightarrow[D = 498]{\text{see 1 more}} N(502.2, 14.5^2) \]

This is why conjugacy is useful in practice: online updates, no re-fit.

これが実用上の共役性の利点: オンライン更新、再学習不要。

Three conjugate pairs, one pattern3つの共役対、1つのパターン

Prior	Likelihood	Posterior
\(\text{Beta}(\alpha, \beta)\)	Binomial\((n, p)\)	\(\text{Beta}(\alpha + k, \beta + n - k)\)
\(\text{Dirichlet}(\vec{\alpha})\)	Multinomial\((n, \vec{p})\)	\(\text{Dirichlet}(\vec{\alpha} + \vec{k})\)
\(N(\mu_0, \sigma_0^2)\)	\(N(\mu, \sigma^2)\)	\(N(\mu_N, \sigma_N^2)\)

Row 2 is the multi-category generalization: \(\vec{\alpha} = (\alpha_1, \ldots, \alpha_K)\), \(\vec{k} = (k_1, \ldots, k_K)\). Same “add the counts” rule.

事前	尤度	事後
\(\text{Beta}(\alpha, \beta)\)	Binomial\((n, p)\)	\(\text{Beta}(\alpha + k, \beta + n - k)\)
\(\text{Dirichlet}(\vec{\alpha})\)	Multinomial\((n, \vec{p})\)	\(\text{Dirichlet}(\vec{\alpha} + \vec{k})\)
\(N(\mu_0, \sigma_0^2)\)	\(N(\mu, \sigma^2)\)	\(N(\mu_N, \sigma_N^2)\)

2行目は多カテゴリへの一般化: \(\vec{\alpha} = (\alpha_1, \ldots, \alpha_K)\)、\(\vec{k} = (k_1, \ldots, k_K)\)。同じ「回数を足す」規則。

Stretch question — pushing the patternストレッチ問題 — パターンを押し広げる

Three pairs all worked. But how strict is the “same family” rule?

Prior on \(\mu\): bimodal (mixture of two Gaussians)

Likelihood: Gaussian

Posterior?

A. Gaussian (likelihood dominates)
B. Bimodal Gaussian mixture
C. Uniform (prior × likelihood cancels)
D. Not closed-form — needs numerics

3つの対はうまくいった。では「同じ族」というルールはどれくらい厳密か?

\(\mu\) の事前: 二峰性（2つのガウスの混合）

尤度: ガウス

事後は?

A. ガウス（尤度が支配）
B. 二峰性のガウス混合
C. 一様（事前と尤度が相殺）
D. 閉形式でない — 数値計算が必要

Stretch question — answerストレッチ問題 — 答え

B and D are both defensible — depending on what “family” means.

D (strict): the posterior isn’t a single Gaussian, so we left the Gaussian family. Conjugacy to Gaussians fails.
B (generalized): components update independently, weights re-balance — still closed-form. The richer “mixture of Gaussians” family IS conjugate.

Lesson: conjugacy is a property of the (prior family, likelihood) pair, not the prior alone.

BとDのどちらも妥当 — 「族」の意味次第。

D (厳密): 事後は単一のガウスでない。「ガウス族への共役」は破綻。
B (一般化): 成分は独立に更新、重みが再調整 — 依然として閉形式。「ガウスの混合」というより広い族は共役。

教訓: 共役性は（事前族、尤度）の対の性質、事前単体ではない。

Subjective randomness — a Bayesian story主観的ランダム性 — ベイズ的物語

Which feels more random?どちらがよりランダムに感じる?

\[ \text{(a)} \quad \mathtt{H \; H \; T \; H \; T \; T \; T \; H} \]

\[ \text{(b)} \quad \mathtt{H \; H \; H \; H \; H \; H \; H \; H} \]

Why does HHTHTTTH look “more random” than HHHHHHHH?なぜHHTHTTTHはHHHHHHHHより「ランダム」に見えるのか?

Both sequences have the same probability under a fair coin: \((1/2)^8 = 1/256\).

So why do people consistently say the first is “more random”?

Griffiths & Tenenbaum (2001): a single likelihood \(P(x \mid \text{random})\) can’t decide anything by itself — you need at least two hypotheses to compare. “Is \(x\) random?” only has an answer if you also ask “compared to what?” — e.g. “or did some regularity in the world produce \(x\)?”

両方とも公正なコインで同じ確率: \((1/2)^8 = 1/256\)。

なぜ人は一貫して前者を「よりランダム」と言うのか?

Griffiths & Tenenbaum (2001): 単一の尤度 \(P(x \mid \text{random})\) だけでは何も判断できない — 比較する仮説が少なくとも2つ必要。「\(x\) はランダムか?」に答えるには「何と比べて?」も問わなければならない — 例: 「それとも世界の何らかの規則性が \(x\) を生んだのか?」

The reframe — likelihood ratio, not likelihood再構築 — 尤度ではなく尤度比

Not \(P(x \mid \text{random})\). Instead, \(P(x \mid \text{random})\) vs. \(P(x \mid \text{regular})\).

\(P(x \mid \text{random})\) ではなく、\(P(x \mid \text{random})\) vs. \(P(x \mid \text{regular})\)。

\[ \text{subjective randomness}(x) \;=\; \log \frac{P(x \mid \text{random})}{P(x \mid \text{regular})} \]

A likelihood ratio. Uniform prior over the two hypotheses → this is the posterior odds.

尤度比。 2つの仮説に一様な事前 → これがそのまま事後オッズ。

The model — local representativenessモデル — 局所代表性

At step \(k\), count heads \(H_i\) and tails \(T_i\) in the suffix going back \(i\) steps. Score how much choosing H vs T at step \(k\) keeps the suffixes balanced:

ステップ \(k\) で、\(i\) ステップ前まで遡る部分系列の表 \(H_i\) と裏 \(T_i\) を数える。ステップ \(k\) で H と T のどちらを選ぶと部分系列のバランスが保たれるかを採点:

\[ L_k \;=\; \sum_{i=1}^{k-1} \log \frac{P(\,H_i + 1,\; T_i \mid \text{random}\,)}{P(\,H_i,\; T_i + 1 \mid \text{random}\,)} \]

Then \(P(R_k = \text{H}) = \sigma(\lambda L_k)\), where \(\sigma(z) = \dfrac{1}{1 + e^{-z}}\) squashes any real number into \([0, 1]\).

A long run of H’s drives \(L_k\) negative (every suffix already looks H-heavy), so the model strongly prefers T next — no free “switch preference” parameter needed.

次に \(P(R_k = \text{H}) = \sigma(\lambda L_k)\)、ここで \(\sigma(z) = \dfrac{1}{1 + e^{-z}}\) は任意の実数を \([0, 1]\) に押し込む関数。

Hの連続が長いと \(L_k\) が負に振れる（どの部分系列もすでにH偏重）ので、モデルは次にTを強く好む — 自由な「切替選好」パラメータ不要。

~3-4 min. The model spec. This slide has more notation than usual — slow down.

WHAT σ MEANS HERE (the load-bearing notation explanation):

σ is the LOGISTIC function (also called the sigmoid). Three facts to convey:

Definition: σ(z) = 1 / (1 + e^(-z)). On the board, draw the S-curve: it passes through (0, 0.5), asymptotes to 0 as z → -∞ and to 1 as z → +∞. The curve is monotonically increasing.
What it does FOR US: L_k is a real number — it can be any value from -∞ to +∞. But we need a PROBABILITY for the response. The logistic squashes the real line into [0, 1] so the output is a valid probability. σ(0) = 0.5 (“indifferent → pick H half the time”), σ(large positive) → 1 (“very strongly prefer H”), σ(large negative) → 0 (“very strongly prefer T”).
NOTATION WARNING — σ is overloaded. In Weeks 2-3 we used σ for the standard deviation of a Gaussian (e.g. σ², σ²₀ in the Gaussian- Gaussian update). HERE σ is a FUNCTION, not a variance. Same Greek letter, totally different meaning. The argument tells you which: σ(z) with parentheses = logistic function; σ alone = std dev. Flag this to the class verbally — it’s the kind of notation collision that trips up careful students.
λ is INSIDE σ. λ multiplies L_k before squashing. Bigger λ → σ steepens around 0 → small differences in L_k become large differences in probability. λ → 0 → σ(0) = 0.5 always, model becomes indifferent. λ → ∞ → step function, model is deterministic. G&T fit λ ≈ 0.6 — moderately strong but not deterministic.

KEY CONCEPTUAL MOVES TO LAND (beyond σ): 1. The model is SEQUENTIAL — each response depends on what came before. 2. “Local representativeness” = each suffix subsequence should look ~50/50. 3. L_k is a SUM of log-likelihood-ratios over different suffix lengths (this is exactly the conjugacy “drop the normalizer, recognize the form” move applied across multiple Bernoulli likelihoods). 4. The logistic transform converts L_k into a response probability. 5. λ is the model’s ONE free parameter. G&T fit λ ≈ 0.6 to Zenith data.

Importantly: the model has NO free “switch preference” parameter. The switching bias is an EMERGENT property of computing the likelihood ratio. That’s the difference between description and explanation.

If students push back on σ: just draw it. The S-curve is a 30-second sketch and once they see “any real number in, a probability out” it becomes intuitive. Don’t get stuck deriving why this particular function — other squashing functions exist; G&T just picked the logistic because it’s smooth and analytically convenient.

Zenith radio data — the binary sequence experimentZenithラジオデータ — 2値系列実験

1937 publicity stunt: Zenith broadcast 5 H/T sequences via radio, asked 20,099 listeners to “transmit” their guesses via ESP. Sequences collapsed to 16 length-5 patterns (initial choice ignored).

1937年の宣伝企画: ZenithがH/T系列5つをラジオで放送し、20,099人のリスナーにESPで「送信」してもらった。長さ5の16パターンに集約（最初の選択は無視）。

\(\lambda = 0.6\) fits with \(r = 0.95\). The bias toward sequences like 01010 falls out of the math — no free “switch preference” knob.

\(\lambda = 0.6\) で当てはまり \(r = 0.95\)。 01010 のような系列への偏りは数式から導かれる — 自由な「切替選好」パラメータなし。

The takeaway for today’s class今日の授業のまとめ

The “mistake” in subjective randomness is not that people are bad at probability.
They’re computing \(\log P(x \mid \text{random}) / P(x \mid \text{regular})\) — a perfectly Bayesian quantity.
Today’s conjugate-update mechanics (“drop the normalizer, recognize the form”) are exactly the operation behind this model.

Open question: is human cognition Bayesian-by-default, or just Bayesian-when-tractable?

主観的ランダムさの「誤り」は、人が確率に弱いということではない。
人は \(\log P(x \mid \text{random}) / P(x \mid \text{regular})\) — 完全にベイズ的な量 — を計算している。
今日の共役更新の操作（「正規化を落として形を認識」）が、まさにこのモデルの背後にある操作。

開かれた問い: 人の認知はデフォルトでベイズ的か、それとも扱える時だけベイズ的か?

Closeまとめ

Next week — Week 4 preview来週 — 第4週のプレビュー

Ira leads. Hierarchical Bayes.

Chibany’s bento rate isn’t the same across every semester, but semesters aren’t totally independent either. How do we share information without collapsing?

Read T3 Ch 5 before class.

イラが進行。階層ベイズ。

チバニーの弁当率は学期ごとに同じではないが、学期同士が完全に独立でもない。潰さずに情報をどう共有する?

授業前にT3 第5章を読むこと。

Week 3 — Conjugate Bayes第3週 — 共役ベイズ

Agenda

Paper presentation assignments論文発表の割り当て

Your paper presentation slots論文発表のスケジュール

How the presentation works発表のやり方

Framing — six questions to organize your talk構成 — 発表を組み立てる6つの問い

Conjugacy as a patternパターンとしての共役性

The move we saw in Week 2第2週で見た動き

What “conjugate” means — in words「共役」とは — 言葉で

What “conjugate” means — formally「共役」とは — 形式的に

Why conjugacy matters共役性が重要な理由

Beta-Binomialベータ-二項

Back to the bentos — but now with counts弁当に戻る — でも今度は回数で

The Binomial likelihood — what you already know二項尤度 — 既知の話

Now flip your viewpoint視点を反転させる

That curve has a name — Betaその曲線には名前がある — ベータ

One Beta to start — Chibany’s priorまずは1つ — チバニーの事前

Four Betas, same mean — different shapes4つのベータ、同じ平均 — 異なる形

Poll — Tanaka’s attic of marblesポール — 田中さんの屋根裏のビー玉

Poll — answerポール — 答え

Beta-Binomial — set up the piecesベータ-二項 — 道具立て

Beta-Binomial — multiply and read offベータ-二項 — 掛けて読み取る

Beta-Binomial conjugate updateベータ-二項 共役更新

Worked example — Chibany’s semester計算例 — チバニーの学期

Break休憩

Resume — Gaussian-Gaussian with N observations再開 — N観測のガウス-ガウス

Gaussian-Gaussian — notation lock-inガウス-ガウス — 記号の確認

Gaussian-Gaussian — precision is additiveガウス-ガウス — 精度は加法的

Gaussian-Gaussian — posterior meanガウス-ガウス — 事後平均

Poll — Jamal’s shortcutポール — ジャマルの近道

Poll — answerポール — 答え

Sequential updating — same rule, no new math逐次更新 — 同じ規則、新しい計算なし

Three conjugate pairs, one pattern3つの共役対、1つのパターン

Stretch question — pushing the patternストレッチ問題 — パターンを押し広げる

Stretch question — answerストレッチ問題 — 答え

Subjective randomness — a Bayesian story主観的ランダム性 — ベイズ的物語

Which feels more random?どちらがよりランダムに感じる?

Why does HHTHTTTH look “more random” than HHHHHHHH?なぜHHTHTTTHはHHHHHHHHより「ランダム」に見えるのか?

The reframe — likelihood ratio, not likelihood再構築 — 尤度ではなく尤度比

The model — local representativenessモデル — 局所代表性

Zenith radio data — the binary sequence experimentZenithラジオデータ — 2値系列実験

The takeaway for today’s class今日の授業のまとめ

Closeまとめ

Next week — Week 4 preview来週 — 第4週のプレビュー

Beta-Binomial conjugate updateベータ-二項共役更新