Week 4 — Bayesian Generalization第4週

Agenda本日の予定

Welcome + Clusters walkthroughようこそ + クラスタ課題の説明 0:00
The generalization problem一般化の問題 0:10
The Bayesian generalization framework + size principleベイズ一般化の枠組み + サイズ原理 0:18
Rectangle game + number game長方形ゲーム + 数当てゲーム 0:40
Break休憩 1:08
Student presentation — Shohei学生発表 — ショウヘイ 1:15
No Free Lunchノーフリーランチ 1:40
Hierarchical Bayes + close階層ベイズ + まとめ 1:50

Assignment 1 — Clusters課題1 — クラスタ

Clusters — what and whenクラスタ — 内容と期限

Assignment 1: Gaussians, Categories, and Clusters

Due Fri Jun 5, 2026, 8:00 PM — worth 7.5%
Problem 1 — Gaussian-Gaussian conjugate model → you can do this today (it’s Week 3 material)
Problem 2 — Gaussian mixture / categorization → Bayes’ rule over two category Gaussians
Problem 3 — clustering → leans on this week + T3 Ch 5

課題1：ガウス分布、カテゴリ、クラスタ

提出期限 2026年6月5日（金）20:00 — 配点 7.5%
問題1 — ガウス-ガウス共役モデル → 今日できる（第3週の内容）
問題2 — ガウス混合 / カテゴリ化 → 2つのカテゴリのガウス分布へのベイズの定理
問題3 — クラスタリング → 今週の内容 + T3 第5章が必要

Clusters — which notebookクラスタ — どのノートブック

clusters.ipynb is the canonical stencil — the GenJAX path
clusters_python.ipynb and clusters_nosoln.Rmd — non-GenJAX paths, same math, same credit
Matlab available on request
“Open in Colab” links are live on the assignments page — one click, no local setup
Read the assignment PDF first — it has the problem statements and all the math

Assignments page: hml.chibatech.dev/assignments.html

clusters.ipynb が正式なひな形 — GenJAX を使う道
clusters_python.ipynb と clusters_nosoln.Rmd — GenJAX を使わない道、数式も配点も同じ
Matlab は希望者に提供
「Open in Colab」 リンクが課題ページにあり — クリック1回、ローカル設定不要
まず課題PDFを読むこと — 問題文と数式がすべて載っている

課題ページ: hml.chibatech.dev/assignments.html

Where we are現在地

Welcome + Clusters walkthroughようこそ + クラスタ課題の説明 0:00
The generalization problem一般化の問題 0:10
The Bayesian generalization framework + size principleベイズ一般化の枠組み + サイズ原理 0:18
Rectangle game + number game長方形ゲーム + 数当てゲーム 0:40
Break休憩 1:08
Student presentation — Shohei学生発表 — ショウヘイ 1:15
No Free Lunchノーフリーランチ 1:40
Hierarchical Bayes + close階層ベイズ + まとめ 1:50

The generalization problem一般化の問題

Chibany’s lunchesチバニーのお弁当

Chibany has had three tonkatsu lunches this week. Each weighed about 500 g.

Today a lunch arrives weighing 480 g. Is it tonkatsu?

What about 700 g? What about 350 g?

チバニーは今週、とんかつ弁当を3回食べた。どれも約 500 g だった。

今日、480 g の弁当が届いた。これはとんかつ?

700 g なら? 350 g なら?

What just happened今起きたこと

Generalization — deciding when to extend a property from observed examples to a novel stimulus.

No two stimuli are ever identical → you must generalize to act at all
It is everywhere: word learning, categorization, object recognition, property induction, stereotypes
It is the core problem of inductive inference

一般化（generalization） — ある性質を、観測した例から 新しい刺激へと拡張するかどうかを判断すること。

まったく同じ刺激は二つとない → 行動するには必ず一般化が要る
どこにでもある: 単語学習、カテゴリ化、物体認識、性質の帰納、ステレオタイプ
帰納的推論の中心的な問題

Shepard’s universal lawシェパードの普遍法則

Shepard (1987) — across species and domains, the probability of generalization decays exponentially with distance in psychological space (the perceived distance, after the mind represents the stimulus — not raw physical space).

Shepard (1987) — 種や領域を超えて、一般化の確率は 心理的空間における距離とともに指数関数的に減衰する（心が刺激を表現した後の知覚距離 — 生の物理的空間ではない）。

One law, one equation: \[g(d) = e^{-d}\]

\(g\) — probability of generalization
\(d\) — distance in psychological space from the referent

The same curve holds across species, senses, and stimulus types — pitch, color, line length, faces. Shepard called it universal for that reason.

But the law is descriptive: it says generalization decays exponentially, not why. Today’s Bayesian framework will derive that exponential — it falls out of the posterior.

1つの法則、1つの式: \[g(d) = e^{-d}\]

\(g\) — 一般化の確率
\(d\) — 参照刺激からの心理的空間における距離

同じ曲線が種・感覚・刺激の種類を超えて成り立つ — 音高、色、線の長さ、顔。だからシェパードは普遍的と呼んだ。

だがこの法則は記述的: 一般化が指数関数的に減衰すると言うだけで、なぜかは言わない。今日のベイズの枠組みがその指数関数を導出する — 事後確率から導かれる。

Poll — Shepard’s universal lawポール — シェパードの普遍法則

Shepard’s universal law of generalization says generalization …

シェパードの一般化の普遍法則によれば、一般化は …

A. decays exponentially in psychological space
B. decays exponentially in stimulus space
C. grows exponentially in psychological space
D. grows exponentially in stimulus space

A. 心理的空間で指数関数的に減衰する
B. 刺激空間で指数関数的に減衰する
C. 心理的空間で指数関数的に増大する
D. 刺激空間で指数関数的に増大する

Poll — answerポール — 答え

A. Decays exponentially in psychological space.A. 心理的空間で指数関数的に減衰する。

“Stimulus space” is the trap — generalization isn’t governed by physical distance but by perceived distance. A model of generalization therefore needs a model of the psychological space. That is exactly what the Bayesian framework supplies next.

「刺激空間」が罠 — 一般化は物理的距離ではなく知覚された距離に支配される。だから一般化のモデルには心理的空間のモデルが必要になる。次に学ぶベイズの枠組みが、まさにそれを与えてくれる。

Where we are現在地

Welcome + Clusters walkthroughようこそ + クラスタ課題の説明 0:00
The generalization problem一般化の問題 0:10
The Bayesian generalization framework + size principleベイズ一般化の枠組み + サイズ原理 0:18
Rectangle game + number game長方形ゲーム + 数当てゲーム 0:40
Break休憩 1:08
Student presentation — Shohei学生発表 — ショウヘイ 1:15
No Free Lunchノーフリーランチ 1:40
Hierarchical Bayes + close階層ベイズ + まとめ 1:50

The Bayesian generalization frameworkベイズ的一般化の枠組み

The idea — concepts as hypotheses考え方 — 概念を仮説として

Instead of measuring distance directly, posit a space of candidate concepts and let Bayes do the generalizing.

A concept = a set of stimuli that share the property
Shepard’s term: a consequential subset — the subset of things the property “applies to”
A feature is a hypothesis too: “has stripes” picks out a set of stimuli — generalization is just asking which feature-sets the examples imply
Generalization becomes: which concepts are consistent with the examples, and do they contain the new stimulus?

距離を直接測る代わりに、候補となる概念の空間を仮定し、一般化はベイズに任せる。

概念 = その性質を共有する刺激の集合
シェパードの用語: 結果的部分集合（consequential subset） — 性質が「当てはまる」ものの部分集合
特徴（feature）も仮説である: 「縞模様がある」は刺激の集合を選び出す — 一般化とは例がどの特徴集合を含意するかを問うこと
一般化はこうなる: どの概念が例と整合し、その概念は新しい刺激を含むか?

Notation lock-in記号の確認

\(h\) — a hypothesis: one candidate concept, i.e. a set of stimuli
\(\mathcal{H}\) — the hypothesis space: all candidate \(h\)
\(X = \{x_1, \dots, x_n\}\) — the observed examples of the concept
\(y\) — a novel stimulus we must judge
\(C\) — the (unknown) true concept

\(h\) — 仮説: 候補となる1つの概念、すなわち刺激の集合
\(\mathcal{H}\) — 仮説空間: すべての候補 \(h\)
\(X = \{x_1, \dots, x_n\}\) — 観測された概念の例
\(y\) — 判断すべき新しい刺激
\(C\) — （未知の）真の概念

The three ingredients3つの構成要素

Prior \(p(h)\) — domain knowledge: which concepts are natural before any data.

Likelihood \(p(X \mid h)\) — how probable the examples are if \(h\) is the true concept.

Posterior \(p(h \mid X)\) — belief in \(h\) after seeing the examples: \[p(h \mid X) \;\propto\; p(X \mid h)\; p(h)\]

事前確率 \(p(h)\) — 領域知識: データを見る前に、どの概念が自然か。

尤度 \(p(X \mid h)\) — \(h\) が真の概念なら、その例がどれだけ起こりやすいか。

事後確率 \(p(h \mid X)\) — 例を見た後の \(h\) への信念: \[p(h \mid X) \;\propto\; p(X \mid h)\; p(h)\]

The hypothesis space IS a prior仮説空間そのものが事前

Choosing \(\mathcal{H}\) is already a strong prior.

Any concept not in \(\mathcal{H}\) has \(p(h) = 0\) — it can never be learned, no matter the data.

So: a learner’s inductive bias lives in which hypotheses it even considers.

\(\mathcal{H}\) を選ぶこと自体が 強い事前。

\(\mathcal{H}\) に含まれない概念は \(p(h) = 0\) — データがどうであれ、決して学習できない。

つまり: 学習者の帰納的バイアスはそもそもどの仮説を考慮するかに宿る。

Generalization = a posterior-weighted vote一般化 = 事後確率による投票

Probability that the novel stimulus \(y\) has the property: \[p(y \in C \mid X) \;=\; \sum_{h \in \mathcal{H}} \mathbf{1}[\,y \in h\,]\; p(h \mid X)\]

新しい刺激 \(y\) がその性質を持つ確率: \[p(y \in C \mid X) \;=\; \sum_{h \in \mathcal{H}} \mathbf{1}[\,y \in h\,]\; p(h \mid X)\]

Every hypothesis votes. Its vote is its posterior weight, and it votes “yes” only if it contains \(y\).

すべての仮説が投票する。票の重みは事後確率であり、その仮説が \(y\) を含むときだけ「はい」に投票する。

\(\mathbf{1}[\,y \in h\,]\) — the indicator: \(1\) if \(y\) is in the set \(h\), else \(0\).\(\mathbf{1}[\,y \in h\,]\) — 指示関数: \(y\) が集合 \(h\) に属せば \(1\)、そうでなければ \(0\)。

One datum → the posterior-weighted vote1つのデータ → 事後確率による投票

The hypotheses. Observe one datum \(x\). Every interval containing \(x\) is a live hypothesis \(h\). Bar thickness \(= 1/|h|\) (the strong-sampling likelihood); flat prior, so posterior \(\propto\) likelihood.

The vote. For each candidate \(y\), sum the posterior of every \(h\) that contains it: \[p(y \in C \mid x) = \sum_{h} \mathbf{1}[\,y \in h\,]\, p(h \mid x)\]

Next three slides: walk \(y\) outward — \(x\), \(x{+}1\), \(x{+}2\).

仮説。 1つのデータ \(x\) を観測。\(x\) を含むすべての区間が生きた仮説 \(h\)。バーの太さ \(= 1/|h|\)（強サンプリング尤度）; 平坦な事前なので事後 \(\propto\) 尤度。

投票。 各候補 \(y\) について、\(y\) を含むすべての \(h\) の事後確率を合計: \[p(y \in C \mid x) = \sum_{h} \mathbf{1}[\,y \in h\,]\, p(h \mid x)\]

次の3枚: \(y\) を外へ動かす — \(x\)、\(x{+}1\)、\(x{+}2\)。

Vote for \(y = x\)\(y = x\) への投票

The question. Does the property generalize to \(y = x\), the observed datum itself?

Who votes. Every hypothesis was built to contain \(x\), so all 7 of 7 vote “yes”.

The bar. The vote sums all the posterior weight — \(\sum_h p(h \mid x) = 1\). The tallest bar the gradient can have.

The baseline. This peak is what every other \(y\) is measured against — step away and the bar can only fall.

問い。 性質は \(y = x\)、観測したデータそのものに一般化するか?

誰が投票するか。 すべての仮説は \(x\) を含むように作られたので、 7つ中7つすべてが「はい」に投票。

バー。 投票はすべての事後確率を合計 — \(\sum_h p(h \mid x) = 1\)。勾配が取りうる最も高いバー。

基準点。 この頂点が他のすべての \(y\) の比較対象 — 離れればバーは下がるしかない。

Vote for \(y = x + 1\)\(y = x + 1\) への投票

The question. Step one unit out: does the property generalize to \(y = x + 1\)?

Who drops. The two smallest intervals no longer reach \(y\) — they gray out and vote “no”. 5 of 7 still vote “yes”.

The bar. Fewer hypotheses contribute → a shorter bar than at \(x\).

Why it falls fast. The dropouts are the smallest intervals — by the size principle the heaviest-weight votes. Losing those first is why the gradient decays steeply near \(x\).

問い。 1単位外へ: 性質は \(y = x + 1\) に一般化するか?

誰が脱落するか。 最小の2区間はもう \(y\) に届かない — グレーになり「いいえ」に投票。7つ中5つがまだ「はい」に投票。

バー。 寄与する仮説が減る → \(x\) より低いバー。

なぜ速く落ちるか。 脱落したのは最小の区間 — サイズ原理によれば事後確率の重みが最大の票。それを最初に失うことが、勾配が \(x\) の近くで急に減衰する理由。

Vote for \(y = x + 2\)\(y = x + 2\) への投票

The question. Two steps out: does the property generalize to \(y = x + 2\)?

Who’s left. Only the 2 widest intervals still reach \(y\) — and those are the least likely (thinnest bars). 2 of 7 vote.

The bar. Few votes, and only low-weight ones → the bar is short.

The payoff. Sweep \(y\) across every point and the bar heights trace an approximately exponential decay — Shepard’s universal law, derived from the model, not assumed.

問い。 2歩外へ: 性質は \(y = x + 2\) に一般化するか?

誰が残るか。 最も広い2区間だけが \(y\) に届く — しかもそれらは最も尤度が低い（最も細いバー）。7つ中2つが投票。

バー。 票が少なく、しかも低重みのみ → バーは低い。

結論。 \(y\) を全点で動かすと、バーの高さはほぼ指数関数的な減衰を描く — シェパードの普遍法則が、仮定ではなくモデルから導出された。

The size principleサイズ原理

Where did the examples come from?例はどこから来たのか?

The likelihood \(p(X \mid h)\) depends on how you assume the examples were generated. Two assumptions:

Weak sampling — examples generated some other way, then labeled by whether they fall in \(h\).

Strong sampling — each example drawn uniformly at random from within \(h\).

尤度 \(p(X \mid h)\) は、例がどう生成されたと仮定するかに依存する。 2つの仮定:

弱いサンプリング（weak sampling） — 例は別の方法で生成され、その後 \(h\) に入るかどうかでラベル付けされる。

強いサンプリング（strong sampling） — 各例は \(h\) の中から一様ランダムに抽出される。

The two likelihoods2つの尤度

Weak sampling弱いサンプリング

\[p(X \mid h) = \begin{cases} 1 & \text{all } x_i \in h \\ 0 & \text{else}\end{cases}\]

Size-blind: a hypothesis either contains the data or it doesn’t.

Does not depend on \(|h|\).

大きさに鈍感: 仮説はデータを含むか含まないかのどちらか。

\(|h|\) には依存しない。

Strong sampling強いサンプリング

\[p(x \mid h) = \frac{1}{|h|} \;\;\Rightarrow\;\; p(X \mid h) = \left(\frac{1}{|h|}\right)^{\!n}\]

Each example drawn uniformly from inside \(h\).

Smaller \(|h|\) → higher likelihood, exponentially so in \(n\).

各例は \(h\) の中から一様に抽出。

\(|h|\) が小さいほど尤度が高い — \(n\) について指数関数的に。

\(|h|\) — the size of hypothesis \(h\) (how many stimuli it contains).
\(n\) — number of examples.\(|h|\) — 仮説 \(h\) の大きさ（含む刺激の数）。
\(n\) — 例の数。

The size principleサイズ原理

Under strong sampling: smaller hypotheses get higher likelihood — and exponentially more so as the number of examples \(n\) grows.

\[p(X \mid h) = \left(\frac{1}{|h|}\right)^{n}\]

So a small \(|h|\) wins — and wins faster with more data, because the exponent \(n\) magnifies any size advantage.

The mechanism behind both games today.

強いサンプリングの下では: 小さい仮説ほど高い尤度を得る — しかも例の数 \(n\) が増えるほど指数関数的にそうなる。

\[p(X \mid h) = \left(\frac{1}{|h|}\right)^{n}\]

だから小さい \(|h|\) が勝つ — そしてデータが増えるほど速く勝つ。指数 \(n\) が大きさの優位を増幅するから。

今日の両方のゲームを動かす中心的な仕組み。

Why — the suspicious coincidenceなぜ — 怪しい偶然

You see the examples \(\{60, 80, 10, 30\}\).

If the concept is “multiples of 10” — unremarkable, that’s just what its members look like.
If the concept is “even numbers” — it’s a suspicious coincidence that not one of the four was 2, 4, 6, 8, …

Strong sampling penalizes the big hypothesis for failing to predict the tight clustering you actually saw.

例 \(\{60, 80, 10, 30\}\) を見たとする。

概念が 「10の倍数」 なら — 当たり前、それがメンバーの姿そのもの。
概念が 「偶数」 なら — 4つのどれも 2, 4, 6, 8, … でなかったのは 怪しい偶然。

強いサンプリングは、実際に見た密な集まりを予測できなかったとして、大きい仮説をペナルティする。

Poll — strong samplingポール — 強いサンプリング

What is strong sampling?

強いサンプリングとは何か?

A. Each stimulus is generated uniformly at random from the true hypothesis
B. A stimulus has probability one given the true hypothesis
C. Larger hypotheses are given smaller prior probability
D. Smaller hypotheses are given smaller prior probability

A. 各刺激は真の仮説から一様ランダムに生成される
B. 刺激は真の仮説の下で確率1を持つ
C. 大きい仮説には小さい事前確率が与えられる
D. 小さい仮説には小さい事前確率が与えられる

Poll — answerポール — 答え

A. Uniformly at random from the true hypothesis.A. 真の仮説から一様ランダムに。

B describes weak sampling — membership gives likelihood \(1\), regardless of size. C / D describe a prior over hypotheses; the size principle is about the likelihood, not the prior. Strong sampling = “the examples were drawn from inside the concept” — and that is what makes size matter.

B は弱いサンプリングの説明 — メンバーであれば、大きさによらず尤度は \(1\)。C / D は仮説の事前の話; サイズ原理は事前ではなく尤度の話。強いサンプリング = 「例は概念の中から抽出された」 — それがサイズを重要にする。

Where we are現在地

Welcome + Clusters walkthroughようこそ + クラスタ課題の説明 0:00
The generalization problem一般化の問題 0:10
The Bayesian generalization framework + size principleベイズ一般化の枠組み + サイズ原理 0:18
Rectangle game + number game長方形ゲーム + 数当てゲーム 0:40
Break休憩 1:08
Student presentation — Shohei学生発表 — ショウヘイ 1:15
No Free Lunchノーフリーランチ 1:40
Hierarchical Bayes + close階層ベイズ + まとめ 1:50

Rectangle game — continuous concepts長方形ゲーム — 連続的な概念

Gnarblesグナーブル

A gnarble is a rectangle whose dimensions fall in some interval.

Tenenbaum’s cover story: a 2-D concept is an axis-aligned rectangle in a feature space — e.g. healthy levels of insulin × cholesterol.

You observe a few examples drawn from inside the rectangle. Which other points are gnarbles?

グナーブル（gnarble） とは、寸法がある区間に収まる長方形。

テネンバウムのカバーストーリー: 2次元の概念は、特徴空間における軸平行な長方形 — 例えばインスリン × コレステロールの健康な水準。

長方形の中から抽出された例をいくつか観測する。他のどの点がグナーブルか?

Start in 1-Dまず1次元で

Concept = an interval \([\ell, u]\) on one dimension.

\(\mathcal{H}\) = all intervals
Strong sampling → likelihood \(\propto \left(\dfrac{1}{u - \ell}\right)^{n}\)
Hypothesis size = the length \(u - \ell\)

Every interval that contains the data is a hypothesis; the snug ones get the most posterior weight.

概念 = 1つの次元上の区間 \([\ell, u]\)。

\(\mathcal{H}\) = すべての区間
強いサンプリング → 尤度 \(\propto \left(\dfrac{1}{u - \ell}\right)^{n}\)
仮説の大きさ = 長さ \(u - \ell\)

データを含むすべての区間が仮説; ぴったりしたものが最も多くの事後確率を得る。

The generalization gradient一般化勾配

Plot \(p(y \in C \mid X)\) as \(y\) moves along the dimension:

High inside the range of the examples
Decays as \(y\) moves outside that range

\(y\) が次元上を動くとき、\(p(y \in C \mid X)\) をプロットする:

例の範囲の内側では高い
\(y\) がその範囲の外へ動くと減衰する

The decay is exponential — Shepard’s universal law, now derived rather than assumed.減衰は指数関数的 — シェパードの普遍法則が、仮定ではなく導出された。

The 1-D gradient — built from votes1次元の勾配 — 投票から構築

The same posterior-weighted vote as the number-line construction — now with several examples \(X\).

Who votes. Every interval that contains all of \(X\) is a live hypothesis; a snugger interval is thicker (more posterior).

The gradient. Sum the posterior of every interval that contains \(y\): \[p(y \in C \mid X) = \sum_h \mathbf{1}[\,y \in h\,]\,p(h \mid X)\]

Flat across the data, decaying outside — the gradient is the posterior-weighted vote, one \(y\) at a time.

数直線の構築と同じ事後確率による投票 — 今度は複数の例 \(X\) で。

誰が投票するか。 \(X\) のすべてを含む区間が生きた仮説; ぴったりした区間ほど太い（事後確率が大きい）。

勾配。 \(y\) を含むすべての区間の事後確率を合計: \[p(y \in C \mid X) = \sum_h \mathbf{1}[\,y \in h\,]\,p(h \mid X)\]

データ全体で平坦、外側で減衰 — 勾配そのものが、\(y\) を1点ずつ見た事後確率による投票。

More examples → tighter generalization例が増えると → 一般化が引き締まる

One example → broad, diffuse generalization (many interval sizes survive)
Many examples → tight generalization, hugging the data range

Why: the size principle. With large \(n\), big intervals lose likelihood exponentially fast — only the small, snug intervals keep posterior mass.

例が1つ → 広く拡散した一般化（多くの区間サイズが生き残る）
例が多数 → データ範囲に密着した、引き締まった一般化

理由: サイズ原理。\(n\) が大きいと、大きい区間は尤度を指数関数的に速く失う — 小さくぴったりした区間だけが事後確率の質量を保つ。

Into 2-D — the rectangle game2次元へ — 長方形ゲーム

Same machinery, one dimension up: a concept is an axis-aligned rectangle.

Observe \(n\) dots inside the true rectangle
Every rectangle that encloses all \(n\) is a hypothesis
Smaller rectangle → bigger likelihood (brighter / thicker)

\(r\) — the range the data spans.

\(d\) — how far a rectangle extends past that range.

同じ仕組みを1次元上へ: 概念は軸平行な長方形。

真の長方形の中の \(n\) 個の点を観測
\(n\) 個すべてを囲む長方形が仮説
長方形が小さいほど尤度が大きい（明るい / 太い）

\(r\) — データが広がる範囲。

\(d\) — 長方形がその範囲をどれだけ超えて広がるか。

The rectangle experiment長方形実験

Tenenbaum (1999) ran this as a behavioral experiment.

On each trial, subjects saw \(n\) dots drawn from “an arbitrary rectangle of healthy insulin / cholesterol levels”
They drew the rectangle they thought the dots came from
\(n\) varied from 2 to 50; the data range \(r\) varied across trials
The measure: \(d\) — how far past the data range \(r\) the drawn rectangle extends

Tenenbaum (1999) はこれを行動実験として実施した。

各試行で被験者は「健康なインスリン / コレステロール値の任意の長方形」から抽出された \(n\) 個の点 を見た
点がどの長方形から来たと思うかを描いた
\(n\) は 2から50まで変化; データ範囲 \(r\) も試行ごとに変化
測定量: \(d\) — 描かれた長方形がデータ範囲 \(r\) をどれだけ超えて広がるか

The result — \(d\) vs. \(r\), by \(n\)結果 — \(n\) ごとの \(d\) 対 \(r\)

Solid = human, dashed = model. One color per \(n\).

The human pattern. Fewer examples → generalize further (\(n=2\) on top); \(d\) rises with \(r\) but saturates.

The model — likelihood only. The size principle with a flat (uninformative) prior. It captures the \(n\)-ordering…

…but the curves run straight — the model over-extends, badly for small \(n\) and large \(r\). It misses the human saturation.

実線 = 人間、破線 = モデル。 色は \(n\) ごと。

人間のパターン。 例が少ない → 遠くまで一般化（\(n=2\) が一番上）; \(d\) は \(r\) とともに増えるが飽和する。

モデル — 尤度のみ。 平坦な（無情報）事前のサイズ原理。 \(n\) の順序は捉えるが…

…曲線は直線的に伸びる — モデルは拡張しすぎ、特に小さい \(n\) と大きい \(r\) で。人間の飽和を捉えられない。

One fix — an exponential prior1つの修正 — 指数事前

A flat prior lets the rectangle run straight — it over-extends. The fix: a prior that makes large rectangles less likely.

The exponential distribution — our first new distribution today. For a size \(s \ge 0\): \[p(s) = \lambda\, e^{-\lambda s}\]

Always decreasing — small \(s\) favored
One parameter \(\lambda > 0\); the mean is \(1/\lambda\)
Larger \(\lambda\) → faster decay → stronger pull toward small rectangles

平坦な事前では長方形が直線的に伸びる — 拡張しすぎる。修正: 大きい長方形を起こりにくくする事前。

指数分布 — 今日初めての新しい分布。サイズ \(s \ge 0\) について: \[p(s) = \lambda\, e^{-\lambda s}\]

常に減少 — 小さい \(s\) が好まれる
パラメータは1つ \(\lambda > 0\); 平均は \(1/\lambda\)
\(\lambda\) が大きいほど減衰が速く、小さい長方形への引きが強い

This is the first time the class meets the exponential distribution — define it properly: density λ·exp(-λs) on s ≥ 0, one rate parameter λ, mean 1/λ, monotonically decreasing. The plot shows the shape. Then the point: an exponential prior over rectangle size penalizes big rectangles, and size principle (likelihood) + this prior together fit the human data — neither alone. Justified by the task framing (monitor edges, prior experience with rectangles). ~2 min.

ANTICIPATED STUDENT QUESTION — how is the scale parameter determined? The exponential prior has one free parameter: equivalently the rate λ, or the mean scale 1/λ (Tenenbaum writes it as σ, the prior’s expected rectangle size). It is NOT learned from the n dots of a single trial — those go into the likelihood. It is a property of the prior, fixed across trials. In the 1999 paper σ is fit once to the average human data (he reports σ = 5 units in the 24-unit window gives an excellent fit) — i.e. it is calibrated to the population, then held constant. Conceptually it encodes prior experience with the scale of “healthy-level” rectangles. The principled move (Week 4’s hierarchical-Bayes block) is to put a hyperprior on σ and infer it too — then the scale is learned from many concepts rather than hand-set. Have this ready but do not lecture it unprompted.

With the prior — the fit事前を入れると — 適合

Same axes, same data — now the model carries the exponential prior over rectangle size.

The dashed curves bend. The straight, over-extending lines collapse onto the human curves — \(d\) now saturates with \(r\), just as people do.

Likelihood (size principle) and prior together — neither alone — give the fit.

Same lesson for every paper: lay the model on top of the human data and read off where it bends.

同じ軸、同じデータ — 今度はモデルが長方形サイズに対する指数事前を持つ。

破線の曲線が曲がる。 直線的に拡張しすぎていた線が人間の曲線へと収束する — \(d\) は今や \(r\) とともに飽和する、人間と同じように。

尤度（サイズ原理）と事前の両方で — どちらか一方では不十分 — 適合が得られる。

どの論文でも同じ教訓: モデルを人間データに重ねて、どこで曲がるかを読み取る。

The rectangle game in one line長方形ゲームを一言で

The continuous concept learner is just the framework equation with \(\mathcal{H}\) = intervals / rectangles.

\[p(y \in C \mid X) = \sum_{h} \mathbf{1}[\,y \in h\,]\; p(h \mid X)\]

Nothing new — only the choice of \(\mathcal{H}\).

連続的な概念の学習者は、\(\mathcal{H}\) = 区間 / 長方形とした 枠組みの式そのもの。

\[p(y \in C \mid X) = \sum_{h} \mathbf{1}[\,y \in h\,]\; p(h \mid X)\]

新しいものは何もない — \(\mathcal{H}\) の選び方だけ。

Number game — discrete concepts数当てゲーム — 離散的な概念

The number game数当てゲーム

A simple task (Tenenbaum, 1999):

I have a concept — a set of numbers between 1 and 100
You see one or more “yes” examples
You judge: is some other number a “yes”?

Watch what your own judgments do as examples come in.

シンプルな課題（Tenenbaum, 1999）:

私はある概念を持っている — 1から100までの数の集合
あなたは1つ以上の「はい」の例を見る
あなたは判断する: 別の数は「はい」か?

例が入ってくるにつれて、自分の判断がどう動くかを見よう。

What people actually do人々が実際にすること

Human generalization judgments (Tenenbaum’s \(N = 20\) subjects):

Examples \(\{60\}\) → diffuse similarity: many numbers get moderate “yes”
Examples \(\{60, 80, 10, 30\}\) → sharp “multiples of 10”
Examples \(\{60, 52, 57, 55\}\) → sharp “numbers near 60”

One example → graded. Four examples → a crisp rule.

人間の一般化判断（テネンバウムの \(N = 20\) 名の被験者）:

例 \(\{60\}\) → 拡散した類似性: 多くの数が中程度の「はい」
例 \(\{60, 80, 10, 30\}\) → 鋭く 「10の倍数」
例 \(\{60, 52, 57, 55\}\) → 鋭く 「60前後の数」

例が1つ → 段階的。例が4つ → 明確な規則。

Two things to explain説明すべき2つのこと

Generalization can look similarity-based (graded) or rule-based (all-or-none) — and people switch between them.
People learn a concept from just a few examples.

One model — the Bayesian framework — produces both, with no extra machinery.

一般化は 類似性ベース（段階的） にも 規則ベース（全か無か） にも見える — そして人はその間を切り替える。
人は ごく少数の例 から概念を学ぶ。

1つのモデル — ベイズの枠組み — が、追加の仕組みなしに両方を生み出す。

The discrete hypothesis space離散的な仮説空間

\(\mathcal{H}\) for the number game has two kinds of hypothesis:

Mathematical properties (~24) — even, odd, primes, squares, cubes, multiples of \(k\), powers of \(k\).

Magnitude intervals — “numbers in \([a, b]\)”: e.g. 10–20, 30–45.

The prior \(p(h)\) weights these families against each other.

数当てゲームの \(\mathcal{H}\) には2種類の仮説がある:

数学的性質（約24個） — 偶数、奇数、素数、平方数、立方数、 \(k\) の倍数、\(k\) の冪。

大きさの区間 — 「\([a, b]\) の中の数」: 例 10–20、30–45。

事前 \(p(h)\) がこれらの族を互いに重み付けする。

Size principle, by the numbersサイズ原理を数値で

Two candidate concepts for numbers in 1–100:

Multiples of 2 — 50 numbers → \(p(x \mid h) = \dfrac{1}{50} = 2\%\) each
Multiples of 10 — 10 numbers → \(p(x \mid h) = \dfrac{1}{10} = 10\%\) each

1–100 の数に対する2つの候補概念:

2の倍数 — 50個の数 → 各 \(p(x \mid h) = \dfrac{1}{50} = 2\%\)
10の倍数 — 10個の数 → 各 \(p(x \mid h) = \dfrac{1}{10} = 10\%\)

One example: \(x = 60\)例が1つ: \(x = 60\)

\[p(60 \mid \text{mult-2}) = \frac{1}{50} \qquad p(60 \mid \text{mult-10}) = \frac{1}{10}\]

\[p(60 \mid \text{2の倍数}) = \frac{1}{50} \qquad p(60 \mid \text{10の倍数}) = \frac{1}{10}\]

Multiples of 10 is already 5× more likely — but only 5×. With one example, many hypotheses stay in contention → graded generalization.10の倍数はすでに 5倍起こりやすい — だが5倍だけ。例が1つだと、多くの仮説が競合に残る → 段階的な一般化。

Four examples: \(\{10, 30, 60, 80\}\)例が4つ: \(\{10, 30, 60, 80\}\)

\[p(X \mid \text{mult-2}) = \left(\tfrac{1}{50}\right)^{4} \approx 1.6 \times 10^{-7}\]

\[p(X \mid \text{2の倍数}) = \left(\tfrac{1}{50}\right)^{4} \approx 1.6 \times 10^{-7}\]

\[p(X \mid \text{mult-10}) = \left(\tfrac{1}{10}\right)^{4} = 10^{-4}\]

\[p(X \mid \text{10の倍数}) = \left(\tfrac{1}{10}\right)^{4} = 10^{-4}\]

Now multiples of 10 is ~625× more likely. The 5× edge got raised to the 4th power → a crisp rule.今や10の倍数は 約625倍起こりやすい。5倍の差が4乗された → 明確な規則。

From likelihood to posterior尤度から事後確率へ

Likelihoods aren’t beliefs yet. Take a two-hypothesis model — \(\mathcal{H} = \{\,\)multiples of 10, even numbers\(\,\}\), both containing every example — with a flat prior, and turn the crank:

\[p(h \mid X) = \frac{p(X \mid h)\,p(h)}{\sum_{h'} p(X \mid h')\,p(h')}\]

Next: the posterior under strong vs. weak sampling, for \(X = \{60\}\) then \(X = \{60, 80, 10, 30\}\).

尤度はまだ信念ではない。2仮説モデル — \(\mathcal{H} = \{\,\)10の倍数, 偶数\(\,\}\)、どちらも全例を含む — に平坦な事前を置いて計算する:

\[p(h \mid X) = \frac{p(X \mid h)\,p(h)}{\sum_{h'} p(X \mid h')\,p(h')}\]

次: 強いサンプリングと弱いサンプリングでの事後確率を、\(X = \{60\}\) と \(X = \{60, 80, 10, 30\}\) について。

Strong sampling — one example強いサンプリング — 例が1つ

The data: \(X = \{60\}\) — a single example. Strong sampling: \[p(h \mid X) \propto \left(\tfrac{1}{|h|}\right)^{1}\!\! \cdot 0.5\]

mult-10: likelihood \(\tfrac{1}{10}\) · even: \(\tfrac{1}{50}\)
Normalize the two → 0.83 vs. 0.17

Read it: one example already tilts belief toward the smaller hypothesis — but only gently. A 5:1 likelihood ratio is not decisive, so plenty of posterior mass still sits on “even numbers”.

This is the graded regime — belief shifts, but no rule yet.

データ: \(X = \{60\}\) — 例が1つ。強いサンプリング: \[p(h \mid X) \propto \left(\tfrac{1}{|h|}\right)^{1}\!\! \cdot 0.5\]

10の倍数: 尤度 \(\tfrac{1}{10}\) · 偶数: \(\tfrac{1}{50}\)
2つを正規化 → 0.83 対 0.17

読み方: 例が1つでもう、小さい仮説へ信念が傾く — だが穏やかに。5:1 の尤度比は決定的ではないので、「偶数」にもまだ多くの事後確率の質量が残る。

これが段階的な領域 — 信念は動くが、まだ規則ではない。

Strong sampling — four examples強いサンプリング — 例が4つ

\(X = \{60, 80, 10, 30\}\), strong sampling: \[p(h \mid X) \propto \left(\tfrac{1}{|h|}\right)^{4}\!\! \cdot 0.5\]

The \(5{:}1\) ratio is now \(5^4 = 625{:}1\)
Normalize → 0.998 vs. 0.002

Four “even” numbers, none of them odd-looking — that’s the suspicious coincidence. Strong sampling all but rules “even” out.

\(X = \{60, 80, 10, 30\}\)、強いサンプリング: \[p(h \mid X) \propto \left(\tfrac{1}{|h|}\right)^{4}\!\! \cdot 0.5\]

\(5{:}1\) の比が今や \(5^4 = 625{:}1\)
正規化 → 0.998 対 0.002

4つとも偶数で、しかも10の倍数 — これが怪しい偶然。強いサンプリングは「偶数」をほぼ排除する。

Weak sampling — eliminate, but don’t rank弱いサンプリング — 除外はするが、順位はつけない

Weak-sampling likelihood is \(1\) for any \(h\) that contains the data, \(0\) otherwise — no \(|h|\) dependence.

What it can do: a datum outside \(h\) kills it — likelihood \(0\). Weak sampling does move the posterior, by ruling hypotheses out.

What it can’t do: among the hypotheses that all still contain the data, it has no preference — every survivor keeps likelihood \(1\), so the posterior over them stays at the prior.

Here both hypotheses contain every example — nothing is ruled out: \[p(h \mid X) = \frac{1 \cdot 0.5}{1\cdot 0.5 + 1\cdot 0.5} = 0.5\]

So weak sampling can’t see the suspicious coincidence — it can’t tell “multiples of 10” from “even numbers”.

弱サンプリングの尤度はデータを含む \(h\) なら \(1\)、それ以外は \(0\) — \(|h|\) に依存しない。

できること: \(h\) の外のデータはその \(h\) を消す — 尤度 \(0\)。弱サンプリングは仮説を除外することで事後確率を動かす。

できないこと: データを含み続ける仮説の間では、優劣をつけられない — 生き残りはすべて尤度 \(1\) なので、それらの事後確率は事前のまま。

ここでは両仮説とも全例を含む — 除外されるものはない: \[p(h \mid X) = \frac{1 \cdot 0.5}{1\cdot 0.5 + 1\cdot 0.5} = 0.5\]

だから弱サンプリングは怪しい偶然を見抜けない — 「10の倍数」と「偶数」を区別できない。

Weak sampling, made precise. Correct the common over-statement: weak sampling is NOT inert. A datum that falls outside a hypothesis gives it likelihood 0 — the hypothesis is ruled out, and the posterior moves. What weak sampling cannot do is rank the hypotheses that all still contain the data: every survivor keeps likelihood 1, so the posterior over the survivors is just the renormalized prior. In THIS example both hypotheses contain every example, so nothing is eliminated and the posterior sits at 0.5 / 0.5 (and the 0.5 is an artifact of two hypotheses + a flat prior — five survivors would give 0.2 each). The teaching point: weak sampling can eliminate but cannot prefer the smaller hypothesis, so it cannot produce the suspicious-coincidence effect — that needs strong sampling. ~2 min.

The number game in one line数当てゲームを一言で

Same equation as the rectangle game — \(\mathcal{H}\) is now discrete.

\[p(y \in C \mid X) = \sum_{h} \mathbf{1}[\,y \in h\,]\; p(h \mid X)\]

Graded vs. rule-like generalization both fall out of the posterior — no extra mechanism. The size principle’s exponent does the switching.

長方形ゲームと同じ式 — \(\mathcal{H}\) が今度は離散的。

\[p(y \in C \mid X) = \sum_{h} \mathbf{1}[\,y \in h\,]\; p(h \mid X)\]

段階的な一般化も規則的な一般化も、追加の仕組みなしに 事後確率から導かれる。サイズ原理の指数が切り替えを担う。

Where we are現在地

Welcome + Clusters walkthroughようこそ + クラスタ課題の説明 0:00
The generalization problem一般化の問題 0:10
The Bayesian generalization framework + size principleベイズ一般化の枠組み + サイズ原理 0:18
Rectangle game + number game長方形ゲーム + 数当てゲーム 0:40
Break休憩 1:08
Student presentation — Shohei学生発表 — ショウヘイ 1:15
No Free Lunchノーフリーランチ 1:40
Hierarchical Bayes + close階層ベイズ + まとめ 1:50

Break休憩

Bridge — Shohei’s paper橋渡し — ショウヘイの論文

Up next: Shohei presents Tenenbaum & Xu (2000), Word learning as Bayesian inference.

A child hears “this is a dax” pointing at three Dalmatians. Is a poodle a dax? A cat?

This is the number game, with \(\mathcal{H}\) = candidate word meanings (subordinate / basic-level / superordinate). The size principle explains why three subordinate examples → a subordinate meaning.

Watch for the size principle doing the work.

次は: ショウヘイが Tenenbaum & Xu (2000)『ベイズ推論としての単語学習』 を発表する。

子供が3匹のダルメシアンを指さして 「これはダックスだよ」 と聞く。プードルはダックス? 猫は?

これは数当てゲームであり、\(\mathcal{H}\) = 候補となる単語の意味（下位 / 基本 / 上位カテゴリ）。サイズ原理が、なぜ3つの下位の例から下位の意味になるかを説明する。

サイズ原理が働くところに注目。

Student presentation — Shohei学生発表 — ショウヘイ

Where we are現在地

Welcome + Clusters walkthroughようこそ + クラスタ課題の説明 0:00
The generalization problem一般化の問題 0:10
The Bayesian generalization framework + size principleベイズ一般化の枠組み + サイズ原理 0:18
Rectangle game + number game長方形ゲーム + 数当てゲーム 0:40
Break休憩 1:08
Student presentation — Shohei学生発表 — ショウヘイ 1:15
No Free Lunchノーフリーランチ 1:40
Hierarchical Bayes + close階層ベイズ + まとめ 1:50

No Free Lunchノーフリーランチ

The No Free Lunch theoremノーフリーランチ定理

Wolpert (1996) — averaged over all possible worlds, no learning algorithm beats any other. Here is why, concretely.

Wolpert (1996) — あらゆる可能な世界にわたって平均すると、どの学習アルゴリズムも他のどれにも勝らない。具体的に、その理由。

The task. You see the bits \(\,0, 1\,\) and must predict \(x_3\).

Your rule. Say it predicts \(x_3 = 0\). In the world \(\,0,1,\mathbf{0}\,\) it is right.

The mirror world. But the world \(\,0,1,\mathbf{1}\,\) is just as possible — same data \(0,1\), opposite answer. There your rule is wrong.

課題。 ビット列 \(\,0, 1\,\) を見て \(x_3\) を予測する。

あなたの規則。 \(x_3 = 0\) と予測するとしよう。世界 \(\,0,1,\mathbf{0}\,\) では正解。

鏡の世界。 だが世界 \(\,0,1,\mathbf{1}\,\) も同じくらい可能 — 同じデータ \(0,1\)、逆の答え。そこではあなたの規則は不正解。

They pair off. Every world where your rule scores a point has a mirror world — identical data, flipped continuation — where it loses one.

Sum the pair. Right + wrong \(= 1\) hit out of \(2\). Average over all worlds: every algorithm, every rule, scores exactly \(1/2\).

No rule can win the average, because the data \(0,1\) says nothing about \(x_3\) until you assume some worlds are more likely than others.

ペアになる。 あなたの規則が得点する世界には必ず鏡の世界がある — 同じデータ、反転した継続 — そこでは失点する。

ペアを合計。 正解 + 不正解 \(= 2\) 回中 \(1\) 回的中。すべての世界で平均すると、どのアルゴリズムも、どの規則も、ちょうど \(1/2\)。

どの規則も平均で勝てない。データ \(0,1\) は、ある世界が他より起こりやすいと仮定するまで \(x_3\) について何も語らないから。

Walk the concrete instance, slowly — this is the intuition pump for the whole NFL → prior payoff.

The setup: data is the two bits 0,1; predict the third bit. There are exactly two continuations, 0,1,0 and 0,1,1, and with a flat prior over worlds they are equally likely. Whatever your rule outputs, it is right in one of those two worlds and wrong in the other. Worlds pair off: for every world your rule gets right, the mirror world (same observed data, opposite continuation) cancels it. So averaged over all worlds the score is exactly 1/2 — and this is true of EVERY rule, including the cleverest one, including “always predict the majority bit.” That is the No Free Lunch theorem made concrete.

The escape hatch (next slide): the only way out is to drop the flat prior — to assert that 0,1,0 is more likely than 0,1,1 (or vice versa). That assertion IS the inductive bias. NFL doesn’t say learning is impossible; it says learning is impossible WITHOUT a prior. ~2.5 min.

What NFL means for usNFLが意味すること

A learner only works because the distribution over worlds is constrained — i.e. because it has a non-flat prior.

Generalization is impossible without inductive bias
Recall: the hypothesis space \(\mathcal{H}\) and the prior \(p(h)\) from Block 3
They are not bookkeeping — they are the entire reason learning is possible

学習者が機能するのは、世界に対する分布が制約されているからにすぎない — すなわち、平坦でない事前を持つから。

帰納的バイアスなしに一般化は不可能
思い出そう: ブロック3の仮説空間 \(\mathcal{H}\) と事前 \(p(h)\)
それらは事務的なものではない — 学習が可能であるまさにその理由

Poll — No Free Lunchポール — ノーフリーランチ

What is the No Free Lunch theorem (for prediction)?

ノーフリーランチ定理（予測について）とは何か?

A. When all hypotheses are possible, there’s nothing you can learn to predict
B. Learning one hypothesis hurts learning other hypotheses
C. If someone gives you lunch for free, they’ll expect something back
D. Generalizing to new stimuli can hurt a learner

A. すべての仮説が可能なとき、予測のために学べることは何もない
B. 1つの仮説を学ぶと、他の仮説の学習が損なわれる
C. 誰かが無料で昼食をくれたら、後で見返りを期待される
D. 新しい刺激への一般化は学習者を害しうる

Poll — answerポール — 答え

A. When all hypotheses are possible, there’s nothing you can learn to predict.A. すべての仮説が可能なとき、予測のために学べることは何もない。

A flat prior over every possible world means data carries no leverage — every continuation stays 50/50. To learn, you must commit to some worlds being more likely than others. That commitment is your prior — and it is why your prior matters.

あらゆる可能な世界に平坦な事前を置くと、データはてこにならない — どの継続も50/50のまま。学ぶには、ある世界が他より起こりやすいとコミット しなければならない。そのコミットメントが事前であり、事前が重要である理由。

Where we are現在地

Welcome + Clusters walkthroughようこそ + クラスタ課題の説明 0:00
The generalization problem一般化の問題 0:10
The Bayesian generalization framework + size principleベイズ一般化の枠組み + サイズ原理 0:18
Rectangle game + number game長方形ゲーム + 数当てゲーム 0:40
Break休憩 1:08
Student presentation — Shohei学生発表 — ショウヘイ 1:15
No Free Lunchノーフリーランチ 1:40
Hierarchical Bayes + close階層ベイズ + まとめ 1:50

Hierarchical Bayes階層ベイズ

Back to Chibany’s classチバニーのクラスに戻る

Every student in Chibany’s class has their own tonkatsu-vs-hamburger rate \(\theta_i\).

The rates aren’t identical — but they aren’t unrelated either. They’re all Chibany’s customers.

Should learning Aoi’s rate tell us anything about Ben’s?

チバニーのクラスの各学生は、自分自身のとんかつ対ハンバーグの率 \(\theta_i\) を持つ。

率は同一ではない — だが無関係でもない。全員チバニーの客なのだ。

アオイの率を学ぶことは、ベンの率について何か教えてくれるべきか?

Priors over priors事前の上の事前

Put a prior on the parameters of the prior — and treat \((a, b)\) themselves as unknown: \[\theta_i \sim \text{Beta}(a, b)\]

The two-level model: \((a, b) \;\longrightarrow\; \theta_i \;\longrightarrow\;\) observed bentos.

\((a, b)\) is the shared structure; each \(\theta_i\) is a student.

事前のパラメータの上に、さらに事前を置く — そして \((a, b)\) 自体を未知として扱う: \[\theta_i \sim \text{Beta}(a, b)\]

2階層モデル: \((a, b) \;\longrightarrow\; \theta_i \;\longrightarrow\;\) 観測された弁当。

\((a, b)\) が共有された構造; 各 \(\theta_i\) が1人の学生。

Why it mattersなぜ重要か

A hierarchical model lets a learner learn the prior from data — exactly the inductive bias No Free Lunch said you cannot do without.

\((a, b)\) is learned from all the students together
It is how “overhypotheses” get acquired — the shape bias, object vs. substance kinds (Kemp, Perfors & Tenenbaum, 2007)

階層モデルは、学習者がデータから事前を学ぶことを可能にする — まさにノーフリーランチが「なしでは済まない」と言った帰納的バイアス。

\((a, b)\) はすべての学生から一緒に学ばれる
これが「上位仮説（overhypotheses）」 — 形状バイアス、物体対物質の種類 — の獲得のされ方（Kemp, Perfors & Tenenbaum, 2007）

Three ways to pool the dataデータをまとめる3つの方法

Six students, six bento records. How do you estimate their rates?

Approach	Model	Problem
Complete pooling	one shared \(\theta\) for everyone	ignores that students differ
No pooling	a separate \(\theta_i\), all unrelated	ignores that they’re all Chibany’s customers
Hierarchical	\(\theta_i \sim \text{Beta}(a,b)\)	the middle path — borrows strength

Hierarchical Bayes = partial pooling.

6人の学生、6つの弁当記録。彼らの率をどう推定する?

方法	モデル	問題
完全プーリング	全員で1つの共有 \(\theta\)	学生が異なることを無視
プーリングなし	別々の \(\theta_i\)、互いに無関係	全員チバニーの客であることを無視
階層	\(\theta_i \sim \text{Beta}(a,b)\)	中間の道 — 強さを借りる

階層ベイズ = 部分プーリング（partial pooling）。

The two-level model2階層モデル

Built up, top to bottom:

\[(a, b) \;\sim\; \text{prior}\]

上から下へ組み上げる:

\[(a, b) \;\sim\; \text{事前}\]

\[\theta_i \mid a, b \;\sim\; \text{Beta}(a, b)\]

\[k_i \mid \theta_i \;\sim\; \text{Binomial}(n_i, \theta_i)\]

\(k_i\) — tonkatsu count for student \(i\). \(n_i\) — that student’s total bentos.\(k_i\) — 学生 \(i\) のとんかつの回数。 \(n_i\) — その学生の弁当の総数。

Inference — no closed form推論 — 閉じた形はない

We want the posterior over everything: \[p\big(a, b, \{\theta_i\} \mid \text{data}\big)\]

Unlike the Beta-Binomial of Week 3, this has no clean closed form.

This is where sampling comes in — and where GenJAX earns its place (the hierarchical bento_day() exercise builds exactly this model).

私たちはすべてに対する事後確率を求めたい: \[p\big(a, b, \{\theta_i\} \mid \text{データ}\big)\]

第3週のベータ-二項とは違い、これにはきれいな閉じた形がない。

ここでサンプリングの出番 — そして GenJAX が本領を発揮する（階層版 bento_day() 演習は、まさにこのモデルを組み立てる）。

Shrinkage — borrowing strength縮約 — 強さを借りる

For each student, compare:

their raw tonkatsu fraction \(k_i / n_i\)
their posterior mean \(\hat\theta_i\)

The posterior means are pulled toward the group mean — a lot when a student has little data, barely at all when they have a lot.

This is hierarchical Bayes automatically borrowing strength across students.

各学生について、比較する:

生のとんかつ割合 \(k_i / n_i\)
事後平均 \(\hat\theta_i\)

事後平均はグループ平均の方へ引き寄せられる — データが少ない学生は 大きく、データが多い学生はほとんど動かない。

これが階層ベイズが学生間で自動的に強さを借りること。

Overhypotheses — learning the bias上位仮説 — バイアスを学ぶ

With the hierarchy in hand, overhypotheses become precise.

The shape bias; the object-vs-substance distinction — these are second-level hypotheses, learned as a distribution over kinds of concept (Kemp, Perfors & Tenenbaum, 2007).

No Free Lunch said a learner needs inductive bias. The hierarchy is where a learner acquires it — instead of being born with it.

階層を手にすると、上位仮説（overhypotheses）が精密になる。

形状バイアス、物体対物質の区別 — これらは第2階層の仮説であり、 概念の種類に対する分布として学ばれる（Kemp, Perfors & Tenenbaum, 2007）。

ノーフリーランチは、学習者には帰納的バイアスが必要だと言った。階層は、学習者がそれを獲得する場所 — 生まれつき持つのではなく。

Closeまとめ

Before next week来週までに

Read T3 Ch 5 — Mixture models before Week 5.

It formalizes the same partial-pooling idea you just saw, and feeds directly into Problem 3 of the Clusters assignment.

Clusters is due Fri Jun 5, 8:00 PM.

第5週までに T3 第5章 — 混合モデル を読むこと。

今見たのと同じ部分プーリングの考え方を形式化し、クラスタ課題の 問題3に直結する。

クラスタ課題の提出期限は 6月5日（金）20:00。

Next week — Week 5来週 — 第5週

Bayes nets + causal Bayes nets.

From “concepts as sets” to structured probabilistic models — graphs that encode how variables depend on each other, and what happens when you intervene.

Check the readings page for the Week 5 paper + presenter.

ベイズネット + 因果ベイズネット。

「集合としての概念」から、構造化された確率モデルへ — 変数が互いにどう依存するか、そして介入したとき何が起こるかを符号化するグラフ。

第5週の論文と発表者は、リーディングのページで確認すること。

Week 4 — Bayesian Generalization第4週 — ベイズ的一般化

Agenda本日の予定

Assignment 1 — Clusters課題1 — クラスタ

Clusters — what and whenクラスタ — 内容と期限

Clusters — which notebookクラスタ — どのノートブック

Where we are現在地

The generalization problem一般化の問題

Chibany’s lunchesチバニーのお弁当

What just happened今起きたこと

Shepard’s universal lawシェパードの普遍法則

Poll — Shepard’s universal lawポール — シェパードの普遍法則

Poll — answerポール — 答え

Where we are現在地

The Bayesian generalization frameworkベイズ的一般化の枠組み

The idea — concepts as hypotheses考え方 — 概念を仮説として

Notation lock-in記号の確認

The three ingredients3つの構成要素

The hypothesis space IS a prior仮説空間そのものが事前

Generalization = a posterior-weighted vote一般化 = 事後確率による投票

One datum → the posterior-weighted vote1つのデータ → 事後確率による投票

Vote for \(y = x\)\(y = x\) への投票

Vote for \(y = x + 1\)\(y = x + 1\) への投票

Vote for \(y = x + 2\)\(y = x + 2\) への投票

The size principleサイズ原理

Where did the examples come from?例はどこから来たのか?

The two likelihoods2つの尤度

The size principleサイズ原理

Why — the suspicious coincidenceなぜ — 怪しい偶然

Poll — strong samplingポール — 強いサンプリング

Poll — answerポール — 答え

Where we are現在地

Rectangle game — continuous concepts長方形ゲーム — 連続的な概念

Gnarblesグナーブル

Start in 1-Dまず1次元で

The generalization gradient一般化勾配

The 1-D gradient — built from votes1次元の勾配 — 投票から構築

More examples → tighter generalization例が増えると → 一般化が引き締まる

Into 2-D — the rectangle game2次元へ — 長方形ゲーム

The rectangle experiment長方形実験

The result — \(d\) vs. \(r\), by \(n\)結果 — \(n\) ごとの \(d\) 対 \(r\)

One fix — an exponential prior1つの修正 — 指数事前

With the prior — the fit事前を入れると — 適合

The rectangle game in one line長方形ゲームを一言で

Number game — discrete concepts数当てゲーム — 離散的な概念

The number game数当てゲーム

What people actually do人々が実際にすること

Two things to explain説明すべき2つのこと

The discrete hypothesis space離散的な仮説空間

Size principle, by the numbersサイズ原理を数値で

One example: \(x = 60\)例が1つ: \(x = 60\)

Four examples: \(\{10, 30, 60, 80\}\)例が4つ: \(\{10, 30, 60, 80\}\)

From likelihood to posterior尤度から事後確率へ

Strong sampling — one example強いサンプリング — 例が1つ

Strong sampling — four examples強いサンプリング — 例が4つ

Weak sampling — eliminate, but don’t rank弱いサンプリング — 除外はするが、順位はつけない

The number game in one line数当てゲームを一言で

Where we are現在地

Break休憩

Bridge — Shohei’s paper橋渡し — ショウヘイの論文

Student presentation — Shohei学生発表 — ショウヘイ

Where we are現在地

No Free Lunchノーフリーランチ

The No Free Lunch theoremノーフリーランチ定理

What NFL means for usNFLが意味すること

Poll — No Free Lunchポール — ノーフリーランチ

Poll — answerポール — 答え

Where we are現在地

Hierarchical Bayes階層ベイズ

Back to Chibany’s classチバニーのクラスに戻る

Priors over priors事前の上の事前

Why it mattersなぜ重要か

Three ways to pool the dataデータをまとめる3つの方法

The two-level model2階層モデル

Inference — no closed form推論 — 閉じた形はない

Shrinkage — borrowing strength縮約 — 強さを借りる

Overhypotheses — learning the bias上位仮説 — バイアスを学ぶ

Closeまとめ

Before next week来週までに

Next week — Week 5来週 — 第5週