ゼロつく 2 (15) - らんだむな記憶

とりあえずの本読みは終わったものの、深く考えずにざっと流れだけ追いかけたので、気になる箇所について少し詳細に見たい。

ch05 について。

$ cat ptb.train.txt
aer banknote berlitz calloway centrust cluett fromstein gitano guterman hydro-quebec ipo kia memotec mlx nahb punts rake regatta rubens sim snack-food ssangyong swapo wachter 
 pierre <unk> N years old will join the board as a nonexecutive director nov. N 
 mr. <unk> is chairman of <unk> n.v. the dutch publishing group ...

冒頭は何語かまったくわからない。

from dataset import ptb

corpus, word_to_id, id_to_word = ptb.load_data('train')
print(' '.join([id_to_word[id] for id in corpus]))

で同じ内容がとれる。というか先にこっちを見て無茶苦茶に見えたのでテキストファイルのほうを後から見た。

get_batch の実装的には、x の全長が 50 だとして、バッチとして 5 個の単語列を切り出す時、インデックス 0, 10, 20, 30, 40 を先頭として単語列を切り出してバッチを作っている。例えばコーパスが以下のような全長 50 語からなるようなものである場合、太文字の部分が 1 つ目のバッチの最初の語となる。

Alice was beginning to get very tired of sitting by her sister on the bank and of having nothing to do once or twice she had peeped into the book her sister was reading but it had no pictures or conversations in it and what is the use of a

time_size は Truncated BPTT の長さに当たり、仮にこれを 4 とすると、1 つ目のバッチは

Alice was beginning to
her sister on the
do once or twice
her sister was reading
conversations in it and

のような単語列から構成されている。次のバッチはこれらの続きから始まり、

get very tired of
bank and of having
she had peeped into
but it had no
what is the use

となる。
もし x の終端に到達し、次の単語がなくなった場合には（やや不自然になるかもしれないが）先頭に戻って単語を切り出すようだ。