Skip to content

Latest commit

 

History

History
251 lines (248 loc) · 8.97 KB

runs.MD

File metadata and controls

251 lines (248 loc) · 8.97 KB

Run 0 - baseline next-token

step 0: train loss 10.9890, val loss 10.9904 iter 0: loss 10.9887, time 34661.01ms iter 10: loss 10.4005, time 246.79ms iter 20: loss 9.7791, time 247.28ms iter 30: loss 9.5240, time 247.06ms iter 40: loss 9.3031, time 278.72ms iter 50: loss 9.1578, time 245.99ms iter 60: loss 8.7869, time 248.42ms iter 70: loss 8.8794, time 257.60ms iter 80: loss 8.6168, time 265.11ms iter 90: loss 8.4142, time 248.87ms iter 100: loss 8.0754, time 357.93ms iter 110: loss 7.9212, time 247.20ms iter 120: loss 7.7288, time 247.92ms iter 130: loss 7.5622, time 250.02ms iter 140: loss 7.3395, time 249.00ms iter 150: loss 7.3733, time 248.21ms iter 160: loss 7.1828, time 276.31ms iter 170: loss 6.9898, time 250.92ms iter 180: loss 7.0126, time 248.37ms iter 190: loss 7.0673, time 249.54ms iter 200: loss 6.6969, time 357.24ms iter 210: loss 6.7007, time 249.08ms iter 220: loss 6.8312, time 248.52ms iter 230: loss 6.5788, time 249.17ms iter 240: loss 6.6638, time 249.28ms iter 250: loss 6.6676, time 305.09ms iter 260: loss 6.3102, time 248.84ms iter 270: loss 6.5648, time 249.04ms iter 280: loss 6.5139, time 248.65ms iter 290: loss 6.3673, time 250.27ms iter 300: loss 6.4838, time 402.03ms iter 310: loss 6.3994, time 251.62ms iter 320: loss 6.4925, time 249.45ms iter 330: loss 6.4402, time 246.81ms iter 340: loss 6.3253, time 254.92ms iter 350: loss 6.2690, time 247.50ms iter 360: loss 6.2100, time 322.94ms iter 370: loss 6.4289, time 250.93ms iter 380: loss 6.4137, time 255.44ms iter 390: loss 6.2782, time 252.83ms iter 400: loss 5.9514, time 392.19ms iter 410: loss 6.0770, time 272.11ms iter 420: loss 6.2226, time 290.20ms iter 430: loss 6.3764, time 341.80ms iter 440: loss 6.2910, time 281.35ms iter 450: loss 6.1476, time 272.77ms iter 460: loss 6.0961, time 274.95ms iter 470: loss 6.1179, time 272.92ms iter 480: loss 6.2284, time 313.07ms iter 490: loss 6.1643, time 271.87ms iter 500: loss 6.0181, time 500.63ms iter 510: loss 6.2621, time 271.73ms iter 520: loss 5.9727, time 274.61ms iter 530: loss 6.0925, time 247.15ms iter 540: loss 5.8810, time 246.93ms iter 550: loss 6.0139, time 247.16ms iter 560: loss 5.9516, time 247.88ms iter 570: loss 5.8696, time 247.90ms iter 580: loss 6.2730, time 246.96ms iter 590: loss 6.0671, time 247.23ms iter 600: loss 5.9611, time 354.36ms iter 610: loss 5.9033, time 247.39ms iter 620: loss 5.9834, time 247.55ms iter 630: loss 5.8416, time 247.36ms iter 640: loss 5.8474, time 247.50ms iter 650: loss 5.7414, time 247.67ms iter 660: loss 5.6912, time 248.78ms iter 670: loss 5.6427, time 271.97ms iter 680: loss 5.8749, time 367.46ms iter 690: loss 5.7253, time 272.73ms iter 700: loss 5.6741, time 484.66ms iter 710: loss 5.6360, time 271.26ms iter 720: loss 5.7388, time 270.85ms iter 730: loss 5.7777, time 270.42ms iter 740: loss 5.5708, time 274.08ms iter 750: loss 5.8506, time 374.06ms iter 760: loss 5.5813, time 276.44ms iter 770: loss 5.6497, time 274.68ms iter 780: loss 5.6110, time 270.92ms iter 790: loss 5.7139, time 272.29ms iter 800: loss 5.8326, time 487.05ms iter 810: loss 5.6761, time 272.23ms iter 820: loss 5.6035, time 377.89ms iter 830: loss 5.5972, time 275.36ms iter 840: loss 5.6055, time 269.20ms iter 850: loss 5.5467, time 272.70ms iter 860: loss 5.7642, time 273.11ms iter 870: loss 5.5938, time 364.68ms iter 880: loss 5.5071, time 271.46ms iter 890: loss 5.4761, time 272.19ms iter 900: loss 5.4601, time 391.58ms iter 910: loss 5.4526, time 272.81ms iter 920: loss 5.5604, time 376.89ms iter 930: loss 5.3856, time 247.01ms iter 940: loss 5.5549, time 247.07ms iter 950: loss 5.3871, time 246.64ms iter 960: loss 5.3593, time 271.18ms iter 970: loss 5.6842, time 247.51ms iter 980: loss 5.5166, time 277.32ms iter 990: loss 5.5334, time 285.71ms step 1000: train loss 5.4129, val loss 5.4066

Run 1 - predict second-next token

step 0: train loss 10.9975, val loss 10.9998 iter 0: loss 10.9931, time 34873.11ms iter 10: loss 10.4293, time 248.09ms iter 20: loss 9.8310, time 247.44ms iter 30: loss 9.6324, time 246.40ms iter 40: loss 9.4881, time 246.59ms iter 50: loss 9.3857, time 246.77ms iter 60: loss 9.1080, time 247.25ms iter 70: loss 9.1631, time 248.80ms iter 80: loss 8.9193, time 247.12ms iter 90: loss 8.7087, time 245.91ms iter 100: loss 8.4375, time 357.29ms iter 110: loss 8.2523, time 247.99ms iter 120: loss 8.0453, time 247.59ms iter 130: loss 7.8801, time 251.51ms iter 140: loss 7.6632, time 246.37ms iter 150: loss 7.7151, time 247.53ms iter 160: loss 7.5573, time 245.70ms iter 170: loss 7.3852, time 260.50ms iter 180: loss 7.4523, time 246.81ms iter 190: loss 7.5213, time 246.36ms iter 200: loss 7.2428, time 501.23ms iter 210: loss 7.2376, time 253.33ms iter 220: loss 7.3865, time 256.47ms iter 230: loss 7.2044, time 251.68ms iter 240: loss 7.3044, time 248.67ms iter 250: loss 7.2840, time 249.86ms iter 260: loss 7.0394, time 301.18ms iter 270: loss 7.2253, time 253.68ms iter 280: loss 7.2091, time 246.70ms iter 290: loss 7.1290, time 246.62ms iter 300: loss 7.2332, time 361.93ms iter 310: loss 7.1704, time 250.47ms iter 320: loss 7.2204, time 249.03ms iter 330: loss 7.1990, time 312.94ms iter 340: loss 7.1055, time 251.12ms iter 350: loss 7.0401, time 250.64ms iter 360: loss 7.0515, time 252.31ms iter 370: loss 7.2171, time 249.61ms iter 380: loss 7.1739, time 249.15ms iter 390: loss 7.0981, time 247.48ms iter 400: loss 6.8173, time 409.07ms iter 410: loss 6.9547, time 248.96ms iter 420: loss 7.0677, time 250.80ms iter 430: loss 7.1677, time 246.82ms iter 440: loss 7.0908, time 246.43ms iter 450: loss 6.9865, time 248.74ms iter 460: loss 6.9538, time 250.55ms iter 470: loss 6.9559, time 249.41ms iter 480: loss 7.0736, time 248.73ms iter 490: loss 7.0242, time 246.24ms iter 500: loss 6.8602, time 356.75ms iter 510: loss 7.1096, time 246.99ms iter 520: loss 6.8615, time 249.09ms iter 530: loss 6.9252, time 299.09ms iter 540: loss 6.7552, time 248.77ms iter 550: loss 6.8735, time 249.09ms iter 560: loss 6.8601, time 250.21ms iter 570: loss 6.7675, time 248.49ms iter 580: loss 7.0854, time 248.17ms iter 590: loss 6.9424, time 249.01ms iter 600: loss 6.8708, time 363.17ms iter 610: loss 6.8040, time 248.17ms iter 620: loss 6.8815, time 248.03ms iter 630: loss 6.7418, time 246.68ms iter 640: loss 6.7845, time 246.44ms iter 650: loss 6.6588, time 249.42ms iter 660: loss 6.6129, time 248.22ms iter 670: loss 6.6387, time 248.48ms iter 680: loss 6.7973, time 248.84ms iter 690: loss 6.6688, time 250.45ms iter 700: loss 6.6359, time 358.13ms iter 710: loss 6.6163, time 248.25ms iter 720: loss 6.7099, time 248.25ms iter 730: loss 6.7519, time 317.31ms iter 740: loss 6.5570, time 250.32ms iter 750: loss 6.7808, time 250.85ms iter 760: loss 6.5925, time 246.77ms iter 770: loss 6.6528, time 264.86ms iter 780: loss 6.6301, time 260.80ms iter 790: loss 6.7204, time 300.05ms iter 800: loss 6.8279, time 361.06ms iter 810: loss 6.7059, time 284.70ms iter 820: loss 6.6234, time 394.47ms iter 830: loss 6.6054, time 249.64ms iter 840: loss 6.6389, time 247.04ms iter 850: loss 6.5963, time 246.53ms iter 860: loss 6.7992, time 246.53ms iter 870: loss 6.6609, time 247.37ms iter 880: loss 6.5680, time 249.63ms iter 890: loss 6.5586, time 247.61ms iter 900: loss 6.5042, time 356.19ms iter 910: loss 6.5028, time 247.62ms iter 920: loss 6.6318, time 246.61ms iter 930: loss 6.5346, time 246.73ms iter 940: loss 6.6517, time 247.17ms iter 950: loss 6.5495, time 247.77ms iter 960: loss 6.4836, time 247.94ms iter 970: loss 6.7566, time 247.82ms iter 980: loss 6.5640, time 246.79ms iter 990: loss 6.6469, time 247.03ms step 1000: train loss 6.5478, val loss 6.5453

Run 2 - predict 5th token

iter 600: loss 7.3925, time 417.89ms iter 610: loss 7.2789, time 295.33ms iter 620: loss 7.3833, time 297.89ms iter 630: loss 7.2497, time 301.87ms iter 640: loss 7.2910, time 387.50ms iter 650: loss 7.2023, time 294.53ms iter 660: loss 7.1731, time 295.44ms iter 670: loss 7.1974, time 301.17ms iter 680: loss 7.3208, time 288.67ms iter 690: loss 7.2210, time 416.83ms iter 700: loss 7.2181, time 540.25ms iter 710: loss 7.1763, time 298.60ms iter 720: loss 7.2652, time 294.39ms iter 730: loss 7.3362, time 298.83ms iter 740: loss 7.1410, time 396.14ms iter 750: loss 7.3354, time 375.67ms iter 760: loss 7.1934, time 295.08ms iter 770: loss 7.2405, time 294.40ms iter 780: loss 7.2562, time 303.40ms iter 790: loss 7.3366, time 295.88ms iter 800: loss 7.4129, time 513.73ms iter 810: loss 7.3115, time 295.65ms iter 820: loss 7.2731, time 296.35ms iter 830: loss 7.2388, time 296.12ms iter 840: loss 7.2655, time 298.19ms iter 850: loss 7.1931, time 387.63ms iter 860: loss 7.4127, time 349.97ms iter 870: loss 7.2589, time 300.50ms iter 880: loss 7.1842, time 298.38ms iter 890: loss 7.2101, time 295.32ms iter 900: loss 7.1455, time 523.19ms iter 910: loss 7.1161, time 339.52ms iter 920: loss 7.2902, time 295.93ms iter 930: loss 7.1797, time 297.58ms iter 940: loss 7.2723, time 294.01ms iter 950: loss 7.2280, time 396.76ms iter 960: loss 7.1732, time 293.49ms iter 970: loss 7.3873, time 294.45ms iter 980: loss 7.2396, time 297.43ms iter 990: loss 7.2889, time 276.55ms step 1000: train loss 7.2186, val loss 7.2190