Data/tokenizer loaded in 0.1s Time budget: 300s Gradient accumulation steps: 2 Model compiled in 3.4s step 00000 (0.0%) | loss: 9.011286 | lrm: 1.00 | dt: 5206ms | tok/sec: 12,588 | epoch: 1 | remaining: 300s step 00001 (0.0%) | loss: 9.020128 | lrm: 1.00 | dt: 3524ms | tok/sec: 18,596 | epoch: 1 | remaining: 296s step 00002 (1.2%) | loss: 8.699729 | lrm: 1.00 | dt: 3609ms | tok/sec: 18,160 | epoch: 1 | remaining: 293s step 00003 (2.4%) | loss: 8.423139 | lrm: 1.00 | dt: 3494ms | tok/sec: 18,754 | epoch: 1 | remaining: 289s step 00004 (3.5%) | loss: 8.220221 | lrm: 1.00 | dt: 3438ms | tok/sec: 19,064 | epoch: 1 | remaining: 286s step 00005 (4.7%) | loss: 8.070054 | lrm: 1.00 | dt: 3334ms | tok/sec: 19,655 | epoch: 1 | remaining: 283s step 00006 (5.8%) | loss: 7.972273 | lrm: 1.00 | dt: 3333ms | tok/sec: 19,660 | epoch: 1 | remaining: 279s step 00007 (6.9%) | loss: 7.900758 | lrm: 1.00 | dt: 3359ms | tok/sec: 19,510 | epoch: 1 | remaining: 276s step 00008 (8.0%) | loss: 7.833145 | lrm: 1.00 | dt: 3489ms | tok/sec: 18,783 | epoch: 1 | remaining: 272s step 00009 (9.2%) | loss: 7.769759 | lrm: 1.00 | dt: 3417ms | tok/sec: 19,177 | epoch: 1 | remaining: 269s step 00010 (10.3%) | loss: 7.711980 | lrm: 1.00 | dt: 3428ms | tok/sec: 19,116 | epoch: 1 | remaining: 266s step 00011 (11.5%) | loss: 7.665034 | lrm: 1.00 | dt: 3445ms | tok/sec: 19,023 | epoch: 1 | remaining: 262s step 00012 (12.6%) | loss: 7.628830 | lrm: 1.00 | dt: 3472ms | tok/sec: 18,875 | epoch: 1 | remaining: 259s step 00013 (13.8%) | loss: 7.599584 | lrm: 1.00 | dt: 3626ms | tok/sec: 18,071 | epoch: 1 | remaining: 255s step 00014 (15.0%) | loss: 7.564150 | lrm: 1.00 | dt: 3417ms | tok/sec: 19,180 | epoch: 1 | remaining: 252s step 00015 (16.1%) | loss: 7.534053 | lrm: 1.00 | dt: 3482ms | tok/sec: 18,818 | epoch: 1 | remaining: 248s step 00016 (17.3%) | loss: 7.512029 | lrm: 1.00 | dt: 3495ms | tok/sec: 18,753 | epoch: 1 | remaining: 245s step 00017 (18.5%) | loss: 7.494451 | lrm: 1.00 | dt: 3468ms | tok/sec: 18,897 | epoch: 1 | remaining: 241s step 00018 (19.6%) | loss: 7.477713 | lrm: 1.00 | dt: 3520ms | tok/sec: 18,620 | epoch: 1 | remaining: 238s step 00019 (20.8%) | loss: 7.451545 | lrm: 1.00 | dt: 3468ms | tok/sec: 18,899 | epoch: 1 | remaining: 234s step 00020 (21.9%) | loss: 7.430358 | lrm: 1.00 | dt: 3640ms | tok/sec: 18,004 | epoch: 1 | remaining: 231s step 00021 (23.2%) | loss: 7.405802 | lrm: 1.00 | dt: 3523ms | tok/sec: 18,604 | epoch: 1 | remaining: 227s step 00022 (24.3%) | loss: 7.390393 | lrm: 1.00 | dt: 3475ms | tok/sec: 18,857 | epoch: 1 | remaining: 224s step 00023 (25.5%) | loss: 7.373971 | lrm: 1.00 | dt: 3468ms | tok/sec: 18,898 | epoch: 1 | remaining: 220s step 00024 (26.6%) | loss: 7.357853 | lrm: 1.00 | dt: 3581ms | tok/sec: 18,298 | epoch: 1 | remaining: 216s step 00025 (27.8%) | loss: 7.341590 | lrm: 1.00 | dt: 3512ms | tok/sec: 18,659 | epoch: 1 | remaining: 213s step 00026 (29.0%) | loss: 7.321334 | lrm: 1.00 | dt: 3519ms | tok/sec: 18,624 | epoch: 1 | remaining: 209s step 00027 (30.2%) | loss: 7.305466 | lrm: 1.00 | dt: 3480ms | tok/sec: 18,831 | epoch: 1 | remaining: 206s step 00028 (31.3%) | loss: 7.284185 | lrm: 1.00 | dt: 3481ms | tok/sec: 18,829 | epoch: 1 | remaining: 203s step 00029 (32.5%) | loss: 7.263010 | lrm: 1.00 | dt: 3548ms | tok/sec: 18,469 | epoch: 1 | remaining: 199s step 00030 (33.7%) | loss: 7.248785 | lrm: 1.00 | dt: 3510ms | tok/sec: 18,673 | epoch: 1 | remaining: 195s step 00031 (34.9%) | loss: 7.240683 | lrm: 1.00 | dt: 3499ms | tok/sec: 18,727 | epoch: 1 | remaining: 192s step 00032 (36.0%) | loss: 7.230092 | lrm: 1.00 | dt: 3511ms | tok/sec: 18,668 | epoch: 1 | remaining: 188s step 00033 (37.2%) | loss: 7.212867 | lrm: 1.00 | dt: 3687ms | tok/sec: 17,775 | epoch: 1 | remaining: 185s step 00034 (38.4%) | loss: 7.199191 | lrm: 1.00 | dt: 3543ms | tok/sec: 18,498 | epoch: 1 | remaining: 181s step 00035 (39.6%) | loss: 7.172128 | lrm: 1.00 | dt: 3586ms | tok/sec: 18,273 | epoch: 1 | remaining: 178s step 00036 (40.8%) | loss: 7.150679 | lrm: 1.00 | dt: 3574ms | tok/sec: 18,336 | epoch: 1 | remaining: 174s step 00037 (42.0%) | loss: 7.131968 | lrm: 1.00 | dt: 3547ms | tok/sec: 18,477 | epoch: 1 | remaining: 170s step 00038 (43.2%) | loss: 7.111455 | lrm: 1.00 | dt: 3573ms | tok/sec: 18,344 | epoch: 1 | remaining: 167s step 00039 (44.4%) | loss: 7.090769 | lrm: 1.00 | dt: 3795ms | tok/sec: 17,270 | epoch: 1 | remaining: 163s step 00040 (45.6%) | loss: 7.075271 | lrm: 1.00 | dt: 3453ms | tok/sec: 18,978 | epoch: 1 | remaining: 160s step 00041 (46.8%) | loss: 7.065164 | lrm: 1.00 | dt: 3333ms | tok/sec: 19,662 | epoch: 1 | remaining: 156s step 00042 (47.9%) | loss: 7.047905 | lrm: 1.00 | dt: 3339ms | tok/sec: 19,627 | epoch: 1 | remaining: 153s step 00043 (49.0%) | loss: 7.029499 | lrm: 1.00 | dt: 3355ms | tok/sec: 19,532 | epoch: 1 | remaining: 150s step 00044 (50.1%) | loss: 7.010608 | lrm: 1.00 | dt: 3310ms | tok/sec: 19,798 | epoch: 1 | remaining: 146s step 00045 (51.2%) | loss: 6.986023 | lrm: 0.98 | dt: 3308ms | tok/sec: 19,813 | epoch: 1 | remaining: 143s step 00046 (52.3%) | loss: 6.959619 | lrm: 0.95 | dt: 3282ms | tok/sec: 19,967 | epoch: 1 | remaining: 140s step 00047 (53.4%) | loss: 6.928925 | lrm: 0.93 | dt: 3547ms | tok/sec: 18,475 | epoch: 1 | remaining: 136s step 00048 (54.6%) | loss: 6.897151 | lrm: 0.91 | dt: 3288ms | tok/sec: 19,929 | epoch: 1 | remaining: 133s step 00049 (55.7%) | loss: 6.866053 | lrm: 0.89 | dt: 3398ms | tok/sec: 19,284 | epoch: 1 | remaining: 130s step 00050 (56.8%) | loss: 6.839681 | lrm: 0.86 | dt: 3378ms | tok/sec: 19,398 | epoch: 1 | remaining: 126s step 00051 (58.0%) | loss: 6.809443 | lrm: 0.84 | dt: 3221ms | tok/sec: 20,343 | epoch: 1 | remaining: 123s step 00052 (59.0%) | loss: 6.787951 | lrm: 0.82 | dt: 3223ms | tok/sec: 20,336 | epoch: 1 | remaining: 120s step 00053 (60.1%) | loss: 6.763101 | lrm: 0.80 | dt: 3238ms | tok/sec: 20,240 | epoch: 1 | remaining: 116s step 00054 (61.2%) | loss: 6.745751 | lrm: 0.78 | dt: 4016ms | tok/sec: 16,320 | epoch: 1 | remaining: 112s step 00055 (62.5%) | loss: 6.729373 | lrm: 0.75 | dt: 4091ms | tok/sec: 16,020 | epoch: 1 | remaining: 108s step 00056 (63.9%) | loss: 6.718591 | lrm: 0.72 | dt: 3727ms | tok/sec: 17,583 | epoch: 1 | remaining: 105s step 00057 (65.1%) | loss: 6.701484 | lrm: 0.70 | dt: 3587ms | tok/sec: 18,269 | epoch: 1 | remaining: 101s step 00058 (66.3%) | loss: 6.677972 | lrm: 0.67 | dt: 3571ms | tok/sec: 18,352 | epoch: 1 | remaining: 97s step 00059 (67.5%) | loss: 6.663926 | lrm: 0.65 | dt: 3539ms | tok/sec: 18,517 | epoch: 1 | remaining: 94s step 00060 (68.7%) | loss: 6.641185 | lrm: 0.63 | dt: 3531ms | tok/sec: 18,562 | epoch: 1 | remaining: 90s step 00061 (69.9%) | loss: 6.621978 | lrm: 0.60 | dt: 3488ms | tok/sec: 18,789 | epoch: 1 | remaining: 87s step 00062 (71.0%) | loss: 6.601048 | lrm: 0.58 | dt: 3499ms | tok/sec: 18,731 | epoch: 1 | remaining: 83s step 00063 (72.2%) | loss: 6.583518 | lrm: 0.56 | dt: 3470ms | tok/sec: 18,885 | epoch: 1 | remaining: 80s step 00064 (73.4%) | loss: 6.560730 | lrm: 0.53 | dt: 3527ms | tok/sec: 18,579 | epoch: 1 | remaining: 76s step 00065 (74.5%) | loss: 6.543296 | lrm: 0.51 | dt: 3502ms | tok/sec: 18,716 | epoch: 1 | remaining: 73s step 00066 (75.7%) | loss: 6.524907 | lrm: 0.49 | dt: 3462ms | tok/sec: 18,927 | epoch: 1 | remaining: 69s step 00067 (76.9%) | loss: 6.509836 | lrm: 0.46 | dt: 3592ms | tok/sec: 18,247 | epoch: 1 | remaining: 66s step 00068 (78.0%) | loss: 6.497096 | lrm: 0.44 | dt: 3484ms | tok/sec: 18,812 | epoch: 1 | remaining: 62s step 00069 (79.2%) | loss: 6.484238 | lrm: 0.42 | dt: 3512ms | tok/sec: 18,659 | epoch: 1 | remaining: 59s step 00070 (80.4%) | loss: 6.469563 | lrm: 0.39 | dt: 3695ms | tok/sec: 17,734 | epoch: 1 | remaining: 55s step 00071 (81.6%) | loss: 6.455906 | lrm: 0.37 | dt: 3482ms | tok/sec: 18,820 | epoch: 1 | remaining: 52s step 00072 (82.8%) | loss: 6.439289 | lrm: 0.34 | dt: 3485ms | tok/sec: 18,805 | epoch: 1 | remaining: 48s step 00073 (83.9%) | loss: 6.424992 | lrm: 0.32 | dt: 3523ms | tok/sec: 18,602 | epoch: 1 | remaining: 45s step 00074 (85.1%) | loss: 6.417309 | lrm: 0.30 | dt: 3450ms | tok/sec: 18,997 | epoch: 1 | remaining: 41s step 00075 (86.3%) | loss: 6.401845 | lrm: 0.27 | dt: 3452ms | tok/sec: 18,983 | epoch: 1 | remaining: 38s step 00076 (87.4%) | loss: 6.384847 | lrm: 0.25 | dt: 3444ms | tok/sec: 19,031 | epoch: 1 | remaining: 34s step 00077 (88.6%) | loss: 6.374213 | lrm: 0.23 | dt: 3457ms | tok/sec: 18,956 | epoch: 1 | remaining: 31s step 00078 (89.7%) | loss: 6.364342 | lrm: 0.21 | dt: 3460ms | tok/sec: 18,941 | epoch: 1 | remaining: 27s step 00079 (90.9%) | loss: 6.350873 | lrm: 0.18 | dt: 3438ms | tok/sec: 19,060 | epoch: 1 | remaining: 24s step 00080 (92.0%) | loss: 6.345540 | lrm: 0.16 | dt: 3453ms | tok/sec: 18,979 | epoch: 1 | remaining: 21s step 00081 (93.2%) | loss: 6.329259 | lrm: 0.14 | dt: 3615ms | tok/sec: 18,130 | epoch: 1 | remaining: 17s step 00082 (94.4%) | loss: 6.321964 | lrm: 0.11 | dt: 3554ms | tok/sec: 18,437 | epoch: 1 | remaining: 13s step 00083 (95.5%) | loss: 6.316718 | lrm: 0.09 | dt: 3660ms | tok/sec: 17,908 | epoch: 1 | remaining: 10s step 00084 (96.8%) | loss: 6.307665 | lrm: 0.06 | dt: 3540ms | tok/sec: 18,512 | epoch: 1 | remaining: 6s step 00085 (97.9%) | loss: 6.301593 | lrm: 0.04 | dt: 3446ms | tok/sec: 19,018 | epoch: 1 | remaining: 3s step 00086 (99.1%) | loss: 6.287576 | lrm: 0.02 | dt: 3498ms | tok/sec: 18,736 | epoch: 1 | remaining: 0s Training completed in 302.7s Starting final eval... Final eval batch size: 64 Final eval completed in 22.0s --- val_bpb: 2.213268 training_seconds: 300.8 total_seconds: 328.2 peak_vram_mb: 11023.9 mfu_percent: 0.00 total_tokens_M: 5.7 num_steps: 87 num_params_M: 11.5 depth: 4