-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex2.html
3712 lines (3359 loc) · 257 KB
/
index2.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>fritzm.github.io</title>
<meta name="description" content="">
<meta name="author" content="Fritz Mueller">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
<script src="https://fritzm.github.io/theme/html5.js"></script>
<![endif]-->
<!-- Le styles -->
<link href="https://fritzm.github.io/theme/bootstrap.min.css" rel="stylesheet">
<link href="https://fritzm.github.io/theme/bootstrap.min.responsive.css" rel="stylesheet">
<link href="https://fritzm.github.io/theme/local.css" rel="stylesheet">
<link href="https://fritzm.github.io/theme/pygments.css" rel="stylesheet">
<!-- Photoswipe -->
<link rel="stylesheet" href="https://fritzm.github.io/theme/photoswipe.css">
<link rel="stylesheet" href="https://fritzm.github.io/theme/default-skin/default-skin.css">
<script src="https://fritzm.github.io/theme/photoswipe.min.js"></script>
<script src="https://fritzm.github.io/theme/photoswipe-ui-default.min.js"></script>
<script src="https://fritzm.github.io/galleries.js"></script>
<script type="text/javascript">
var pswipe = function(gname, index) {
var pswpElement = document.querySelectorAll('.pswp')[0];
var items = galleries[gname];
var options = { index: index };
var gallery = new PhotoSwipe(pswpElement, PhotoSwipeUI_Default, items, options);
gallery.init();
};
</script>
<!-- So Firefox can bookmark->"abo this site" -->
<link href="https://fritzm.github.io/feeds/all.rss.xml" rel="alternate" title="fritzm.github.io" type="application/rss+xml">
</head>
<body>
<div class="navbar">
<div class="navbar-inner">
<div class="container">
<a class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</a>
<a class="brand" href="https://fritzm.github.io">fritzm.github.io</a>
<div class="nav-collapse">
<ul class="nav">
</ul>
</div>
</div>
</div>
</div>
<div class="container">
<div class="content">
<div class="row">
<div class="span9">
<div class='article'>
<div class="content-title">
<a href="https://fritzm.github.io/dl11-bodge.html"><h1>PDP-11/45: Reversing a vintage DL11 hack</h1></a>
Fri 27 November 2020
by <a class="url fn" href="https://fritzm.github.io/author/fritz-mueller.html">Fritz Mueller</a>
</div>
<div><p>I recently had need to assess and repair several DL11 serial interfaces in my stock of spares. One of these
had had some sort of end-user hack applied; in the course of putting the board back to factory condition, I
did some analysis of the hack and its intended purpose, documented here.</p>
<p><img src='/images/pdp11/dl11-user-hack_thumbnail_tall.jpg' title='DL11 with end-user hack' onclick='pswipe("pdp11",85);'/>
<img src='/images/pdp11/dl11-hack-front_thumbnail_tall.png' title='DL11 user hack front' onclick='pswipe("pdp11",86);'/>
<img src='/images/pdp11/dl11-hack-back_thumbnail_tall.png' title='DL11 user hack back' onclick='pswipe("pdp11",87);'/></p>
<p>Easy enough to beep this out and reverse to a schematic:</p>
<p><img style="display:block; margin-left:auto; margin-right:auto" src="/images/pdp11/dl11-hack-schem.png"
title="Schematic of DL11 hack"/></p>
<p>So, the hack appears to dynamically alter the CSR address and interrupt vector of the card, choosing between
two hard-wired presets, based on whether P1A/P1B are connected together or not.</p>
<p>The CSR jumpers on a stock DL11 operate with pull-ups upstream of the address decode logic, so these can be
directly driven by the hack so long as the jumpers for the bits-to-be-hacked are left open on the board. The
vector address bits, however, must be driven by the DL11 onto to the Unibus contingent on an appropriate
global enable. On a stock DL11, drivers for <em>all</em> configurable vector bits are activated by a single global
enable, and jumpers downstream of the drivers control which of these activated bits will be admitted to bus.
So, for the vector address part of the hack to function, hack control must be asserted instead of the global
enable for each of the to-be-driven bits, and the corresponding jumpers for these bits must be left in. And
indeed, upon inspection of the DL11 there are trace cuts that have been done (marked here with "X") to lift
the global enable and allow individual hack control of each of the affected bits:</p>
<p><img style="display:block; margin-left:auto; margin-right:auto" src="/images/pdp11/dl11-hack-cuts.png"
title="Trace cuts for DL11 hack"/></p>
<p></br></p>
<p>Last, we can look at the board jumpering and the wiring of the hack to determine the specific CSR and
vector addresses at play:</p>
<style>
.bitlist { border-collapse: collapse; margin-left: auto; margin-right: auto; margin-bottom: 2ex; }
.bitlist caption { font-weight: bold; }
.bitlist .hacked { font-weight: bold; }
.bitlist tr:nth-child(even) :not(th) { background-color: #f2f2f2; }
.bitlist td:nth-child(3n+2) { border-left-color: #000000; }
.bitlist td:nth-child(3n+1) { border-right-color: #000000; }
.bitlist th, .bitlist td { padding: 5px; }
.bitlist td { border: 1px solid lightgray; font-family: Menlo,Consolas,monospace; }
.bitlist tr:first-child td { border-top-color: #000000; }
.bitlist tr:last-child td { border-bottom-color: #000000; }
</style>
<table class="bitlist">
<thead><tr>
<th></th>
<th>A11</th><th>A10</th><th>A9</th>
<th>A8</th><th>A7</th><th>A6</th>
<th>A5</th><th>A4</th><th>A3</th>
<th>A2</th><th>A1</th><th>A0</th>
<th></th>
</tr></thead>
<tbody><tr>
<th>P1 Open</th>
<td>1</td>
<td>1</td>
<td class="hacked">0</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td class="hacked">0</td>
<td class="hacked">0</td>
<td class="hacked">1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<th>776510</th>
</tr><tr>
<th>P1 Closed</th>
<td>1</td>
<td>1</td>
<td class="hacked">1</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td class="hacked">1</td>
<td class="hacked">1</td>
<td class="hacked">0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<th>777560</th>
</tr></tbody>
</table>
<table class="bitlist">
<thead><tr>
<th></th>
<th>V8</th><th>V7</th><th>V6</th>
<th>V5</th><th>V4</th><th>V3</th>
<th>V2</th><th>V1</th><th>V0</th>
<th></th>
</tr></thead>
<tbody><tr>
<th>P1 Open</th>
<td>0</td>
<td class="hacked">1</td>
<td class="hacked">1</td>
<td class="hacked">0</td>
<td class="hacked">0</td>
<td class="hacked">1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<th>310</th>
</tr><tr>
<th>P1 Closed</th>
<td>0</td>
<td class="hacked">0</td>
<td class="hacked">0</td>
<td class="hacked">1</td>
<td class="hacked">1</td>
<td class="hacked">0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<th>060</th>
</tr></tbody>
</table>
<p><br/></p>
<p>We see from these specific addresses that closing the contacts of P1 would dynamically re-jumper the board
from assignment as the 2nd non-console interface to assignment as the console interface. So perhaps this was
once used (in conjunction with another similarly hacked interface?) to swap console terminals with the flip of
a single switch.</p></div>
<hr />
</div>
<div class='article'>
<div class="content-title">
<a href="https://fritzm.github.io/fp11-again.html"><h1>PDP-11/45: Some more floating point trouble</h1></a>
Sat 21 November 2020
by <a class="url fn" href="https://fritzm.github.io/author/fritz-mueller.html">Fritz Mueller</a>
</div>
<div><p><em>[A catch-up article, documenting events of April/May 2020]</em></p>
<p>In late April, I offered to give a video demonstration of the '11/45 to some interested work colleagues. Since
I hadn't had it on in a while, I fired it up to make sure everything was still in working order. The machine
behaved well from the front panel and was able to boot both V6 Unix and RSTS V06C. Great! Typed a very simple
demo program in to RSTS (print a multiplication table) and that ran, but produced some very strange results.
Uh oh... </p>
<p>Asked RSTS to <code>PRINT PI</code>, and it spat out a value somewhere around 3.7... :-)</p>
<p>So, time to try the floating point MAINDECS... Sure enough, failures all over the place, starting with the
very first diagnostic in the floating point suite, CFPAB0. This diagnostic covers utility operations like
LDFPS/STFPS, SETI/SETL, SETF/SETD, etc.</p>
<p>I do not have listings for the diagnostics in this suite, but it is usually simple enough to reproduce
failures with short toggle-in programs given the names and descriptions of the failing diagnostics. In this
case, the following simple code to exercise an LDFPS/STFPS sequence from the front panel switches and lights
showed that bits 10 and 11 of the floating point status/control word would come back erroneously toggled:</p>
<table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre>1
2
3
4
5</pre></div></td><td class="code"><div class="highlight"><pre><span></span><span class="nt">001000</span> <span class="nt">170137</span> <span class="nt">START</span><span class="o">:</span> <span class="nt">LDFPS</span> <span class="o">@</span><span class="p">#</span><span class="nn">177570</span> <span class="o">;</span><span class="nt">LOAD</span> <span class="nt">FPS</span> <span class="nt">FROM</span> <span class="nt">SWITCH</span> <span class="nt">REGISTER</span>
<span class="nt">177570</span>
<span class="nt">001004</span> <span class="nt">170237</span> <span class="nt">STFPS</span> <span class="o">@</span><span class="p">#</span><span class="nn">177570</span> <span class="o">;</span><span class="nt">AND</span> <span class="nt">STORE</span> <span class="nt">BACK</span> <span class="nt">TO</span> <span class="nt">DISPLAY</span> <span class="nt">REGISTER</span>
<span class="nt">177570</span>
<span class="nt">001010</span> <span class="nt">000773</span> <span class="nt">BR</span> <span class="nt">START</span> <span class="o">;</span><span class="nt">REPEAT</span>
</pre></div>
</td></tr></table>
<p>First things first, check power to the FPU and its clock; these look fine. Next, plug the KM11 into the
floating point slot and check the FPU microcode sequences while executing LDFPS and STFPS instructions.
These also look fine:</p>
<ul>
<li>
<p>For <code>LDFPS @#177570</code> I see <code>RDY.00</code>, <code>RDY.10</code>, <code>RDY.20</code>, <code>RDY.30</code>, <code>RDY.70</code>, <code>LD.50</code></p>
</li>
<li>
<p>For <code>STFPS @#177570</code> I see <code>RDY.00</code>, <code>RDY.10</code>, <code>RDY.20</code>, <code>RDY.30</code>, <code>RDY.80</code>, <code>STR.30</code>, <code>STR.08</code></p>
</li>
</ul>
<p>Most of the data paths of interest regarding the FPS register are on the fraction low (FRL) board, so this
goes out on extenders so the microcode can be stepped and gate-level logic inspected with a logic probe.</p>
<p>Here is the block diagram of data paths in the FPU, for reference in discussion below:</p>
<p><img style="display:block; margin-left:auto; margin-right:auto" src="/images/pdp11/fp11-data-paths.png"
title="FP11-B data paths"/>
<p style="text-align: center;"><em>FP11-B data paths</em></p></p>
<p>So, one thing to note with regard to the FPS register is that it is gated through the ACMX multiplexer and
written into scratch pad register AC7[0] during microcode state <code>RDY.00</code> which is the first state in the
common prolog of every FPU instruction:</p>
<p><img style="display:block; margin-left:auto; margin-right:auto" src="/images/pdp11/fp11-ucode-prolog.png"
title="FP11-B microcode prolog" width="200px"/>
<p style="text-align: center;"><em>FP11-B microcode prolog</em></p></p>
<p>Stopping in state <code>RDY.00</code> and examining the ACMX inputs, selects, and outputs for bits 10 and 11 immediately
reveals a problem. These bits of ACMX are implemented by a 74153 dual 4-input mux, E71 on sheet FRLB of the
FP11-B engineering drawings:</p>
<p><img style="display:block; margin-left:auto; margin-right:auto" src="/images/pdp11/fp11-acmx-e71.png"
title="FP11-B ACMX >11:10<" width="400px"/>
<p style="text-align: center;"><em>FP11-B ACMX <11:10></em></p></p>
<p>Inputs from the FPS register on pins 6 and 10 appear correct, as do the selector signals on pins 14 and 2.
But outputs on pins 7 and 9 appear to be inverted. So E71 appears bad. Pulled this, socketed, and replaced.
After this fix, LDFPS/STFPS function correctly in the toggle-in test program, and MAINDEC CFPAB0 passes.</p>
<p>Not out of the woods yet, though... Progressing down the sequence of MAINDECS, diagnostic CFPDC0
(add/subtract) now fails :-( For this, we bring back the simple "add two floats" diagnostic used during
previous FP11 debug:</p>
<table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre> 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17</pre></div></td><td class="code"><div class="highlight"><pre><span></span> <span class="nt">000000</span> <span class="nt">AC0</span><span class="o">=%</span><span class="nt">0</span>
<span class="nt">000001</span> <span class="nt">AC1</span><span class="o">=%</span><span class="nt">1</span>
<span class="nt">000000</span> <span class="p">.</span><span class="nc">ASECT</span>
<span class="nt">001000</span> <span class="o">.=</span><span class="nt">1000</span>
<span class="nt">001000</span> <span class="nt">170011</span> <span class="nt">START</span><span class="o">:</span> <span class="nt">SETD</span> <span class="o">;</span><span class="nt">SET</span> <span class="nt">DOUBLE</span> <span class="nt">PRECISION</span> <span class="nt">MODE</span>
<span class="nt">001002</span> <span class="nt">172467</span> <span class="nt">000014</span> <span class="nt">LDD</span> <span class="nt">D1</span><span class="o">,</span><span class="nt">AC0</span> <span class="o">;</span><span class="nt">FETCH</span> <span class="nt">FIRST</span> <span class="nt">ADDEND</span> <span class="nt">FROM</span> <span class="nt">D1</span>
<span class="nt">001006</span> <span class="nt">172567</span> <span class="nt">000020</span> <span class="nt">LDD</span> <span class="nt">D2</span><span class="o">,</span><span class="nt">AC1</span> <span class="o">;</span><span class="nt">FETCH</span> <span class="nt">SECOND</span> <span class="nt">ADDEND</span> <span class="nt">FROM</span> <span class="nt">D2</span>
<span class="nt">001012</span> <span class="nt">172100</span> <span class="nt">ADDD</span> <span class="nt">AC0</span><span class="o">,</span><span class="nt">AC1</span> <span class="o">;</span><span class="nt">ADD</span> <span class="nt">THEM</span> <span class="o">(</span><span class="nt">RESULT</span> <span class="nt">IN</span> <span class="nt">AC1</span><span class="o">)</span>
<span class="nt">001014</span> <span class="nt">174167</span> <span class="nt">000022</span> <span class="nt">STD</span> <span class="nt">AC1</span><span class="o">,</span><span class="nt">D3</span> <span class="o">;</span><span class="nt">STORE</span> <span class="nt">RESULT</span> <span class="nt">TO</span> <span class="nt">D3</span>
<span class="nt">001020</span> <span class="nt">000000</span> <span class="nt">HALT</span>
<span class="nt">001022</span> <span class="nt">040200</span> <span class="nt">000000</span> <span class="nt">000000</span> <span class="nt">D1</span><span class="o">:</span> <span class="p">.</span><span class="nc">WORD</span> <span class="nt">040000</span><span class="o">,</span><span class="nt">000000</span><span class="o">,</span><span class="nt">000000</span><span class="o">,</span><span class="nt">000000</span> <span class="o">;</span><span class="nt">0</span><span class="p">.</span><span class="nc">5</span>
<span class="nt">001030</span> <span class="nt">000000</span>
<span class="nt">001032</span> <span class="nt">040200</span> <span class="nt">000000</span> <span class="nt">000000</span> <span class="nt">D2</span><span class="o">:</span> <span class="p">.</span><span class="nc">WORD</span> <span class="nt">040000</span><span class="o">,</span><span class="nt">000000</span><span class="o">,</span><span class="nt">000000</span><span class="o">,</span><span class="nt">000000</span> <span class="o">;</span><span class="nt">0</span><span class="p">.</span><span class="nc">5</span>
<span class="nt">001040</span> <span class="nt">000000</span>
<span class="nt">001042</span> <span class="nt">000000</span> <span class="nt">000000</span> <span class="nt">000000</span> <span class="nt">D3</span><span class="o">:</span> <span class="p">.</span><span class="nc">WORD</span> <span class="nt">000000</span><span class="o">,</span><span class="nt">000000</span><span class="o">,</span><span class="nt">000000</span><span class="o">,</span><span class="nt">000000</span>
<span class="nt">001050</span> <span class="nt">000000</span>
<span class="nt">001000</span> <span class="p">.</span><span class="nc">END</span> <span class="nt">START</span>
</pre></div>
</td></tr></table>
<p>Sure enough, this is producing incorrect results. The microcode flows for add/subtract/compare are a bit more
involved than the simple load/store sequences above. The sequence starts with common prolog <code>RDY.00</code>,
<code>RDY.10</code>, <code>RDY.20</code>, <code>RDY.30</code>, same as above. The first fork after <code>RDY.30</code> goes to <code>RDY.60</code>, since
add/subtract/compare are "no memory class" instructions (FP accumulator register operands only). The second
fork after <code>RDY.60</code> takes us to <code>ADD.00</code> on sheet FP11 FLOWS 8.</p>
<p>The left side if FLOWS 8 is a decision tree for zero operands and/or whether or not we are executing a compare
instruction. Traversal of these states sets up fraction and exponent operands and, if necessary, a comparison
of operand exponents in the EALU. In our case (addition of two double-precision non-zero operands), the
sequence is: <code>ADD.00</code>, <code>ADD.04</code>, <code>ADD.06</code>, <code>ADD.02</code>, <code>ADD.08</code>, <code>ADD.12</code>.</p>
<p>We then end up at state <code>ADD.22</code> at the top of the right side of FLOWS 8. The previously set up exponent
difference is used to index into a 256x4 "range ROM"; output bits from this ROM inform the subsequent
microcode fork which determines which operand shift, if any, to apply before the upcoming fraction ALU
operation.</p>
<p><img style="display:block; margin-left:auto; margin-right:auto" src="/images/pdp11/fp11-exp-compare.png"
title="FP11-B Exponent Comparison Flow"/></p>
<p>Here a problem is evident. We should fork to <code>ADD.24</code>, for equal exponents, but instead we end up add
<code>ADD.30</code>, for destination exponent less than source exponent. Putting the FXP board out on the extender and
pausing in this state, the operands and operation codes on the EALU bit-slices appear to be correct, but
signal FRMH ALU CIN L is erroneously asserted at E34 pin 7 (sheet FXPA). This extra carry (borrow, really,
since the operation is a subtract) into the least significant bit-slice causes the EALU output to be -1
instead of 0.</p>
<p>Moving back to the source of this signal on the FRM board, it turns out that FRM E20, a 74H40 dual quad-input
NAND, is outputting an invalid logic level at pin 8. Pulled this, socketed, replaced, and the problem appears
to be fixed.</p>
<p><img style="display:block; margin-left:auto; margin-right:auto" src="/images/pdp11/FRM-E20.png"
title="FP11-B FRMH ALU CIN L"/></p>
<p>After this second repair, the full suite of FP11-B diagnostics is passing again. And RSTS/E has a much less
fanciful interpretation of <code>PI</code>...</p></div>
<hr />
</div>
<div class='article'>
<div class="content-title">
<a href="https://fritzm.github.io/unix-v6-trouble-2.html"><h1>PDP-11/45: V6 Unix Troubleshooting, Part II</h1></a>
Sun 25 October 2020
by <a class="url fn" href="https://fritzm.github.io/author/fritz-mueller.html">Fritz Mueller</a>
</div>
<div><p><em>[A catch-up article, documenting discoveries of Feb 2019]</em></p>
<p>In early 2019, I made a V6 Unix pack from the Ken Wellsch tape image, as mentioned in <a href="https://fritzm.github.io/unix-and-ms11.html">this blog
entry</a>. It booted on my machine, but dumped core on the first <code>ls</code> in single-user
mode, or as soon as I did any heavy lifting in multi-user mode.</p>
<p>The following is the conclusion of a chronology of the troubleshooting campaign that took place over the next
month and a half, culminating in a hardware fix and successful operation of V6 Unix on the machine (part I is
<a href="https://fritzm.github.io/unix-v6-trouble-1.html">here</a>.) This was largely a collaborative effort between Noel Chiappa an
myself via direct email correspondence, though some help was received from others via the cctalk mailing list
as well.</p>
<p>By this point, the nature of the <code>ls</code> problem had been fairly well characterized: part of the <code>ls</code> process
address space ended up holding an incorrect portion of its program text; subsequently, when execution jumped
to some of these unexpected bits, an out-of-bounds memory access would occur triggering a memory management
trap. Efforts now focus on understanding how and why the bad bits got there...</p>
<h3>February 7</h3>
<p>[Here and below, block-quoted content is excerpted from email correspondence.]</p>
<p>Fritz:</p>
<blockquote>
<p>Noel, is it possible for you deduce where Unix <em>should</em> be placing these "bad" bits (from file offset octal
4220)? Maybe a comparison of addresses where the bits should be, with addresses where the "bad" copy ends
up, could point us at some particular failure modes to check in the KT11, CPU, or RK11...</p>
</blockquote>
<p>Noel:</p>
<blockquote>
<p>Yes, it's quite simple: just add the virtual address in the code to the physical address of the bottom of
the text segment (given in UISA0). The VA is actually 04200, though: the 04220 includes 020 to hold the
a.out header at the start of the command file.</p>
<p>So, with UISA0 containing 01614, that gives us PA:161400 + 04200 = PA:165600, I think. And it wound up at
PA:171600 - off by 04000 (higher) - which is obviously an interesting number.</p>
</blockquote>
<hr>
<blockquote>
<p>Here's where it gets 'interesting'.</p>
<p>Executing a command with pure text on V6 is a very complicated process. The shells fork()s a copy of itself,
and does an exec() system call to overlay the entire memory in the new process with a copy of the command
(which sounds fairly simple, at a high level) - but the code path to do the exec() with a pure text is
incredibly hairy, in detail. In particular, for a variety of reasons, the memory of the process can get
swapped in and out several times during that. I apparently used to understand how this all worked, see this
message:</p>
<p><a href="https://minnie.tuhs.org/pipermail/tuhs/2018-February/014299.html">https://minnie.tuhs.org/pipermail/tuhs/2018-February/014299.html</a></p>
<p>but it's so complicated it's going to take a while to really comprehend it again. (The little grey cells are
aging too, sigh...)</p>
<p>The interesting point is that when V6 first copies the text in from the file holding the command (using
readi(), Lions 6221 for anyone who's masochistic enough to try and actually follow this :-), it reads it in
starting from the bottom, one disk block at a time (since in V6, files are not stored contiguously).</p>
<p>So, if it starts from the bottom, and copies the wrong thing from low in the file <em>up</em> to VA:010200, when it
later gets to VA:010200 in the file contents, that <em>should</em> over-write the stuff that got put there in the
wrong place <em>earlier</em>. Unless there's <em>another</em> problem which causes that later write to <em>also</em> go somewhere
wrong...</p>
<p>So, I'm not sure when this trashage is happening, but because of the above, my <em>guess</em> is that it's in one
of the two swap operations on the text (out, and then back in). (Although it might be interesting to look at
PA:165600 and see what's actually <em>there</em>.) Unix does swapping of pure texts in a single, multi-block
transfer (although not always as an integral number of blocks, as we found out the hard way with the QSIC
:-).</p>
<p>So my suspicions have now switched back to the RK11... One way to proceed would be to stop the system after
the pure text is first read in (say around Lions 4465), and look to see what the text looks like in main
memory at <em>that</em> point. (This will require looking at KT11 registers to see where it's holding the text
segment, first.)</p>
<p>If that all looks good, we'll have to figure out how to stop the system after the pure text is read back in
(which does not happen in exec(), it's done by the normal system operation to swap in the text and data of a
process which is ready to run).</p>
<p>We could also stop the system after the text is swapped out, and key in a short (~ a dozen words) program to
read the text back in from the swap device, and examine it - although we'd have to grub around in the system
a bit to figure out where it got written to. (It might be just easier to stop it at, say, Lions 5196 and
look at the arguments on the kernel stack.)</p>
</blockquote>
<p>Fritz:</p>
<blockquote>
<blockquote>
<p>...it might be interesting to look at PA:165600 and see what's actually <em>there</em></p>
</blockquote>
<p>A sea of zeros, as it turns out.</p>
</blockquote>
<hr>
<blockquote>
<blockquote>
<p>The most valuable thing ... would be to look at the text segment, after it's read in and before it's
swapped out. I can work out where to put a halt, if you want to try that.</p>
</blockquote>
<p>Yes, this sounds like a good plan to me! Is this as simple as dropping a HALT at VA:0 in the text? </p>
</blockquote>
<p>Noel:</p>
<blockquote>
<p>No; actually, probably easier! :-) Probably easiest is to, just before you type 'ls', put a HALT in the OS
just after 4467 in Lions. Halt the machine momentarily, patch the kernel, and CONT. (Basically the same as
your patch to the trap vector, just a different address.) That'll be at 021320 (should contain 062706),
physical or virtual. :-)</p>
<p>When the system halts, you'll need to look at the text in memory. Two ways to find the location: look on the
kernel stack, the address should be the second thing down:</p>
<div class="highlight"><pre><span></span>mov 16(r3),-(sp)
add $20,(sp)
mov (r4),-(sp)
jsr pc,*$_swap
</pre></div>
<p>(i.e. the thing that 020 got added to). Probably easier, though, is just to look in UISA0 (which at this
point is pointing to the block of memory that's been allocated to read the text into, Lions 4459-60).</p>
<p>That number in UISA0, T, will be the click address of the text. So PA:T00 should be the start of the text
(170011 010600, etc). So then PA:(T00+010200) should be the trashed chunk of text: 110024 010400 000167
000016 010500 etc (right) or 016162 004767 000224 000414 016700 (wrong).</p>
</blockquote>
<h3>February 8</h3>
<p>Noel:</p>
<blockquote>
<p>In addition to the info I already sent about how to [set the breakpoint], if you could note down the top 3
words on the kernel stack, and the contents of the RK registers, those would be really useful; the first
will allow us to work out what <em>should</em> be in the RK registers after the swap I/O operation completes - I
don't think the RK11 will be asked to do anything after that finishes and before the system hits that halt
in xalloc().</p>
<p>To find the kernel stack.... read out KISA6, S. This value will point to the 'user' area of that process,
plus the kernel stack. The kernel SP should be something like 01417xx; subtract 140000 (the segment number),
and add what's left to S00. Alternatively, you can probably use the rotating switch on the front panel to
just look up VA:1417xx (whatever's in R6) directly.</p>
<p>Oh, if you need some bed-time reading to put you to sleep, check out the bottom section ("exec() and
pure-text images") in:</p>
<p><a href="http://gunkies.org/wiki/Unix_V6_internals">http://gunkies.org/wiki/Unix_V6_internals</a></p>
<p>which will explain what's going on here with the swapping in and out, which is sorta complicated.</p>
</blockquote>
<h3>February 9</h3>
<p>Noel:</p>
<blockquote>
<blockquote>
<p>just halt the machine after the text is swapped in</p>
</blockquote>
<p>The code we need is at Lions 2034, where the pure text of a process is swapped in (and this should only be
traversed once; I don't think the system will need to swap in the text of the shell); just put a HALT in (in
the usual manner, just before trying 'ls') at 015406, which should contain a 062706 (again).</p>
<p>At that point, since the text size is 010400, and the location of the text in physical memory is 0161400,
the BAR <em>should</em> contain 0172000. If not, and it's 0232000 (note that the 0200000 bit will be in the CSR,
the lower XM bit) instead, Bazinga!, it's nailed (unless the system somehow snuck another RK operation in
there, but I don't see anything that could do that).</p>
</blockquote>
<p>I finally get some time back in front of the machine, after a few days in bed with a cold:</p>
<blockquote>
<blockquote>
<p>...put a HALT in the OS just after 4467 in Lions. Halt the machine momentarily, patch the kernel, and CONT.
(Basically the same as your patch to the trap vector, just a different address.) That'll be at 021320
(should contain 062706)...</p>
</blockquote>
<p>But alas, it does not. [PA:021320] = 010246. Furthermore, [PA:015406] = 016504.</p>
</blockquote>
<hr>
<blockquote>
<p>I just tried under SIMH, also, and got consistent results:</p>
<div class="highlight"><pre><span></span>[PA:015406] = 016504
[PA:021320] = 010246
</pre></div>
<p>...so, one would think, my rkunix and yours are different?</p>
</blockquote>
<p>Noel:</p>
<blockquote>
<p>That must be it. I thought we were both working from the V6 distribution? Oh, yours prints out that Western
Electric copyright notice, I don't think mine has that...</p>
</blockquote>
<h3>February 10</h3>
<p>The first part of the day is spent sorting out and comparing the "Wellsch" V6 distribution that I have been
using, and the "Ritchie" version that Noel has been using. Noel comes to the conclusion that the only
differences in the kernel sources are in fact the four <code>printfs</code> for the copyright notice, but this is enough
to perturb the locations of various symbols of interest between the two kernels. He also finds the binaries
<code>ls</code>, <code>cc</code>, <code>as</code>, <code>as2</code>, <code>ld</code> <code>c0</code>, <code>c1</code>, and <code>c2</code> all match; as do liba.a, libc.a and crt0.o.</p>
<p>Getting back on the trail of the bug:</p>
<blockquote>
<p>So the first place I'd like to try HALTing is just after the call to swap, Lions 4467; at that point, the
text should be in main memory, and also just written to disk. Should be at 021320 (old contents should be
062706).</p>
<p>Fun things to do here: look at the text in main memory (0161400 and up), see if it's correct at this point.
Also: pull the arguments off the top of the stack, and write a small program to read it back in...</p>
</blockquote>
<p>This turns out to be one last typo ("rkunix" vs. "rrkunix" on Noel's part) resulting in incorrect symbol
addresses for my kernel, but I'm hip to Noel's curveballs now so:</p>
<blockquote>
<p>Okay, using today's newly acquired 'db' skillz :-), in my rkunix, that spot is at PA:21420. Firing up the
machine again and trying that now...</p>
</blockquote>
<p>It works; I end up stopped at the breakpoint and start extracting data:</p>
<blockquote>
<p>Hmmm:</p>
<div class="highlight"><pre><span></span><span class="n">PA</span><span class="o">:</span><span class="mi">161400</span><span class="o">:</span> <span class="mi">141644</span> <span class="mi">141660</span> <span class="mi">000000</span> <span class="mi">000000</span> <span class="mi">000000</span> <span class="mi">000000</span> <span class="mi">000000</span> <span class="mi">000000</span>
<span class="n">PA</span><span class="o">:</span><span class="mi">161420</span><span class="o">:</span> <span class="mi">000000</span> <span class="mi">000000</span> <span class="mi">000000</span> <span class="mi">000000</span> <span class="mi">000000</span> <span class="mi">000000</span> <span class="mi">000000</span> <span class="mi">000000</span>
</pre></div>
</blockquote>
<p>Noel:</p>
<blockquote>
<p>The text is probably at a different location in PA at this point. Read out UISA0 for its base.</p>
</blockquote>
<p>Fritz:</p>
<blockquote>
<div class="highlight"><pre><span></span><span class="n">UISA0</span><span class="o">:</span> <span class="mi">001654</span>
<span class="n">PA</span><span class="o">:</span><span class="mi">165400</span><span class="o">:</span> <span class="mi">170011</span> <span class="mi">010600</span> <span class="mi">011046</span> <span class="mi">005720</span> <span class="mi">010066</span> <span class="mi">000002</span> <span class="mi">004767</span> <span class="mi">000010</span>
<span class="n">KSP</span><span class="o">:</span> <span class="mi">141656</span> <span class="o">-></span> <span class="n">PA</span><span class="o">:</span><span class="mi">165256</span>
<span class="n">PA</span><span class="o">:</span><span class="mi">165256</span><span class="o">:</span> <span class="mi">007656</span> <span class="mi">001654</span> <span class="mi">000104</span> <span class="mi">000000</span> <span class="mi">101602</span> <span class="mi">066312</span> <span class="mi">000000</span> <span class="mi">141726</span>
<span class="n">PA</span><span class="o">:</span><span class="mi">175600</span><span class="o">:</span> <span class="mi">110024</span> <span class="mi">010400</span> <span class="mi">000167</span> <span class="mi">000016</span> <span class="mi">010500</span> <span class="mi">010605</span> <span class="mi">101446</span> <span class="mi">010346</span>
</pre></div>
<p>So far so good -- both beginning and eventually-bogus sections of text check out at this point!</p>
</blockquote>
<p>Noel:</p>
<blockquote>
<p>Woo-Hoo!!!! YEAH!!!!</p>
<p>So that part of the text <em>is</em> right at this point.</p>
<p>Needless to say, this is <em>very</em>, very important data.</p>
<p>So chances are very strong, at this point, that it's the RK11.</p>
<p>What did you want to do next? You could start with the RK11 registers. Also, use PDP11GUI to read the copy
off the swap device, once I decipher the stack?</p>
</blockquote>
<hr>
<blockquote>
<div class="highlight"><pre><span></span><span class="n">PA</span><span class="o">:</span><span class="mi">165256</span><span class="o">:</span> <span class="mi">007656</span> <span class="mi">001654</span> <span class="mi">000104</span> <span class="mi">000000</span> <span class="mi">101602</span> <span class="mi">066312</span> <span class="mi">000000</span> <span class="mi">141726</span>
</pre></div>
<p>OK, so the 01654 is the start address in PA (in clicks) for the area to swap out, and that matches UISA0.
0104 is the text length (also in clicks), and that also matches. The 0 is a flag which says it's a write
(read is 01). And the 07656 is the block number (4014.).</p>
</blockquote>
<p>Fritz:</p>
<blockquote>
<p>I should have a valid swap on the disk from before I shut down... Going to fire up PDP11GUI and grab it now
to have a look. We want blocks 4014-4022, then? (9 x 512-byte blocks = 0110 clicks if I got that right?)</p>
</blockquote>
<p>Noel:</p>
<blockquote>
<p>4014.-4023., I think...</p>
<blockquote>
<p>(9 x 512-byte blocks = 0110 clicks if I got that right?)</p>
</blockquote>
<p>I think 8-1/2 or so; text is 010400 bytes (a little less, actually, but that's what the system is using),
01000 bytes/block, = 010.4 blocks.</p>
</blockquote>
<p>Fritz:</p>
<p>Hmm, the beginning looks good, but it seems to cut off to soon:</p>
<blockquote>
<div class="highlight"><pre><span></span>0000000 000000 000000 000000 000000 000000 000000 000000 000000
*
7656000 170011 010600 011046 005720 010066 000002 004767 000010
7656020 010016 004737 006374 104401 004567 010154 162706 000044
7656040 012716 000001 004737 004652 010067 022314 010516 062716
7656060 177762 004737 006346 016500 177762 062700 177413 010067
|
7660320 000137 002346 016516 000004 012746 020452 004737 003562
7660340 005726 000137 002542 005067 017552 012704 022336 005003
7660360 012716 021050 004737 005042 110024 005203 022703 000020
7660400 000000 000000 000000 000000 000000 000000 000000 000000
*
11410000
</pre></div>
</blockquote>
<p>Noel:</p>
<blockquote>
<blockquote>
<div class="highlight"><pre><span></span>7656000 170011 010600 011046 005720 010066 000002 004767 000010
</pre></div>
</blockquote>
<p>Yup, good start; SETD, etc.</p>
<blockquote>
<div class="highlight"><pre><span></span>7660360 012716 021050 004737 005042 110024 005203 022703 000020
7660400 000000 000000 000000 000000 000000 000000 000000 000000
</pre></div>
</blockquote>
<p>Hunh; not good. (Might be worth looking at that location in main memory, see if it's zeros or not.)</p>
<p>That's so odd that it's all zeros - I wonder where they came from? Maybe they were already on the disk, and
the write stopped way early? (At 01000 bytes per block, it stopped after 2-1/2 blocks; 056000s, 057000s,
stopped half-way through the 060000's.)</p>
<p>Would be useful to have the RK register contents after the swap() call returns...</p>
</blockquote>
<p>Fritz:</p>
<blockquote>
<p>Okay, the write should be from PA:165400 - PA:175777, to sectors 07656 - 07667. Block 7667 encodes to an
RKDA value of 012363.</p>
<p>After the halt, I find:</p>
<div class="highlight"><pre><span></span><span class="n">RKDS</span><span class="o">:</span> <span class="mi">004707</span> <span class="o">(</span><span class="n">OK</span><span class="o">)</span>
<span class="n">RKER</span><span class="o">:</span> <span class="mi">000000</span> <span class="o">(</span><span class="n">OK</span><span class="o">)</span>
<span class="n">RKCS</span><span class="o">:</span> <span class="mi">000322</span> <span class="o">(</span><span class="n">BOGUS</span><span class="o">!</span> <span class="n">EX</span><span class="o">.</span><span class="na">MEM</span> <span class="o">=</span> <span class="mi">01</span><span class="o">)</span>
<span class="n">RKWC</span><span class="o">:</span> <span class="mi">000000</span> <span class="o">(</span><span class="n">OK</span><span class="o">)</span>
<span class="n">RKBA</span><span class="o">:</span> <span class="mi">176000</span> <span class="o">(</span><span class="n">OK</span><span class="o">)</span>
<span class="n">RKDA</span><span class="o">:</span> <span class="mi">012363</span> <span class="o">(</span><span class="n">OK</span><span class="o">)</span>
</pre></div>
<p>So, EX.MEM are the smoking bits here! I will review the associated designs and come up with things the
try/check.</p>
</blockquote>
<hr>
<blockquote>
<p>Okay, taking a look:</p>
<p>RKBA is implemented in the M795 module in slots AB07, as detailed on sheet RK11-C-15. The M795 is a generic
WC/BA Unibus interfacing module. The BA part only covers 16 bits, but generates an overflow out "D15
RKBA=ALL 1 L".</p>
<p>EX MEM 01 and EX MEM 02 are maintained on the M239 module in slot A17, as detailed on sheet RK11-C-03. The
M239 is a 3x 4-bit counter/register module, so this also implements counting up these bits, when triggered
by "D15 RKBA = ALL 1 L".</p>
<p>Based on where we see the data on disk fall off (offset 2400) and the start PA (165400), I'm guessing we get
a false trigger on this "ALL 1" at RKBA 167777. So that looks like a false "1" detect on RKBA bit 12.</p>
<p>So I think the thing to do is to put the M795 out on an extender, load RKBA with 167777, and have a check at
E28 pin 5, and E34 pin 8!</p>
<p>And we leave the cliffhanger there, for now, at least until tomorrow evening. Because due to the way the
RK11-C is mounted, in order to do the above I'm going to have to spin the whole machine around (its a dual
H960), extend the RK05's so there is room to physically climb in the back, rig a work light, and get on in
there...</p>
</blockquote>
<h3>February 11</h3>
<blockquote>
<p>SUCCESS!!</p>
<p>Put the M795 out on an extender, loaded 167777 in RKBAR, and had a look around with a logic probe. Narrowed
it down to E34 (a 7430 8-input NAND). Pulled, socketed, replaced, and off she goes!</p>
<p>I can now successfully boot and run both V6 Unix and RSTS/E V06C from disk.</p>
<p><em>THAT</em> was a really fun and rewarding hunt :-) First message in the thread was back on Dec 30, 2018. Lots
of debugging and DRAM repairs, then the final long assault to this single, failed gate...</p>
<p>Thanks to all here for the help and resources, and particular shout-outs for Noel and Paul who gave
generously of their time and attention working through the densest bits, both on and off the list.</p>
<p>I predict a long happy weekend and a big power bill at the end of the month :-)</p>
</blockquote>
<p><img style="display:block; margin-left:auto; margin-right:auto" src="/images/pdp11/M795.png"
title="M795 WC/BAModule"/>
<p style="text-align: center;"><em>M795 module and the single failed gate</em></p></p></div>
<hr />
</div>
<div class='article'>
<div class="content-title">
<a href="https://fritzm.github.io/unix-v6-trouble-1.html"><h1>PDP-11/45: V6 Unix Troubleshooting</h1></a>
Sat 24 October 2020
by <a class="url fn" href="https://fritzm.github.io/author/fritz-mueller.html">Fritz Mueller</a>
</div>
<div><p><em>[A catch-up article, documenting discoveries of Jan/Feb 2019]</em></p>
<p>In early 2019, I made a V6 Unix pack from the Ken Wellsch tape image, as mentioned in <a href="https://fritzm.github.io/unix-and-ms11.html">this blog
entry</a>. It booted on my machine, but dumped core on the first <code>ls</code> in single-user
mode, or as soon as I did any heavy lifting in multi-user mode.</p>
<p>The following is the first part of a chronology of the troubleshooting campaign that took place over the next
month and a half, culminating in a smoking gun hardware fix and successful operation of V6 Unix on the
machine. This was largely a collaborative effort between Noel Chiappa an myself via direct email
correspondence, though help was received from others via the cctalk mailing list as well.</p>
<h3>January 8-9</h3>
<p>Initial experiments. Described the <code>ls</code> crashes to Noel. He theorizes that <code>ls</code> works in one case and
crashes in another is because it lands in a different spot in memory in each case.</p>
<p>Luckily, a subsequent <code>od</code> on the core file does not crash, and a core file is successfully extracted:</p>
<div class="highlight"><pre><span></span>140004 000000 141710 141724
$DK
@rkunix
mem = 1035
RESTRICTED RIGHTS
Use, duplication or disclosure is subject to
restrictions stated in Contract with Western
Electric Company, Inc.
# LS
MEMORY FAULT -- CORE DUMPED
# OD CORE
0000000 141552 141562 000000 000000 000000 000000 000000 000000
0000020 000000
0000060 000000 000000 000000 000001 000000 000000 063260 140076
0000100 001700 000000 000104 066112 067543 062562 000000 000000
0000120 000000 000000 000000 060221 000567 067543 062562 000000
0000140 000000 000000 000000 000000 066112 000000 000020 000000
0000160 000000 000000 000000 000000 177701 000000 000020 000000
0000200 000000 000000 000000 000000 177701 041402 016006 000000
0000220 000000 000000 000000 000000 066016 041402 016006 000000
0000240 000000 000000 000000 000000 066016 075120 075120 075120
0000260 000000
0000300 000000 000000 000000 000000 000013 010400 001050 002366
0000320 000000 000104 000035 000024 000000 141732 141742 141664
0000340 141674 000000 000000 000000 000000 000000 000000 000000
0000360 000000
0000400 000000 000000 000000 000000 000000 000000 000012 000000
0000420 000000 000000 000000 141772 000000 000000 000000 000000
0000440 000000
0001500 000000 025334 003602 001236 025334 003602 002454 003602
0001520 063260 177716 000000 141542 016070 001176 000000 003602
0001540 063260 177716 000000 141562 016070 001176 066352 030300
0001560 063260 025334 003602 077572 000013 107564 141626 000512
0001600 000000 141604 141616 000300 074616 025334 003602 000217
0001620 000203 107404 020276 000512 000000 141634 141640 003602
0001640 000007 000135 107454 141662 014314 003602 066352 005674
0001660 000000 141712 013640 074616 000000 001000 000000 000000
0001700 001000 074616 063260 066352 000013 141726 023730 066352
0001720 063260 000000 000013 141742 023502 003602 000000 177760
0001740 000013 141756 022050 000013 000000 000000 000000 000034
0001760 000444 000031 177760 000000 030351 177770 010210 170010
0002000 000001 177777 177777 023436 023436 020264 000162 000262
0002020 000262 000202 000262 000256 000210 000262 000250 000262
0002040 000262 000216 000262 000262 000262 000262 000262 000224
0002060 000170 000234 000242 000003 100000 000144 040000 000142
0002100 020000 000143 000055 000001 000400 000162 000055 000001
0002120 000200 000167 000055 000002 004000 000163 000100 000170
0002140 000055 000001 000040 000162 000055 000001 000020 000167
0002160 000055 000002 002000 000163 000010 000170 000055 000001
0002200 000004 000162 000055 000001 000002 000167 000055 000001
0002220 000001 000170 000055 000001 010000 000164 000040 020066
0002240 020106 020116 020126 020142 020152 020162 020176 020206
0002260 020216 020226 000056 062457 061564 070057 071541 073563
0002300 000144 062457 061564 063457 067562 070165 005000 071445
0002320 005072 072000 072157 066141 022440 005144 022400 062065
0002340 000040 031045 020144 022400 033055 033056 000163 026445
0002360 062066 022400 062063 022454 062063 022400 071467 020000
0002400 026445 027067 071467 022440 032055 032056 020163 020000
0002420 026445 031061 030456 071462 000040 032045 020144 022400
0002440 005163 022400 030456 071464 000012 071445 072440 071156
0002460 060545 060544 066142 005145 022400 020163 067556 020164
0002500 067546 067165 005144 000000 003750 000144 004076 000157
0002520 004070 000170 004172 000146 004210 000145 004026 000143
0002540 004044 000163 003764 000154 004226 000162 000000 000000
0002560 177774 177760 177775 177770 104404 022376 000000 104405
0002600 000000 000000 104403 000000 001000 104405 000000 000000
0002620 104421 000000 023436 104423 000000 000000 104422 000000
0002640 000000 000037 000034 000037 000036 000037 000036 000037
0002660 000037 000036 000037 000036 000037 043120 020712 020716
0002700 000001 000005 000515 000072 000457 051505 000124 042105
0002720 000124 060504 020171 067515 020156 030060 030040 035060
0002740 030060 030072 020060 034461 030060 000012 072523 046556
0002760 067157 072524 053545 062145 064124 043165 064562 060523
0003000 000164 060512 043156 061145 060515 040562 071160 060515
0003020 045171 067165 072512 040554 063565 062523 047560 072143
0003040 067516 042166 061545 000000 000000 000000 000000 000000
0003060 000000
0010060 000000 000020 000001 177770 177774 177777 071554 000000
0010100
#
</pre></div>
<p>Noel prepares to analyze the core file (block quotes here and further below taken from email correspondence):</p>
<blockquote>
<p>I just checked, and the binary for the 'ls' command is what's called 'pure code'; i.e. the instructions are
in a separate (potentially shared) block of memory from the process' data (un-shared).</p>
</blockquote>
<hr>
<blockquote>
<p>On another front, that error message ("Memory error") is produced when a process gets a 'memory management
trap' (trap to 0250). This could be caused by any number of things (it's a pity we don't know the contents
of SR0 when the trap happened, that would tell us exactly what the cause was).</p>
</blockquote>
<hr>
<blockquote>
<p>[Memory management registers in the core dump] are 'prototypes', later modified for actual use by adding in
the actual address in main memory. Still trying to understand how that works - the code (in sureg() in
main.c) is kind of obscure.</p>
</blockquote>
<h3>January 10-24</h3>
<p>Further communication with Noel and the cctalk list raises some suspicion about the memory in my machine.
Though I had done spot checks and repairs on this in the past, which had been sufficient to pass most MAINDEC
diagnostics and to boot and run RT11, in fact the memory had not yet been exhaustively tested.</p>
<p>Over the course of some days, memory test codes are developed and run, and several additional failed DRAMs in
the MS11 memory system are isolated and repaired. These efforts have previously been reported in detail in
<a href="https://fritzm.github.io/unix-and-ms11.html">this blog entry</a>.</p>
<p>After these repairs, the MAINDEC MS11 memory diagnostics and KT11-C MMU diagnostics, both of which are beastly
and exhaustive, are found to pass robustly with one caveat: memory parity tests. A deep-dive into the design
and implementation of memory parity on the PDP-11/45 follows. At the end it is concluded that the machine, a
very early serial no. in its line, is in fact functioning per-design. These efforts are documented in <a href="https://fritzm.github.io/parity-handling.html">this
blog entry</a>.</p>
<p>Even though the memory system looks solid after this, the V6 Unix crash behavior remains exactly the same...</p>
<h3>January 27-29</h3>
<p>With the KT11 and memory now verified, Noel takes up the core dump again:</p>
<blockquote>
<p>The problem is that Unix does not save enough info in the core dump for me to thoroughly diagnose the MM
fault; e.g. 'ls' is a 'pure text' program/command, and the code's not included in the core dump (in normal
operation, there's no need/use for it), so I don't have the code that was running at the time, just the data
and swappable per-process kernel data - which is not all the per-process data, e.g. it doesn't include the
location of the process's code and data segments in main memory.</p>
<p>Also, I'll look at the V6 code that sets up the KT11 registers to make sure I understand what it's doing.
(The dump contains the 'prototype' for those contents, but the values are modified, by adding the actual
memory location, before being stored in the KT11.)</p>
</blockquote>
<hr>
<blockquote>
<p>I did find out that the PC at the time of the segmentation fault was 010210, which I thought looked awfully
big (so I was wondering if somehow it went crazy), but in fact the text size is 010400, so it's just inside
the pure text.</p>
</blockquote>
<p>We agree to use
<a href="https://en.wikipedia.org/wiki/Lions%27_Commentary_on_UNIX_6th_Edition,_with_Source_Code"><em>Lions</em></a> as a common
reference point for detailed discussion of the loading and running of "ls" and what may be seen in the core
dump.</p>
<h3>January 30</h3>
<p>Noel:</p>
<blockquote>
<p>So, a bit more from my examination of the swappable per-process kernel data (the 'user' structure - not sure
how much of a Unix internals person you are).</p>
<p>It gives the following for the text, data and stack sizes:</p>
<div class="highlight"><pre><span></span>tsize 000104
dsize 000035
ssize 000024
</pre></div>
<p>which seems reasonable/correct, because looking at the header for 'ls' we see:</p>
<div class="highlight"><pre><span></span>000410 010400 001050 002366 000000 000000 000000 000001
</pre></div>
<p>'0410' says it's pure text, non-split; the 010400 is the text size, which matches (those sizes above are in
'clicks', i.e. the 0100 byte quantum used in the PDP-11 memory management).</p>
<p>The data size also appears to be correct:</p>
<div class="highlight"><pre><span></span>001050 (initialized)
002366 (BSS)
------
003436
</pre></div>
<p>which again matches (round up and divide by 0100).</p>
<p>I have yet to dig around through the system sources and see what the initial stack allocation is, to see if
that's reasonable (of course, it may have been extended during execution).</p>
<p>And here are the 'prototype' segmentation register contents:</p>
<div class="highlight"><pre><span></span>UISA 000000 000020 000000 000000 000000 000000 000000 177701
UDSA 000000 000020 000000 000000 000000 000000 000000 177701
UISD 041402 016006 000000 000000 000000 000000 000000 066016
UDSD 041402 016006 000000 000000 000000 000000 000000 066016
</pre></div>
<p>Since it's not split, the D-space ones are clones of the I-space (which is what the code does - I don't
think it turns user D off and on, depending on what the process has: I'd have made context switching faster
by not having to set up the D-space registers for non-split processes, but I guess the extra overhead is
pretty minimal).</p>
<p>I have yet to check all the contents to make sure they look good, but the U?SA registers look OK; the '020'
is for the data, and that's kept contiguous with the 'user' area, so the '020' is to offset past that.</p>
<p>The PC at fault time of 010210 seems to point to the following code (assuming what was in main memory was
actually the same as the binary on the disk):</p>
<div class="highlight"><pre><span></span> mov r4,r0
jmp 10226
210: mov r5,r0
mov sp,r5
</pre></div>
<p>We don't have SSR2, which points to the failing instruction, and I forget whether the saved PC on an MMU
fault points to the failing instruction, or the next one; I'm going to assume the latter.</p>
<p>But either way, this is very puzzling, because I don't see an instruction there that could have gotten an
MMU fault! The jump is to a location within the text segment (albeit at the end), and everything else it
just register-register moves!</p>
<p>And how could the fault depend on the location in main memory?!?!</p>
<p>If you want to poke around in the core dump yourself, to verify that I haven't made a mistake, see this
page:</p>
<p><a href="http://gunkies.org/wiki/Unix_V6_dump_analysis">http://gunkies.org/wiki/Unix_V6_dump_analysis</a></p>
<p>which gives useful offsets. (The ones in the user table I verified by writing a short program which did
things like 'printf("%o", &0->u_uisa)', and the data at those locations looks like what should be there, so
I'm pretty sure that table is good. For the other one, core(5) (in the V6 man pages) gives the register
offsets (albeit in a different form), so you can check that I worked them out correctly.</p>
<p>Two things you could try to get rid of potential pattern sensitivities: before doing the 'ls', say 'sleep
360 &' first; that running in the background <em>should</em> cause the 'ls' to be loaded and run from a different
address in main memory. The other thing you could try is 'cp /bin/ls xls' and then 'xls', to load the
command from a different disk location. (Both of these assume that you don't get another fault, of course!)</p>
</blockquote>
<hr>
<blockquote>
<p>[Initial stack size] is 20. clicks, which is what it still is (024 clicks) in the process core dump, so
the stack has <em>not</em> been extended. So any MM fault you see after starting 'ls' will <em>probably</em> be the one
that's causing the process to blow out.</p>
</blockquote>
<hr>
<blockquote>
<p>I tried to re-create that exact version of the 'ls' binary, because the one in the distro is stripped, and I
wanted one with symbols to look at. I failed, because a library routine (for dates) has changed on my
machine, see here:</p>
<p><a href="http://www.chiappa.net/~jnc/tech/V6Unix.html#Issues">http://www.chiappa.net/~jnc/tech/V6Unix.html#Issues</a></p>
<p>However, I did verify that the binary for ls.o is identical to what I can produce (using the -O flag). It's
just that library routine which is different. I don't think it's worth backing out my library; I did manage
to hand-produce a stub of the symbol table for where the error is happening in the old 'ls' binary:</p>
<div class="highlight"><pre><span></span>010210T csv
010226T cret
010244T cerror
010262T _ldiv
010304T _lrem
010324T _dpadd
</pre></div>
<p>The fault does indeed seem to be happening at either the last instruction in the previous routine (ct_year,
in ctime.c), or the first of csv.</p>
<p>(I should explain that PDP-11 C uses two small chunks of code, CSV and CRET, to construct and take down
stack frames on procedure entry and exit. So on exit from <em>any</em> C procedure, the last instruction is always
an PC-relative jump to CRET.)</p>
<p>It looks like that's what's blowing up - but it apparently works with the command at a different location in
main memory! So it pretty much has to be a pattern sensitivity.</p>
<p>However, I think the KT11 does the bounds checking <em>before</em> it does the relocation - the bounds checking is
done on virtual, un-relocated addresses. So <em>that</em> part of it <em>should</em> be the same for both locations! So
here's my analysis:</p>
<p>Is it actually an indexed jump that's blowing up? I've been looking at the command binary, but that might
not be what's in main memory. Or the CPU might be looking somewhere else (because of a KT error). (If we
don't find the problem soon, we might want to put in that breakpoint so we can look in main memory and see
what inst is actually at the location where SSR2 says the failing inst was; that can rule out a whole bunch
of potential causes in one go - e.g. RK11 errors.)</p>
<p>If it is actually that jump that's failing - how? The PC hasn't been updated yet, so it can't be the fetch
of the next instruction that's failing. Is the fetch of the index word producing the MM fault?</p>
</blockquote>
<p>Fritz:</p>
<blockquote>
<p>It occurs to me that we don't even <em>really</em> know if the fault occurs from the same address every time, since
we have a core sample size of 1; I should duplicate the fail and extract another core file to compare.</p>
</blockquote>
<hr>
<blockquote>
<p>Another thing I thought I might try tonight: deposit a trap catcher in the memory mgmt trap location from
the front panel, just before issuing the 'ls' command. I can then check the PSW, PC, SP, and KT11 regs
right at the time of fault.</p>
</blockquote>
<p>Experiments begin from the front panel, and continue on into the early hours, producing:</p>
<p>Core #2:</p>
<div class="highlight"><pre><span></span>140004 000000 141710 141724
$DK
@rkunix
mem = 1035
RESTRICTED RIGHTS
Use, duplication or disclosure is subject to
restrictions stated in Contract with Western
Electric Company, Inc.
# RM CORE
# LS
MEMORY FAULT -- CORE DUMPED