• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1536
  • Last Modified:

Why does my Linux server keep rebooting?

Brief:  Why does my Linux server keep rebooting?

Detail:  I've got a VA Linux Systems 2200 Dual P3/600 1gb ram server running Ubuntu, and for some reason the thing keeps rebooting. I don't have too many details (I can hear it happening - the floppy drive seeks at POST - but I've never seen the screen when it occurs). I cannot pinpoint a pattern, or any reason why it's happening. Not sure if it's hardware (dying) or software (cron job, etc?)

Here's what I can tell you - the bottom chunk from the syslog file (more availbale if required)

Jul 30 00:17:54 floyd kernel: [42949379.870000] 00:0b: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
Jul 30 00:17:54 floyd kernel: [42949379.870000] mice: PS/2 mouse device common for all mice
Jul 30 00:17:54 floyd kernel: [42949379.870000] RAMDISK driver initialized: 16 RAM disks of 65536K size 1024 blocksize
Jul 30 00:17:54 floyd kernel: [42949379.870000] Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
Jul 30 00:17:54 floyd kernel: [42949379.870000] ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
Jul 30 00:17:54 floyd kernel: [42949379.880000] PNP: PS/2 Controller [PNP0303:KBC,PNP0f13:MOUS] at 0x60,0x64 irq 1,12
Jul 30 00:17:54 floyd kernel: [42949379.880000] serio: i8042 AUX port at 0x60,0x64 irq 12
Jul 30 00:17:54 floyd kernel: [42949379.880000] serio: i8042 KBD port at 0x60,0x64 irq 1
Jul 30 00:17:54 floyd kernel: [42949379.880000] EISA: Probing bus 0 at eisa.0
Jul 30 00:17:54 floyd kernel: [42949379.880000] Cannot allocate resource for EISA slot 1
Jul 30 00:17:54 floyd kernel: [42949379.880000] Cannot allocate resource for EISA slot 2
Jul 30 00:17:54 floyd kernel: [42949379.880000] EISA: Detected 0 cards.
Jul 30 00:17:54 floyd kernel: [42949379.880000] TCP bic registered
Jul 30 00:17:54 floyd kernel: [42949379.880000] NET: Registered protocol family 1
Jul 30 00:17:54 floyd kernel: [42949379.880000] NET: Registered protocol family 8
Jul 30 00:17:54 floyd kernel: [42949379.880000] NET: Registered protocol family 20
Jul 30 00:17:54 floyd kernel: [42949379.880000] Starting balanced_irq
Jul 30 00:17:54 floyd kernel: [42949379.880000] Using IPI No-Shortcut mode
Jul 30 00:17:54 floyd kernel: [42949379.880000] ACPI: (supports S0 S1 S4 S5)
Jul 30 00:17:54 floyd kernel: [42949379.880000] Freeing unused kernel memory: 312k freed
Jul 30 00:17:54 floyd kernel: [42949379.910000] input: AT Translated Set 2 keyboard as /class/input/input0
Jul 30 00:17:54 floyd kernel: [42949381.150000] Capability LSM initialized
Jul 30 00:17:54 floyd kernel: [42949382.700000] SCSI subsystem initialized
Jul 30 00:17:54 floyd kernel: [42949382.720000] ACPI: PCI Interrupt 0000:00:0c.0[A] -> GSI 19 (level, low) -> IRQ 169
Jul 30 00:17:54 floyd kernel: [42949397.940000] scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 7.0
Jul 30 00:17:54 floyd kernel: [42949397.940000]         <Adaptec aic7896/97 Ultra2 SCSI adapter>
Jul 30 00:17:54 floyd kernel: [42949397.940000]         aic7896/97: Ultra2 Wide Channel A, SCSI Id=7, 32/253 SCBs
Jul 30 00:17:54 floyd kernel: [42949397.940000]
Jul 30 00:17:54 floyd kernel: [42949397.950000] ACPI: PCI Interrupt 0000:00:0c.1[A] -> GSI 19 (level, low) -> IRQ 169
Jul 30 00:17:54 floyd kernel: [42949413.170000] scsi1 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 7.0
Jul 30 00:17:54 floyd kernel: [42949413.170000]         <Adaptec aic7896/97 Ultra2 SCSI adapter>
Jul 30 00:17:54 floyd kernel: [42949413.170000]         aic7896/97: Ultra2 Wide Channel B, SCSI Id=7, 32/253 SCBs
Jul 30 00:17:54 floyd kernel: [42949413.170000]
Jul 30 00:17:54 floyd kernel: [42949413.380000] PIIX4: IDE controller at PCI slot 0000:00:12.1
Jul 30 00:17:54 floyd kernel: [42949413.380000] PIIX4: chipset revision 1
Jul 30 00:17:54 floyd kernel: [42949413.380000] PIIX4: not 100%% native mode: will probe irqs later
Jul 30 00:17:54 floyd kernel: [42949413.380000]     ide0: BM-DMA at 0x2860-0x2867, BIOS settings: hda:DMA, hdb:pio
Jul 30 00:17:54 floyd kernel: [42949413.380000]     ide1: BM-DMA at 0x2868-0x286f, BIOS settings: hdc:DMA, hdd:pio
Jul 30 00:17:54 floyd kernel: [42949413.380000] Probing IDE interface ide0...
Jul 30 00:17:54 floyd kernel: [42949413.690000] hda: WDC WD400BB-00DEA0, ATA DISK drive
Jul 30 00:17:54 floyd kernel: [42949414.420000] ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
Jul 30 00:17:54 floyd kernel: [42949414.420000] Probing IDE interface ide1...
Jul 30 00:17:54 floyd kernel: [42949415.200000] hdc: CD-540E, ATAPI CD/DVD-ROM drive
Jul 30 00:17:54 floyd kernel: [42949415.920000] ide1 at 0x170-0x177,0x376 on irq 15
Jul 30 00:17:54 floyd kernel: [42949415.950000] hda: max request size: 128KiB
Jul 30 00:17:54 floyd kernel: [42949416.080000] hda: 78165360 sectors (40020 MB) w/2048KiB Cache, CHS=65535/16/63, UDMA(33)
Jul 30 00:17:54 floyd kernel: [42949416.080000] hda: cache flushes not supported
Jul 30 00:17:54 floyd kernel: [42949416.080000]  hda: hda1 hda2 < hda5 >
Jul 30 00:17:54 floyd kernel: [42949416.110000] hdc: ATAPI 40X CD-ROM drive, 128kB Cache, UDMA(33)
Jul 30 00:17:54 floyd kernel: [42949416.110000] Uniform CD-ROM driver Revision: 3.20
Jul 30 00:17:54 floyd kernel: [42949416.560000] usbcore: registered new driver usbfs
Jul 30 00:17:54 floyd kernel: [42949416.560000] usbcore: registered new driver hub
Jul 30 00:17:54 floyd kernel: [42949416.560000] USB Universal Host Controller Interface driver v3.0
Jul 30 00:17:54 floyd kernel: [42949416.560000] ACPI: PCI Interrupt 0000:00:12.2[D] -> GSI 21 (level, low) -> IRQ 177
Jul 30 00:17:54 floyd kernel: [42949416.560000] uhci_hcd 0000:00:12.2: UHCI Host Controller
Jul 30 00:17:54 floyd kernel: [42949416.560000] uhci_hcd 0000:00:12.2: new USB bus registered, assigned bus number 1
Jul 30 00:17:54 floyd kernel: [42949416.560000] uhci_hcd 0000:00:12.2: irq 177, io base 0x00002840
Jul 30 00:17:54 floyd kernel: [42949416.560000] usb usb1: configuration #1 chosen from 1 choice
Jul 30 00:17:54 floyd kernel: [42949416.560000] hub 1-0:1.0: USB hub found
Jul 30 00:17:54 floyd kernel: [42949416.560000] hub 1-0:1.0: 2 ports detected
Jul 30 00:17:54 floyd kernel: [42949416.790000] Attempting manual resume
Jul 30 00:17:54 floyd kernel: [42949416.830000] EXT3-fs: INFO: recovery required on readonly filesystem.
Jul 30 00:17:54 floyd kernel: [42949416.830000] EXT3-fs: write access will be enabled during recovery.
Jul 30 00:17:54 floyd kernel: [42949417.340000] kjournald starting.  Commit interval 5 seconds
Jul 30 00:17:54 floyd kernel: [42949417.340000] EXT3-fs: recovery complete.
Jul 30 00:17:54 floyd kernel: [42949417.340000] EXT3-fs: mounted filesystem with ordered data mode.
Jul 30 00:17:54 floyd kernel: [42949428.680000] Linux agpgart interface v0.101 (c) Dave Jones
Jul 30 00:17:54 floyd kernel: [42949428.690000] agpgart: Detected an Intel 440GX Chipset.
Jul 30 00:17:54 floyd kernel: [42949428.690000] agpgart: AGP aperture is 64M @ 0xf8000000
Jul 30 00:17:54 floyd kernel: [42949429.360000] pci_hotplug: PCI Hot Plug PCI Core version: 0.5
Jul 30 00:17:54 floyd kernel: [42949429.380000] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
Jul 30 00:17:54 floyd kernel: [42949429.590000] input: PC Speaker as /class/input/input1
Jul 30 00:17:54 floyd kernel: [42949429.780000] piix4_smbus 0000:00:12.3: Found 0000:00:12.3 device
Jul 30 00:17:54 floyd kernel: [42949430.080000] Floppy drive(s): fd0 is 1.44M
Jul 30 00:17:54 floyd kernel: [42949430.110000] input: ImPS/2 Generic Wheel Mouse as /class/input/input2
Jul 30 00:17:54 floyd kernel: [42949430.170000] e100: Intel(R) PRO/100 Network Driver, 3.5.10-k2-NAPI
Jul 30 00:17:54 floyd kernel: [42949430.170000] e100: Copyright(c) 1999-2005 Intel Corporation
Jul 30 00:17:54 floyd kernel: [42949430.170000] ACPI: PCI Interrupt 0000:00:0e.0[A] -> GSI 21 (level, low) -> IRQ 177
Jul 30 00:17:54 floyd kernel: [42949430.190000] e100: eth0: e100_probe: addr 0xf4102000, irq 177, MAC addr 00:D0:B7:88:74:91
Jul 30 00:17:54 floyd kernel: [42949430.360000] FDC 0 is a National Semiconductor PC87306
Jul 30 00:17:54 floyd kernel: [42949430.410000] parport: PnPBIOS parport detected.
Jul 30 00:17:54 floyd kernel: [42949430.410000] parport0: PC-style at 0x378 (0x778), irq 7, dma 3 [PCSPP,TRISTATE,COMPAT,ECP,DMA]
Jul 30 00:17:54 floyd kernel: [42949430.530000] ts: Compaq touchscreen protocol output
Jul 30 00:17:54 floyd kernel: [42949430.640000] e100: eth0: e100_watchdog: link up, 100Mbps, full-duplex
Jul 30 00:17:54 floyd kernel: [42949430.940000] NET: Registered protocol family 10
Jul 30 00:17:54 floyd kernel: [42949430.940000] lo: Disabled Privacy Extensions
Jul 30 00:17:54 floyd kernel: [42949430.940000] IPv6 over IPv4 tunneling driver
Jul 30 00:17:54 floyd kernel: [42949431.270000] lp0: using parport0 (interrupt-driven).
Jul 30 00:17:54 floyd kernel: [42949431.370000] Adding 1614492k swap on /dev/disk/by-uuid/b630a92e-9618-46e2-9c5c-a13496081940.  Priority:-1 extents:1 across:1614492k
Jul 30 00:17:54 floyd kernel: [42949431.490000] EXT3 FS on hda1, internal journal
Jul 30 00:17:54 floyd kernel: [42949434.260000] ACPI: Power Button (FF) [PWRF]
Jul 30 00:17:54 floyd kernel: [42949434.630000] ibm_acpi: ec object not found
Jul 30 00:17:54 floyd kernel: [42949434.760000] pcc_acpi: loading...
Jul 30 00:17:59 floyd named[3847]: starting BIND 9.3.2 -u bind
Jul 30 00:17:59 floyd named[3847]: found 2 CPUs, using 2 worker threads
Jul 30 00:17:59 floyd named[3847]: loading configuration from '/etc/bind/named.conf'
Jul 30 00:17:59 floyd named[3847]: listening on IPv4 interface lo, 127.0.0.1#53
Jul 30 00:17:59 floyd named[3847]: listening on IPv4 interface eth0, 172.16.1.201#53
Jul 30 00:17:59 floyd named[3847]: command channel listening on 127.0.0.1#953
Jul 30 00:17:59 floyd named[3847]: command channel listening on ::1#953
Jul 30 00:17:59 floyd named[3847]: zone 0.in-addr.arpa/IN: loaded serial 1
Jul 30 00:17:59 floyd named[3847]: zone 127.in-addr.arpa/IN: loaded serial 1
Jul 30 00:17:59 floyd named[3847]: zone 255.in-addr.arpa/IN: loaded serial 1
Jul 30 00:17:59 floyd named[3847]: zone ooglenetworks.com/IN: loaded serial 2007062719
Jul 30 00:17:59 floyd named[3847]: zone localhost/IN: loaded serial 1
Jul 30 00:17:59 floyd named[3847]: zone ooglenetworks.com/IN: sending notifies (serial 2007062719)
Jul 30 00:17:59 floyd named[3847]: running
Jul 30 00:17:59 floyd kernel: [42949441.610000] eth0: no IPv6 routers present
Jul 30 00:18:00 floyd hpiod: 1.6.9 accepting connections at 2208...
Jul 30 00:18:03 floyd spamd[3930]: logger: removing stderr method
Jul 30 00:18:03 floyd kernel: [42949445.440000] apm: BIOS not found.
Jul 30 00:18:05 floyd spamd[3934]: rules: meta test DIGEST_MULTIPLE has undefined dependency 'DCC_CHECK'
Jul 30 00:18:06 floyd spamd[3934]: spamd: server started on port 783/tcp (running version 3.1.7-deb)
Jul 30 00:18:06 floyd spamd[3934]: spamd: server pid: 3934
Jul 30 00:18:06 floyd spamd[3934]: spamd: server successfully spawned child process, pid 3992
Jul 30 00:18:06 floyd spamd[3934]: spamd: server successfully spawned child process, pid 3993
Jul 30 00:18:06 floyd spamd[3934]: prefork: child states: II
Jul 30 00:18:12 floyd hcid[4271]: Bluetooth HCI daemon
Jul 30 00:18:12 floyd kernel: [42949454.000000] Bluetooth: Core ver 2.8
Jul 30 00:18:12 floyd kernel: [42949454.000000] NET: Registered protocol family 31
Jul 30 00:18:12 floyd kernel: [42949454.000000] Bluetooth: HCI device and connection manager initialized
Jul 30 00:18:12 floyd kernel: [42949454.000000] Bluetooth: HCI socket layer initialized
Jul 30 00:18:12 floyd kernel: [42949454.040000] Bluetooth: L2CAP ver 2.8
Jul 30 00:18:12 floyd kernel: [42949454.040000] Bluetooth: L2CAP socket layer initialized
Jul 30 00:18:12 floyd kernel: [42949454.060000] Bluetooth: RFCOMM socket layer initialized
Jul 30 00:18:12 floyd sdpd[4275]: Bluetooth SDP daemon
Jul 30 00:18:12 floyd kernel: [42949454.060000] Bluetooth: RFCOMM TTY layer initialized
Jul 30 00:18:12 floyd kernel: [42949454.060000] Bluetooth: RFCOMM ver 1.7
Jul 30 00:18:12 floyd hcid[4271]: Register path:/org/bluez fallback:1
Jul 30 00:18:12 floyd anacron[4319]: Anacron 2.3 started on 2007-07-30
Jul 30 00:18:12 floyd anacron[4319]: Will run job `cron.daily' in 5 min.
Jul 30 00:18:12 floyd anacron[4319]: Jobs will be executed sequentially
Jul 30 00:18:12 floyd /usr/sbin/cron[4345]: (CRON) INFO (pidfile fd = 3)
Jul 30 00:18:12 floyd /usr/sbin/cron[4346]: (CRON) STARTUP (fork ok)
Jul 30 00:18:12 floyd /usr/sbin/cron[4346]: (CRON) INFO (Running @reboot jobs)
Jul 30 00:23:01 floyd /USR/SBIN/CRON[4527]: (mail) CMD (  if [ -x /usr/lib/exim/exim3 -a -f /etc/exim/exim.conf ]; then /usr/lib/exim/exim3 -q ; fi)
Jul 30 00:23:12 floyd anacron[4319]: Job `cron.daily' started
Jul 30 00:23:12 floyd anacron[4532]: Updated timestamp for job `cron.daily' to 2007-07-30
Jul 30 00:23:59 floyd exiting on signal 15


-  

The server is running email (Exim) and web services (apache) - nothing fancy going on.  I'm no linux guru, but no noobie either - somewhere in between. Would appreciate it if anybody can steer me in the right direction.

j



0
jkittle99
Asked:
jkittle99
  • 7
  • 5
1 Solution
 
m1tk4Commented:
I've seen a problem like this eventually tracked to lm_sensors reporting incorrect (too high) CPU temperatures and kernel shutting down because of this. Incidentally, this was a VIA chipset too.

However, there could be 100s of other reasons - please provide /var/log/messages and /var/log/boot.log immediately BEFORE the reboot.
0
 
jkittle99Author Commented:
I just set up syslog to spit out to an external server, hopefully this will help with the pinpoint.

messages (see bottom)

Jul 30 21:16:18 floyd kernel: [42949433.730000] ACPI: Power Button (FF) [PWRF]
Jul 30 21:16:18 floyd kernel: [42949434.210000] pcc_acpi: loading...
Jul 30 21:16:22 floyd hpiod: 1.6.9 accepting connections at 2208...
Jul 30 21:16:26 floyd kernel: [42949443.750000] apm: BIOS not found.
Jul 30 21:16:34 floyd kernel: [42949452.340000] Bluetooth: Core ver 2.8
Jul 30 21:16:34 floyd kernel: [42949452.340000] NET: Registered protocol family                                                  31
Jul 30 21:16:34 floyd kernel: [42949452.340000] Bluetooth: HCI device and connec                                                 tion manager initialized
Jul 30 21:16:34 floyd kernel: [42949452.340000] Bluetooth: HCI socket layer init                                                 ialized
Jul 30 21:16:34 floyd kernel: [42949452.370000] Bluetooth: L2CAP ver 2.8
Jul 30 21:16:34 floyd kernel: [42949452.370000] Bluetooth: L2CAP socket layer in                                                 itialized
Jul 30 21:16:34 floyd kernel: [42949452.390000] Bluetooth: RFCOMM socket layer i                                                 nitialized
Jul 30 21:16:34 floyd kernel: [42949452.390000] Bluetooth: RFCOMM TTY layer init                                                 ialized
Jul 30 21:16:34 floyd kernel: [42949452.390000] Bluetooth: RFCOMM ver 1.7
Jul 30 21:23:57 floyd exiting on signal 15
Jul 30 21:23:58 floyd syslogd 1.4.1#18ubuntu6: restart.
Jul 30 21:26:28 floyd exiting on signal 15
Jul 30 21:26:29 floyd syslogd 1.4.1#18ubuntu6: restart.
Jul 30 21:26:29 floyd kernel: [42950046.910000] process `syslogd' is using obsol                                                 ete setsockopt SO_BSDCOMPAT


boot

Jul 30 17:16:11 rcS:  * Reading files needed to boot...                  [ ok ]
Jul 30 17:16:12 rcS:  * Setting preliminary keymap...                    [ ok ]
Jul 30 17:16:12 rcS:  * Starting basic networking...                     [ ok ]
Jul 30 17:16:12 rcS:  * Starting kernel event manager...                 [ ok ]
Jul 30 21:16:12 rcS:  * Loading hardware drivers...                      [ ok ]
Jul 30 21:16:13 rcS:  * Loading manual drivers...                        [ ok ]
Jul 30 21:17:01 rcS:  * Mounting local filesystems...                    [ ok ]
Jul 30 21:17:01 rcS:  * Activating swapfile swap...                      [ ok ]
Jul 30 21:17:01 rcS:  * Configuring network interfaces...                [ ok ]
Jul 30 21:17:02 rcS:  * Setting up console keymap...                     [ ok ]
Jul 30 21:16:16 rc2:  * Loading ACPI modules...                          [ ok ]
Jul 30 21:16:17 rc2:  * Starting ACPI services...                        [ ok ]
Jul 30 21:16:17 rc2:  * Starting system log...                           [ ok ]
Jul 30 21:16:17 rc2:  * Starting kernel log...                           [ ok ]
Jul 30 21:16:19 rc2:  * Starting GNOME Display Manager...                [ ok ]
Jul 30 21:16:22 rc2:  * Starting domain name service...                  [ ok ]
Jul 30 21:16:22 rc2:  * Starting Common Unix Printing System: cupsd      [ ok ]
Jul 30 21:16:23 rc2:  * Starting HP Linux Printing and Imaging System    [ ok ]
Jul 30 21:16:26 rc2: Starting SpamAssassin Mail Filter Daemon: spamd.
Jul 30 21:16:26 rc2:  * Starting system message bus dbus                 [ ok ]
Jul 30 21:16:31 rc2:  * Starting Hardware abstraction layer hald         [ ok ]
Jul 30 21:16:32 rc2:  * Starting System Tools Backends system-tools-backe[ ok ]
Jul 30 21:16:33 rc2: Starting MTA: exim4.
Jul 30 21:16:33 rc2: Starting pop daemon: popa3d.
Jul 30 21:16:33 rc2:  * Starting powernowd...                                   /etc/rc2.d/S20powernowd: 156: cannot create /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor: Directory nonexistent
Jul 30 21:16:33 rc2:  * CPU frequency scaling not supported
Jul 30 21:16:33 rc2:                                                     [ ok ]
Jul 30 21:16:34 rc2:  * Starting Samba daemons...                        [ ok ]
Jul 30 21:16:34 rc2:  * Starting OpenBSD Secure Shell server...          [ ok ]
Jul 30 21:16:34 rc2:  * Starting Bluetooth services                      [ ok ]
Jul 30 21:16:34 rc2:  * Starting anac(h)ronistic cron: anacron           [ ok ]
Jul 30 21:16:34 rc2:  * Starting deferred execution scheduler atd        [ ok ]
Jul 30 21:16:35 rc2:  * Starting periodic command scheduler...           [ ok ]
Jul 30 21:16:35 rc2:  * Enabling additional executable binary formats bin[ ok ] port
Jul 30 21:16:35 rc2:  * Starting apache 2.0 web server...                       apache2: Could not determine the server's fully qualified domain name, using 127.0.0.1 for ServerName
Jul 30 21:16:35 rc2:                                                     [ ok ]
Jul 30 21:16:35 rc2:  * Checking battery state...                        [ ok ]
Jul 30 21:16:35 rc2:  * Running local boot scripts (/etc/rc.local)       [ ok ]


Nothing jumping out at me, but then again, that's why I'm asking :)

0
 
jkittle99Author Commented:
It just happened again - I got nada in the remote syslog prior to the boot (it's set up for *.*    @syslogserveraddress) - will gladly change this if I should.

j
0
Microsoft Certification Exam 74-409

VeeamĀ® is happy to provide the Microsoft community with a study guide prepared by MVP and MCT, Orin Thomas. This guide will take you through each of the exam objectives, helping you to prepare for and pass the examination.

 
jkittle99Author Commented:
my boot is gettting overwritten every reboot, so I've got no way to show what it had before the bounce. How do I change this?
0
 
m1tk4Commented:
Looks like a kernel panic to me - this would leave nothing in the logs. Can you physically get to the machine and see what is output to the screen?

If not, I'd try disabling the following and trying if this solves the problem:

ACPI
powernowd

Another thing that might help troubleshooting is to see if you can get SMART information from your hard drive(s) using smartctl and if there are SMART errors on the drives. Something like this could be caused by a failing drive.
0
 
jkittle99Author Commented:
I have physical access - but the odds of catching the reboot "live" while I stand there are slim.  It did this a few weeks ago, I thought it was overheading, so I cranked up the air conditioning. It ran 21 days stable, and then started doing this again last week.

How do I disable ACPI and powernowd?

0
 
jkittle99Author Commented:
Ok oddly enough I just saw it happen. It was at an ubuntu login screen - the screen went black, and system began to POST.
0
 
m1tk4Commented:
I'm about 95% sure it's a hardware problem then. Before disabling any services, run hardware tests on the system. These should be either on a service partition (Dell, HP) or on the manufacturer's CD.

Most likely, the problem is with the memory, then I'd look at the hard drives (SMART info will help here), power supply, then the rest.

The fact that cranking up AC made it run more stable for at least a while is telling a lot. Chances are, you have a memory stick slowly dying.

If you want to try disabling the services before trying to diagnose the hardware, it's done with update-rc.d . Another thing that might be worth trying is disabling the graphical boot (going from initlevel 5 to initlevel 3). However, if this was my server, I would not waste the time on that.
0
 
jkittle99Author Commented:
I'll boot a livecd and run a memory test and see where that takes us.
0
 
jkittle99Author Commented:
Ok its dying right away on a memtest.  It's running ECC memory, going to try and pull a chip or the other and see if its just one stick -- not sure whether to trust the memtest app yet, my first time using it (and not so familiar hardware)
0
 
m1tk4Commented:
memtest is usually pretty good about figuring out faulty places in the memory. Try swapping out all sticks first for some proven memory - if it's still dying on a memtest, try adjusting the timings in BIOS to "fail-safe" levels, and if BIOS allows, check if the voltage supplied is OK. If after this tinkering it still fails I'd probably be shopping for a new server. You can get a decent replacement for a dual P3 for so little now it's not worth spending more of your time on that old hardware.
0
 
NopiusCommented:
> You can get a decent replacement for a dual P3 for so little now it's not worth spending more of your time on that old hardware.

I don't recommend you to buy too old hardware, because an average lifetime of electrolythic  reactor (capacitor) is about 7-8 years. You may find them on a motherboard. If some of them are puffed, they are probably dying or dead. Then you may have unexpected reboots or not 100% sure power-ons.

I agree with all m1tk4 comments, but don't recommend you to use so old hardware for your server...
0
 
m1tk4Commented:
I was actually thinking about an off-lease brand-name like an HP, Compaq or a Dell. They are normally 3-4 years old and very economical.
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 7
  • 5
Tackle projects and never again get stuck behind a technical roadblock.
Join Now