Discussion:
[pve-devel] Blacklisting HP hardware watchdog timer module ?
Emmanuel Kasper
2015-12-02 10:29:41 UTC
Permalink
Hi
It seems that the HP Watchdog timer does not work properly: it triggers
a kernel panic instead of rebooting the server.

The issue came up here in this thread:

http://forum.proxmox.com/threads/24015-VE-4-0-Kernel-Panic-on-HP-Proliant-servers

At least 3 users seem to have solved the problem by blacklisting the
corresponding hpwdt kernel module.

As remarked by Alexandre, the Ubuntu folks actually decided to black
*all* the hardware watchdog timers some time ago
( https://lists.ubuntu.com/archives/kernel-team/2015-March/054512.html )

Should we add hpwdt to our list of blacklisted module ?

Emmanuel
Alexandre DERUMIER
2015-12-02 10:54:37 UTC
Permalink
I don't have hp server to test,

but on dell server, by default the idrac watchdog is not used (module is not loaded),
if the motherboard intel watchdog is loaded ( iTCO_wdt), the idrac/bmc watchdog module is not loaded.
(to have idrac watchdog working, I need to use nmi_watchdog=0 in grub.cfg to disable motherboard watchdog)


I think the problem with HP, is that both are loaded (motherboard and ilo), but ilo watchdog is not updated.

according to https://www.kernel.org/doc/Documentation/watchdog/hpwdt.txt,
the module need to be loaded with priority=1:


1. If the kernel has not been booted with nmi_watchdog turned off then
edit /boot/grub/menu.lst and place the nmi_watchdog=0 at the end of the
currently booting kernel line.
2. reboot the sever
3. Once the system comes up perform a rmmod hpwdt
4. insmod /lib/modules/`uname -r`/kernel/drivers/char/watchdog/hpwdt.ko priority=1




I don't known what is the advantage to use bmc|ilo|idrac watchdog vs motherboard watchdog ?


----- Mail original -----
De: "Emmanuel Kasper" <***@proxmox.com>
À: "pve-devel" <pve-***@pve.proxmox.com>
Cc: "t lamprecht" <***@proxmox.com>
Envoyé: Mercredi 2 Décembre 2015 11:29:41
Objet: [pve-devel] Blacklisting HP hardware watchdog timer module ?

Hi
It seems that the HP Watchdog timer does not work properly: it triggers
a kernel panic instead of rebooting the server.

The issue came up here in this thread:

http://forum.proxmox.com/threads/24015-VE-4-0-Kernel-Panic-on-HP-Proliant-servers

At least 3 users seem to have solved the problem by blacklisting the
corresponding hpwdt kernel module.

As remarked by Alexandre, the Ubuntu folks actually decided to black
*all* the hardware watchdog timers some time ago
( https://lists.ubuntu.com/archives/kernel-team/2015-March/054512.html )

Should we add hpwdt to our list of blacklisted module ?

Emmanuel






_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Alexandre DERUMIER
2015-12-02 10:56:16 UTC
Permalink
Also, if somebody have a redhat subscription,
it seem that they have an explain about the problem

https://access.redhat.com/solutions/707563


(don't have access, sorry )

----- Mail original -----
De: "aderumier" <***@odiso.com>
À: "pve-devel" <pve-***@pve.proxmox.com>
Cc: "t lamprecht" <***@proxmox.com>
Envoyé: Mercredi 2 Décembre 2015 11:54:37
Objet: Re: [pve-devel] Blacklisting HP hardware watchdog timer module ?

I don't have hp server to test,

but on dell server, by default the idrac watchdog is not used (module is not loaded),
if the motherboard intel watchdog is loaded ( iTCO_wdt), the idrac/bmc watchdog module is not loaded.
(to have idrac watchdog working, I need to use nmi_watchdog=0 in grub.cfg to disable motherboard watchdog)


I think the problem with HP, is that both are loaded (motherboard and ilo), but ilo watchdog is not updated.

according to https://www.kernel.org/doc/Documentation/watchdog/hpwdt.txt,
the module need to be loaded with priority=1:


1. If the kernel has not been booted with nmi_watchdog turned off then
edit /boot/grub/menu.lst and place the nmi_watchdog=0 at the end of the
currently booting kernel line.
2. reboot the sever
3. Once the system comes up perform a rmmod hpwdt
4. insmod /lib/modules/`uname -r`/kernel/drivers/char/watchdog/hpwdt.ko priority=1




I don't known what is the advantage to use bmc|ilo|idrac watchdog vs motherboard watchdog ?


----- Mail original -----
De: "Emmanuel Kasper" <***@proxmox.com>
À: "pve-devel" <pve-***@pve.proxmox.com>
Cc: "t lamprecht" <***@proxmox.com>
Envoyé: Mercredi 2 Décembre 2015 11:29:41
Objet: [pve-devel] Blacklisting HP hardware watchdog timer module ?

Hi
It seems that the HP Watchdog timer does not work properly: it triggers
a kernel panic instead of rebooting the server.

The issue came up here in this thread:

http://forum.proxmox.com/threads/24015-VE-4-0-Kernel-Panic-on-HP-Proliant-servers

At least 3 users seem to have solved the problem by blacklisting the
corresponding hpwdt kernel module.

As remarked by Alexandre, the Ubuntu folks actually decided to black
*all* the hardware watchdog timers some time ago
( https://lists.ubuntu.com/archives/kernel-team/2015-March/054512.html )

Should we add hpwdt to our list of blacklisted module ?

Emmanuel






_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Dietmar Maurer
2015-12-02 11:08:32 UTC
Permalink
I will ship a blacklist for all watchdog modules with newer kernels:

https://git.proxmox.com/?p=pve-kernel.git;a=commitdiff;h=acea917faa9ef4974f984b9b2dc612e73d22d220

With latest kernel you can show that blacklist with:

# cat /lib/modprobe.d/blacklist_pve-kernel-4.2.6-1-pve.conf
Post by Emmanuel Kasper
Hi
It seems that the HP Watchdog timer does not work properly: it triggers
a kernel panic instead of rebooting the server.
http://forum.proxmox.com/threads/24015-VE-4-0-Kernel-Panic-on-HP-Proliant-servers
At least 3 users seem to have solved the problem by blacklisting the
corresponding hpwdt kernel module.
As remarked by Alexandre, the Ubuntu folks actually decided to black
*all* the hardware watchdog timers some time ago
( https://lists.ubuntu.com/archives/kernel-team/2015-March/054512.html )
Should we add hpwdt to our list of blacklisted module ?
Alexandre DERUMIER
2015-12-02 11:13:07 UTC
Permalink
Seem to be a good idea.

(Maybe add a wiki note about different watchdog module that user can defined in /etc/modules)

----- Mail original -----
De: "dietmar" <***@proxmox.com>
À: "Emmanuel Kasper" <***@proxmox.com>, "pve-devel" <pve-***@pve.proxmox.com>
Cc: "t lamprecht" <***@proxmox.com>
Envoyé: Mercredi 2 Décembre 2015 12:08:32
Objet: Re: [pve-devel] Blacklisting HP hardware watchdog timer module ?

I will ship a blacklist for all watchdog modules with newer kernels:

https://git.proxmox.com/?p=pve-kernel.git;a=commitdiff;h=acea917faa9ef4974f984b9b2dc612e73d22d220

With latest kernel you can show that blacklist with:

# cat /lib/modprobe.d/blacklist_pve-kernel-4.2.6-1-pve.conf
Post by Emmanuel Kasper
Hi
It seems that the HP Watchdog timer does not work properly: it triggers
a kernel panic instead of rebooting the server.
http://forum.proxmox.com/threads/24015-VE-4-0-Kernel-Panic-on-HP-Proliant-servers
At least 3 users seem to have solved the problem by blacklisting the
corresponding hpwdt kernel module.
As remarked by Alexandre, the Ubuntu folks actually decided to black
*all* the hardware watchdog timers some time ago
( https://lists.ubuntu.com/archives/kernel-team/2015-March/054512.html )
Should we add hpwdt to our list of blacklisted module ?
_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Dietmar Maurer
2015-12-02 11:22:59 UTC
Permalink
Post by Alexandre DERUMIER
(Maybe add a wiki note about different watchdog module that user can defined
in /etc/modules)
Yes, that is also a good idea ;-)
lyt_yudi
2015-12-02 19:33:08 UTC
Permalink
https://git.proxmox.com/?p=pve-kernel.git;a=commitdiff;h=acea917faa9ef4974f984b9b2dc612e73d22d220 <https://git.proxmox.com/?p=pve-kernel.git;a=commitdiff;h=acea917faa9ef4974f984b9b2dc612e73d22d220>
# cat /lib/modprobe.d/blacklist_pve-kernel-4.2.6-1-pve.conf
for me. maybe it's can working of the dell r710.

had a few times,
———————
NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [swapper/5:0]
———————

http://mirrors.myccdn.info/images/dell-r710-bug.txt

thanks.
Alexandre DERUMIER
2015-12-03 05:24:01 UTC
Permalink
About nmi watchdog,

I have it enabled, even if no watchdog module is loaded.

The only way I have found to disable it is to pass "nmi_watchdog=0"
to grub.


cat /proc/sys/kernel/nmi_watchdog to see if it's enable or not.

(I can't load idrac ipmi_watchdog until it's disable)


----- Mail original -----
De: "lyt_yudi" <***@icloud.com>
À: "pve-devel" <pve-***@pve.proxmox.com>
Envoyé: Mercredi 2 Décembre 2015 20:33:08
Objet: Re: [pve-devel] Blacklisting HP hardware watchdog timer module ?





在 2015年12月2日,下午7:08,Dietmar Maurer < ***@proxmox.com > 写道:
I will ship a blacklist for all watchdog modules with newer kernels:

https://git.proxmox.com/?p=pve-kernel.git;a=commitdiff;h=acea917faa9ef4974f984b9b2dc612e73d22d220

With latest kernel you can show that blacklist with:

# cat /lib/modprobe.d/blacklist_pve-kernel-4.2.6-1-pve.conf




for me. maybe it's can working of the dell r710.

had a few times,
———————
NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [swapper/5:0]
———————

http://mirrors.myccdn.info/images/dell-r710-bug.txt

thanks.


_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Alexandre DERUMIER
2015-12-03 05:47:15 UTC
Permalink
They are also a "nowatchdog" grub option, to disable nmi watchdog (hard-lockup) & soft-lockup


https://lkml.org/lkml/2015/3/2/651

nmi_watchdog= [KNL,BUGS=X86] Debugging features for SMP kernels
Format: [panic,][nopanic,][num]
- Valid num: 0
+ Valid num: 0 or 1
0 - turn nmi_watchdog off
+ 1 - turn nmi_watchdog on
When panic is specified, panic when an NMI watchdog
timeout occurs (or 'nopanic' to override the opposite
default).
@@ -2460,7 +2461,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.

nousb [USB] Disable the USB subsystem

- nowatchdog [KNL] Disable the lockup detector (NMI watchdog).
+ nowatchdog [KNL] Disable both lockup detectors, i.e.
+ soft-lockup and NMI watchdog (hard-lockup).

nowb [ARM]






Alexandre Derumier
Ingénieur systÚme et stockage


Fixe : 03 20 68 90 88
Fax : 03 20 68 90 81


45 Bvd du Général Leclerc 59100 Roubaix
12 rue Marivaux 75002 Paris


MonSiteEstLent.com - Blog dédié à la webperformance et la gestion de pics de trafic


De: "aderumier" <***@odiso.com>
À: "pve-devel" <pve-***@pve.proxmox.com>
Envoyé: Jeudi 3 Décembre 2015 06:24:01
Objet: Re: [pve-devel] Blacklisting HP hardware watchdog timer module ?

About nmi watchdog,

I have it enabled, even if no watchdog module is loaded.

The only way I have found to disable it is to pass "nmi_watchdog=0"
to grub.


cat /proc/sys/kernel/nmi_watchdog to see if it's enable or not.

(I can't load idrac ipmi_watchdog until it's disable)


----- Mail original -----
De: "lyt_yudi" <***@icloud.com>
À: "pve-devel" <pve-***@pve.proxmox.com>
Envoyé: Mercredi 2 Décembre 2015 20:33:08
Objet: Re: [pve-devel] Blacklisting HP hardware watchdog timer module ?





圚 2015幎12月2日䞋午7:08Dietmar Maurer < ***@proxmox.com > 写道
I will ship a blacklist for all watchdog modules with newer kernels:

https://git.proxmox.com/?p=pve-kernel.git;a=commitdiff;h=acea917faa9ef4974f984b9b2dc612e73d22d220

With latest kernel you can show that blacklist with:

# cat /lib/modprobe.d/blacklist_pve-kernel-4.2.6-1-pve.conf




for me. maybe it's can working of the dell r710.

had a few times,
———————
NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [swapper/5:0]
———————

http://mirrors.myccdn.info/images/dell-r710-bug.txt

thanks.


_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Alexandre DERUMIER
2015-12-03 07:10:08 UTC
Permalink
So,

I think that we blacklist all hardware watchdog module, we can keep nmi watchdog enabled.

but in the wiki,

we need to add a note:

if you want to use hardware watchdog, you need to disable nmi_watchdog=0
and load hardware watchdog in /etc/modules.


common hardware watchdog are :

intel itco (ich chipset motherboard, almost any intel motherboard since 15years):

/etc/modules
------------
iTCO_wdt
iTCO_vendor_support


dell idrac or generic ipmi:

/etc/modules
------------
ipmi_watchdog


hp lio:

/etc/modules
------------
hpwdt





----- Mail original -----
De: "aderumier" <***@odiso.com>
À: "pve-devel" <pve-***@pve.proxmox.com>
Envoyé: Jeudi 3 Décembre 2015 06:47:15
Objet: Re: [pve-devel] Blacklisting HP hardware watchdog timer module ?

They are also a "nowatchdog" grub option, to disable nmi watchdog (hard-lockup) & soft-lockup


https://lkml.org/lkml/2015/3/2/651

nmi_watchdog= [KNL,BUGS=X86] Debugging features for SMP kernels
Format: [panic,][nopanic,][num]
- Valid num: 0
+ Valid num: 0 or 1
0 - turn nmi_watchdog off
+ 1 - turn nmi_watchdog on
When panic is specified, panic when an NMI watchdog
timeout occurs (or 'nopanic' to override the opposite
default).
@@ -2460,7 +2461,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.

nousb [USB] Disable the USB subsystem

- nowatchdog [KNL] Disable the lockup detector (NMI watchdog).
+ nowatchdog [KNL] Disable both lockup detectors, i.e.
+ soft-lockup and NMI watchdog (hard-lockup).

nowb [ARM]






Alexandre Derumier
Ingénieur système et stockage


Fixe : 03 20 68 90 88
Fax : 03 20 68 90 81


45 Bvd du Général Leclerc 59100 Roubaix
12 rue Marivaux 75002 Paris


MonSiteEstLent.com - Blog dédié à la webperformance et la gestion de pics de trafic


De: "aderumier" <***@odiso.com>
À: "pve-devel" <pve-***@pve.proxmox.com>
Envoyé: Jeudi 3 Décembre 2015 06:24:01
Objet: Re: [pve-devel] Blacklisting HP hardware watchdog timer module ?

About nmi watchdog,

I have it enabled, even if no watchdog module is loaded.

The only way I have found to disable it is to pass "nmi_watchdog=0"
to grub.


cat /proc/sys/kernel/nmi_watchdog to see if it's enable or not.

(I can't load idrac ipmi_watchdog until it's disable)


----- Mail original -----
De: "lyt_yudi" <***@icloud.com>
À: "pve-devel" <pve-***@pve.proxmox.com>
Envoyé: Mercredi 2 Décembre 2015 20:33:08
Objet: Re: [pve-devel] Blacklisting HP hardware watchdog timer module ?





在 2015年12月2日,下午7:08,Dietmar Maurer < ***@proxmox.com > 写道:
I will ship a blacklist for all watchdog modules with newer kernels:

https://git.proxmox.com/?p=pve-kernel.git;a=commitdiff;h=acea917faa9ef4974f984b9b2dc612e73d22d220

With latest kernel you can show that blacklist with:

# cat /lib/modprobe.d/blacklist_pve-kernel-4.2.6-1-pve.conf




for me. maybe it's can working of the dell r710.

had a few times,
———————
NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [swapper/5:0]
———————

http://mirrors.myccdn.info/images/dell-r710-bug.txt

thanks.


_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
lyt_yudi
2015-12-03 07:16:28 UTC
Permalink
Post by Alexandre DERUMIER
About nmi watchdog,
I have it enabled, even if no watchdog module is loaded.
The only way I have found to disable it is to pass "nmi_watchdog=0"
to grub.
cat /proc/sys/kernel/nmi_watchdog to see if it's enable or not.
(I can't load idrac ipmi_watchdog until it's disable)
hi

just got new error, because the watchdog.

Dec 3 14:41:19 ctc.fp12 kernel: [174867.983204] Call Trace:
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983206] <IRQ> [<ffffffff81800818>] dump_stack+0x45/0x57
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983218] [<ffffffff8107b65a>] warn_slowpath_common+0x8a/0xc0
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983220] [<ffffffff8107b6e5>] warn_slowpath_fmt+0x55/0x70
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983223] [<ffffffff8170df58>] dev_watchdog+0x228/0x240
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983225] [<ffffffff8170dd30>] ? dev_deactivate_queue.constprop.34+0x70/0x70
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983229] [<ffffffff810e5629>] call_timer_fn+0x39/0x100
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983231] [<ffffffff8170dd30>] ? dev_deactivate_queue.constprop.34+0x70/0x70
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983234] [<ffffffff810e6e12>] run_timer_softirq+0x182/0x2c0
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983236] [<ffffffff8107faf5>] __do_softirq+0x105/0x260
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983238] [<ffffffff8107fdae>] irq_exit+0x8e/0x90
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983243] [<ffffffff8180a486>] smp_apic_timer_interrupt+0x46/0x60
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983245] [<ffffffff8180861b>] apic_timer_interrupt+0x6b/0x70
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983246] <EOI> [<ffffffff8168b474>] ? poll_idle+0x54/0xa0
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983251] [<ffffffff8168ad45>] cpuidle_enter_state+0xb5/0x220
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983256] [<ffffffff8168aee7>] cpuidle_enter+0x17/0x20
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983261] [<ffffffff810bdb1b>] call_cpuidle+0x3b/0x70
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983294] [<ffffffff8168aec3>] ? cpuidle_select+0x13/0x20
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983296] [<ffffffff810bdde7>] cpu_startup_entry+0x297/0x360
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983301] [<ffffffff817f4cfc>] rest_init+0x7c/0x80
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983307] [<ffffffff81f66029>] start_kernel+0x49a/0x4bb
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983309] [<ffffffff81f65120>] ? early_idt_handler_array+0x120/0x120
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983311] [<ffffffff81f654d7>] x86_64_start_reservations+0x2a/0x2c
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983313] [<ffffffff81f65623>] x86_64_start_kernel+0x14a/0x16d
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983315] ---[ end trace 1242808342f6897e ]---

detail log:

http://mirrors.myccdn.info/images/dell-r510.txt

thanks.
Alexandre DERUMIER
2015-12-03 07:28:59 UTC
Permalink
Post by lyt_yudi
just got new error, because the watchdog.
I see that ipmi_watchdog is loaded and nmi too, maybe can you try to disable nmi.



can you try:

edit
/etc/default grub

add
GRUB_CMDLINE_LINUX_DEFAULT="quiet nmi_watchdog=0"


#update-grub


edit

/etc/modprodbe.d/pve-blacklist.conf

add


blacklist iTCO_wdt
blacklist iTCO_vendor_support


edit

/etc/modules

ipmi_watchdog

----- Mail original -----
De: "lyt_yudi" <***@icloud.com>
À: "pve-devel" <pve-***@pve.proxmox.com>
Envoyé: Jeudi 3 Décembre 2015 08:16:28
Objet: Re: [pve-devel] Blacklisting HP hardware watchdog timer module ?





在 2015年12月3日,下午1:24,Alexandre DERUMIER < ***@odiso.com > 写道:

About nmi watchdog,

I have it enabled, even if no watchdog module is loaded.

The only way I have found to disable it is to pass "nmi_watchdog=0"
to grub.


cat /proc/sys/kernel/nmi_watchdog to see if it's enable or not.

(I can't load idrac ipmi_watchdog until it's disable)




hi

just got new error, because the watchdog.

Dec 3 14:41:19 ctc.fp12 kernel: [174867.983204] Call Trace:
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983206] <IRQ> [<ffffffff81800818>] dump_stack+0x45/0x57
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983218] [<ffffffff8107b65a>] warn_slowpath_common+0x8a/0xc0
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983220] [<ffffffff8107b6e5>] warn_slowpath_fmt+0x55/0x70
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983223] [<ffffffff8170df58>] dev_watchdog+0x228/0x240
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983225] [<ffffffff8170dd30>] ? dev_deactivate_queue.constprop.34+0x70/0x70
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983229] [<ffffffff810e5629>] call_timer_fn+0x39/0x100
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983231] [<ffffffff8170dd30>] ? dev_deactivate_queue.constprop.34+0x70/0x70
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983234] [<ffffffff810e6e12>] run_timer_softirq+0x182/0x2c0
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983236] [<ffffffff8107faf5>] __do_softirq+0x105/0x260
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983238] [<ffffffff8107fdae>] irq_exit+0x8e/0x90
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983243] [<ffffffff8180a486>] smp_apic_timer_interrupt+0x46/0x60
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983245] [<ffffffff8180861b>] apic_timer_interrupt+0x6b/0x70
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983246] <EOI> [<ffffffff8168b474>] ? poll_idle+0x54/0xa0
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983251] [<ffffffff8168ad45>] cpuidle_enter_state+0xb5/0x220
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983256] [<ffffffff8168aee7>] cpuidle_enter+0x17/0x20
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983261] [<ffffffff810bdb1b>] call_cpuidle+0x3b/0x70
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983294] [<ffffffff8168aec3>] ? cpuidle_select+0x13/0x20
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983296] [<ffffffff810bdde7>] cpu_startup_entry+0x297/0x360
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983301] [<ffffffff817f4cfc>] rest_init+0x7c/0x80
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983307] [<ffffffff81f66029>] start_kernel+0x49a/0x4bb
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983309] [<ffffffff81f65120>] ? early_idt_handler_array+0x120/0x120
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983311] [<ffffffff81f654d7>] x86_64_start_reservations+0x2a/0x2c
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983313] [<ffffffff81f65623>] x86_64_start_kernel+0x14a/0x16d
Dec 3 14:41:19 ctc.fp12 kernel: [174867.983315] ---[ end trace 1242808342f6897e ]---

detail log:

http://mirrors.myccdn.info/images/dell-r510.txt

thanks.

_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
lyt_yudi
2015-12-03 07:43:24 UTC
Permalink
Post by Alexandre DERUMIER
Post by lyt_yudi
just got new error, because the watchdog.
I see that ipmi_watchdog is loaded and nmi too, maybe can you try to disable nmi.
edit
/etc/default grub
add
GRUB_CMDLINE_LINUX_DEFAULT="quiet nmi_watchdog=0"
#update-grub
edit
/etc/modprodbe.d/pve-blacklist.conf
add
blacklist iTCO_wdt
blacklist iTCO_vendor_support
edit
/etc/modules
ipmi_watchdog
thanks. have been added.
Alexandre DERUMIER
2015-12-03 07:53:03 UTC
Permalink
@Dietmar :

if we blacklist all modules in pve-blacklist.conf,

I can't force to load one with define it in /etc/modules.

example :
/etc/modprobe.d/pve-blacklist.conf
blacklist ipmi_watchdog


/etc/modules
ipmi_watchdog


dmesg :

Dec 03 08:45:56 kvmtest1.odiso.net systemd-modules-load[229]: Module 'ipmi_watchdog' is blacklisted
Dec 03 08:45:56 kvmtest1.odiso.net systemd-modules-load[229]: Module 'ipmi_watchdog' is blacklisted


I don't known how to manage that without removing the blacklist from pve-blacklist.conf,
but I think it'll be overwrite at each kernel update ?


----- Mail original -----
De: "lyt_yudi" <***@icloud.com>
À: "pve-devel" <pve-***@pve.proxmox.com>
Envoyé: Jeudi 3 Décembre 2015 08:43:24
Objet: Re: [pve-devel] Blacklisting HP hardware watchdog timer module ?





在 2015年12月3日,下午3:28,Alexandre DERUMIER < ***@odiso.com > 写道:


BQ_BEGIN

BQ_BEGIN
just got new error, because the watchdog.



BQ_END
I see that ipmi_watchdog is loaded and nmi too, maybe can you try to disable nmi.



can you try:

edit
/etc/default grub

add
GRUB_CMDLINE_LINUX_DEFAULT="quiet nmi_watchdog=0"


#update-grub


edit

/etc/modprodbe.d/pve-blacklist.conf

add


blacklist iTCO_wdt
blacklist iTCO_vendor_support


edit

/etc/modules

ipmi_watchdog

BQ_END



thanks. have been added.

_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Dietmar Maurer
2015-12-03 09:07:19 UTC
Permalink
Post by Alexandre DERUMIER
Dec 03 08:45:56 kvmtest1.odiso.net systemd-modules-load[229]: Module
'ipmi_watchdog' is blacklisted
Dec 03 08:45:56 kvmtest1.odiso.net systemd-modules-load[229]: Module
'ipmi_watchdog' is blacklisted
sigh :-/
Post by Alexandre DERUMIER
I don't known how to manage that without removing the blacklist from pve-blacklist.conf,
but I think it'll be overwrite at each kernel update ?
yes.

But manually loading the modules using 'modprobe' works. So maybe we
just configure the watchdog module in:

/etc/default/pve-watchdog:
WATCHDOG=ipmi_watchdog

and then load pass that to the watchdog-mux.service:

-------------------------
[Unit]
Description=Proxmox VE watchdog multiplexer

[Service]
EnvironmentFile=-/etc/default/pve-watchdog
ExecStart=/usr/sbin/watchdog-mux
OOMScoreAdjust=-1000
Restart=no
---------------------

which do the modprobe to load the module?
Alexandre DERUMIER
2015-12-03 09:25:50 UTC
Permalink
Post by Alexandre DERUMIER
Post by Dietmar Maurer
which do the modprobe to load the module?
yes, it should work. (modprobe module is working for me)


----- Mail original -----
De: "dietmar" <***@proxmox.com>
À: "aderumier" <***@odiso.com>, "pve-devel" <pve-***@pve.proxmox.com>
Envoyé: Jeudi 3 Décembre 2015 10:07:19
Objet: Re: [pve-devel] Blacklisting HP hardware watchdog timer module ?
Post by Alexandre DERUMIER
Dec 03 08:45:56 kvmtest1.odiso.net systemd-modules-load[229]: Module
'ipmi_watchdog' is blacklisted
Dec 03 08:45:56 kvmtest1.odiso.net systemd-modules-load[229]: Module
'ipmi_watchdog' is blacklisted
sigh :-/
Post by Alexandre DERUMIER
I don't known how to manage that without removing the blacklist from pve-blacklist.conf,
but I think it'll be overwrite at each kernel update ?
yes.

But manually loading the modules using 'modprobe' works. So maybe we
just configure the watchdog module in:

/etc/default/pve-watchdog:
WATCHDOG=ipmi_watchdog

and then load pass that to the watchdog-mux.service:

-------------------------
[Unit]
Description=Proxmox VE watchdog multiplexer

[Service]
EnvironmentFile=-/etc/default/pve-watchdog
ExecStart=/usr/sbin/watchdog-mux
OOMScoreAdjust=-1000
Restart=no
---------------------

which do the modprobe to load the module?
Dietmar Maurer
2015-12-03 10:16:48 UTC
Permalink
Post by Alexandre DERUMIER
Post by Dietmar Maurer
which do the modprobe to load the module?
yes, it should work. (modprobe module is working for me)
Please can you test?

https://git.proxmox.com/?p=pve-ha-manager.git;a=commitdiff;h=6263c81dfe8bf78c1f47ab8a1e33aa896202dba0
Alexandre DERUMIER
2015-12-03 10:33:18 UTC
Permalink
I'll test this afternoon.

Also, maybe you could add a fallback to softdog, if the modprobe of defined watchdog module is not working ? (wrong module for example)


if (stat(WATCHDOG_DEV, &fs) == -1) {
- system("modprobe -q softdog"); // load softdog by default
+ char *wd_module = getenv("WATCHDOG_MODULE");
+ if (wd_module) {
+ char *cmd = NULL;
+ if ((asprintf(&cmd, "modprobe -q %s", wd_module) == -1)) {
+ perror("assemble modprobe command failed");
+ exit(EXIT_FAILURE);
+ }
+ system(cmd);
++ if (stat(WATCHDOG_DEV, &fs) == -1) {
++ system("modprobe -q softdog"); // fallback
++ }
+ } else {
+ system("modprobe -q softdog"); // load softdog by default
+ }
}




----- Mail original -----
De: "dietmar" <***@proxmox.com>
À: "aderumier" <***@odiso.com>
Cc: "pve-devel" <pve-***@pve.proxmox.com>
Envoyé: Jeudi 3 Décembre 2015 11:16:48
Objet: Re: [pve-devel] Blacklisting HP hardware watchdog timer module ?
Post by Alexandre DERUMIER
Post by Dietmar Maurer
which do the modprobe to load the module?
yes, it should work. (modprobe module is working for me)
Please can you test?

https://git.proxmox.com/?p=pve-ha-manager.git;a=commitdiff;h=6263c81dfe8bf78c1f47ab8a1e33aa896202dba0
Riccardo Gallazzi
2015-12-03 10:39:15 UTC
Permalink
Hi all, I had a kernel panic issue as described with iLO v3 but not with
v2, however I've disabled hpwdt on all my servers (inside Proxmox). More
details, if needed, later as I don't have a VPN atm.
Thank you for the workaround, I wouldn't been able to test HA without it.

Ciao!

-Riccardo
Post by Alexandre DERUMIER
I'll test this afternoon.
Also, maybe you could add a fallback to softdog, if the modprobe of
defined watchdog module is not working ? (wrong module for example)
if (stat(WATCHDOG_DEV, &fs) == -1) {
- system("modprobe -q softdog"); // load softdog by default
+ char *wd_module = getenv("WATCHDOG_MODULE");
+ if (wd_module) {
+ char *cmd = NULL;
+ if ((asprintf(&cmd, "modprobe -q %s", wd_module) == -1)) {
+ perror("assemble modprobe command failed");
+ exit(EXIT_FAILURE);
+ }
+ system(cmd);
++ if (stat(WATCHDOG_DEV, &fs) == -1) {
++ system("modprobe -q softdog"); // fallback
++ }
+ } else {
+ system("modprobe -q softdog"); // load softdog by default
+ }
}
----- Mail original -----
Envoyé: Jeudi 3 Décembre 2015 11:16:48
Objet: Re: [pve-devel] Blacklisting HP hardware watchdog timer module ?
Post by Alexandre DERUMIER
Post by Dietmar Maurer
which do the modprobe to load the module?
yes, it should work. (modprobe module is working for me)
Please can you test?
https://git.proxmox.com/?p=pve-ha-manager.git;a=commitdiff;h=6263c81dfe8bf78c1f47ab8a1e33aa896202dba0
_______________________________________________
pve-devel mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Dietmar Maurer
2015-12-03 11:01:44 UTC
Permalink
Post by Alexandre DERUMIER
I'll test this afternoon.
Also, maybe you could add a fallback to softdog, if the modprobe of defined
watchdog module is not working ? (wrong module for example)
I though about that, but that may be miss-leading (User thinks everything is
working, but
still uses softdog)?
Alexandre DERUMIER
2015-12-03 13:24:11 UTC
Permalink
Post by Alexandre DERUMIER
Post by Dietmar Maurer
I though about that, but that may be miss-leading (User thinks everything is
working, but
still uses softdog)?
ok, you are right. (don't known what it the impact to not have softdog loaded ?)


I have tested the patch,

it's working fine with ipmi_watchdog && iTCO_wdt module,


BTW, is it possible to pass option to module ?

"ipmi_watchdog timeout=60" for example

----- Mail original -----
De: "dietmar" <***@proxmox.com>
À: "aderumier" <***@odiso.com>
Cc: "pve-devel" <pve-***@pve.proxmox.com>
Envoyé: Jeudi 3 Décembre 2015 12:01:44
Objet: Re: [pve-devel] Blacklisting HP hardware watchdog timer module ?
Post by Alexandre DERUMIER
I'll test this afternoon.
Also, maybe you could add a fallback to softdog, if the modprobe of defined
watchdog module is not working ? (wrong module for example)
I though about that, but that may be miss-leading (User thinks everything is
working, but
still uses softdog)?
Dietmar Maurer
2015-12-03 13:35:50 UTC
Permalink
Post by Alexandre DERUMIER
BTW, is it possible to pass option to module ?
"ipmi_watchdog timeout=60" for example
AFAIK the suggested way is to add an option line to

/lib/modprobe.d/aliases.conf

or some other modprobe config file. For example I use:

options softdog soft_noboot=1

for testing. But I guess you can also add it to /etc/default/pve-ha-manager:

WATCHDOG_MODULE="ipmi_watchdog timeout=60"

(I have not tested that)
Alexandre DERUMIER
2015-12-03 15:04:09 UTC
Permalink
Post by Alexandre DERUMIER
Post by Dietmar Maurer
AFAIK the suggested way is to add an option line to
/lib/modprobe.d/aliases.conf
options softdog soft_noboot=1
for testing.
yes, it's working like that
Post by Alexandre DERUMIER
Post by Dietmar Maurer
WATCHDOG_MODULE="ipmi_watchdog timeout=60"
(I have not tested that)
Don't work.

But no problem, define it in modprobe.d is the right place.


BTW, what is the best timeout for the watchdog ?
I think that pve ha manager wait for around 1min before migrating vm ?
if yes, the watchdog timeout should be lower ?


Another question, I have done some tests 2weeks ago with a customer,
and I think I had some problem, if the node reboot too fast
(pve-ha-manager see the node down, but it's coming up again before the vm was migrated).
Is it a known bug ?
If yes, maybe forcing a shutdown/halt and not a reset/reboot of the node is better ?




----- Mail original -----
De: "dietmar" <***@proxmox.com>
À: "aderumier" <***@odiso.com>
Cc: "pve-devel" <pve-***@pve.proxmox.com>
Envoyé: Jeudi 3 Décembre 2015 14:35:50
Objet: Re: [pve-devel] Blacklisting HP hardware watchdog timer module ?
Post by Alexandre DERUMIER
BTW, is it possible to pass option to module ?
"ipmi_watchdog timeout=60" for example
AFAIK the suggested way is to add an option line to

/lib/modprobe.d/aliases.conf

or some other modprobe config file. For example I use:

options softdog soft_noboot=1

for testing. But I guess you can also add it to /etc/default/pve-ha-manager:

WATCHDOG_MODULE="ipmi_watchdog timeout=60"

(I have not tested that)
Dietmar Maurer
2015-12-03 16:28:55 UTC
Permalink
Post by Alexandre DERUMIER
BTW, what is the best timeout for the watchdog ?
I think that pve ha manager wait for around 1min before migrating vm ?
if yes, the watchdog timeout should be lower ?
The timeout must be 60 seconds!! Never change that.

We set the timeout to 60s when we start watchdog-mux.
Post by Alexandre DERUMIER
Another question, I have done some tests 2weeks ago with a customer,
and I think I had some problem, if the node reboot too fast
(pve-ha-manager see the node down, but it's coming up again before the vm was migrated).
Is it a known bug ?
What bug exactly?
Dietmar Maurer
2015-12-03 16:40:04 UTC
Permalink
Post by Dietmar Maurer
The timeout must be 60 seconds!! Never change that.
We set the timeout to 60s when we start watchdog-mux.
Sorry, above info was wrong! We set the timeout to 10 seconds
when we start watchdog-mux (watchdog-mux uses 60s timeouts
for the clients).

Anyways, you should not set the timeout manually.
Alexandre DERUMIER
2015-12-03 16:48:14 UTC
Permalink
Post by Alexandre DERUMIER
Post by Dietmar Maurer
The timeout must be 60 seconds!! Never change that.
We set the timeout to 60s when we start watchdog-mux.
Ah ok. I thinked we need to define it manually

What is the difference between this 2 timeout ?

+ int watchdog_timeout = 10;
+ int client_watchdog_timeout = 60;


ipmitool give me 10s, so it's seem to works fine :)
# ipmitool mc watchdog get
Initial Countdown: 10 sec
Post by Alexandre DERUMIER
Another question, I have done some tests 2weeks ago with a customer,
and I think I had some problem, if the node reboot too fast
(pve-ha-manager see the node down, but it's coming up again before the vm was migrated).
Is it a known bug ?
Post by Dietmar Maurer
What bug exactly?
I don't remember exactly, but lrm or crm was stuck, because node (and vms) had rebooted too fast.

I don't have access to customer logs sorry.



----- Mail original -----
De: "dietmar" <***@proxmox.com>
À: "aderumier" <***@odiso.com>
Cc: "pve-devel" <pve-***@pve.proxmox.com>
Envoyé: Jeudi 3 Décembre 2015 17:28:55
Objet: Re: [pve-devel] Blacklisting HP hardware watchdog timer module ?
Post by Alexandre DERUMIER
BTW, what is the best timeout for the watchdog ?
I think that pve ha manager wait for around 1min before migrating vm ?
if yes, the watchdog timeout should be lower ?
The timeout must be 60 seconds!! Never change that.

We set the timeout to 60s when we start watchdog-mux.
Post by Alexandre DERUMIER
Another question, I have done some tests 2weeks ago with a customer,
and I think I had some problem, if the node reboot too fast
(pve-ha-manager see the node down, but it's coming up again before the vm was migrated).
Is it a known bug ?
What bug exactly?
Alexandre DERUMIER
2015-12-03 17:24:40 UTC
Permalink
I just found a strange bug with ipmi_watchdog, dell openmanage related

at boot the timeout is correclty setup to 10s

***@kvmtest1 ~ # ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0x44)
Watchdog Timer Is: Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x10
Initial Countdown: 10 sec
Present Countdown: 9 sec


but after some minutes (5-10min),
I'm seeing it at 480s

# ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0xc4)
Watchdog Timer Is: Started/Running
Watchdog Timer Actions: No action (0x00)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x10
Initial Countdown: 480 sec
Present Countdown: 479 sec


In the dell openmanage, I'm seeing a reset configuration option at 480s.

(I think it's the openmanage service which overwrite the value).

I'll add a note in the wiki about this too.


----- Mail original -----
De: "aderumier" <***@odiso.com>
À: "dietmar" <***@proxmox.com>
Cc: "pve-devel" <pve-***@pve.proxmox.com>
Envoyé: Jeudi 3 Décembre 2015 17:48:14
Objet: Re: [pve-devel] Blacklisting HP hardware watchdog timer module ?
Post by Alexandre DERUMIER
Post by Dietmar Maurer
The timeout must be 60 seconds!! Never change that.
We set the timeout to 60s when we start watchdog-mux.
Ah ok. I thinked we need to define it manually

What is the difference between this 2 timeout ?

+ int watchdog_timeout = 10;
+ int client_watchdog_timeout = 60;


ipmitool give me 10s, so it's seem to works fine :)
# ipmitool mc watchdog get
Initial Countdown: 10 sec
Post by Alexandre DERUMIER
Another question, I have done some tests 2weeks ago with a customer,
and I think I had some problem, if the node reboot too fast
(pve-ha-manager see the node down, but it's coming up again before the vm was migrated).
Is it a known bug ?
Post by Dietmar Maurer
What bug exactly?
I don't remember exactly, but lrm or crm was stuck, because node (and vms) had rebooted too fast.

I don't have access to customer logs sorry.



----- Mail original -----
De: "dietmar" <***@proxmox.com>
À: "aderumier" <***@odiso.com>
Cc: "pve-devel" <pve-***@pve.proxmox.com>
Envoyé: Jeudi 3 Décembre 2015 17:28:55
Objet: Re: [pve-devel] Blacklisting HP hardware watchdog timer module ?
Post by Alexandre DERUMIER
BTW, what is the best timeout for the watchdog ?
I think that pve ha manager wait for around 1min before migrating vm ?
if yes, the watchdog timeout should be lower ?
The timeout must be 60 seconds!! Never change that.

We set the timeout to 60s when we start watchdog-mux.
Post by Alexandre DERUMIER
Another question, I have done some tests 2weeks ago with a customer,
and I think I had some problem, if the node reboot too fast
(pve-ha-manager see the node down, but it's coming up again before the vm was migrated).
Is it a known bug ?
What bug exactly?
_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Alexandre DERUMIER
2015-12-03 17:33:54 UTC
Permalink
Damned,

I can't force openmanage to set the timer under 60s :(

#omconfig system recovery timer=10
Error! Recovery reset time must be between 60 and 720 seconds.

I'll try to see if we can disable it.

----- Mail original -----
De: "aderumier" <***@odiso.com>
À: "pve-devel" <pve-***@pve.proxmox.com>
Envoyé: Jeudi 3 Décembre 2015 18:24:40
Objet: Re: [pve-devel] Blacklisting HP hardware watchdog timer module ?

I just found a strange bug with ipmi_watchdog, dell openmanage related

at boot the timeout is correclty setup to 10s

***@kvmtest1 ~ # ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0x44)
Watchdog Timer Is: Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x10
Initial Countdown: 10 sec
Present Countdown: 9 sec


but after some minutes (5-10min),
I'm seeing it at 480s

# ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0xc4)
Watchdog Timer Is: Started/Running
Watchdog Timer Actions: No action (0x00)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x10
Initial Countdown: 480 sec
Present Countdown: 479 sec


In the dell openmanage, I'm seeing a reset configuration option at 480s.

(I think it's the openmanage service which overwrite the value).

I'll add a note in the wiki about this too.


----- Mail original -----
De: "aderumier" <***@odiso.com>
À: "dietmar" <***@proxmox.com>
Cc: "pve-devel" <pve-***@pve.proxmox.com>
Envoyé: Jeudi 3 Décembre 2015 17:48:14
Objet: Re: [pve-devel] Blacklisting HP hardware watchdog timer module ?
Post by Alexandre DERUMIER
Post by Dietmar Maurer
The timeout must be 60 seconds!! Never change that.
We set the timeout to 60s when we start watchdog-mux.
Ah ok. I thinked we need to define it manually

What is the difference between this 2 timeout ?

+ int watchdog_timeout = 10;
+ int client_watchdog_timeout = 60;


ipmitool give me 10s, so it's seem to works fine :)
# ipmitool mc watchdog get
Initial Countdown: 10 sec
Post by Alexandre DERUMIER
Another question, I have done some tests 2weeks ago with a customer,
and I think I had some problem, if the node reboot too fast
(pve-ha-manager see the node down, but it's coming up again before the vm was migrated).
Is it a known bug ?
Post by Dietmar Maurer
What bug exactly?
I don't remember exactly, but lrm or crm was stuck, because node (and vms) had rebooted too fast.

I don't have access to customer logs sorry.



----- Mail original -----
De: "dietmar" <***@proxmox.com>
À: "aderumier" <***@odiso.com>
Cc: "pve-devel" <pve-***@pve.proxmox.com>
Envoyé: Jeudi 3 Décembre 2015 17:28:55
Objet: Re: [pve-devel] Blacklisting HP hardware watchdog timer module ?
Post by Alexandre DERUMIER
BTW, what is the best timeout for the watchdog ?
I think that pve ha manager wait for around 1min before migrating vm ?
if yes, the watchdog timeout should be lower ?
The timeout must be 60 seconds!! Never change that.

We set the timeout to 60s when we start watchdog-mux.
Post by Alexandre DERUMIER
Another question, I have done some tests 2weeks ago with a customer,
and I think I had some problem, if the node reboot too fast
(pve-ha-manager see the node down, but it's coming up again before the vm was migrated).
Is it a known bug ?
What bug exactly?
_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Dietmar Maurer
2015-12-03 18:05:57 UTC
Permalink
Post by Alexandre DERUMIER
but after some minutes (5-10min),
I'm seeing it at 480s
The assumption is that watchdog-mux is the only tool with access
to the watchdog. I think the watchdog is quite useless if
someone/something else can change or modify the watchdog timers.
IMHO very dangerous - maybe better to use softdog instead?
Alexandre DERUMIER
2015-12-04 05:14:39 UTC
Permalink
I finally found how to disable watchdog management from dell openmanage.

simply edit :

/opt/dell/srvadmin/etc/srvadmin-isvc/ini/dcwddy64.ini

and comment:

;[HWC Configuration]
;watchDogObj.settings=0
;watchDogObj.expiryTime=480


That's all.


I'll add a note in the wiki.
Post by Alexandre DERUMIER
Post by Dietmar Maurer
IMHO very dangerous - maybe better to use softdog instead?
In think iTCO_wdt is a good one also, for intel motherboard (integrated in all ich chipsets)


But ipmi can have some avantage with dell, like taking screenshot of kernel panic for example.


----- Mail original -----
De: "dietmar" <***@proxmox.com>
À: "aderumier" <***@odiso.com>, "pve-devel" <pve-***@pve.proxmox.com>
Envoyé: Jeudi 3 Décembre 2015 19:05:57
Objet: Re: [pve-devel] Blacklisting HP hardware watchdog timer module ?
Post by Alexandre DERUMIER
but after some minutes (5-10min),
I'm seeing it at 480s
The assumption is that watchdog-mux is the only tool with access
to the watchdog. I think the watchdog is quite useless if
someone/something else can change or modify the watchdog timers.
IMHO very dangerous - maybe better to use softdog instead?
Alexandre DERUMIER
2015-12-04 05:37:22 UTC
Permalink
I have added a new hardware watchdog section in the wiki (was before in troubleshotting)

https://pve.proxmox.com/wiki/High_Availability_Cluster_4.x#Hardware_Watchdogs

with new /etc/default/pve-ha-manager configuration file.


----- Mail original -----
De: "aderumier" <***@odiso.com>
À: "dietmar" <***@proxmox.com>
Cc: "pve-devel" <pve-***@pve.proxmox.com>
Envoyé: Vendredi 4 Décembre 2015 06:14:39
Objet: Re: [pve-devel] Blacklisting HP hardware watchdog timer module ?

I finally found how to disable watchdog management from dell openmanage.

simply edit :

/opt/dell/srvadmin/etc/srvadmin-isvc/ini/dcwddy64.ini

and comment:

;[HWC Configuration]
;watchDogObj.settings=0
;watchDogObj.expiryTime=480


That's all.


I'll add a note in the wiki.
Post by Alexandre DERUMIER
Post by Dietmar Maurer
IMHO very dangerous - maybe better to use softdog instead?
In think iTCO_wdt is a good one also, for intel motherboard (integrated in all ich chipsets)


But ipmi can have some avantage with dell, like taking screenshot of kernel panic for example.


----- Mail original -----
De: "dietmar" <***@proxmox.com>
À: "aderumier" <***@odiso.com>, "pve-devel" <pve-***@pve.proxmox.com>
Envoyé: Jeudi 3 Décembre 2015 19:05:57
Objet: Re: [pve-devel] Blacklisting HP hardware watchdog timer module ?
Post by Alexandre DERUMIER
but after some minutes (5-10min),
I'm seeing it at 480s
The assumption is that watchdog-mux is the only tool with access
to the watchdog. I think the watchdog is quite useless if
someone/something else can change or modify the watchdog timers.
IMHO very dangerous - maybe better to use softdog instead?
_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Dietmar Maurer
2015-12-04 07:07:59 UTC
Permalink
Post by Alexandre DERUMIER
I have added a new hardware watchdog section in the wiki (was before in troubleshotting)
https://pve.proxmox.com/wiki/High_Availability_Cluster_4.x#Hardware_Watchdogs
Great! Thanks.

But I do not really understand the why someone should disable the nmi_watchdog?
Alexandre DERUMIER
2015-12-04 08:34:53 UTC
Permalink
Post by Alexandre DERUMIER
Post by Dietmar Maurer
But I do not really understand the why someone should disable the nmi_watchdog?
For HP watchdog, it's mandatory
https://www.kernel.org/doc/Documentation/watchdog/hpwdt.txt

lyt_yudi has also reported crash with nmi_watchdog on his dell server when ipmi_watchdog is loaded.


And I have also seen a lof of bug report on the net with nmi_watchdog.


From my point of view, nmi_watchdog is only usefull with softdog, so if you have hardware watchdog,
disable it can avoid problems.





----- Mail original -----
De: "dietmar" <***@proxmox.com>
À: "aderumier" <***@odiso.com>, "pve-devel" <pve-***@pve.proxmox.com>
Envoyé: Vendredi 4 Décembre 2015 08:07:59
Objet: Re: [pve-devel] Blacklisting HP hardware watchdog timer module ?
Post by Alexandre DERUMIER
I have added a new hardware watchdog section in the wiki (was before in troubleshotting)
https://pve.proxmox.com/wiki/High_Availability_Cluster_4.x#Hardware_Watchdogs
Great! Thanks.

But I do not really understand the why someone should disable the nmi_watchdog?
Dietmar Maurer
2015-12-04 08:41:57 UTC
Permalink
Post by Alexandre DERUMIER
From my point of view, nmi_watchdog is only usefull with softdog, so if you
have hardware watchdog,
disable it can avoid problems.
OK, thanks for the details.

Dietmar Maurer
2015-12-04 06:58:59 UTC
Permalink
Post by Alexandre DERUMIER
I finally found how to disable watchdog management from dell openmanage.
Is the source for that tool available? I really wonder how the can reset
an already opened/used watchdog device?
Alexandre DERUMIER
2015-12-04 08:22:29 UTC
Permalink
Post by Alexandre DERUMIER
Post by Dietmar Maurer
Is the source for that tool available? I really wonder how the can reset
an already opened/used watchdog device?
No, it's closed source java.

But I think they simply use WDIOC_SETTIMEOUT ioctl. (Don't known if it can be done when the watchdog is already used)




----- Mail original -----
De: "dietmar" <***@proxmox.com>
À: "aderumier" <***@odiso.com>
Cc: "pve-devel" <pve-***@pve.proxmox.com>
Envoyé: Vendredi 4 Décembre 2015 07:58:59
Objet: Re: [pve-devel] Blacklisting HP hardware watchdog timer module ?
Post by Alexandre DERUMIER
I finally found how to disable watchdog management from dell openmanage.
Is the source for that tool available? I really wonder how the can reset
an already opened/used watchdog device?
Dietmar Maurer
2015-12-03 17:52:23 UTC
Permalink
Post by Alexandre DERUMIER
Post by Dietmar Maurer
The timeout must be 60 seconds!! Never change that.
We set the timeout to 60s when we start watchdog-mux.
Ah ok. I thinked we need to define it manually
What is the difference between this 2 timeout ?
+ int watchdog_timeout = 10;
This is used for the hardware watchdog
Post by Alexandre DERUMIER
+ int client_watchdog_timeout = 60;
The purpose of the watchdog_mux is to provide several 'virtual'
watchdog like devices to clients (pve-ha-crm and pve-ha-lrm). Those
virtual watchdogs use a 60 second timeout.
Loading...