[pvrusb2] Kernel oops solved

Mike Isely isely at isely.net
Wed Sep 21 00:11:51 CDT 2005


Never mind about the additional logs.

I believe I've found the problem.  It's a linked list foul-up happening in
pvrusb2-io.c.  Here's the long explanation (so at least I don't forget
it):

All I/O (where really it's just "I") is handled through the circulation of
a set of buffers.  Each buffer is a control structure (struct pvr2_buffer)
attached to a piece of storage for the data and an URB (Usb Request Block)
for operation with the USB core.  Circulation is tracked by moving the
buffers through various linked lists as the buffer state changes.  These
are the states and their corresponding linked lists:

"none" (not on any list - happens only during initialization)
"idle" - buffer is not being used at all
"queued" - buffer has been posted into the USB core for I/O
"ready" - buffer has data ready for retrieval by the application

The circulation is normally idle --> queued --> ready --> idle.  Movement
from idle to queued happens when the application has pending data
requests.  Movement from queued to ready only happens when the URB's
callback is executed to indicate completion of the I/O, and the buffer is
moved back to idle from ready when the application slurps up the data.
The whole thing is a giant pipeline.  I made this code deliberately
generic and flexible so that the buffer pool's size can be tuned (even on
the fly) and different actual application I/O methods can be layered on
top.  Right now the only I/O method laid over this is just read() calls
(which always tunes the pool to 32 buffers), but in the future it should
be possible to layer the other V4L or DVB I/O methods over this if needed.

Anyway, the pool starts with all buffers in the idle state (having moved
there from "none" when the buffers are allocated).  The pool on the other
hand is shut down by first walking the entire queued list and cancelling
all pending associated URBs (moving each buffer to idle as it is
cancelled), and then the ready list is simply flushed to the idle list.
Notably, the list walk needed for the URB cancellation is an instance of
the kernel's list_for_each_entry_safe() macro.

Here's what is going wrong:

1. The stream is stopped.

2. The function pvr2_stream_internal_flush() is ultimately called to walk
that queued list and cancel all pending URBs.

3. HOWEVER during that walk, a URB or two actually concurrently completes
and its callback function runs, which moves that buffer to the ready list.

4. Unfortunately, since I'm using a for-loop style macro to walk the
queued list, there's a chance that the loop is pointing to one of these
just-completed buffers at just the wrong moment.

5. Internal logic actually avoids corrupting anything at this point since
the actual list manipulations are protected with a spinlock, however now
the for-loop inside list_for_each_entry_safe() is going to go astray
because it's using that "should have been queued but is really ready"
buffer as its placeholder.  So when the for-loop iterates to the next
buffer what it ends up getting is the next ready buffer, or it finds an
empty ready list.

6. Unfortunately the Linux kernel's linked list architecture uses the same
struct type for a list head as it uses for an element, and it detects
end-of-list by recognizing that the pointer is pointing to the list head
instead of a member.  BUT, in this case since we've leaked over to the
ready list instead of the queued list, the list head ends up being the
ready list now.  Yet the for-loop is looking for the list head for the
queued list.  So when it hits the list head for the ready list, it doesn't
recognize it and erroneously interprets it as another buffer pointer.
THAT's the bad pointer we're seeing.

I usually have a different style of implementing linked lists when I write
code.  I use null pointers for list termination and manage the list with a
struct type for the list that is different than the list elements.  This
IMHO is more type-safe and it's something I've been doing for far too many
years - so much so that my fingers can usually whip up such an
implementation in code with barely any thought.  You can find examples of
this other type of list construction elsewhere in the driver.  But that's
not how linked lists are done in the kernel and I expect to get complaints
when others try to main such code.  In this particular case in
pvrusb2-io.c I had decided to adopt the Linux kernel style lists and use
them as I probably should be doing everywhere.  Inexperience with this
list style caused me to miss this scenario.  Damnit.

I already know how to fix this.  Changing the for-loop style of list
walking into a do-while (e.g. grab first item, remove, lather rinse
repeat) plus a second sanity check after cancelling the URB should solve
it.  However it's midnight here right now and I have an early morning
tomorrow.  I will try to get a snapshot with the fix out tomorrow night,
when I'm thinking on all cylinders again.

For all who worked to reproduce this scenario, I thank you.  Oh, and I
REALLY appreciate that Arnaud turned on all the buffer logging as well.
I hadn't really asked for that (because I know that generates huge piles
of noise in the log), but that additional info was instrumental in
allowing me to reconstruct blow-by-blow what happened in the linked lists!

  -Mike



On Tue, 20 Sep 2005, Mike Isely wrote:

>
> Another thing:
>
> There is more to the pvrusb2 log that's missing.  I need to match up all
> the buffer addresses being reported when the buffers are created.  There
> are normally 32 of them but what you posted only starts with the 16th
> buffer - I can't see anything for the first 16 buffers.  This is immensely
> important because it will tell us if that bad buffer is from corrupted
> memory, deleted memory, or if that is not really a buffer but a wild
> pointer.
>
> Oh heck.  You say there's 60MB of log data?  I've got a 1.5Mb DSL downlink
> here.  Can you just put it up and I'll suck it down?  Then I'll be able to
> match up the USB addresses as well.
>
>   -Mike
>
>
> On Tue, 20 Sep 2005, Mike Isely wrote:
>
> >
> > This line is at the top of your kernel log just before the oops:
> >
> >   Sep 21 02:39:38 t42p ehci_hcd 0000:00:1d.7: devpath 4.1 ep4in 3strikes
> >
> > What the $##@! is that?
> >
> > Can you check for me that the USB address listed there matches the pvrusb2
> > device's location?  Check the connection messages logged when you first
> > plug it.  I suspect it is the same.  If so, I'm going to go digging
> > through the kernel sources to see what is going on here...
> >
> > Also, I'd like you send me your pvrusb2.ko binary module.  Just send it to
> > me directly.  It may be useful to cross-reference back some EIP values.
> >
> > Still looking...
> >
> >   -Mike
> >
> >
> >
> > On Tue, 20 Sep 2005, Mike Isely wrote:
> >
> > >
> > > Thanks for catching this!  I've grabbed the logs, and will investigate.
> > >
> > >   -Mike
> > >
> > >
> > > On Wed, 21 Sep 2005, Arnaud Boulan wrote:
> > >
> > > > Le Dimanche 18 Septembre 2005 20:08, Mike Isely a écrit :
> > > >
> > > > > Cute.
> > > > >
> > > > > Haven't seen any reports yet of this problem happening again,
> > > > > unfortunately....
> > > > >
> > > > >   -Mike
> > > >
> > > > Yippeeeeeeeee! It happened again!! :o)
> > > >
> > > > Sep 21 02:39:38 t42p pvrusb2 Buffer cd717f70 is bad
> > > > at /root/admin/drivers/pvr/pvrusb2-mci-20050911/driver/pvrusb2-io.c:211
> > > > Sep 21 02:39:38 t42p pvrusb2 buffer BadSig cd717f70 state=unknown
> > > > id=-706763648 status=1 stream=00004000 purb=0000001d sig=0xd5dfa200
> > > > Sep 21 02:39:38 t42p kernel BUG
> > > > at /root/admin/drivers/pvr/pvrusb2-mci-20050911/driver/pvrusb2-io.c:211!
> > > >
> > > > Here are my logs: (not attached because it's too big for the list)
> > > >
> > > > http://www.boulan.net/pvr/oops1/pvrusb2.log-443-09-21.gz
> > > > http://www.boulan.net/pvr/oops1/kern.log-09-21
> > > >
> > > > I hope it will help...
> > > > Although, i can see no logs about buffer "cd717f70" since last time my pvrusb2
> > > > was plugged. I have some logs about "stream=cd717f80", maybe it's related?
> > > > I keep the whole logs (~60MB compressed :) in case you need it.
> > > >
> > > > Arnaud
> > > >
> > > > _______________________________________________
> > > > pvrusb2 mailing list
> > > > pvrusb2 at isely.net
> > > > http://www.isely.net/cgi-bin/mailman/listinfo/pvrusb2
> > > >
> > >
> > >
> >
> >
>
>

-- 
                        |         Mike Isely          |     PGP fingerprint
     Spammers Die!!     |                             | 03 54 43 4D 75 E5 CC 92
                        |   isely @ pobox (dot) com   | 71 16 01 E2 B5 F5 C1 E8
                        |                             |


More information about the pvrusb2 mailing list