[Bug 1684273] [NEW] Missing tail in iterparse

Jason Owen 1684273 at bugs.launchpad.net
Wed Apr 19 19:38:27 UTC 2017


Public bug reported:

Given a minimal parser (below) and a particular input file (attached),
iterparse is not returning the `tail` of the last `<span>` tag.

I am listening for the `end` event, which is the default, instead of the
`start` event.

Changing the input, for example by deleting unrelated tags such as the
`<link>` tag in the `<head>`, causes the missing text to reappear. This
makes it hard to produce a minified input! I was able to remove
everything /after/ the element with the missing tail, which doesn't
affect the bug, so that is what I attached.

I took the silence on the mailing list to mean that I did not have any
obvious problems with the way I was using iterparse. :) https://mailman-
mail5.webfaction.com/pipermail/lxml/2017-April/007882.html

---

```python
#!/usr/bin/env python3

import sys
from lxml import etree

for _, element in etree.iterparse(sys.argv[1], html=True):
    print((
        element.tag,
        element.attrib,
        element.text,
        element.tail,
    ))
```

Invoke by:
```sh
$ ./bug.py bug.html | grep "splays their blue cards left"
```

Expected output:
```
('span', {'class': 'age e'}, '4', '.\n... Nnastya splays their blue cards left.\n')
```

Actual output: none, and return code 1.

---

Python              : sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)
lxml.etree          : (3, 7, 3, 0)
libxml used         : (2, 9, 3)
libxml compiled     : (2, 9, 3)
libxslt used        : (1, 1, 29)
libxslt compiled    : (1, 1, 29)

** Affects: lxml
     Importance: Undecided
         Status: New

** Affects: lxml (Ubuntu)
     Importance: Undecided
         Status: New

** Attachment added: "Somewhat minified input that triggers bug"
   https://bugs.launchpad.net/bugs/1684273/+attachment/4865114/+files/bug.html

** Also affects: lxml (Ubuntu)
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to lxml in Ubuntu.
https://bugs.launchpad.net/bugs/1684273

Title:
  Missing tail in iterparse

Status in lxml:
  New
Status in lxml package in Ubuntu:
  New

Bug description:
  Given a minimal parser (below) and a particular input file (attached),
  iterparse is not returning the `tail` of the last `<span>` tag.

  I am listening for the `end` event, which is the default, instead of
  the `start` event.

  Changing the input, for example by deleting unrelated tags such as the
  `<link>` tag in the `<head>`, causes the missing text to reappear.
  This makes it hard to produce a minified input! I was able to remove
  everything /after/ the element with the missing tail, which doesn't
  affect the bug, so that is what I attached.

  I took the silence on the mailing list to mean that I did not have any
  obvious problems with the way I was using iterparse. :) https
  ://mailman-mail5.webfaction.com/pipermail/lxml/2017-April/007882.html

  ---

  ```python
  #!/usr/bin/env python3

  import sys
  from lxml import etree

  for _, element in etree.iterparse(sys.argv[1], html=True):
      print((
          element.tag,
          element.attrib,
          element.text,
          element.tail,
      ))
  ```

  Invoke by:
  ```sh
  $ ./bug.py bug.html | grep "splays their blue cards left"
  ```

  Expected output:
  ```
  ('span', {'class': 'age e'}, '4', '.\n... Nnastya splays their blue cards left.\n')
  ```

  Actual output: none, and return code 1.

  ---

  Python              : sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)
  lxml.etree          : (3, 7, 3, 0)
  libxml used         : (2, 9, 3)
  libxml compiled     : (2, 9, 3)
  libxslt used        : (1, 1, 29)
  libxslt compiled    : (1, 1, 29)

To manage notifications about this bug go to:
https://bugs.launchpad.net/lxml/+bug/1684273/+subscriptions



More information about the foundations-bugs mailing list