[Bug 1937874] [NEW] one --accept-regex expression negates another

Fri Jul 23 19:44:58 UTC 2021

Public bug reported:

This command should theoretically fetch all PDFs on a page:

$ wget -v -d -r --level 1 --adjust-extension --no-clobber --no-directories\
       --accept-regex 'administrative-orders/.*/administrative-order-matter-'\
       --accept-regex 'administrative-orders.*.pdf'\
       --accept-regex 'administrative-orders.page[^&]*$'\
       --directory-prefix=/tmp\
       'https://www.ncua.gov/regulation-supervision/enforcement-actions/administrative-orders?page=56'

But it fails to grab any of them, giving the output:

---
Deciding whether to enqueue "https://www.ncua.gov/files/administrative-orders/AO14-0241-R4.pdf".
https://www.ncua.gov/files/administrative-orders/AO14-0241-R4.pdf is excluded/not-included through regex.
Decided NOT to load it.
---

That's bogus.  The workaround is to remove this option:

--accept-regex 'administrative-orders.page[^&]*$'

But that should not be necessary.  Adding an --accept-* clause should
never cause another --accept-* clause to become invalidated and it
should not shrink the set of fetched files.

** Affects: wget (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to wget in Ubuntu.
https://bugs.launchpad.net/bugs/1937874

Title:
  one --accept-regex expression negates another

Status in wget package in Ubuntu:
  New

Bug description:
  This command should theoretically fetch all PDFs on a page:

  $ wget -v -d -r --level 1 --adjust-extension --no-clobber --no-directories\
         --accept-regex 'administrative-orders/.*/administrative-order-matter-'\
         --accept-regex 'administrative-orders.*.pdf'\
         --accept-regex 'administrative-orders.page[^&]*$'\
         --directory-prefix=/tmp\
         'https://www.ncua.gov/regulation-supervision/enforcement-actions/administrative-orders?page=56'

  But it fails to grab any of them, giving the output:

  ---
  Deciding whether to enqueue "https://www.ncua.gov/files/administrative-orders/AO14-0241-R4.pdf".
  https://www.ncua.gov/files/administrative-orders/AO14-0241-R4.pdf is excluded/not-included through regex.
  Decided NOT to load it.
  ---

  That's bogus.  The workaround is to remove this option:

  --accept-regex 'administrative-orders.page[^&]*$'

  But that should not be necessary.  Adding an --accept-* clause should
  never cause another --accept-* clause to become invalidated and it
  should not shrink the set of fetched files.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/wget/+bug/1937874/+subscriptions