Having trouble finding a word in multiple files

Peter Flynn peter at silmaril.ie
Mon Jun 15 11:14:12 UTC 2020


On 15/06/2020 11:40, Liam Proven wrote:
> On Sun, 14 Jun 2020 at 09:56, Pat Brown <pat.mysterywriter at gmail.com> wrote:
>>
>> I've tried a variety of grep commands but I can't find the specific
>> word I'm searching for that is in a file or files somewhere in my
>> Dropbox folder. The word I'm trying to find is Blowback. Can someone
>> please help me with the correct command?
> 
> [Reading down the thread]
> 
> They are files from proprietary Windows/Mac apps?
> 
> Then you can't. Grep only searches plain text.
> 
> You can't search in proprietary binary files. At all. Forget all the
> ideas about converting them; you cannot efficiently convert or filter
> these -- every file would need to be converted every time, which would
> be _ludicrously_ slow.

The script I posted does the job in seconds. Word files are just zip 
files, so they unzip easily, and the document inside is XML, so it's 
plain text. Finding all files containing 'Blowback' is fast.

BUT...Word and ODT XML documents are stored without linebreaks: like 
HTML, you can have the end of one paragraph butting up against the start 
of the next with no white-space, eg like this</w:p><w:p>Next para, so 
finding *where* in the file (as opposed to *whether* the file) is a 
second stage, and that would slow it down a little, although doing this 
search as below (for my name, not for 'Blowback') on every Word file (a 
few hundred) on my disk took under a minute.

So the document.xml or content.xml file in the zip is just one long 
string with markup. To search *within* it you need a tool that will 
separate the markup from the text; fortunately there are dozens, if not 
hundreds, of these (every CS student at some stage writes an XML 
parser). For use in a script, the easiest I find are the LTxml2 tools 
from the Language Technology Group in Edinburgh 
(https://www.ltg.ed.ac.uk/software/ltxml2/), so after extracting the 
file from the zip into a pipe you could say

...unzip -qc $wordfile.docx word/document.xml |\
    lxprintf -e 'w:p[contains(.,"Blowback")]' "%s\n" -

and you'll get the text of any paragraph containing 'Blowback' (that 
conditional in [square brackets] is the XPath language used to identify 
pieces of an XML document).

> You need a desktop search tool. There are not many for Linux and in my
> experience they do not work well. I recently tried Catfish and it was
> unable to search inside LibreOffice files.

Then the people who wrote it need to add some code. This stuff is not 
rocket science (or if it is, I know any number of unemployed rocket 
scientists who can do it for you) — it just means knowing what scripted 
text utilities can offer. XML is easy and fast to handle when it is used 
appropriately for what we designed it for: normal running text 
documents, not rectangular or columnar data, which is the province of 
CSV and JSON.

2¢
Peter




More information about the ubuntu-users mailing list