Finding candidate text for making into reusable snippets

Please post all questions and comments regarding Help & Manual 7 here.

Moderators: Alexander Halser, Tim Green

Post Reply
User avatar
itexnz
Posts: 137
Joined: Tue Jan 06, 2009 10:09 pm
Location: Wellington, New Zealand
Contact:

Finding candidate text for making into reusable snippets

Unread post by itexnz »

Single sourcing, de-duplicating content is a wonderful thing, both from reducing the risk of errors and massively reducing workloads. I use snippets a lot - and store then in a repository. Its great, allows me to do what would be a 40 hour a week barely-keeping-head-above-water job into a easy task that I can do in 20. H&M really is a lifesaver! Thank you...

Finding content that is duplicated (or nearly identical) in multiple publications/topics and is fairly easy to spot when you are very familiar with your content or only have a certain amount of it (and therefore it can be fairly easily made into reusable snippets), but it is hard to start the process of finding snippet candidates when you have a lot of content or a lot of new content that you are not familiar with - e.g. after migrating lots of stuff from MS Word documents (which I often do for my consulting clients).

What I would really love, is a tool for finding snippet candidates, e.g. be able to crawl all the XML files, and find blocks of text within the <para> tags that is very similar to text found in other <para> tags, either in the same XML topic or in other XML files in other projects. E.g. find blocks of text within tags (i.e. written content, not the XML itself) greater than say 100 characters that are more than 80% similar to others.

I imagine this kind of tool would take a fairly long time to crawl and return a fairly accurate/useful list of candidate blocks of text, because it would have to load into a database every block of text greater than 100 characters in your content and then run a similarity algorithm across it all, but it would be a great way to encourage users to start making into snippets, and a great way to sell the feature. I think it might be a cool challenge to work on too :)

I've tried a bunch of third party tools to compare contents of files, but they are all pretty simple and generalised - they all seem to look at the content of files as a whole, not finding similar areas within a file. This is not useful.

Thoughts? Useful?

Has anyone found a good third party tool that can parse XML like I suggest for similarities?
David Scott
Documentation Infrastructure Consultant
https://www.sourceone.co.nz
SourceOne. Documentation, engineered.
User avatar
Tim Green
Site Admin
Posts: 23156
Joined: Mon Jun 24, 2002 9:11 am
Location: Bruehl, Germany
Contact:

Re: Finding candidate text for making into reusable snippets

Unread post by Tim Green »

Hi David,

Even though you appreciate that it would take a fairly long time, I think you're significantly underestimating the enormity of the programming task you are proposing here. Establishing similarity would have to be a sliding scale and would need to be able to handle things like different formatting, slightly different formulations and a wide range of other things. And even exact matches would be very difficult without a clear definition of the size of the text blocks you are looking at. To get something even approaching a useful result you would have to use advanced machine learning systems based on neural networks, that would need to be trained for a long time on many hundreds of thousands of texts. 8)
Regards,
Tim (EC Software Documentation & User Support)

Private support:
Please do not email or PM me with private support requests -- post to the forum directly.
User avatar
itexnz
Posts: 137
Joined: Tue Jan 06, 2009 10:09 pm
Location: Wellington, New Zealand
Contact:

Re: Finding candidate text for making into reusable snippets

Unread post by itexnz »

You may say that I'm a dreamer... But I'm not the only one lol
:D
David Scott
Documentation Infrastructure Consultant
https://www.sourceone.co.nz
SourceOne. Documentation, engineered.
Post Reply