|
|
Question : Improve Output of Python HTML Diff Script
|
|
This is for Expert "pepr." I'm trying to eliminate a couple of problems in a Python diff script. Sometimes it is indicating changes in the html, even though there are no visible changes in the text. For example,
The bottom lineline is you can't go wrongwrong with either one.
Also, the script usually does redundant deletions before doing insertions. For example, to insert the word "good" in the sentence ("I believe oranges are better than apples."), it is doing this: I believe orangesgood oranges are better than apples.
But it should simply do this: I believe good oranges are better than apples.
The code to be enhanced is shown near the end of http:Q_22619022.html . Thanks.
|
Answer : Improve Output of Python HTML Diff Script
|
|
Then you can replace the for loop by OldFile = rowNew[-1]['doctext'].replace('\r', '')
The rowNew[-1] is the last element in the sequence.
For the comparison algorithm, it seems that it has limits. The problem is that when removing markup, the text from elements that are not closely related is put together and compared using the difflib. The real problem is that the detected delete/insert sequence can partly be contained in the two elements and there is no mark to synchronize better.
In my opinion, the approach should be changed towards parsing the HTML documents and compare the important parts of its structure plus the content of the related elements.
|
|
|
|
|