Main

wordsetdiff

Main

Description

Is one text file a subset of the other? Or is there some bit of new text that needs to be salvaged?

The basic unix diff tool is sometimes incredibly unsatisfactory for this purpose -- for example when text has been moved around, or when there are widespread whitespace differences.

This program compares two files by treating them as unstructured sets of word sequences. By default words are defined by isAlpha.

Run wordsetdiff with no arguments to print the help information.

Synopsis

type TupMap a = HashMap [ByteString] a

type Window = [ByteString]

pack_window :: [ByteString] -> Window

toStrict :: ByteString -> ByteString

data CmdFlag

= NoColor

| NWords Int

| WithPunc

| AlphaOnly

| CaseInsensitive

options :: [OptDescr CmdFlag]

safeRead :: String -> Int

data Loc = Loc Int64 Int64

words_wloc :: (Char -> Bool) -> ByteString -> [(ByteString, Loc)]

clump_regions :: [Loc] -> [Loc]

combine_locs :: [Loc] -> Loc

wordmapN :: (Char -> Bool) -> Int -> ByteString -> TupMap (Set Loc)

sliding_win :: Int -> [(ByteString, Loc)] -> [[(ByteString, Loc)]]

trim_separators :: (Char -> Bool) -> ByteString -> [Loc] -> [Loc]

print_diff_regions :: Bool -> ByteString -> [Loc] -> IO ()

data Config = Cfg {

color :: Bool

word_sequence_size :: Int

case_insensitive :: Bool

with_punctuation :: Bool

}

Documentation

type TupMap a = HashMap [ByteString] a

type Window = [ByteString]

pack_window :: [ByteString] -> Window

toStrict :: ByteString -> ByteString

data CmdFlag

Command line option flags

Constructors

NoColor
NWords Int
WithPunc
AlphaOnly
CaseInsensitive

options :: [OptDescr CmdFlag]

safeRead :: String -> Int

data Loc

Tracking simple source locations as (start,end) inclusive/exclusive character indices.

Constructors

Loc Int64 Int64

Instances

Eq Loc

Ord Loc

Show Loc

words_wloc :: (Char -> Bool) -> ByteString -> [(ByteString, Loc)]

Returns words satisfying whose characters satisfy a predicate along with their ZERO BASED locations.

clump_regions :: [Loc] -> [Loc]

Cluster regions together if they are almost touching. Any regions within clump_distance characters of one another are joined. The result should have no overlaps:

combine_locs :: [Loc] -> Loc

Take the bounding box of a list of locations.

wordmapN :: (Char -> Bool) -> Int -> ByteString -> TupMap (Set Loc)

Form a map mapping words to a set of occurrence locations within the bytestring. | This version forms a map using consecutive sequences of | N words (represented as lists) as the keys instead of individual words.

sliding_win :: Int -> [(ByteString, Loc)] -> [[(ByteString, Loc)]]

trim_separators :: (Char -> Bool) -> ByteString -> [Loc] -> [Loc]

The region of interest will end up bloated with separator charactors around the edges. This will trim those down.

print_diff_regions :: Bool -> ByteString -> [Loc] -> IO ()

Print out results, i.e. the distinct regions of text within one file and not the other.

data Config

Constructors

Cfg

color :: Bool
word_sequence_size :: Int
case_insensitive :: Bool
with_punctuation :: Bool

Produced by Haddock version 2.6.1