Light-weight chart parser for multimodal type-logical grammars
Given an input file raw.txt
which can contain several paragraphs, the tokenization script splits sentences and words.
tokenize.tcl raw.txt > input.txt
You can use other tokenization scripts, but make sure the tokenization is that same as the one used for your supertagger.
As an example, we take the following input text.
Aussi ont-ils décidé d'eux-mêmes, mercredi, de se soumettre à l'arbitrage définitif des autorités monétaires.
Use use your external taggers to convert your input file into a supertag file, ensure it returns its results in the format of the Clark and Curran taggers.
The expected format for each word is word|POS|N|Formula1|Prob1 ... Formulan|ProbN
where word
is the word, POS
the assigned part-of-speech tag (using the Treetagger tagset), N
is the number of formulas give
by the supertagger, followed by N
occurrences of a formula and its probability.
This gives the following output for our example
Aussi|ADV-ADV|1|dr(0,s,s)|0.9999778 ont|V-VER:pres|2|dr(0,dr(0,s,dl(0,np,s_ppart)),np)|0.9980058|dr(0,dr(0,s,np),dl(0,np,s_ppart))|0.0018172036 -ils|CLS-PRO:PER|1|np|0.99999213 décidé|VPP-VER:pper|9|dr(0,dl(0,np,s_ppart),dl(0,np,s_inf))|0.95984834|dr(0,dr(0,dl(0,np,s_ppart),dl(0,np,s_inf)),np)|0.012183049|dr(0,dl(0,np,s_ppart),pp_de)|0.0071665817|dr(0,dr(0,dl(0,np,s_ppart),dl(0,np,s_inf)),pp_a)|0.005813557|dr(0,dr(0,dl(0,np,s_ppart),pp_de),np)|0.0021970696|dr(0,dl(0,cl_r,dl(0,np,s_inf)),dl(0,np,s_inf))|0.0011594007|dr(0,dl(0,np,s),dl(0,np,s_inf))|0.0011494745|dr(0,dl(0,cl_r,dl(0,np,s)),dl(0,np,s_inf))|0.0010469758|dr(0,dl(0,cl_r,dl(0,np,s_ppart)),dl(0,np,s_inf))|0.0010047258 d'|P-PRP|2|dr(0,pp_de,np)|0.9892832|dr(0,pp_de,n)|0.010377718 eux-mêmes|PRO-PRO:PER|2|np|0.99766695|n|0.002082536 ,|PONCT-PUN|1|let|0.99999964 mercredi|NC-NOM|1|dl(1,s,s)|0.99971145 ,|PONCT-PUN|1|let|0.99999976 de|P-PRP|1|dr(0,dl(0,np,s_inf),dl(0,np,s_inf))|0.9999877 se|CLR-PRO:PER|1|cl_r|1.0 soumettre|VINF-VER:infi|1|dr(0,dl(0,cl_r,dl(0,np,s_inf)),pp_a)|0.99848455 à|P-PRP|1|dr(0,pp_a,np)|0.99999905 l'|DET-DET:ART|1|dr(0,np,n)|1.0 arbitrage|NC-NOM|1|n|1.0 définitif|ADJ-ADJ|1|dl(0,n,n)|1.0 des|P+D-PRP:det|1|dr(0,dl(0,n,n),n)|0.9999982 autorités|NC-NOM|1|n|1.0 monétaires|ADJ-ADJ|1|dl(0,n,n)|1.0 .|PONCT-PUN|1|dl(0,s,txt)|0.9996948
This output uses the MElt-TT
format for part-of-speech tags. Only the treetagger tags are used later.
The script supertag2pl
converts the supertagger output into (unlemmatized) Prolog clauses for the parser.
Clauses contain a list of items of the form ex_si(Word, Pos, Word, List)
where List
is a list of Formula-Probability
pairs.
supertag2pl superpos.txt > superpos_nolem.pl
For our example, this script gives the following ouput.
sent(20, Result) :-
prob_parse([ ex_si('Aussi', adv-adv, 'Aussi', [dr(0,s,s)-0.9999778]),
ex_si(ont, v-ver:pres, ont, [dr(0,dr(0,s,dl(0,np,s_ppart)),np)-0.9980058,
dr(0,dr(0,s,np),dl(0,np,s_ppart))-0.0018172036]),
ex_si('-ils', cls-pro:per, '-ils', [np-0.99999213]),
ex_si(décidé, vpp-ver:pper, décidé, [dr(0,dl(0,np,s_ppart),dl(0,np,s_inf))-0.95984834,
dr(0,dr(0,dl(0,np,s_ppart),dl(0,np,s_inf)),np)-0.012183049,
dr(0,dl(0,np,s_ppart),pp_de)-0.0071665817,
dr(0,dr(0,dl(0,np,s_ppart),dl(0,np,s_inf)),pp_a)-0.005813557,
dr(0,dr(0,dl(0,np,s_ppart),pp_de),np)-0.0021970696,
dr(0,dl(0,cl_r,dl(0,np,s_inf)),dl(0,np,s_inf))-0.0011594007,
dr(0,dl(0,np,s),dl(0,np,s_inf))-0.0011494745,
dr(0,dl(0,cl_r,dl(0,np,s)),dl(0,np,s_inf))-0.0010469758,
dr(0,dl(0,cl_r,dl(0,np,s_ppart)),dl(0,np,s_inf))-0.0010047258]),
ex_si('d\'', p-prp, 'd\'', [dr(0,pp_de,np)-0.9892832,
dr(0,pp_de,n)-0.010377718]),
ex_si('eux-mêmes', pro-pro:per, 'eux-mêmes', [np-0.99766695, n-0.002082536]),
ex_si(',', ponct-pun, ',', [let-0.99999964]),
ex_si(mercredi, nc-nom, mercredi, [dl(1,s,s)-0.99971145]),
ex_si(',', ponct-pun, ',', [let-0.99999976]),
ex_si(de, p-prp, de, [dr(0,dl(0,np,s_inf),dl(0,np,s_inf))-0.9999877]),
ex_si(se, clr-pro:per, se, [cl_r-1.0]),
ex_si(soumettre, vinf-ver:infi, soumettre, [dr(0,dl(0,cl_r,dl(0,np,s_inf)),pp_a)-0.99848455]),
ex_si(à , p-prp, à, [dr(0,pp_a,np)-0.99999905]),
ex_si('l\'', det-det:art, 'l\'', [dr(0,np,n)-1.0]),
ex_si(arbitrage, nc-nom, arbitrage, [n-1.0]),
ex_si(définitif, adj-adj, définitif, [dl(0,n,n)-1.0]),
ex_si(des, p+d-prp:det, des, [dr(0,dl(0,n,n),n)-0.9999982]),
ex_si(autorités, nc-nom, autorité, [n-1.0]),
ex_si(monétaires, adj-adj, monétaires, [dl(0,n,n)-1.0]),
ex_si('.', ponct-pun, '.', [dl(0,s,txt)-0.9996948])], Result).
It it preferable to use an external lemmatizer. Otherwise, the script lefff.pl
provides a basic functionality, looking up the word-POStag combination in the Lefff database. This is very slow.
The lemmatizer replaces the list entries of the form ex_si(Word, Pos, Word, List)
by entries si(Word, Pos, Lemma, List)
.
The lefff.pl
script takes a file NAME_nolem.pl
and returns a lemmatized file NAME.pl
.
lefff.pl superpos_nolem.pl
The output of the lemmatizer (in the file superpos.pl
) is shown below.
sent(20, A) :-
prob_parse(
[ si('Aussi', adv-adv, aussi, [dr(0, s, s)-0.9999778]),
si(ont,
v-ver:pres,
avoir,
[ dr(0, dr(0, s, dl(0, np, s_ppart)), np)-0.9980058,
dr(0, dr(0, s, np), dl(0, np, s_ppart))-0.0018172036
]),
si('-ils', cls-pro:per, ils, [np-0.99999213]),
si(décidé,
vpp-ver:pper,
décider,
[ dr(0, dl(0, np, s_ppart), dl(0, np, s_inf))-0.95984834,
dr(0, dr(0, dl(0, np, s_ppart), dl(0, np, s_inf)), np)-0.012183049,
dr(0, dl(0, np, s_ppart), pp_de)-0.0071665817,
dr(0, dr(0, dl(0, np, s_ppart), dl(0, np, s_inf)), pp_a)-0.005813557,
dr(0, dr(0, dl(0, np, s_ppart), pp_de), np)-0.0021970696,
dr(0, dl(0, cl_r, dl(0, np, s_inf)), dl(0, np, s_inf))-0.0011594007,
dr(0, dl(0, np, s), dl(0, np, s_inf))-0.0011494745,
dr(0, dl(0, cl_r, dl(0, np, s)), dl(0, np, s_inf))-0.0010469758,
dr(0, dl(0, cl_r, dl(0, np, s_ppart)), dl(0, np, s_inf))-0.0010047258
]),
si('d\'',
p-prp,
de,
[dr(0, pp_de, np)-0.9892832, dr(0, pp_de, n)-0.010377718]),
si('eux-mêmes',
pro-pro:per,
'eux-mêmes',
[np-0.99766695, n-0.002082536]),
si(',', ponct-pun, ',', [let-0.99999964]),
si(mercredi, nc-nom, mercredi, [dl(1, s, s)-0.99971145]),
si(',', ponct-pun, ',', [let-0.99999976]),
si(de,
p-prp,
de,
[dr(0, dl(0, np, s_inf), dl(0, np, s_inf))-0.9999877]),
si(se, clr-pro:per, se, [cl_r-1.0]),
si(soumettre,
vinf-ver:infi,
soumettre,
[dr(0, dl(0, cl_r, dl(0, np, s_inf)), pp_a)-0.99848455]),
si(à, p-prp, à, [dr(0, pp_a, np)-0.99999905]),
si('l\'', det-det:art, 'l\'', [dr(0, np, n)-1.0]),
si(arbitrage, nc-nom, arbitrage, [n-1.0]),
si(définitif, adj-adj, définitif, [dl(0, n, n)-1.0]),
si(des, p+d-prp:det, des, [dr(0, dl(0, n, n), n)-0.9999982]),
si(autorités, nc-nom, autorité, [n-1.0]),
si(monétaires, adj-adj, monétaire, [dl(0, n, n)-1.0]),
si('.', ponct-pun, '.', [dl(0, s, txt)-0.9996948])
],
A).
Finally, the parser is run as follows.
grail_light_nd superpos.pl
This command tries to parse all sentence and outputs the results in a number of files:
semantics.pl
contains the Prolog ouput giving the DRT semantics, both in unreduced and reduced versions.proofs.pl
contains the Prolog proofs (chart proofs, but they can be converted to natural deduction)The proofs in proofs.pl
are of the following form
proof(N, rule(RuleName, Structure, Formula-LambdaTerm, ListOfPremisses)).
where N
is the sentence number, RuleName
the name of the last rule of the proof, Structure
the computed structure (a tree over the words in the sentence), Formula
is the goal formula, and LambdaTerm
the lambda term corresponding to the proof, ListOfPremisses
is the list of proofs (all of the form rule(...)
allowing us to derive the conclusion).
Note that the easiest way to obtain the lambda term of the proof (without any substitutions) is to simply obtain it from the last line of the proof.