Faculty of Language: What's in a Category? [Part 2]

Monday, July 21, 2014

What's in a Category? [Part 2]

Last week I wondered about the notion of syntactic category, aka part of speech (POS). My worry is that we have no clear idea what kind of work POS are supposed to do for syntax. We have some criteria for assigning POS to lexical items (LI) --- morphology, distribution, semantics --- but there are no clear-cut rules for how these are weighed against each other. Even worse, we have no idea why these are relevant criteria while plausible candidates such as phonological weight and arity seem to be irrelevant.¹ So what we have is an integral part of pretty much every syntactic formalism for which we cannot say

what exactly it encompasses,
why it is necessary,
why it shows certain properties but not others.

Okay, that's a pretty unsatisfying state of affairs. Actually, things are even more unsatisfying once you look at the issue from a formal perspective. But the formal perspective also suggests a way out of this mess.

The Formal Problem: POS Leak

Those of you who remember my 4-part discussion of the connection between constraints and Merge know that POS are actually the locus of an immense amount of power. Too much power, in fact. If it is a restricted theory of syntax you want, you need a theory of POS, because by default POS leak.

Don't worry, I'm not going to subject you to yet another tour de force through arcane logics and automata models, no, the basic point can be made in a much simpler fashion. Recall that the syntactic distribution of an LI is one of the criteria we use for determining its category, or rather, the category of the phrase it projects. That is to say, we obviously do not expect all LIs of category V to have the same distribution. For instance, transitives must be followed by a DP, intransitives must not. But we still treat them as Vs because the phrases they project have the same distribution. Wherever you have the VP slept, you should be able to substitute the VP killed Mary. But that is not the case for the VP killed herself, since John slept is fine yet John killed herself is not.² Thus the two VPs should actually have distinct categories, say, VP[-refl] and VP[+refl]. But if the label of a constituent is determined by the POS of its head, that means killed has different POS in killed Mary and killed herself. That is the basic idea that underlies my proof that constraints can be encoded directly in the POS of the grammar: both constraints and POS can be used to restrict the distribution of constituents, so we can switch between the two as we see fit.

This state of affairs is unsatisfying from a linguistic perspective. Distribution is a major criterion for assigning POS, yet this opens up a loop hole where categories can be freely refined to capture the distributions enforced by constraints. Categories and constraints become interchangeable, we can completely do away with one as long as we keep the other. We lose generalizations (the two VPs above become completely different syntactic objects), and we allow for massive overgeneration because categories act as a loop hole for smuggling in unwanted constraints. All because we do not have the understanding of POS that would allow us to restrict their use in an insightful way.

UG POS Doesn't Answer the Question

Now it might be tempting to plug the POS leak by fixing the set of POSs across all grammars. This was suggested by Norbert at some point, but by now you'll hopefully agree with me that this is not a helpful way to go about it.

Let's assume that fixing the POS across all languages would actually stop them from leaking by limiting category refinement (I don't think it would, but that's not the point here). The crux is that this step would weaken the notion of POSs even more, to the point of being completely useless. Basically, POSs would be mostly tied to inflectional morphology (an extra-syntactic property, mind you) and syntactic distribution modulo the syntactic constraints. That is to say, we look at the distribution of a given constituent, factor out those parts of the distribution that are already accounted for by independent constraints, and then pick the POS that accounts for the remainder (by assumption one of the fixed POSs should fit the remainder).

But what exactly does this achieve? Without a fixed set of constraints, we can tweak the grammar until the remainder is 0, leaving us with no need whatsoever for POS --- and by extension labeling. And any fixed set of constraints is ultimately arbitrary because we have no principled way of distinguishing between constraints and POS. In addition, it is rather unlikely that small set of constraints Minimalists are likely to post would yield a small set of POS that captures the full range of distribution, as is evidenced by the huge number of POS used in the average tree bank.

So what's the moral here? The link between POSs and constraints is not a funny coincidence or a loop hole that needs to be worked around. It reveals something fundamental about the role POS serve in our theory, but the fact that the connection to constraints also opens the floodgates of overgeneration shows that our notion of POS lacks clear delimiting boundaries; which shouldn't come as a surprise, it is a hodge podge of morphological and syntactic properties without a clear operational semantics in our linguistic theories.

Categories, Categories, and Categories

Now that we have established that the problem of POS isn't just one of scientific insight but also has real repercussions for the power of the formalism, what can we do about it? Until a few weeks ago, I wouldn't have had much to offer except for a shrug. But then Alex Clark, a regular in our prestigious FoL comments section, presented some very exciting work at this year's LACL on an algebraic view of categories (joint work with Ryo Yoshinaka, not available online yet). So now I can offer you some first results coupled with my enthusiastic interpretation of where we could take this.

Before we set out for our voyage through the
~~treacherous~~delightful depths of algebra, however, it is prudent to sharpen our vocabulary. Until now I've been using category to denote an LI's POS or the category of its projected phrase. This makes sense in your standard X'-setting where phrases are labeled via projection of POS. But in a framework without projection, or where projection is not directly tied to POS, we need to be more careful. Let's adopt the following terminology instead (ah, making up new jargon, the scientist's favorite pastime, in particular if it's only to be used for a few paragraphs).

A POS is a specific type of feature on an LI, e.g. V, N, A, T, C, ...
A l[exical]-category is the full feature specification of an LI. In Minimalist grammars, for instance, this includes the POS and all Merge and Move features (as well as their respective order).
A p[rojection]-category is the label assigned to an interior node, for instance V' or TP.

In a framework like standard Minimalist grammars where all dependencies are encoded via features, an LI's l-category fully determines its syntactic distribution. Its POS represents exactly that part of the distribution that is not accounted for by the other features of its l-category. That's not news, it simply reminds us of the fact discussed above that in MGs POS have no contribution to make except for restricting an LI's distribution. We're interested in a bigger question, namely what constitutes a valid POS-system? And while Alex C's work does not directly touch on that, it hints at a possible solution.

Algebra and Categories

At first sight, it seems that Alex's work has nothing to tell us about POS in Minimalist syntax. For one thing, him and Ryo are looking at Multiple Context-Free Grammars (MCFGs) rather than Minimalist grammars, let alone the standard concept of POS that goes beyond mere distribution facts. And more puzzlingly, POS never enter the picture. Since they are working with MCFGs, their concept of category is the MCFG-concept, which amounts to what I called p-categories above. What they show is a particular correspondence between syntactic distribution and p-categories of specific MCFGs. More precisely, if one takes a string language L that can generated by a multiple context-free grammar³ and groups strings into equivalence classes in some specific fashion building on their distribution, what one gets is a rich algebraic structure (a residuated lattice, for those in the know) where each node corresponds to a specific category of the smallest MCFG that generates this language L.

Okay, this was pretty dense, let's break it up into two main insights:

If one operates with the most succinct MCFG possible, its categories correspond directly to specific distribution classes.
Those distribution classes are not just a flat collection of atomic entities, they form part of a highly structured system that we can describe and study from an algebraic perspective.

Point 1 isn't all that relevant here. It's useful for learnability because it means that a learner only needs to pay attention to distribution in order to infer the right categories, and it has technical applications when it comes to converting grammars according to specific criteria --- definitely useful, but not something that would wow a crowd of linguists. Point 2, on the other hand, is truly exciting to me because it highlights what is wrong with the standard view of POS.

Linguists treat POS as equivalence classes --- the lexicon is partitioned by the set of POSs. This invariably leads to a view of POSs as atomic units that are unrelated to each other and cannot be analyzed any further. Hence there are no principled limits one how the lexicon can be carved up by POSs. We can't do much more than write down a list of valid POSs. In other words, we are limited to substantive universal with no attack vector for formal universals. It also means that the only way the concept of POS can be made meaningful is to link each POS to specific properties, which in practice mostly happen to be morphological. But that can only be done if the set of POSs is fixed across languages, which limits the flexibility of POS and once again enforces a substantive universals treatment of POS.

However, there is no reason why we should think of the lexicon as a collection of partitions. Imagine that the lexicon is ordered instead such that a < b iff LI b can be selected by any LI c that selects LI a.⁴ The verb kill would no longer be marked as a V selecting a DP, but just have an entry that shows it can select the (assuming that the is the determiner with the most permissive distribution). This is not a particularly new idea, of course, it just takes Bare Phrase Structure to its logical conclusion: there are no POS at all, only LIs. At the same time, it still allows us to express generalization across multiple LIs by virtue of the ordering relation that holds between them. In a certain sense this also captures the intuition of David Adger's suggestion that lexical items may themselves contain syntactic structure, except that we reencode the entailments of the internal structure as an ordering over the entire lexicon. Irrespective of how closely these three ideas are actually connected, the essential point is that we can think of the lexicon as an algebraic object. And algebraic structure can be studied, classified, characterized, manipulated and altered in all kinds of ways.

Towards a Solution: An Algebraic Theory of POS

Suppose that we identify a property P that is satisfied by all natural language lexicons, some basic property of the ordering of LIs. Then P would be an indispensable requirement that needs to be preserved no matter what. In this case a constraint can be coded into the lexicon only if the necessary refinement does not destroy property P. And the really neat thing is that constraints can themselves be represented algebraically via tree automata, so this all reduces to the problem of combining algebraic structures while preserving property P --- a cute math problem.

But there's more. Remember how I pointed out in my last post that we should have an explanation as to why phonology and arity do not matter for assigning POS, whereas morphology does? If POS are atomic object, this is just baffling. But if our primary object of study isn't POS but a structured lexicon, the answer might be that the orders that could by induced by these notions do not satisfy our mystical property P. Conversely, morphology and distribution should have something in common that does give rise to P, or at least does not destroy it.

Finally, POS could be given an operational semantics in terms of succinct encoding similar to what Alex C found for MCFG categories. Maybe POS correspond to nodes in the smallest algebraic structure from which one can assemble the fully structured lexicon. Or maybe they denote in a structure that is slightly lossy with respect to information (not as fine-grained as necessary). It all depends on what the linguistic practice actually turns out to be. But that is the cool thing about this perspective: exploring what linguists are doing when they assign POS reduces to figuring out how they are condensing the lexicon into a more compact (but possibly lossy) structure.

So there's tons of neat things we could be doing, and none of it is particularly challenging on a mathematical level. Alas, there is one thorny issue that currently prevents this project from getting off the ground (beyond limited time and resources, that is): Just what the heck is our mystery property P? All of the applications above assume that we already know how to restrict the orderings between LIs, and we clearly don't. Well, such is the fate of promising new ideas, you usually have to start from square 1.

Somewhat ironically, the correspondence between POS and constraints that got me to worry about the role of POS in the first place might be of great help here. As linguists, we have a pretty good idea of what kind of constraints we do not want in our grammar, e.g. the number of nodes is a multiple of 5. We can assume, then, that P is a property that would be lost if we compiled these unwanted constraints into the lexicon. That's a first hint, and if we pool that with other insights about selection and how syntactic scales and orderings tend to work, we might be able to approximate P to a sufficient degree to get some exploratory work started. Here's hoping I'll have a less speculative story for you a couple of years from now.

Actually, phonological weight has some use in predicting the split between lexical and functional categories, with the former being on average longer than the latter. Personally, I'm inclined to attribute this to extra-grammatical factors --- functional elements are more frequent, and the more frequent a word, the shorter it tends to be. But that doesn't change the fact that phonological weight has some predictive power, yet we do not put it on the same level as morphology as distribution. The intuition, it seems, is that those two are proper syntactic criteria in some sense.↩
One could of course argue that what is spelled out as killed herself is actually the more abstract VP killed REFL and that John killed REFL is perfectly fine. But you can cook up other examples involving NPIs, PPIs, movement, whatever floats your boat. The basic point is that the category of a constituent does not fully predict its syntactic distribution, which is uncontroversial in any framework that incorporates long-distance dependencies of some kind.↩
Actually their result pertains only to 2-MCFGs, the weakest kind of MCFGs. Generalizing the result to arbitrary MCFGS doesn't seem particularly difficult, though.↩
Similar ideas are used in Meaghan Fowlie's treatment of adjunction, and programmers will recognize this as coercive subtyping. Also note that the order is a pre-order but not a partial order. That is to say, a < a holds for every a (reflexivity), and a < b and b < c jointly imply a < c (transitivity). But it is not the case that a < b and b < a implies a = b (no antisymmetry). So two words can have the same distribution and still be distinct, cf. groundhog and woodchuck. Preorders seem to be a very natural construct in natural language, they form the foundation of hyperintensional semantics and also arise organically in the algebraic view of the PCC.↩

38 comments:

Alex ClarkJuly 21, 2014 at 11:55 PM
I have put a copy of this paper up, accessible from here. Pagination differs etc.
ReplyDelete
Replies
Emily M. BenderJuly 22, 2014 at 7:48 AM
It seems to me that what you're describing is closely related to ideas that have been around in unification based frameworks (GPSG, LFG, HPSG) for a long time (i.e. since the 1980s, at least). In GPSG and HPSG especially, the labels on the nodes of the trees are taken to be complex feature structures, i.e. bundles of constraints. (In LFG, the analogous structures show up on the f-structure rather than the c-structure.) This allows for the (precise) description of overlapping classes of items (words, phrases) while also maintaining any relevant distinctions.
ReplyDelete
Replies
Alex ClarkJuly 23, 2014 at 1:14 AM
Exactly -- GPSG told us, among other things, that the number of nonterminals we would need in an adequate CFG would be vast, so you need some featural representation. So where do the features come from? And why do we have one set of features rather than another? There don't seem to be principled answers to these questions at the moment -- which occur with HPSG and MGs as well.

But the representation theory of these canonical algebras seems like it could give a non-arbitrary answer. And answer the Chomskyan critique of PSGs or at least the part of it that says "these productions are arbitrary".
ReplyDelete
Replies
AveryAndrewsJuly 24, 2014 at 6:09 PM
There is no theory here that I'm aware of, but the Feature Committee of the LFG Pargram Project group has a list of ones that tend to be useful, across what is beginning to be a decent spread of languages. HPSG's LINGO must have something similar.
ReplyDelete
Replies
AveryAndrewsJuly 24, 2014 at 6:10 PM
Alex' paper linked to above seems a bit rugged for beginners, what would be the best thing(s) to read for preparation?
ReplyDelete
Replies
Emily M. BenderJuly 25, 2014 at 9:15 AM
@Alex, @Thomas: I think there are some presuppositions that I don't share with you because I'm not understanding the motivation for your question. From my point of view, the models we build are formal models that I expect to capture the generalizations we see in the data. (Where the primary data are the acceptability of individual sentences and paraphrase and other semantic relations between sentences.) It doesn't follow from that that each piece of the model itself must be rooted in anything outside of syntax. POS corresponds to the idea that there are general classes of words that share important aspects of their distribution, even though we also see subclasses (some of which overlap with each other). Why do we need any further motivation for this set of distinctions? (From a typological perspective, there is the interesting question of whether we can align such classes across languages, but there are approaches to that as well, at least for practical purposes.)

ReplyDelete
Replies
AveryAndrewsJuly 26, 2014 at 1:30 AM
BTW David Gil believes that certain urban dialects of Indonesian, such as Riau and Jakarta, have no parts of speech distinctions, e.g. one PoS. I don't think one is obliged to believe this, but the claim has been made, and he gives concrete idea of how such a language could work (there's more than just this floating around, but these languages are very hard to work on because you can't get judgements out of the speakers, who will correct everything you suggest to the standard).

http://www.lingvistika.cz/download/knihovna/gil_how_much.pdf
ReplyDelete
Replies
AveryAndrewsJuly 26, 2014 at 5:13 AM
& for Alex's syntacticconceptlattice paper, are there some missing sub/superscripts or something in the last paragraph of p7? It doesn't quite parse for me (could be me, not it).
ReplyDelete
Replies
AveryAndrewsAugust 5, 2014 at 5:49 PM
I'll try to test my understanding of the PoS leak problem with an ulta-simple example, using a theory of 'PSG with limited features', a bit like a grossly oversimplified version of LFG.

Suppose we have PS rules, and our task is to produce a grammar of English noun phrases, and we have just bumped into the problem of singular vs plural nouns. If all we have is classic PS rules, we need to split the NP rule like this (blogger doesn't support standard linguistics layout, so using disjunctive '|' instead):

NP -> {(Detsg) (AdjP)* Nsg (PP)* | (Detpl) (Adjp)* Npl (PP)*}

Scientifically, we know that this is wrong because languages
never change in such a way as to wind up with different NP
structures for different combinations of inflectional features
(gender, number, case, definiteness), which is what we would
expect if there were really multiple copies of the rules (this
is basically Martin Davies' (1987 Mind96:441-462) argument for Chris Peacocke's "level 1.5" of explanation, using naturally occurring language change rather than implausible and unethical surgeries).

So, if we're teaching class that will eventually wind up at LFG, we might at this point introduce 'features', including NUM with values SG and PL, and propose that the Det and N nodes can be annotated with something like '=' meaning 'share all the features' (looking ahead to when similar stuff comes for German or Italian etc). Now we can
revise the NP rule to:

NP = Det (AdjP)* N (PP)*
= =

But we haven't addressed the problem of what makes the learner chose the right grammar rather than the wrong one. We can introduce some kind of evaluation metric, and observe that the correct rule is shorter.

But the margin of victory for the right rule over the wrong one becomes slimmer as the lexicon gets bigger, and why should we assume that learners care about small differences in grammar size?

Indeed, if we don't do things right, it might get reversed rather quickly as the lexicon gets larger. For the wrong grammar, we might have:

the:Detsg; the:Detpl; this:Detsg; these:Detpl; ...
dog:Nsg; dogs:Npl; cat:Nsg; cats:Npl; ...

whereas for the right one, we might have:
the: Det; this:Det, NUM SG; these:Det, NUM PL; ...
dog:N, NUM SG; dogs:N, NUM PL; cat:N, NUM SG; cats:N, NUM PL; ...

which might count as bigger since the nouns are being specified for two things rather than just one. Or maybe not, if we count in information-theoretically sophisticated way, since the number of possibilities specified for each lexical item is the same.

But, the point is, contemporary syntactic theory has not sorted this kind of issue out in a generally accepted way, and it is therefore not clear why syntactic theory that needs to be able to split quantifiers into subcategories for 'many', 'enough' and 'galore', with their different ordering possibilities, can't do a similar thing for the noun phrases.
ReplyDelete
Replies

Add comment

Faculty of Language

Comments