Monday, July 21, 2014

What's in a Category? [Part 2]

Last week I wondered about the notion of syntactic category, aka part of speech (POS). My worry is that we have no clear idea what kind of work POS are supposed to do for syntax. We have some criteria for assigning POS to lexical items (LI) --- morphology, distribution, semantics --- but there are no clear-cut rules for how these are weighed against each other. Even worse, we have no idea why these are relevant criteria while plausible candidates such as phonological weight and arity seem to be irrelevant.1 So what we have is an integral part of pretty much every syntactic formalism for which we cannot say
  • what exactly it encompasses,
  • why it is necessary,
  • why it shows certain properties but not others.
Okay, that's a pretty unsatisfying state of affairs. Actually, things are even more unsatisfying once you look at the issue from a formal perspective. But the formal perspective also suggests a way out of this mess.

The Formal Problem: POS Leak

Those of you who remember my 4-part discussion of the connection between constraints and Merge know that POS are actually the locus of an immense amount of power. Too much power, in fact. If it is a restricted theory of syntax you want, you need a theory of POS, because by default POS leak.

Don't worry, I'm not going to subject you to yet another tour de force through arcane logics and automata models, no, the basic point can be made in a much simpler fashion. Recall that the syntactic distribution of an LI is one of the criteria we use for determining its category, or rather, the category of the phrase it projects. That is to say, we obviously do not expect all LIs of category V to have the same distribution. For instance, transitives must be followed by a DP, intransitives must not. But we still treat them as Vs because the phrases they project have the same distribution. Wherever you have the VP slept, you should be able to substitute the VP killed Mary. But that is not the case for the VP killed herself, since John slept is fine yet John killed herself is not.2 Thus the two VPs should actually have distinct categories, say, VP[-refl] and VP[+refl]. But if the label of a constituent is determined by the POS of its head, that means killed has different POS in killed Mary and killed herself. That is the basic idea that underlies my proof that constraints can be encoded directly in the POS of the grammar: both constraints and POS can be used to restrict the distribution of constituents, so we can switch between the two as we see fit.

This state of affairs is unsatisfying from a linguistic perspective. Distribution is a major criterion for assigning POS, yet this opens up a loop hole where categories can be freely refined to capture the distributions enforced by constraints. Categories and constraints become interchangeable, we can completely do away with one as long as we keep the other. We lose generalizations (the two VPs above become completely different syntactic objects), and we allow for massive overgeneration because categories act as a loop hole for smuggling in unwanted constraints. All because we do not have the understanding of POS that would allow us to restrict their use in an insightful way.

UG POS Doesn't Answer the Question

Now it might be tempting to plug the POS leak by fixing the set of POSs across all grammars. This was suggested by Norbert at some point, but by now you'll hopefully agree with me that this is not a helpful way to go about it.

Let's assume that fixing the POS across all languages would actually stop them from leaking by limiting category refinement (I don't think it would, but that's not the point here). The crux is that this step would weaken the notion of POSs even more, to the point of being completely useless. Basically, POSs would be mostly tied to inflectional morphology (an extra-syntactic property, mind you) and syntactic distribution modulo the syntactic constraints. That is to say, we look at the distribution of a given constituent, factor out those parts of the distribution that are already accounted for by independent constraints, and then pick the POS that accounts for the remainder (by assumption one of the fixed POSs should fit the remainder).

But what exactly does this achieve? Without a fixed set of constraints, we can tweak the grammar until the remainder is 0, leaving us with no need whatsoever for POS --- and by extension labeling. And any fixed set of constraints is ultimately arbitrary because we have no principled way of distinguishing between constraints and POS. In addition, it is rather unlikely that small set of constraints Minimalists are likely to post would yield a small set of POS that captures the full range of distribution, as is evidenced by the huge number of POS used in the average tree bank.

So what's the moral here? The link between POSs and constraints is not a funny coincidence or a loop hole that needs to be worked around. It reveals something fundamental about the role POS serve in our theory, but the fact that the connection to constraints also opens the floodgates of overgeneration shows that our notion of POS lacks clear delimiting boundaries; which shouldn't come as a surprise, it is a hodge podge of morphological and syntactic properties without a clear operational semantics in our linguistic theories.

Categories, Categories, and Categories

Now that we have established that the problem of POS isn't just one of scientific insight but also has real repercussions for the power of the formalism, what can we do about it? Until a few weeks ago, I wouldn't have had much to offer except for a shrug. But then Alex Clark, a regular in our prestigious FoL comments section, presented some very exciting work at this year's LACL on an algebraic view of categories (joint work with Ryo Yoshinaka, not available online yet). So now I can offer you some first results coupled with my enthusiastic interpretation of where we could take this.

Before we set out for our voyage through the
treacherousdelightful depths of algebra, however, it is prudent to sharpen our vocabulary. Until now I've been using category to denote an LI's POS or the category of its projected phrase. This makes sense in your standard X'-setting where phrases are labeled via projection of POS. But in a framework without projection, or where projection is not directly tied to POS, we need to be more careful. Let's adopt the following terminology instead (ah, making up new jargon, the scientist's favorite pastime, in particular if it's only to be used for a few paragraphs).
  • A POS is a specific type of feature on an LI, e.g. V, N, A, T, C, ...
  • A l[exical]-category is the full feature specification of an LI. In Minimalist grammars, for instance, this includes the POS and all Merge and Move features (as well as their respective order).
  • A p[rojection]-category is the label assigned to an interior node, for instance V' or TP.
In a framework like standard Minimalist grammars where all dependencies are encoded via features, an LI's l-category fully determines its syntactic distribution. Its POS represents exactly that part of the distribution that is not accounted for by the other features of its l-category. That's not news, it simply reminds us of the fact discussed above that in MGs POS have no contribution to make except for restricting an LI's distribution. We're interested in a bigger question, namely what constitutes a valid POS-system? And while Alex C's work does not directly touch on that, it hints at a possible solution.

Algebra and Categories

At first sight, it seems that Alex's work has nothing to tell us about POS in Minimalist syntax. For one thing, him and Ryo are looking at Multiple Context-Free Grammars (MCFGs) rather than Minimalist grammars, let alone the standard concept of POS that goes beyond mere distribution facts. And more puzzlingly, POS never enter the picture. Since they are working with MCFGs, their concept of category is the MCFG-concept, which amounts to what I called p-categories above. What they show is a particular correspondence between syntactic distribution and p-categories of specific MCFGs. More precisely, if one takes a string language L that can generated by a multiple context-free grammar3 and groups strings into equivalence classes in some specific fashion building on their distribution, what one gets is a rich algebraic structure (a residuated lattice, for those in the know) where each node corresponds to a specific category of the smallest MCFG that generates this language L.

Okay, this was pretty dense, let's break it up into two main insights:
  1. If one operates with the most succinct MCFG possible, its categories correspond directly to specific distribution classes.
  2. Those distribution classes are not just a flat collection of atomic entities, they form part of a highly structured system that we can describe and study from an algebraic perspective.
Point 1 isn't all that relevant here. It's useful for learnability because it means that a learner only needs to pay attention to distribution in order to infer the right categories, and it has technical applications when it comes to converting grammars according to specific criteria --- definitely useful, but not something that would wow a crowd of linguists. Point 2, on the other hand, is truly exciting to me because it highlights what is wrong with the standard view of POS.

Linguists treat POS as equivalence classes --- the lexicon is partitioned by the set of POSs. This invariably leads to a view of POSs as atomic units that are unrelated to each other and cannot be analyzed any further. Hence there are no principled limits one how the lexicon can be carved up by POSs. We can't do much more than write down a list of valid POSs. In other words, we are limited to substantive universal with no attack vector for formal universals. It also means that the only way the concept of POS can be made meaningful is to link each POS to specific properties, which in practice mostly happen to be morphological. But that can only be done if the set of POSs is fixed across languages, which limits the flexibility of POS and once again enforces a substantive universals treatment of POS.

However, there is no reason why we should think of the lexicon as a collection of partitions. Imagine that the lexicon is ordered instead such that a < b iff LI b can be selected by any LI c that selects LI a.4 The verb kill would no longer be marked as a V selecting a DP, but just have an entry that shows it can select the (assuming that the is the determiner with the most permissive distribution). This is not a particularly new idea, of course, it just takes Bare Phrase Structure to its logical conclusion: there are no POS at all, only LIs. At the same time, it still allows us to express generalization across multiple LIs by virtue of the ordering relation that holds between them. In a certain sense this also captures the intuition of David Adger's suggestion that lexical items may themselves contain syntactic structure, except that we reencode the entailments of the internal structure as an ordering over the entire lexicon. Irrespective of how closely these three ideas are actually connected, the essential point is that we can think of the lexicon as an algebraic object. And algebraic structure can be studied, classified, characterized, manipulated and altered in all kinds of ways.

Towards a Solution: An Algebraic Theory of POS

Suppose that we identify a property P that is satisfied by all natural language lexicons, some basic property of the ordering of LIs. Then P would be an indispensable requirement that needs to be preserved no matter what. In this case a constraint can be coded into the lexicon only if the necessary refinement does not destroy property P. And the really neat thing is that constraints can themselves be represented algebraically via tree automata, so this all reduces to the problem of combining algebraic structures while preserving property P --- a cute math problem.

But there's more. Remember how I pointed out in my last post that we should have an explanation as to why phonology and arity do not matter for assigning POS, whereas morphology does? If POS are atomic object, this is just baffling. But if our primary object of study isn't POS but a structured lexicon, the answer might be that the orders that could by induced by these notions do not satisfy our mystical property P. Conversely, morphology and distribution should have something in common that does give rise to P, or at least does not destroy it.

Finally, POS could be given an operational semantics in terms of succinct encoding similar to what Alex C found for MCFG categories. Maybe POS correspond to nodes in the smallest algebraic structure from which one can assemble the fully structured lexicon. Or maybe they denote in a structure that is slightly lossy with respect to information (not as fine-grained as necessary). It all depends on what the linguistic practice actually turns out to be. But that is the cool thing about this perspective: exploring what linguists are doing when they assign POS reduces to figuring out how they are condensing the lexicon into a more compact (but possibly lossy) structure.

So there's tons of neat things we could be doing, and none of it is particularly challenging on a mathematical level. Alas, there is one thorny issue that currently prevents this project from getting off the ground (beyond limited time and resources, that is): Just what the heck is our mystery property P? All of the applications above assume that we already know how to restrict the orderings between LIs, and we clearly don't. Well, such is the fate of promising new ideas, you usually have to start from square 1.

Somewhat ironically, the correspondence between POS and constraints that got me to worry about the role of POS in the first place might be of great help here. As linguists, we have a pretty good idea of what kind of constraints we do not want in our grammar, e.g. the number of nodes is a multiple of 5. We can assume, then, that P is a property that would be lost if we compiled these unwanted constraints into the lexicon. That's a first hint, and if we pool that with other insights about selection and how syntactic scales and orderings tend to work, we might be able to approximate P to a sufficient degree to get some exploratory work started. Here's hoping I'll have a less speculative story for you a couple of years from now.

  1. Actually, phonological weight has some use in predicting the split between lexical and functional categories, with the former being on average longer than the latter. Personally, I'm inclined to attribute this to extra-grammatical factors --- functional elements are more frequent, and the more frequent a word, the shorter it tends to be. But that doesn't change the fact that phonological weight has some predictive power, yet we do not put it on the same level as morphology as distribution. The intuition, it seems, is that those two are proper syntactic criteria in some sense.
  2. One could of course argue that what is spelled out as killed herself is actually the more abstract VP killed REFL and that John killed REFL is perfectly fine. But you can cook up other examples involving NPIs, PPIs, movement, whatever floats your boat. The basic point is that the category of a constituent does not fully predict its syntactic distribution, which is uncontroversial in any framework that incorporates long-distance dependencies of some kind.
  3. Actually their result pertains only to 2-MCFGs, the weakest kind of MCFGs. Generalizing the result to arbitrary MCFGS doesn't seem particularly difficult, though.
  4. Similar ideas are used in Meaghan Fowlie's treatment of adjunction, and programmers will recognize this as coercive subtyping. Also note that the order is a pre-order but not a partial order. That is to say, a < a holds for every a (reflexivity), and a < b and b < c jointly imply a < c (transitivity). But it is not the case that a < b and b < a implies a = b (no antisymmetry). So two words can have the same distribution and still be distinct, cf. groundhog and woodchuck. Preorders seem to be a very natural construct in natural language, they form the foundation of hyperintensional semantics and also arise organically in the algebraic view of the PCC.

38 comments:

  1. I have put a copy of this paper up, accessible from here. Pagination differs etc.

    ReplyDelete
  2. It seems to me that what you're describing is closely related to ideas that have been around in unification based frameworks (GPSG, LFG, HPSG) for a long time (i.e. since the 1980s, at least). In GPSG and HPSG especially, the labels on the nodes of the trees are taken to be complex feature structures, i.e. bundles of constraints. (In LFG, the analogous structures show up on the f-structure rather than the c-structure.) This allows for the (precise) description of overlapping classes of items (words, phrases) while also maintaining any relevant distinctions.

    ReplyDelete
  3. Exactly -- GPSG told us, among other things, that the number of nonterminals we would need in an adequate CFG would be vast, so you need some featural representation. So where do the features come from? And why do we have one set of features rather than another? There don't seem to be principled answers to these questions at the moment -- which occur with HPSG and MGs as well.

    But the representation theory of these canonical algebras seems like it could give a non-arbitrary answer. And answer the Chomskyan critique of PSGs or at least the part of it that says "these productions are arbitrary".

    ReplyDelete
  4. There is no theory here that I'm aware of, but the Feature Committee of the LFG Pargram Project group has a list of ones that tend to be useful, across what is beginning to be a decent spread of languages. HPSG's LINGO must have something similar.

    ReplyDelete
    Replies
    1. @Emily, Alex, Avery: AVMs in GPSG and HPSG do not solve the actual problem, though, because they still include POS as one particular feature. All other features, however, have a straight-forward morphological and semantic interpretation (person, number, tense, aspects etc.; gender is a little quirky, but still fits the general theme). POS don't, they are as inscrutable in GPSG/HPSG/LFG as they are in GB and Minimalism.

      But I guess one interpretation of Emily's remark is that we should expect my algebraic proposal for POS to be similar to a system where each POS is itself a complex AVM. That would be very similar to David Adger's idea that we need a decompositional approach to POS (and lexical items in general). From this perspective, the advantage of the algebraic approach is that it would yield the relevant generalizations without commiting us to a specific way of encoding them in our theory (e.g. AVMs or functional hierarchies).

      Delete
    2. Perhaps I'm missing something. Doesn't the multiple-inheritance hierarchy employed in HPSG give a useful algebraic structure not tied to a particular encoding? An item X's ARG-STR lists the smallest type that can be selected by X, and any item of greater or equal type can be selected by X just as well. If you think morphological and semantic features don't belong, take the hierarchy you get after throwing them away. Maybe it's better to have lexical items rather than lexical types in the hierarchy. But if so, just look at what lexical types have been given to lexical items, replace the lexical types in the hierarchy with lexical items, and stitch it all up. Won't this give a very similar picture?

      Delete
    3. @Josh: Yes, it's a similar picture in that HPSG implicitly incorporates coercive subtyping and thus induces orderings over lexical items and constituents.

      But I want to know what the properties of these orderings are for natural language, and as far as I can tell HPSG has nothing principled to say about this. The formalism allows for just about any order between lexical items, including none at all (give every lexical item its own POS, make sure every SUBCAT feature specifies one unique POS).

      Delete
    4. It's a fair point that, at least as far as I know, there hasn't been much theorizing in the HPSG community about what properties the orderings should have. But there's a difference between being the formalism allowing any ordering and practicing linguists using every ordering. A lot of work in linguistics has gone about by describing data in overly-powerful formalisms, seeing what is and isn't used, and then developing a formalism that hits the right spot. If you're looking for a property P shared by the relevant ordering over lexical items in every language, wouldn't one very important starting point be to look at descriptively adequate orderings over lexical items in multiple languages? For that task, what is a better starting point than the large-scale inheritance hierarchies built for multiple languages within HPSG?

      Delete
    5. @Josh: In principle yes, but since HPSG grammars tend to be fairly complex (a necessary eviil of wide coverage), it might be more difficult to figure out what properties they encode implicitly than to start from scratch with some very basic axioms and see where they go wrong. That's also preferable for the simple reason that the property P identified by HPSG might be so complex that it proves very hard to study mathematically, so we would have to make a lot of simplifying assumptions anyways in order to gain any traction.

      Still, even if HPSG might not be a magic bullet for this problem, I concede that there may be some important empirical generalizations hidden in the HPSG literature.

      Delete
  5. Alex' paper linked to above seems a bit rugged for beginners, what would be the best thing(s) to read for preparation?

    ReplyDelete
    Replies
    1. I'm afraid there is none. There is no easy intro to MCFGs (although the formalism is a pretty natural extension of CFGs), and residuated lattices aren't discussed in any surveys I know of, either. There's two intros to algebra and lattices for linguists, though: 1) Partee et al's Mathematical Methods in Linguistics, and 2) a textbook manuscript by Keenan and Moss. But it will take quite some time to get through those, and it's still not enough to fully appreciate the technical aspects of the paper.

      This is part of a bigger problem I lamented a while ago: Mathematical linguistics needs a lot more introductory material. I'm pretty sure that the lack of intermediate level material --- a bridge from the mathematical basics to the advanced concepts used in the actual literature --- is a major reason why we have failed to attract a bigger audience. Imagine a mathematics curriculum where you would have to go straight from calculus to differential topology. It'd be insanity.

      Delete
    2. Addendum: The last chapter of Makoto Kanazawa's lecture notes on formal grammar discusses MCFGs. It's still more technical than it needs to be, but maybe you'll find it helpful.

      That being said, it's great that you trying to work through the paper. That's pretty much how I learned it all, reading and rereading papers over and over again and working through self-compiled reading lists until I finally understood what was going on.

      Delete
    3. @Avery : that paper is too dense I agree. The equivalent paper for CFGs is maybe easier (non final version ) -- and the slides I used are less technical (slides)

      Delete
    4. @Benjamin, Alex: Oh boy, how come I didn't know that one yet? That's one nifty intro paper. Thanks for the link.

      Delete
    5. Having had a first pass through Alex' MCFG paper, I'll try to test my comprehension by proposing that a plausible expectation of mild context sensitivity for natural language is that it should be impossible to scramble/split a constituent such as an NP or nonfinite VP into an unbounded number of pieces located in different places. Many cases of scrambling that can be observed in texts involve pieces of NPs appearing either at the end of the clause or at the beginnings of maximal projections (S, NP, VP, PP etc), but we should not expect one piece of an NP to appear at the front of a subordinate clause and another at the front of the main clause, in addition to one in basic position (ie Enormous_i reported the newspaper that venenous_i a child found snake_i in the bathtub) (because, a part of an NP can be 'fated' to appear in some specific higher discourse position, but only one of each kind). This is consistent with what I've noticed about Homeric Greek.

      Delete
    6. @Avery: Yes, that is one prediction of the MCS hypothesis. It has been claimed that scrambling in German thus contradicts the MCS hypothesis. The argument goes as follows: a) scrambling in German allows for extraction of arbitrarily deeply embedded constituents, and b) the arguments of a verb can be scrambled out of infinitival TPs, hence c) holds: in a structure with nested infinitival TPs, an unbounded number of DPs can be scrambled into the highest TP.

      Joshi et al argue that this is overly generous and the strong performance limitations on scrambling we observe actually reflect a limitation of the grammar. I never found that story particularly convincing as I think the argument above can be debunked on syntactic grounds alone: non-local scrambling is horrible if it involves two DPs that can't be distinguished by certain features (case, animacy, definiteness, etc.), and since it seems that there is only a finitely bounded number of those, only a finitely bounded number of DPs can be scrambled at the same time.

      I feel like the idea that moving constituents must be uniquely identifiable by some feature has also been entertained in the Minimalist literature, but I can't think of any specific proposals right now.

      Delete
  6. @Alex, @Thomas: I think there are some presuppositions that I don't share with you because I'm not understanding the motivation for your question. From my point of view, the models we build are formal models that I expect to capture the generalizations we see in the data. (Where the primary data are the acceptability of individual sentences and paraphrase and other semantic relations between sentences.) It doesn't follow from that that each piece of the model itself must be rooted in anything outside of syntax. POS corresponds to the idea that there are general classes of words that share important aspects of their distribution, even though we also see subclasses (some of which overlap with each other). Why do we need any further motivation for this set of distinctions? (From a typological perspective, there is the interesting question of whether we can align such classes across languages, but there are approaches to that as well, at least for practical purposes.)

    ReplyDelete
    Replies
    1. @Emily -- so I am interested ultimately in finding or trying to construct models of the cognitively real representations, and so acquisition is crucial. I reject (on methodological grounds) the idea that the syntactic categories are innate and so we have to account for their acquisition. The categories must therefore be rooted, directly or indirectly, in the data available to the child.

      But this way of thinking about it may be oversimplified -- there could be explanations that don't fall neatly on one side or the other of the innate/learned divide. If anyone knows any...?

      But Iif you are trying to "capture the generalizations" or patterns that you see in the data, then what set of categories you use will be driven presumably by other more practical considerations. There are I guess arguments that one should use a principled methodology for constructing the categories in this case, but they aren't very convincing.

      Delete
    2. @Emily: the models we build are formal models that I expect to capture the generalizations we see in the data
      The crux is what one thinks qualifies as capturing the generalizations. It's not enough for a formalism to assign each sentence an approximately correct acceptability value, the formalism must also be as general and non-stipulative as possible (that's why formal universals are always preferable to substantive universal, which boil down to writing arbitrary lists), and it must be clear how the formalism does what it does. I think we actually agree on all these points, they're fairly uncontroversial in science as well as engineering.

      Now the thing about POS is that they open up major loopholes in our formalisms and we have no principled theory of POS that would allow us to plug those holes. That is the issue I'm most worried about: not whether POS are a convenient way of capturing generalizations, not whether they can be derived from extragrammatical factors, but that we have no theory of POS to limit their power in the right way. Like, at all.

      Add to the mix the learnability issue pointed out by Alex, the question why POS cluster with certain properties (e.g. morphology) but not others (e.g. syllable structure), why something like POS should exist in the first place (why is there no language with only 1 POS?), and you've got yourself a very juicy research problem.

      Delete
    3. @Thomas: "It's not enough for a formalism to assign each sentence an approximately correct acceptability value, the formalism must also be as general and non-stipulative as possible (that's why formal universals are always preferable to substantive universal, which boil down to writing arbitrary lists), and it must be clear how the formalism does what it does." Actually, I don't think we agree. Where does the force behind your "must" in that statement come from? That is, why must those things be true? I think it depends on what you expect your formal model to do. If, like @Alex, you expect your formal model to be cognitively real, than I think I see the point of disagreement: I don't see how the methodology of formal syntax (working with what we take to be primary data) can lead to models that correspond directly to what's in our wet-ware. If that's the interest, then you've got to be looking at evidence as close a possible to actual human processing: psycholinguistic experiments, neurolinguistic experiments, actual details of language acquisition (rather than abstract learnability). What formal syntax can do to help answer these big questions is to build coherent, comprehensive, working models that approximate, as best we can, the knowledge acquired when one acquires a language. These can help (I hope) illuminate the search space for those designing (expensive, tricky) experiments in actual human processing.

      Put a different way: Even if "Occam's Razor" is a good rule of thumb for science and engineering in general, I don't see how it gets us closer to cognitively real models if we're not starting from cognitive data: Biology is messy---why should grammars encoded in actual human brains be optimally simple? Also: Optimizing on the simplicity of the grammar (fewest number of categories, rules, etc) often (always?) comes at the expense of complications in processing (longer derivations, more complicated structures derived from the rules). Why should we expect human brains to put a premium on storage in this way?

      @Alex: "The categories must therefore be rooted, directly or indirectly, in the data available to the child." Why do you assume that syntactic distributional facts aren't in the data available to the child? That is, why do all syntactic categories have to be rooted outside syntax?

      Delete
    4. What came to worry me in the later years of my teaching syntax was how to distinguish the traditional 'parts of speech' from the various other kinds of features and properties that people recognize (inflectional features, presence of gaps, presence of anaphors ...). So what I came up with was the idea that the PoS features encode (a) the internal word-order possibilities for the constituent (b) the inflectional features that can be expressed on those words, eg NPs in Russian have gender, number & case with their well-known values. (c) more loosely, have a connection to the distribution of gaps, anaphors and various other things. But this is (a) not a mathematical understanding, but an intuitive one. They also have major effects on the external distribution, but the behavior of case features makes that a bit messy.

      Delete
    5. @Emily: ""The categories must therefore be rooted, directly or indirectly, in the data available to the child." Why do you assume that syntactic distributional facts aren't in the data available to the child? That is, why do all syntactic categories have to be rooted outside syntax?"

      The distributional facts are, on this model, the only facts that are available to the child. I think I misunderstood what you meant by "outside syntax" ... for me the syntax is the internal-to-the-child grammar if you like, and the raw surfacey distributional facts are external, and accessible to the child who is learning. This paper is about how under weak assumptions the internal syntactic categories will correspond to external distributional categories.

      Delete
    6. @Emily again: "Even if "Occam's Razor" is a good rule of thumb for science and engineering in general, I don't see how it gets us closer to cognitively real models if we're not starting from cognitive data: Biology is messy---why should grammars encoded in actual human brains be optimally simple? Also: Optimizing on the simplicity of the grammar (fewest number of categories, rules, etc) often (always?) comes at the expense of complications in processing (longer derivations, more complicated structures derived from the rules). "

      I agree completely with this -- though I would say that grammaticality judgments are cognitive data. But I think the cognitively real grammars may be very different, and potentially much larger and more redundant than the grammars that generative grammarians contemplate. The trade off you describe between grammar size and complexity is I think crucially important.

      Delete
    7. @Emily: I believe we're talking past each other, or at the very least I have trouble figuring out the actual point of disagreement.

      If you want to use your formalism primarily for engineering, you want to know as much about it as possible (because that will get you new techniques for implementation, normalization, extension, and so on). So then you also want to do more with POS than just treating them as inscrutable, arbitrary partitions of the lexicon. Alex C's paper does a good job highlighting the potential engineering advantages of studying POS.

      If you're mostly interested in scientific explanation, then POS matter because, among other things, your formalism has to carve out the right typology. That's provably not the case if anything goes with respect to POS. So you either fix the POS, which is unappealing and might be empirically infeasible, or you try to figure out what constitutes a valid system of POS.

      All of this is completely independent of any ontological commitments that one might want to attach to the formalism.

      Delete
    8. @Avery: That's indeed the cluster of properties that are somehow connected to POS. I think on an intuitive level, POS aren't all that mystical, linguists use them on a daily basis and there's usually very little disagreement on how to classify things (excluding contentious cases such as the status of adjectives in Navajo).

      But many things work fairly well on an intuitive level yet break quickly if you start probing deeper. I've already explained why I think POS are broken and what is missing. So now we have to actually work out a formal theory, and then it will be interesting to see how that relates to the intuition you sketch.

      Delete
    9. A partial answer I'd suggest would be 'modest UG' (meaning that we think there is a bias needed for language acquisition, but don't make grand claims about exactly what it is or where it comes from) with an evaluation metric for grammars, such that the duplication involved in accommodating agreement etc with the PS rules rules out these solutions given the existence of more highly-valued ones using different kinds of rules for agreement.

      The way this would work for LFG is that the c-structure rules (ID/LP) would have access only to the PoS features (and maybe CASE, but probably not for a clean theory), whereas agreement features would appear only in lexical items, with their distribution stated in terms of grammatical functions (and, occasionally, perhaps, notions based on 'f-precedes', a relation on f-structures induced by the c-structure). So the optimal way to get 'this dog/these dogs/*this dogs/*these dog is to put NUM SG|PL on the nouns and determiners rather than split the NP rule.

      Putting it a bit more generally, we say that there are two kinds of features, those that are attributed to the c-structure level and mentioned in c-structure rules, and those that can are attributed to f-structure, which don't appear in c-structure rules, but are 'cheap' to attribute to lexical items. The former in addition arguably have subclassification but not cross-classification, while cross classification is a striking feature of of the agreement features noticed as least as long ago as Ancient Greece (Aristotle seems to have gotten phi features on nouns pretty much right, although the made a mess of tense/aspect/mood on verbs).

      There is I think a lot further to go in order to turn this into a real theory that generates predictions (projection of grammars from PLD), but, I do get the feeling that recent work on statistical and formal language theory has put a lot of new interesting-looking items on the workbench, which somebody sufficiently clever might be able to assemble.

      Delete
    10. @Alex: I confess I haven't read your paper, and am just reacting to the blog post and the discussion. From your most recent replies, it sounds like we are substantially in agreement. But: you write "I reject (on methodological grounds) the idea that the syntactic categories are innate and so we have to account for their acquisition. The categories must therefore be rooted, directly or indirectly, in the data available to the child." and also "The distributional facts are, on this model, the only facts that are available to the child." If the data available to the child motivates categorizing words into something that corresponds to what linguists call POS, what does it matter whether there is anything else that those categories can be reduced to or that motivates those categories? (That is the central question of the original post, as I see it.)

      Delete
    11. @Thomas: "If you want to use your formalism primarily for engineering, you want to know as much about it as possible (because that will get you new techniques for implementation, normalization, extension, and so on). So then you also want to do more with POS than just treating them as inscrutable, arbitrary partitions of the lexicon. Alex C's paper does a good job highlighting the potential engineering advantages of studying POS." I do use my formalism primarily for engineering, and a distributional notion of part of speech, without worrying about the categories being "inscrutable" has served just fine. If the partition of the lexicon they provide works for the phenomena analyzed by the grammar, good. If the next phenomena we try to add causes problems for the current conception of POS categories, we revisit it. The benefit of the engineering approach is that we can then test the new system against all previously analyzed sentence types.

      @Thomas cont: "If you're mostly interested in scientific explanation, then POS matter because, among other things, your formalism has to carve out the right typology. That's provably not the case if anything goes with respect to POS. So you either fix the POS, which is unappealing and might be empirically infeasible, or you try to figure out what constitutes a valid system of POS." From my perspective, all the formalism "has to" do is (1) be formally well-defined such that we (with or without the aid of computers) can calculate the predictions it makes with respect to particular strings, independently of what we think the right answer should be (2) be sufficiently flexible that it can be used to state (and then contrast) different theories. There seem to be several assumptions packed into your use of "scientific explanation", including the idea that there is one set of POS that predicts some typology. Why do POS systems have to be shared across all languages in order to participate in "scientific explanation"? What are the possible sources of explanatory power, for you?

      Delete
    12. @Emily: Why do POS systems have to be shared across all languages. They don't. But among the class of all logically possible POS systems, there is a proper subclass of those that are instantiated by natural language. And we know this is proper because without restrictions on how you assign POS, you predict that there are languages where trees are well-formed iff their number of nodes is a multiple of 17 (this follows from the reducability of constraints definable in monadic second-order logic to POS). There are no such languages, hence there are some restrictions on what counts as a licit POS system. And it would be nice to know what those are. Not just out of curiousity but because this subclass will exhibit properties that the bigger class lacks and that we might be able to exploit for various purporses, including machine learning.

      If the partition of the lexicon they provide works for the phenomena analyzed by the grammar, good. If the next phenomena we try to add causes problems for the current conception of POS categories, we revisit it.
      That's fine for incremental improvements in coverage, but not much more than that. It's always important to also push our understanding of the formalisms to create completely new tools and techniqes. That's pretty much what theoretical computer science is all about, and it has had a clear positive effect on the engineering side of CS.

      If the data available to the child motivates categorizing words into something that corresponds to what linguists call POS, what does it matter whether there is anything else that those categories can be reduced to or that motivates those categories? (That is the central question of the original post, as I see it.)
      First of all, the central question of this blog post is even more basic, namely what it is POS do in linguistic theories. That POS for you are just distribution classes is all nice and dandy, but that's not the role they serve in most formalisms, as I painstakingly pointed out. And that's why your whole conditional is moot: POS as used by linguists don't just correspond to distribution classes, and the algebraic connection between categories and contexts that Alex establishes does not at all carry over to POS in an obvious fashion.

      So another way of phrasing my question is: Given that there are genuine linguistic concerns about POS, and that Alex has this interesting perspective of MCFG categories (which are similar to POS but not exactly the same), can we combine the two to address the linguistic issues, and if so, what might that look like, and what would follow from it on a technical level.

      Delete
  7. BTW David Gil believes that certain urban dialects of Indonesian, such as Riau and Jakarta, have no parts of speech distinctions, e.g. one PoS. I don't think one is obliged to believe this, but the claim has been made, and he gives concrete idea of how such a language could work (there's more than just this floating around, but these languages are very hard to work on because you can't get judgements out of the speakers, who will correct everything you suggest to the standard).

    http://www.lingvistika.cz/download/knihovna/gil_how_much.pdf

    ReplyDelete
    Replies
    1. Interesting, thanks for the link. If we already had a structural theory of POS, we could evaluate the 1POS analysis according to whether it fits the template or not. Yet another item to put on my overly long to do list.

      Delete
  8. & for Alex's syntacticconceptlattice paper, are there some missing sub/superscripts or something in the last paragraph of p7? It doesn't quite parse for me (could be me, not it).

    ReplyDelete
  9. I'll try to test my understanding of the PoS leak problem with an ulta-simple example, using a theory of 'PSG with limited features', a bit like a grossly oversimplified version of LFG.

    Suppose we have PS rules, and our task is to produce a grammar of English noun phrases, and we have just bumped into the problem of singular vs plural nouns. If all we have is classic PS rules, we need to split the NP rule like this (blogger doesn't support standard linguistics layout, so using disjunctive '|' instead):

    NP -> {(Detsg) (AdjP)* Nsg (PP)* | (Detpl) (Adjp)* Npl (PP)*}

    Scientifically, we know that this is wrong because languages
    never change in such a way as to wind up with different NP
    structures for different combinations of inflectional features
    (gender, number, case, definiteness), which is what we would
    expect if there were really multiple copies of the rules (this
    is basically Martin Davies' (1987 Mind96:441-462) argument for Chris Peacocke's "level 1.5" of explanation, using naturally occurring language change rather than implausible and unethical surgeries).

    So, if we're teaching class that will eventually wind up at LFG, we might at this point introduce 'features', including NUM with values SG and PL, and propose that the Det and N nodes can be annotated with something like '=' meaning 'share all the features' (looking ahead to when similar stuff comes for German or Italian etc). Now we can
    revise the NP rule to:

    NP = Det (AdjP)* N (PP)*
    = =

    But we haven't addressed the problem of what makes the learner chose the right grammar rather than the wrong one. We can introduce some kind of evaluation metric, and observe that the correct rule is shorter.

    But the margin of victory for the right rule over the wrong one becomes slimmer as the lexicon gets bigger, and why should we assume that learners care about small differences in grammar size?

    Indeed, if we don't do things right, it might get reversed rather quickly as the lexicon gets larger. For the wrong grammar, we might have:

    the:Detsg; the:Detpl; this:Detsg; these:Detpl; ...
    dog:Nsg; dogs:Npl; cat:Nsg; cats:Npl; ...

    whereas for the right one, we might have:
    the: Det; this:Det, NUM SG; these:Det, NUM PL; ...
    dog:N, NUM SG; dogs:N, NUM PL; cat:N, NUM SG; cats:N, NUM PL; ...

    which might count as bigger since the nouns are being specified for two things rather than just one. Or maybe not, if we count in information-theoretically sophisticated way, since the number of possibilities specified for each lexical item is the same.

    But, the point is, contemporary syntactic theory has not sorted this kind of issue out in a generally accepted way, and it is therefore not clear why syntactic theory that needs to be able to split quantifiers into subcategories for 'many', 'enough' and 'galore', with their different ordering possibilities, can't do a similar thing for the noun phrases.

    ReplyDelete
    Replies
    1. @Avery: it is therefore not clear why syntactic theory that needs to be able to split quantifiers into subcategories for 'many', 'enough' and 'galore', with their different ordering possibilities, can't do a similar thing for the noun phrases.

      Yes, that's pretty close to what I consider the problem of "POS leakage". But it's not so much that the Det category can be more fine-grained than N, but that we have no systematic restriction on the granularity of POS in general. If we want good empirical coverage, we need very fine-grained POS, as you correctly point out. But if we can have very fine-grained POS, why can't I have V_e and V_o that only differ with respect to whether the projected VP contains an odd or even number of nodes?

      Your intuition that it has to do with the complexity of the grammar might be on the right track, but at the very least this complexity cannot be measured in terms of the size of the lexicon (and the lexicon is the grammar in a lexicalized framework such as MGs). Here's why:

      In order to encode how many nodes a constituent contains, we need to know how many nodes each of its arguments contains, so their POS must be split into e and o subcategories, too. But by induction the same split must be done for the arguments of this argument, and so on. So the size of the lexicon could at least double (depending on the number of arguments per head).

      The same blow-up obtains if we want to locally encode the fact that a constituent contains a reflexive by splitting each X into X[-refl] and X[+refl]. The latter is a linguistically attested distinction (if you want to handle Principle A in a strictly local fashion, those categories are indispensable), the former is not. So what's the difference between the two? Whatever it is, potential blow-up in the size of the lexicon is an unlikely candidate imho.

      Delete
    2. In LFG, the sketch answer would be that there is also a binding theory along the lines of the one in Dalrymple (1993) (which can be seen as a typologically elaborated version of Chomksy's original), so that doubling the number of PoS categories would be the wrong solution for reflexives as well as for agreement categories. Part of the story also has to be different rule/constraint formats for the different kinds of features, so that PoS features:

      1. Do get mentioned in Immediate Domination/Linear Order constraints
      2. Do get mentioned in statements of what inflectional categories *can* be manifested on a word (not 'must', because of defective inflection).
      3. Do not have access to coreference information.
      (contents of list motivated by general reflection on typology)

      The sketch needs to be filled in with something that demonstrably works, by either learning grammars from data, or at least identifying the best one from amongst given alternatives (and, pace Emily, I think this implies making psychological reality claims).

      Delete
    3. This comment has been removed by the author.

      Delete
    4. Continuing the above, I'd suggest not worrying too much about the even-only language, because it might be in some sense formally possible in terms of grammatical theory alone, but not found for other reasons such as being neither a) useful for anything whatsoever b) being capable of being evolved by normal diachronic processes from anything else that is useful. Working out the details of how more-or-less correct grammars can be acquired from the kinds and amounts of data that they actually seem to be acquired from therefore seems to me to be a better way to go ... my sg/pl NP example is one of the simplest cases of this nature that I can think of.

      Delete