That is why I won't blog about this research quite yet (except for the shameless self-promotion above) and instead focus on the talks I heard, rather than the one I gave. Don't get me wrong, many of them were very interesting to me on a technical level; some of them even pierced my 90s habitus of acerbic cynicism and got me a bit excited. Quite generally, a fun time was had by all. But the talks made me aware of a gapping hole in my understanding of the field, a hole that one of you (I believe we have some readers with serious modelling chops) may be able to plug for me: Just what is the point of cognitive modelling?
Advantages of Modelling
Don't get me wrong, I understand why modelling can be useful, and human cognition is one of the most interesting things to study. But from where I'm standing, the two simply do not fit together. Or rather, I think what people are trying to do cannot be done by modelling and instead requires an approach grounded in mathematics --- theorems and proofs.Let's first outline some clear advantages of developing computational models:
- Proof of concept
Your idea might sound batshit crazy, but if it can be turned into a working model that works on a wide range of problem instances, that demonstrates a certain level of sophistication. So maybe we shouldn't dismiss it right away. - Getting results
Models are great from a utalitarian point of view: you have a problem, and your model solves it for you. You want to know if tomorrow's picnic will be a pleasant sunshine siesta or a rainy rancor trigger? Let's feed the data into our weather model and see what it has to say. - Testing for problems
A model can test a much bigger set of data than any group of humans can, so they're a great way of hunting for holes in your theory.
Why do we Study Cognition?
Just because modelling has advantages doesn't mean that its advantages are of much use for a given area: a wine cooler is a nifty thing to keep in your kitchen, but it's pretty worthless at a rehab clinic. In the case of cognitive modelling, it really depends on what you are trying to achieve. For computational linguists, cognition might be just another pesky quirk of humans that makes language needlessly complicated for computers. They just need an efficient method for constructing discourse representations, making semantic associations, and whatever else you need to simulate human-like understanding and usage of language. Given such a tool, they also need to verify that it works for a wide variety of industrial applications. Both issues are covered by advantages 2 and 3 above, so modelling does indeed fit the bill.But I, for one, do not care that much about cognition as a problem for non-sentient machines. I am interested in how human cognition works and, most importantly, why it doesn't work in different ways. More boldly: what makes human cognition human?
If that's your main interest, it's not enough to show that some model works for a given problem. The important questions are:
- Is it guaranteed to succeed on for every problem for a given problem space? In formal terms: is it sound and complete?
- Why does its solution to the problem actually work --- what does that tell us about the problem?
- How is the workload distributed across the assumptions and techniques your model incorporates?
- Are there different ways of doing it? Can we translate between different models in an automatic fashion?
But even if you're not ready to take that extreme stance, a model by itself is still a very boring thing and provides little insight. What matters is its relation to other models --- those that succeed as well as those that fail. Now since that is an infinite class, the standard strategy of designing and testing models via simulations won't be able to answer any of these issues. If you need to understand the structure of an infinite object, you are firmly within the realm of theorems and proofs. And I don't see any of that in the cognitive modelling community.
The Argument Against Theorems and Proofs
I suppose one reply to this little rant of mine would be that in an ideal world a proof-based approach would indeed be preferable, but the problem is simply too complex to be studied in this fashion. Just like you can't prove many things about the behavior of leaves blowing in the wind, the system is too complex to be studied in this fashion. So rather than fruitlessly toiling away for hundreds of years, we accept the limitations of the approach and sacrifice a little bit of rigor for a huge increase in data coverage.To this I have two replies, one personal and one more objective. On a personal level, one of my guiding credos is that a question that cannot be answered in a satisfying manner is not worth asking. So if the problems the cognitive modellers are trying to solve are indeed too complicated to be studied in an insightful manner (according to my high standards of what counts as insightful), then they simply aren't worth studying scientifically (the engineering angle is still perfectly viable, though). Pretty black and white, but a simple(minded) view of things is comforting once in a while.
More generally, though, my hunch is that the reply itself relies on a false equivocation. The problems themselves may indeed be complicated, but your models are mathematical objects. So simplify them, figure out the math for the simple cases, and keep expanding your results until you reach the level of the original models again. In many cases we are dealing with the construction of special cases of hypergraphs that are evaluated over a probabilistic semiring. That's not exactly the epitome of mathematical inscrutability. Why, then, don't we see any work along those lines? Or is this actually a big research area, and the only thing to blame is my ignorance of the field?
I'm not sure I understand what's at issue here. What would be an insightful cognitive modeling result in your opinion?
ReplyDeleteLet's take my own work as a starting point, because it, too, currently falls short of what I'd really like it to be. What Brad and me have done so far is simply taken a few metrics over derivation trees and tested what kind of predictions they make for a few phenomena.
DeleteBut what I'd really like this to become eventually is a theory of mappings that define preorders over derivation trees such that one can prove things like "under mapping M derivations matching regular expression R will be ranked higher than those matching expression S", or when exactly two mappings produce the same preorder, or how derivation trees can be altered to improve their ranking under R without changing the phrase structure trees, and so on.
For now I'm happy with the few simple results we have, because I'm more like an engineer in this case: as long as there is some progress on bringing processing effects to bear on your choice of syntactic analysis, that solves part of the problem. But I would prefer for the solution to be a lot more general.
Ironically, I think I liked your paper more than you did. It's a neat direction, and I think everyone in the CMCL community would be very interested if you could show that structure-sensitive memory metrics do a better job of predicting reading times than simple linear order based metrics (as in the Dependency Locality Theory), and perhaps used those metrics to figure out the correct representation of that structure.
DeleteI wasn't able to follow the desiderata for a cognitive model that you mention in the second paragraph of your comment, could you try to explain what you mean by preorder, ranking, etc?
I wasn't able to follow the desiderata for a cognitive model
DeleteSorry about that, I made it more complicated than it really needs to be. There are three problems with the project in its current state, and some of them already came up during the CMCL discussion:
1) Why these particular evaluation metrics for processing difficulty?
2) How do the metrics and derivational structures interact to yield certain processing predictions?
3) Can we keep the psycholinguistic predictions the same while altering the structure and/or metrics?
Now from an abstract perspective, what we're trying to do is ordering derivation tree by their processing difficulty. The empirical data fixes what the order should be: if sentence s is easier than s', than the derivation d(s) of s should be judged easier than the derivation d(s') of s' according to metric M.
What order you get depends on the metric M, and what d(s) and d(s') look like. There's many ways you can shift the workload around between those two guys to get an empirically correct order of derivation trees.
What order M induces depends on certain properties P_1 ... P_n of the derivation tree. We have no clear characterization at this point of what those are --- and they probably differ between metrics. So what we need is some theorems about how each metric induces a particular order based on certain properties of derivation trees. And once we know that, we can shift between metrics by altering the shape of derivation trees.
Once all of this is in place, we can verify whether, for instance, head movement comes out as psychologically more adequate than remnant movement under any metric satisfying certain axioms that we deem desirable or psychologically plausible. That way we reduce the number of assumptions that matter for the model to a bare minimum, and we also get an understanding of how derivational structure and metrics interact.
This research program sounds great to me, looking forward to hearing about the next stages of the work. I still don't get the distinction you're trying to make in this post between interesting and uninteresting cognitive modeling work though...
DeleteInteresting and uninteresting is really too harsh a choice of words. Let's say interesting and Interesting, instead. ;)
DeleteAs I point out towards the end of the post, I have no qualms with people constructing models, running simulations, etc. Many of the CMCL talks were along those lines, and I found them very interesting. But that line of work leaves many questions open, the kind that I outline above, such as "how unique is your solution?", "what is actually doing the work?", "can we describe the solution in a way that's independent of the implementation", "do we have a theoretical guarantee that this works in every case?" and so on. Those are the things that I find truly interesting, i.e. Interesting. They are essential because they further our understanding of why our solutions are solutions, and why other models are not good solutions. And that's what science is really about, imho.
And this attitude of mine isn't restricted to modelling. I have the same standards for linguistic analyses, that's why I do the kind of work that I do, trying to understand what the actual ideas behind all those different analyses are, how they are connected, where they diverge, and so on.
At least some of these concerns are, I think, shared by the Bayesian modeling crowd.
DeleteIn particular the "can we describe the solution in a way that's independent of the implementation"-bit is a big deal for some people, and as far as equivalences between models are concerned, the nicest example I can think of that highlights the importance of being explicit about the assumed model is Sharon Goldwater's analysis of Brent's 1999 Minimum-Description-Length segmentation algorithm. Check out her 2009 cognition paper, I think you might enjoy the appendix even if word segmentation is not the sexiest of all modeling problems.
Essentially, she shows how (a) a previously proposed model is a special case of her Dirichlet Process model, providing a more elegant characterization of Brent's model and (b), more importantly, that his heuristic search algorithm vastly overestimated the performance of the model. You may be disappointed that the latter point has to be driven home by simulation rather than mathematical analysis, but still, I think the finding that even for segmentation, you need to take into account dependencies between words or otherwise you'll end up undersegmenting severely, is quite important; in particular, as it "corrects" Venkataraman 2001's finding that, if you apply the same heuristic algorithm to (essentially) Brent's model or a variant that takes into account Bigram dependencies, you see virtually no improvement, suggesting incorrectly that there is no relevant difference between the models.
@Benjamin: Thanks for the pointer to Goldwater09, I'll check it out.
DeleteYou may be disappointed that the latter point has to be driven home by simulation rather than mathematical analysis
When all parameters are fixed (models, data set, evaluation procedure, etc.), that's a viable approach. Similar to how a linguistic counterexample is a perfectly fine to show that two accounts make different predictions under a given set of assumptions (phrase structure, restrictions on movement, etc.).
It's just not a very general method. As soon as you allow changes in some of the parameters you're dealing with such a big class (possibly infinite) that mathematical analysis is the only way to obtain some useful results in a manageable amount of time.
And I don't see any of that in the cognitive modelling community.
ReplyDeleteYou don't see any theorems and proofs in the cognitive modeling community? Then you're not looking particularly hard. The subfield of mathematical psychology has plenty of theorems and proofs (as well as simulations, computation models, and so on).
I admitted my ignorance of a big chunk of this line of work, and to add insult to injury my view is so limited that "cognitive modelling = cognitive modelling of language", that is really all I know. I can easily imagine that the situation might be similar to somebody looking at linguistics and concluding that there is no work building on theorems and proofs because their small sample simply didn't include any mathematical linguistics.
DeleteSo what would be a good starting point? I'm looking at the Journal of Mathematical Psychology right now, but I'm having a hard time telling which of the articles would fit the bill of 1) being about language and 2) establishing properties and (non-)equivalences of models.
Well, that's a tough question to answer. As much theorem proving as math psych folks do, it's not a terribly language oriented crowd. I was mostly just responding to what seemed to be an over-generalization about cognitive modeling, but, then, I know for a fact that my view of what counts as cognitive modeling is skewed by my experience with math psych (i.e., theorems and proofs seem pretty normal to me, but I gather that's not a terribly common point of view).
DeleteFor what it's worth, I'm a phonetician, so my interest in math psych is on the perceptual modeling end of things. Also for what it's worth, I find it interesting that, back in the 60s and 70s (maybe just the 60s), there was a series of Handbooks of Mathematical Psychology, and in one of them, most (maybe all) of the chapters were written by Chomsky. Which is to say that, once upon a time, there was formal linguistic work that could be classified as (part of) math psych. I've seen a few language-y talks at the conference, too, but mathematical linguistics is definitely not at the center of math psych.
You know, though, the more I think about it, the more it seems like mathematical linguistics and math psych people would probably get along swimmingly. The (psychological/cognitive) content of concern to math psych people is all over the place, the one common thread being mathematics. It's easy for me to imagine mathematical linguistics talks and posters going over well at the meeting, and math-ling papers going over well in the journal. I would certainly enjoy seeing more language-y talks at the meeting, anyway...
the more I think about it, the more it seems like mathematical linguistics and math psych people would probably get along swimmingly
DeleteI'm reading through the abstract volume for MathPsych 2014 right now, and I'm having a blast. It makes me wonder, though, why I wasn't aware of this line of work. Is MathPsych in a similar situation PR-wise as MathLing, or is it the scarcity of language-oriented MathPsych that limits its popularity among (psycho)linguists?
I think it's the former more than the latter, but maybe a bit of both.
DeleteHmm, I wrote a long post here but Blooger seems to have eaten it. (Or moderated?)
ReplyDeleteAnyway, the tl;dr was that: (a) it was nice to meet you at CMCL and (b) I think that your provided objection to your real objection (ie, that the world is too complicated for theorem-proving to provide good explanation) is not the one that cognitive modellers, such as those that appear at CMCL, would actually make. I think they would go "whole hog" and argue that the right way to study language is to construct simulations that increasingly match the behavioural output as well as match whatever it is we know about neurobiology, and that theorem-proving is partly a red-herring, and that the infinitude of the object is not biologically relevant.
Yeah, Blogger does the same thing if I try to post in Firefox. However, it works fine in Luakit, so if you're on a linux machine that's a valid alternative.
DeleteRE a) same here
RE b) Even if your goal is simply matching the input-output behavior, in the quest for a perfect match you need something like a meta-theory of models that guides your inquiry, otherwise you're just playing things by ear in the hope that eventually you'll end up with a good fit. I'm also not sure how neurobiology can put constraints on the choice of models without a good understanding of models --- it's conceivable that neurologically implausible models can be converted into equivalent models that are plausible. Still, thanks for bringing up this angle, I hadn't considered the option that some researchers might be happy with a blackbox theory as long as it produces the right results.
I agree with you on the need for a meta-theory, but lo and behold, we have Ted Gibson in the closing keynote providing one: that explanatory models of language production need to be guided radically by concerns of communicative efficiency. (I don't really agree but I am playing "modeller's advocate" for the people who pay my salary here.)
DeleteYou remember Vera Demberg's question about the match between your measures and online processing behaviour? For many people in the room, that is the dispositive question. If you can't produce a match that is better than the existing ones, then it doesn't matter how theoretically well-grounded your algorithm is. Separation from implementation details is a red-herring; if you have to implement it in a particular way to get the right results, then you have a clue as to what is in that black box!
So no, they aren't satisfied with a blackbox theory. They think that objects like "derivation trees" and their theoretical bounds *are* the black box. They implicitly reject infinitude by including probabilities as a fundamental component of their models.
I actually think most people would agree with Thomas that increasing the goodness of fit of your model isn't necessarily the most important goal of cognitive modeling - you want to your model to be as simple as possible (e.g. Myung 2000 "The Importance of Complexity in Model Selection"), make plausible assumptions, be interpretable, etc.
Delete@Asad: They implicitly reject infinitude by including probabilities as a fundamental component of their models.
DeleteI'm afraid I don't understand what you're getting at here. If anything, adding probabilities increases the number of alternative models to consider since now you have millions of ways of computing probabilities. Just to be clear, the infinity I'm talking about in the post is the infinity of competing models: you cannot carve out the class of empirically equivalent models by simulation because each class might be infinite.
@Tal: you want to your model to be as simple as possible (e.g. Myung 2000 "The Importance of Complexity in Model Selection"), make plausible assumptions, be interpretable, etc.
yes, but these are very fuzzy notions and far from obvious. I think I already pointed out earlier that many implausible models can be translated into plausible ones (simply because you can take a plausible one and make it implausible, irrespective of what you count as plausible). Simplicity is similarly tricky. Mathematical linguistics shows us that these issues are not straight-forward, what a theory or model looks like at the surface can be very misleading.
I think an instructive parallel can be drawn between `cognitive modeling' and `writing grammar fragments'. While there are some who disparage the latter (arguing that that is something a learner should do for us in a principled way), I suspect that you are more positively inclined to these than to those. What do you think the difference is between these two activities? Here are some questions to get you started:
ReplyDelete1) is there a difference (for you) between writing a fragment in HPSG vs TAG?
2) between writing a small vs a large fragment?
I personally think (agreeing with Stefan Müller) that large grammar fragments are necessary (but not sufficient) in seeing whether or not these ideas that we as linguists have actually do what we think they do.
Yes, I think grammar fragments have the same three advantages that I listed for models above. But I'm also sure that we all agree that linguistics would be pretty boring if we just wrote grammar fragments in whatever formalism strikes our fancy, tinker with it until it fits the data, and then call it a day. Writing grammar fragments is useful because there's something at stake. And, if I may boast, because mathematical linguistics has done its fair share in highlighting what some of these important issues are.
DeleteMy original question was prompted by the feeling that a lot of modeling work focuses on data fitting and neglects to take the next step: working out what the model actual tells us about cognition. But as the other commenters here have pointed out, things aren't quite as black and white.
As for your two questions:
1) It depends on a lot of factors. Are you interested in grammar fragments as feasibility proofs? Then there is a difference on a technical level due to HPSG being unrestricted, so there's nothing to prove for HPSG. But we also know that a linguistically rich fragment of HPSG is weakly equivalent to TAG, so it might be the case that HPSG as she is used has some data it couldn't account for. Same issue arises for TAG: are we talking about the formalism, or the linguistic ideas that come with it? And quite generally, do we care just about strings, phrase structure trees, the string-meaning mappings, etc?
Depending on how you answer those questions, there might be little gain in writing fragments, or it doesn't matter which formalism you pick, or the two are in very different boats regarding grammar fragments. The cool thing is that we are aware of these factors and understand (at least partially) how the different formalisms align along these axes. That's what makes it interesting.
2) Once again it depends on what exactly you're going for. If all you want is to see some data that flies right in the face of your linguistic theory, a wisely chosen one sentence corpus can be enough. A bigger corpus can increase the odds of accidentally coming across such sentences, but not necessarily (a 100 WSJ sentences vs a million doesn't make a difference if the problematic cases simply aren't part of that register).
@Greg,Thomas
DeleteI am curious what you think the analogue of fragment writing would be in the more mature sciences. I can think of no real analogue in physics, say, of doing a fragment of the physical environment. Nobody decides to model the physics of back yard, for example, to see how the theories fare. Or maybe they do. Is there an analogue? And if not, why not? And if this is important in linguistics, why here and not there? This really is a naive question.
Isn't all of engineering about this? We build a model of a bridge to see if it will fall down etc., and for scientific purposes, people often build models of the earth's core, or of soliton waves or whatever to test out their predictions.
DeleteI don't think writing grammar fragments is (necessarily) about isolating a "fragment of the physical environment", it's about isolating a fragment of hypothesised I-language (i.e. some particular set of rules/constraints/whatever) and seeing what that fragment of the I-language does and doesn't generate.
DeleteOf course one can imagine doing the reverse thing where you identify a fragment of the environment (e.g. a particular corpus of sentences) and then set yourself the goal of constructing a theory of that chunk of phenomena. This might make sense for engineering purposes. But I don't think that's what Greg and Thomas have in mind.
Yes, there are lots of obvious engineering reasons why one might want a model of a particular physical situation, (the flow of air over a car you are designing) or a grammar of English (e.g. writing a grammar checker or machine translation system) . Norbert's question is about the scientific reasons for doing this; I am not sure I know what the right analogy is.
DeleteI guess I am one of those that Greg thinks of as disparaging grammar-fragment-writing, but I think it is a useful way of seeing if the grammar formalism you are working with does the job it is meant to, and so it is a model -- at some level -- of part of the mind/brain/physical environment.
Greg's 2) is a good question -- do you get some insight from a large fragment versus a small fragment? I'd be interested in hearing what other people have to say. I think Thomas took this as being the large corpus versus small corpus question which is different.
I actually think that `fragments' are everywhere in physics. Let us equate the field equations (the true theory of the universe) and UG. When modeling some phenomenon, the physisist will show that the relevant equations are instances of (or approximations to) the field equations with certain parameters fixed. (This is what I think Thomas' gripe was with the modeling crowd, that their models are not particular instances of a more general theory.) This is how predictions are made, and theories are tested. Your backyard is perhaps not very interesting, but the action of a spring, or swimming through water, or badminton is.
DeleteWhen writing a grammar fragment, one wants the resulting analysis to be an instance of a UG-licensed object. We can think of this as trying to find a specialization of the field equations which correctly models a particular aspect of the world (pace Tim, we can only see whether our theory is any good by comparing it with observable data; pace me just now, we might have other, world-independent notions of goodness, like elegance). We can have models of lots of disparate phenomena, but it could turn out that they rely on mutually incompatible assumptions about parameters of the theory. Thus, they could not all be true (or the theory could be wrong).
This, in my mind, is one point of a big fragment, which essentially sees whether the assumptions made in analysing other phenomena are consistent. The other point is that you can derive predictions; given a fragment, you will assign structures to an infinite number of strings, and not to infinitely many others. On the basis of this one can make predictions about meaning, acceptability (?), reading times (?), etc. Predictions are good, right? How else are you going to get them?
Alex: I think your disparagement might ultimately be well-placed, but that we are not advanced enough yet to have learning algs do that work for us.
I hate to be dense, but on one way of reading what you say everyone writes fragments all the time. That's what the standard syntax paper is. It takes a paradigm, presents a grammatical analyses, explores consequences for other phenomena and considers how earlier analyses need revision in light do the proposals advanced. If this is writing a fragment, well we do it all the time. That's why I don't see what you are driving at. I get the feeling that this is not what you mean. Is it?
Delete@Norbert: This would be what Greg considers a small fragment, I think. But afaik there's no PnP work that would qualify as a big fragment in Greg's sense (unless he considers the fragment in thesis a big one). What would be examples? Take the "Syntax of X" book series (Thrainson's Syntax of Icelandic, Bailyn's Syntax of Russian). Those are very fine books that present a wide range of data and analyses, but they do not present something like a unified analysis of these languages, with a grammar where all operations and constraints are precisely defined and we have a lexicon with fully specified lexical entries.
DeleteNow you might say "what's the point, this provides no new insights and commits us to a specific view of the discussed phenomena". Greg's worry seems to be that any realistic grammar is a complex beast, and in any sufficiently complex system it is easy to introduce inconsistencies that are hard to spot unless your description is explicit enough that you can verify it automatically. Basically, something like "control as movement works fine in Russian", "Reuland's theory of binding works fine in Russian", but when you put the two together you suddenly need some extra assumptions for some other phenomenon X, and when you buy into those, you lose your previous account of Y.
I think this is a plausible scenario, but at the same time I'm not sure how much of an issue it is for linguistic progress. It's common in the sciences to have models of distinct phenomena that depend on mutually exclusive simplifying assumptions. Just like we have no problem with syntactic and phonological theories using different models of, say, the lexicon. If both phenomena happen to be relevant for an engineering application, then you have to fix those assumptions, but otherwise there's little harm done. For linguistics this means that if you want a wide-coverage grammar, incompatible analyses are a problem. But since linguists are mostly attached to the spirit of an analysis, rather than its specific formalization, I believe this fixing can be achieved with some tinkering in the majority of cases.
@myself: Having thought about this for 5 seconds more, it occurs to me that one case where big-scale modelling might actually affect linguistic insights is typology. Since big fragments need an increased level of detail, two languages that look similar at "standard syntactic resolution" might turn out to involve different strucures after all. Now whether you think this is a good thing depends on your priorities (if you don't like cartography, you'll probably also think that the increased resolution just obfuscates the truly important insights).
DeleteNorbert: on one way of reading what you say everyone writes fragments all the time. That's what the standard syntax paper is. It takes a paradigm, presents a grammatical analyses, explores consequences for other phenomena and considers how earlier analyses need revision in light do the proposals advanced.
DeleteBroadly speaking, I agree, but one thing that characterises a fragment for me, perhaps the main thing, is being concrete enough that one can enumerate all the licensed derivations. I don't think the "standard syntax paper" does this. Not that this is necessarily a bad thing, i.e. I'm not saying that everything should be this concrete all the time. But it's a useful tool that does seem to be under-utilised and/or under-appreciated. To me the status of fragment-writing seems (not coincidentally perhaps) roughly like that of weak generative capacity arguments: they're neither the be-all-and-end-all nor entirely useless, but common consensus seems to me to place them too far towards the "entirely useless" end of the spectrum.
As an aside - goodness gracious, can we stop being diplomatic about this supposed disagreement about how much explicitness is necessary, and call a spade a spade? Linguists are lazy and sloppy, and computer nerds are arrogant twits. Arrogant twits will find any excuse to call other people dumb, and since they are willing to be maximally precise they pretend that lack of maximal precision makes you dumb, and they have a target. Lazy and sloppy people find any excuse to justify their laziness and sloppiness, and since they know that common sense (and their leader) would dictate that maximal precision isn't always necessary, they say that until they're blue in the face. Both arguments are beside the point in anything but a particular case, and then only in retrospect ("was it useful/interesting to be X degree precise/explicit in this case?"). Anything else is just a childish waste of time.
DeleteI too wrote a long post that got eaten, I think there must have been a timed out session involved.
ReplyDeleteThe short version is that I think "proof of concept" doesn't do modeling enough justice. Modeling is exploratory work, but exploratory work is not something you go do once just to hash out or try out a few things before going back to the "real" work. Exploratory work and theory driven work are equal partners in science, addressing different parts of the creative cycle.
To make the analogy, modeling is like "data collection" about the space of possible models. Proofs are like "theorizing" about that same space. There needs to be an interplay between the two. Sometimes theorizing will be "data"-driven, merely an effort to explain why certain models work and others don't. Benjamin alluded to the urge to do this among Bayesian modeling people already, I'd say I'd hope the pressure from reviewers to explain catches up soon (which may only happen after we take some distance from the compulsively theory-hollow world of NLP). But, to take it a bit further, this would be butterfly-collecting if theory didn't also have a life of its own, logical deduction about what works and what doesn't and why needs to be free to run in parallel, the way it does in pure machine learning and theoretical statistics. The analogy doesn't quite work, because there's little chance that we'll be surprised by the new "data" we collect, i.e., collections of modeling results, if we've done our proofs right. But those proofs will only address narrow abstractions, and so there's always more to learn.
Modeling and proofs are as inherently complementary as wake/sleep. Just like everyone should avoid the trap of doing exploratory work without at least a vague idea about what's interesting in mind - ideas that will be driven, I think fundamentally, by a bit of deduction about what "must be so," i.e., theory - we should also avoid the trap of doing models without stopping to deduce why we see the results we do, and letting the theory speak and the deductions flow without too doing much new modeling work (perhaps using it only as "proof of concept"). And, conversely, we always need to avoid the trap of doing theory without constant inspiration and nudging from the real world, and so no intelligent person should think of doing theory without a constant flow of new ideas about which no proofs exist.