A native speaker judges The woman loves himself to be odd sounding. I explain this by saying that the structure underlying this string is ungrammatical, specifically that it violates principle A of the Binding Theory. How does what I say explain this judgment? It explains it if we assume the following: grammaticality is a causally relevant variable in judgments of acceptability. It may not be the only variable relevant to acceptability, but it is one of the relevant variables that cause the native speaker to judge as s/he does (i.e the ungrammaticality of the structure underlying the string causes a native speaker to judge the sentence unacceptable). If this is correct (which it is), the relation between acceptability and grammaticality is indirect. The aim of what follows is to consider how indirect it can be while still leaving the relation between (un)acceptability and (un)grammaticality direct enough for judgments concerning the former to be useful as probes into the structure of the latter (and into the structure of Gs which the notion of grammaticality implicitly reflects).
The above considerations suffice to conclude that (un)acceptability need not be an infallible guide to (un)grammaticality. As the latter is but one factor, then it need not be that the former perfectly tracks the latter. And, indeed, we know that there are many strings that are quite unacceptable but are not ungrammatical. Famous examples include self-embedding (e.g. That that that Mary saw Sam is intriguing is interesting is false), ‘buffalo’ sentences (e.g. Buffalo buffalo buffalo buffalo buffalo buffalo buffalo) and multiple negation sentences (No eye injury is too insignificant to ignore). The latter kinds of sentences are hard to process and are reliably judged quite poor despite their being grammatical. A favorite hobby of psycho-linguists is to find other cases of grammatical strings that trouble speakers as this allows them to investigate (via how sentences are parsed in real time) factors other than grammaticality that are psycho-linguistically important. Crucially, all accept (and have since the “earliest days of Generative Grammar”) that unacceptability does not imply ungrammaticality.
Moreover, we have some reason to believe that acceptability does not imply grammaticality. There are the famous cases like More people visited Rome than I did which are judged by speakers to be fine despite the fact that speakers cannot tell you what they mean. I personally no longer think that this shows that these sentences are acceptable. Why? Precisely because there is no interpretation that they support. There is no interpretation for these strings that native speakers consistently recognize so I conclude form this that they are unacceptable despite “sounding” fine. In other words, “sounding fine” is at best a proxy for acceptability, one that further probing may undermine. It often is good enough and it may be an interesting question to ask why some ungrammatical sentences “sound fine” but the mere fact that they do is not in itself sufficient reason to conclude that these strings are acceptable (let alone grammatical).
So are there any cases of acceptability without grammaticality? I believe the best examples are those where we find subliminal island effects (see here for discussion). In such cases we find sentences that are judged acceptable under the right interpretation. Despite this, they display the kinds of super-additivity effects that characterize islands. It seems reasonable to me to describe these strings as ungrammatical (i.e. violate island conditions) despite their being acceptable. What this means is that for cases such as these the super-additivity profile is a more sensitive measure of grammaticality than is the bare acceptability judgment. In fact, assuming that the sentence violates islands restrictions explains why we find the super-additivity profile. Of course, we would love to know why in these cases (but not in many other island violating examples) ungrammaticality does not lead to unacceptability. But not knowing why this is so, does not in and of itself compromise the conclusion that sentences these acceptable sentences are ungrammatical.
So, (un)acceptability does not imply (un)grammaticality, nor vice versa. How then can the former be used as a probe into the latter? Well, because this relation is stable often enough. In other words, over a very large domain acceptability judgments track grammaticality judgments, and that is good enough. In fact, as I’ve mentioned more than once, Sprouse, Almeida, and Schutze have shown that these data are very robust and very reliable over a very wide range, and thus are excellent probes into grammaticality. Of course, this does not mean that they such judgments are infallible indicators of grammatical structure, but then nobody thought that they ever were. Let me elaborate on this.
We’ve known for a very long time that acceptability is affected by many factors (see Aspects:10-15 for an early sophisticated discussion of these issues), including sentence length, word frequencies, number of referential DPs employed, intonation and prosody, types of embedding, priming, kinds of dependency resolutions required, among others. These factors combine to yield a judgment of (un)acceptability on a given occasion. And these are expected to be (and acknowledged to be) a matter of degree. One of the things that linguists try to do in probing for grammaticality is to compensate for these factors by comparing sentences of similar complexity to one another to isolate the grammatical contribution to the judgment in a particular case (e.g. we compare sentences of equal degree of embedding when probing for island effects). This is frequently doable, though we currently have no detailed account of how these factors interact to produce any given judgment. Let me repeat this: though we don’t have a general theory of acceptability judgments, we have a pretty good idea what factors are involved and when we are careful (and even when we are not, as Sprouse has shown) we can control for these and allow the grammatical factor to shine thorough a particular judgment. In other words, we can set up a specific experimental situation that reliably tests for G-factors (i.e. we can test whether G-factors are causally relevant in the standard way that experiments typically do, by controlling the hell out of the other factors). This is standard practice in the real sciences, where unpacking interaction effects is the main aim of experimentation. I see no reason why the same should not hold in linguistics.
It is worth noting that the problem of understanding complex data (i.e data that is reasonably taken to be the result of many causally interacting factors) is not limited to linguistics. It is a common feature of the real sciences (e.g. physics). Geoffrey Joseph has a nice (old) paper discussing this, where he notes (786):
Success at the construction and testing of theories often does not proceed by attempting to explain all, or even most, of the actually available data. Either by selecting appropriate naturally occurring data or by producing appropriate data in the laboratory, the theorist implicitly acknowledges …his decomposing the causal factors at work into more comprehensible components. A consequence of this feature of his methodology is that we are often in the position of having very well-confirmed fundamental theories at hand, but at the same time being unable to formulate complete deductive explanations of natural (complex) phenomena.
This said, it is interesting when (un)acceptability and (un)grammaticality diverge. Why? Because, somewhat surprisingly, as a matter of fact the two track one another so closely (in fact, much more closely than we had any reason to expect a priori). This is what makes it theoretically interesting when the two diverge. Here’s what I mean.
There is no reason why (un)acceptability should have been such a good probe into (un)grammaticality. After all, this is a pretty gross judgment that we ask speakers to make and we are absolutely sure that many factors are involved. Nonetheless, it seems that the two really are closely aligned over a pretty large domain. And precisely because they are, it is interesting to probe where they diverge and why. Figuring out what’s going on is likely to be very informative.
Paul Pietroski has suggested an analogy from the real sciences. Celestial mechanics tracks the actual position of planets in space in terms of their apparent positions. Now, there is no a priori reason why a planet’s apparent position should be a reliable guide to its actual one. After all, we know that the fact that a stick in water looks bent does not mean that it is bent. But at least in the heavens, apparent position data was good enough to ground Kepler’s discoveries and Newton’s. Moreover, precisely because the fit was so good, their apparent divergence in a few cases was rightly taken to be an interesting problem to be solved. The solutions required a complete overhaul of Newton’s laws of gravitation (actually, relativity ended up deriving Newton’s laws as limit cases, so in an important sense they these laws were conserved). Note that this did not deny that apparent position was pretty good evidence of actual position. Rather it explained why in the general case this was so and why in the exceptional cases the correlation failed to hold. It would have been a very bad idea for the history of physics had physicists drawn the conclusion that the anomalies (e.g. the perihelion of Mercury) showed that Newton’s laws should be trashed. The right conclusion was that the anomaly needed explanation, and they retained the theory until something came along that explained both the old data and also explained the anomalies.
This seems like a rational strategy, and it should be applied to our divergent cases as well. And it has been to good effect in some cases. The case of unacceptability-despite- grammaticality has generated interesting parsing models that try to explain why self- embedding is particularly problematic given the nature of biological memory (e.g. see work by Rick Lewis and friends). The case of acceptability-despite-ungrammaticality has led to the development of somewhat more refined tools for testing acceptability that has given us criteria other than simple acceptability to measure grammaticality.
The most interesting instance of the divergence, IMO, is the case of Mainland Scandinavian where detectable island violations (remember the super-additivity effects) do not yield unacceptability. Why not? Dunno. But, as in the mechanics case above, the right attitude is not that the failure of acceptability to track grammaticality shows that there are no island effects and that a UG theory of islands is clearly off the mark. Rather the divergence indicates the possibility of an interesting problem here and that there is something that we still do not understand. Need I say, that this latter observation is not a surprise to any working GGer? Need I say that this is what we should expect in normal scientific practice?
So, grammaticality is one factor in acceptability and a reliable robust one at that. However, like most measures, it works better in some contexts than in others, and though this fact does not undermine the general utility of the measure, it raises interesting research questions as to why.
Let me end by repeating that all of this is old hat. Indeed, Chomsky’s discussion in chapter 1 of Aspects is still a valuable intro to these issues (11):
…the scales of grammaticalness and acceptability do not coincide. Grammaticalness is only one of many factors that interact to determine acceptability. Correspondingly, although one might propose various operational tests for acceptability, it is unlikely that a necessary and sufficient operational criterion might be invented for the much more abstract and important notion of grammaticalness.
So, can we treat grammaticalness as some “kind” of acceptability? No, nor should we expect to. Can we use acceptability to probe grammaticalness? Yes, but as in all areas of inquiry there is no guarantee that these judgments are infallible guides. Should we expect to one day have a solid theory of acceptability? Well, some hope for this, but I am skeptical. Phenomena that are the result of the interactions of many factors are usually theoretically elusive. We can tie loose ends down well enough in particular cases, but theories that specify in advance which looses ends are most relevant are hard to come by, and not only in linguistics. There are no general theories of experimental design. Rather there are rules of thumb of what to control for in specific cases informed by practice and some theory. This is true in the “real” sciences, and we should expect no less in linguistics. Those who demand more in the latter case are methodological dualists, holding linguistics to uniquely silly standards.
 Other similar examples involve cases where linear intervening non-licensing material can improve a sentences acceptability. There is a lot of work on this involving NPI licensing by clearly non-c-commanding negative elements. This too has been widely discussed in the parsing literature. Again, interpretations for these improved sentences are hard to come by and so it is unclear whether these sentences are actually acceptable despite their tonal improvements.
 So far as I can tell, similar reasoning applies to some recent discussion of binding effects in a recent Cognition paper by Cole, Hermon and Yanti that I hope to discuss more fully in the near future.
 As Nancy Cartwright observes in her 1983 book (p. 83), the aim of an experiment is to find “quite specific effects peculiarly sensitive to the exact character of the causes [you] want to study.” Experiments are very context sensitive set ups developed to find these effects. And they often fail to explain a lot. See Geoffrey Joseph quote below.
 See his “The many sciences and the one world,” Journal of Philosophy 1980: 773-791.
 I own what follows to some discussion with Paul. He is, of course, fully responsible for my misstatements.