I hope to get some discussion going on this topic. This one of the most common errors in application of macroinvertebrate assessments of streams and rivers.

This is why you must replicate to evaluate change.
I’m going to start short and sweet, to promote readers and discussion.Note that i am working on a paper on this topic and your comments might help me head off some reviewer criticisms.
The problem is that rapid bio assessment protocols are developed by examining a large population of reference streams and a large population of “study streams” (often referred to as an “impaired population,” but in truth this is not a population. Streams might be impaired or not, and those that are impaired might be impaired by diiferent types of disturbances. Indeed, the only thing these streams really have in common is that they are not part of the reference population).
This development phase selects metrics that differentiate these two populations of data and then uses the box plots of the reference population to determine approximate the value attained by 75% (or some similar proportion). The layout for this analysis is very similar to a t-test of two populations. (Figure, “development box”).
However, in application of bioassessment protocols, a single sample is used to represent the condition of a study site and it is compared to the criteria defined by the reference population. To ensure that this one sample represents the actual condition of the stream being assessed, We collect extra large samples. This is said to “homogenize the variance.” This statement was never tested by bioassessment developers, but i tested it for several years. Composite samples do not homogenize the variance (Marshall 2006). There is often as much as a 30% variation in metric scores within a site.

All metrics show variation; even in 8-suber composites like these.
Bioassessments were originally proposed as a screening process (Plafkin et. al 1989, p.1). And they work for this. When you do a bioassessment you are, in effect, comparing a single sample to the reference condition (again, bottom left), but if you want to compare it more rigorously, you will want to know how much variation occurs among your biota’s distribution (lower middle of figure).
When you are interested in change, you might want to compare a site to its self over time, or to other sites in the area, or both. In fact, the variation of the reference population has absolutely nothing to do with this process. To compare a site (spatially or temporally) you must account for the within-site variation; for both sites! (Right middle and bottom of Figure).
When you do not, your conclusions must be limited to weather the sample (not the community) deviates from reference conditions.
So what is a biologist to do? Collecting replicate samples using large composites as sample units increases the time spent at each site in the field dramatically, and large samples are expensive to identify. Yet, the states and the EPA have invested heavily to biocriteria, it would be nice to attain a comparable methods so that you can compare your site to the regional reference. I have found that there is good way to do this, when necessary.
I collect 8 individual Surber samples (or Hess samples, etc.) and process 200 organisms each (some might not have 500). In my experience about 5% of the samples have about 100 organisms. still this results in a comparison of 1500-1600 organisms. So, once you do your comparisons of populations using replicates, you can add the samples together to represent the 8 sq. ft. level of field effort. This will inflate richness, but these can be corrected by rarefaction analysis (Marshall 2009). furthermore, proportional measures better reflect the actual community structure of the site being evaluaed— each of the replicates is included in the subsample.
Hrm… what do I mean by that. Well lets say one of your composites hit a black fly hot-spot… say, 50,000 per sq. meter. (this is not uncommon). the rest of your samples have 200-400 bugs of a variety of groups, but very few black flies. when the 500-organism subsample is processed will have be very strongly influenced by blackflies… and will end up seeming very similar to a location that has high black flies every across the whole stream. So, smaller replicates would diferentiate sites with one small concentrated aggregate of black flies (or midges or worms or …etc…) from sites that are completely dominated by them.
Additionally, if you measure flow with each sample your analysis can be used to control for it (or other interesting covariates). Composite samples cannot be used this way. Consider the example above where one sample had high flow (and high black fly abundance)… it would not relate to any covariates because most of the organisms were from only one of the composite samples.