WEED OR NO WEED METHODOLOGY
In an attempt to quantify some aspects of the marijuana industry in Humboldt County, the Lost Coast Outpost created a crowdsourcing game – “Weed or No Weed” – in which its readers were invited to determine whether particular parcels of county land hosted grow operations. Readers were presented with freely available commercial satellite imagery of a random Humboldt County parcel and asked to vote “weed,” “no weed” or “don’t know.” After voting, they were given another random parcel.
For the game itself, see this link. For an introduction, see here.
Because of the sensitive nature of this undertaking, the context of each parcel was obscured. The ability to pan or zoom out beyond the boundaries of the parcel were disabled, and the landscape surrounding each target parcel within its bounding box was whited out, to the degree feasible. Readers were asked to base their votes completely on the non-obscured parcel presented to them.
The project took as its total population nearly every parcel in the county one acre in size or larger, according to the parcel map available on the County of Humboldt’s downloadable shapefile layer. A small number of these candidate parcels ( <1% ) were culled from the population before the game began as the geometries in the county’s shapefile (or their reverses) were deemed invalid after transfer to a PostGIS database. In the end, 28,545 parcels were included in the total population presented to “Weed or No Weed” players.
Over 125,000 votes have been collected at the time of this writing (Dec. 8, 2015). This is enough to draw some fairly secure conclusions, with the following caveats:
1. The exercise says nothing about the size of the grow operations. It counts them; it doesn’t measure them.
2. To repeat: This exercise only takes into account parcels one acre or greater. It says nothing whatsoever about cultivation operations in Humboldt County that take place on parcels smaller than one acre, including urban indoor grows.
3. This only counts apparent cultivation operations that are visible from space. As the industry has grown and moved a bit out into the open, this is less of a limitation than it would have been 10 years ago. Still: If anyone still grows under carefully manicured manzanita bushes in an effort to avoid spotters in helicopters, they will not be accounted for here. By far the majority of the operations tagged as such use light-deprivation hoop houses, which have an obvious aerial signature.
4. The satellite imagery for any given parcel is obviously from a point in time, and it’s not immediately evident which point in time it is from.
5. The accuracy of the conclusions we will draw is highly dependent on:
- The ability of the “Weed or No Weed” voting public, as a body, to make the the correct call. This is half an exercise in crowdsourcing, and we are in part utilizing the aggregate of the mass mind.
- The ability of your correspondent to make the correct call, in the absence of crowd consensus.
We believe the results are good, but trust in the numbers we give is necessarily dependent on your faith in the methodology of the exercise. More on that below.
Methodology
To draw conclusions about the entire set of 28,545 parcels, we divide them into two cohorts:
1. Those which the crowd came to consensed upon — those which were viewed enough times, and voted on enough times, and voted on definitively, one way or the other, to make the final call: Does it have weed, or does it not have weed? Specifically, we stipulate, for the purposes of this exercise, that a minimum three-vote difference is enough to place it in one category or the other. Thus, a parcel in our population that received four yes votes and one no vote is included assumed to host a grow. Conversely, a parcel that received zero yes votes and five no votes is marked as not hosting a grow. A parcel that received two yes votes and two no votes is not included in this cohort.
The three-vote difference is intended to overcome slips of the mouse, incompetent examiners and the small number of saboteurs who felt called to action. Does it give us good results? My own eye tells me that it does, but the sensitivity of the data – and particularly, of course, that part that we have put down as yes – renders your researcher unwilling to submit it to independent scrutiny. We will not be providing a bulk download of locations that our readers have tagged as grow operations. However, here is a mosaic raster image of tight close-ups from 64 parcels chosen randomly from parcels marked “yes”. These show the features, or some of the features, that presumably led to the crowd to vote for weed.
We call this cohort “counted.”
2. Those parcels which were not viewed enough times, or which the crowd were divided upon — those that failed to meet the three-vote difference. This population — roughly 45 percent of the total — were statistically sampled by your researcher. I randomly went through 1,102 of the remaining 12,767 parcels and, conservatively, made the call on whether or not they contained features indicative of growing operations. I then applied the sample to the whole of the non-“counted” population, resulting in a low-to-high range of weed-positive parcels.
We call this cohort “sampled.”
Sampling method
In this, our hazy memories of that one stats class we took in college were greatly assisted by the write-up on this web page, which is aimed at the meanest possible understanding (our own). The formulae therein were translated into the following Python function, which was used to calculate the confidence interval (at a 95% confidence rate) when applying our sample to the whole.
from __future__ import division
import math
def get_confidence_interval(
sample=0,
population=0,
rate=.5
):
margin_of_error = 1.96 * math.sqrt(
rate * (1-rate) / sample
)
finite_pop_correction = math.sqrt(
(population-sample) / (population-1)
)
return margin_of_error * finite_pop_correction
Adding the cohorts
Finally, we add the “counted” cohort to the range of figures produced by the “sampled” cohort to draw a final conclusion about the prevalence of marijuana growing operations on Humboldt County parcels one acre in size or greater. All the relevant figures are reproduced in the following table.
ALL | |||||||||||
Counted | |||||||||||
+ | - | Total | |||||||||
1,571 | 14,207 | 15,778 | |||||||||
Sampled | |||||||||||
+ | - | S. Size | Total | ||||||||
151 | 951 | 1,102 | 12,767 | ||||||||
Rate | Confidence Interval |
||||||||||
13.7% | 1.94% | ||||||||||
Totals | |||||||||||
Sample | Count | Combined | Pop. | % | |||||||
LO | 1,502 | 1,571 | 3,073 | 28,545 | 10.76 | ||||||
MD | 1,749 | 1,571 | 3,320 | 28,545 | 11.63 | ||||||
HI | 1,997 | 1,571 | 3,568 | 28,545 | 12.5 |
So if the premises of the exercise are accepted, we can say with 95 percent certainty that between 3,073 and 3,568 of the parcels in our study house grow operations — between 10.75 and 12.50 percent of the total population of parcels.
Subsampling
But beyond a simple headcount, we would like to know more about the particular characteristics of the today’s marijuana industry. Lawmakers at the state and county level are currently attempting to craft regulatory regimes to govern cannabis production, but they do so with little solid information about the industry as it exists. What can we say about where marijuana is currently manufactured? Which watersheds does the industry particularly impact? How are parcels associated with marijuana production currently zoned?
To answer these questions, we identify a characteristic we wish to test for, then take subsets of our population that share those characteristics. We then run the above calculations above on those subsets.
For instance: Draft county law that will govern cannabis production in the future draws several regulatory distinctions between parcels under five acres in size and those over five acres in size. How does the current cannabis industry break down along those lines?
UNDER FIVE ACRES | |||||||||||
Counted | |||||||||||
+ | - | Total | |||||||||
262 | 4,803 | 5,065 | |||||||||
Sampled | |||||||||||
+ | - | S. Size | Total | ||||||||
18 | 343 | 361 | 4,147 | ||||||||
Rate | Confidence | ||||||||||
4.99% | 2.15% | ||||||||||
Totals | |||||||||||
Sample | Count | Combined | Pop. | % | |||||||
LO | 118 | 262 | 380 | 9,212 | 4.12 | ||||||
MD | 207 | 262 | 469 | 9,212 | 5.09 | ||||||
HI | 296 | 262 | 558 | 9,212 | 6.05 |
OVER FIVE ACRES | |||||||||||
Counted | |||||||||||
+ | - | Total | |||||||||
1,309 | 9,404 | 10,713 | |||||||||
Sampled | |||||||||||
+ | - | S. Size | Total | ||||||||
133 | 608 | 741 | 8,620 | ||||||||
Rate | Confidence | ||||||||||
17.95% | 2.64% | ||||||||||
Totals | |||||||||||
Sample | Count | Combined | Pop. | % | |||||||
LO | 1,319 | 1,309 | 2,628 | 19,333 | 13.6 | ||||||
MD | 1,547 | 1,309 | 2,856 | 19,333 | 14.77 | ||||||
HI | 1,775 | 1,309 | 3,084 | 19,333 | 15.95 |
Thank you
Thanks to Erick Eschker, Eli Asarian and Scott Brown for talking through number-crunching tactics with me along the way. None of them have reviewed this final draft; all errors are my own.
— Hank Sims
← Follow