Data
As of 17 July 2025, we have processed 2201 original data sets containing a total of 1,876,882 records. The map below shows all locations for which we have at least one observation.
For ease of organization we divide the data into thematic groups. These are not mutually exclusive. For example, the first place to look for crop response to fertilizer data would be in the “agronomy” group. However, the “survey”, and “varieties” groups may also contain fertilizer application data. Likewise, the “varieties” group has data for comparing crop varieties, but variety names are also reported in the “agronomy” group. This means that you may want to consider using data from multiple groups.
The table below shows the current groups and the number of original datasets and records in each group. We also show these numbers for the datasets that have a Creative Commons (CC) license.
Group | Datasets | Records | CC-Datasets | CC-Records |
---|---|---|---|---|
agronomy | 216 | 305087 | 175 | 155716 |
pest_disease | 8 | 3225 | 7 | 2785 |
soil_samples | 12 | 16549 | 10 | 13007 |
survey | 30 | 149730 | 12 | 77469 |
varieties | 49 | 128052 | 47 | 127984 |
varieties_cassava | 1466 | 138686 | 1466 | 138680 |
varieties_cowpea | 76 | 23193 | 76 | 23193 |
varieties_maize | 76 | 81811 | 62 | 73052 |
varieties_potato | 54 | 29141 | 54 | 29141 |
varieties_wheat | 214 | 1001408 | 4 | 19234 |
Below, you can download the compiled standardized data that come with a Creative Commons license. You can create the full datasets yourself by following these instructions.
You can download data by group, or, if you want all available data, select “everything”. If you want data for a single data set, you can find these here. You can use R package caramba to integrate the data download into an R workflow.
Please note that for most survey data, we have currently only partially processed the data, and the original data sources may contains many more variables. The data available here are our first attempt to standardize widely variable data with lots of data quality issues. The data still contain errors from the original data that remain, and likely also errors that we have introduced.