--- title: "Self-Hosting Guide" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Self-Hosting Guide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ## Overview Many organizations publish statistical indicators derived from survey data (unemployment rates, poverty rates, innovation metrics) without disclosing how those indicators were computed. The final number is public, but the methodology -- which variables were used, what transformations were applied, what survey design was used -- remains opaque. metasurvey solves this by **separating private data from public methodology**. You can deploy the metasurvey API on your own infrastructure where: - **Microdata stays private**: raw survey files never leave your network. - **Indicators are public**: computed results (point estimates, standard errors, confidence intervals) are served via the API. - **Methodology is transparent**: anyone can query the recipe (data transformation steps) and the workflow (estimation calls and survey design) that produced each indicator. The traceability chain is: ```text Indicator Workflow Recipe (the number) --> (how it was estimated) --> (how variables were built) value: 0.082 svymean(~pd, design) step_compute(svy, pd = ...) se: 0.003 estimation_type: annual step_recode(svy, pea, ...) cv: 0.037 recipe_ids: [ech_emp_001] depends_on: [e27, f66, ...] ``` ## Architecture The system has **two R services**: a public API that serves indicators and their traceability, and a private **Worker** that has metasurvey installed and access to microdata. The worker is the only component that touches raw survey data. ```{r architecture-diagram, echo=FALSE, eval=TRUE, out.width="100%", fig.cap="metasurvey self-hosting infrastructure"} knitr::include_graphics("metasurvey-infrastructure.png") ``` **Key separation**: The Worker loads microdata, fetches recipes from MongoDB, applies them (`bake_steps`), runs the estimation (`workflow`), and posts the result back to the API. The public API never touches microdata -- it only serves pre-computed indicators and their traceability. The frontend can also request **on-demand computations** with filters (e.g., "unemployment rate for women in Montevideo") via `POST /indicators/compute`, which the API proxies to the Worker. ## Quick Start with Docker Compose The repository includes a `docker-compose.yml` that starts MongoDB, the plumber API, and the Shiny recipe explorer. No external database required. ### 1. Clone and configure ```bash git clone https://github.com/metasurveyr/metasurvey.git cd metasurvey ``` Create a `.env` file (or use the defaults for development): ```bash # .env MONGO_USER=metasurvey MONGO_PASSWORD=change-me-in-production METASURVEY_JWT_SECRET=change-me-in-production METASURVEY_ADMIN_EMAIL=admin@example.com ``` ### 2. Start the stack ```bash docker compose up --build ``` This starts four services: | Service | URL | Description | |---------|-----|-------------| | `mongo` | `localhost:27017` | MongoDB 7 with persistent volume | | `worker` | `localhost:8788` | Compute worker (internal only) | | `api` | `http://localhost:8787` | Plumber REST API | | `shiny` | `http://localhost:3838` | Recipe explorer | ### 3. Initialize the database ```bash # Create collections and indexes docker compose exec mongo mongosh \ -u metasurvey -p change-me-in-production \ --authenticationDatabase admin \ metasurvey /dev/stdin < inst/scripts/setup_mongodb.js # Seed example recipes, workflows, and indicators docker compose exec api Rscript -e ' Sys.setenv( METASURVEY_MONGO_URI = "mongodb://metasurvey:change-me-in-production@mongo:27017/?authSource=admin" ) source("/app/seed_ech_recipes.R") ' ``` ### 4. Verify ```bash curl http://localhost:8787/health ``` ```json { "status": "ok", "service": "metasurvey-api", "mongodb": "connected" } ``` ## Computing and Publishing Indicators The typical flow: load survey data, apply a recipe, run a workflow, and publish the result as an indicator with full traceability. ```{r publish-indicator} library(metasurvey) # 1. Load survey data (private -- stays on your server) svy <- Survey$new( data = my_survey_data, edition = "2024", type = "ech", engine = "data.table", weight = add_weight(annual = "W_ANO") ) # 2. Apply a recipe (defines variables like unemployment status) svy <- step_compute(svy, pd = data.table::fcase( pobpcoac == 2, 1L, pobpcoac %in% c(1, 3), 0L ), comment = "Unemployed: POBPCOAC == 2" ) svy <- bake_steps(svy) # 3. Run the estimation (workflow) result <- workflow( svy = list(svy), survey::svymean(~pd, na.rm = TRUE), estimation_type = "annual" ) # result is a data.table: # stat value se cv # svymean: pd 0.082 0.003 0.037 ``` Now publish the indicator to the API: ```{r publish-to-api} # Connect to your local API configure_api("http://localhost:8787") api_login("admin@example.com", "your-password") # Build the indicator payload indicator <- list( name = "Unemployment Rate 2024", description = "Annual unemployment rate, population 14+", recipe_id = "ech_employment_001", workflow_id = "ech_wf_labor", survey_type = "ech", edition = "2024", estimation_type = "annual", stat = result$stat[1], value = result$value[1], se = result$se[1], cv = result$cv[1], confint_lower = result$confint_lower[1], confint_upper = result$confint_upper[1], metadata = list( formula = "~pd", estimation_function = "svymean" ) ) # Publish (requires authentication) resp <- httr2::request("http://localhost:8787/indicators") |> httr2::req_headers( Authorization = paste("Bearer", Sys.getenv("METASURVEY_TOKEN")) ) |> httr2::req_body_json(indicator) |> httr2::req_perform() httr2::resp_body_json(resp) # {ok: true, id: "ind_1708099200_42"} ``` ## Consuming Indicators (Transparency) Once published, indicators and their full methodology are accessible **without authentication**. This is the transparency layer. ### Get the indicator value ```bash curl http://localhost:8787/indicators/ind_ech_unemployment_2024 ``` ```json { "ok": true, "indicator": { "id": "ind_ech_unemployment_2024", "name": "Tasa de desempleo 2024", "value": 0.082, "se": 0.003, "cv": 0.037, "confint_lower": 0.076, "confint_upper": 0.088, "survey_type": "ech", "edition": "2024", "recipe_id": "ech_employment_001", "workflow_id": "ech_wf_labor" } } ``` ### Get the workflow (how it was estimated) ```bash curl http://localhost:8787/indicators/ind_ech_unemployment_2024/workflow ``` ```json { "ok": true, "indicator_id": "ind_ech_unemployment_2024", "workflow": { "id": "ech_wf_labor", "name": "Mercado Laboral ECH", "estimation_type": "annual", "recipe_ids": ["ech_pobpcoac_000", "ech_employment_001"], "calls": [ "svymean(~pea, design, na.rm=TRUE)", "svymean(~po, design, na.rm=TRUE)", "svymean(~pd, design, na.rm=TRUE)" ], "call_metadata": [ { "type": "svymean", "formula": "~pea", "description": "Tasa de actividad" }, { "type": "svymean", "formula": "~pd", "description": "Tasa de desempleo" } ] } } ``` The workflow tells you **what statistical function was used** (`svymean`), **what formula** (`~pd`), and **which recipes** were applied to the data before estimation. ### Get the recipe (how variables were built) ```bash curl http://localhost:8787/indicators/ind_ech_unemployment_2024/recipe ``` ```json { "ok": true, "indicator_id": "ind_ech_unemployment_2024", "recipe": { "id": "ech_employment_001", "name": "Employment Status", "steps": [ "step_compute(svy, po = fcase(pobpcoac == 1, 1L, TRUE, 0L), comment = 'Employed')", "step_compute(svy, pd = fcase(pobpcoac == 2, 1L, TRUE, 0L), comment = 'Unemployed')", "step_compute(svy, pea = fcase(pobpcoac %in% 1:2, 1L, TRUE, 0L), comment = 'EAP')" ], "depends_on": ["pobpcoac"], "doc": { "input_variables": ["pobpcoac"], "output_variables": ["po", "pd", "pea"], "pipeline": [ {"step": 1, "type": "compute", "outputs": ["po"]}, {"step": 2, "type": "compute", "outputs": ["pd"]}, {"step": 3, "type": "compute", "outputs": ["pea"]} ] } } } ``` The recipe tells you **exactly how each variable was constructed** from the original survey variables. Combined with the workflow, anyone can verify the full computation chain without accessing the microdata. ## Indicator API Reference | Method | Endpoint | Auth | Description | |--------|----------|------|-------------| | `GET` | `/indicators` | No | List and search published indicators | | `GET` | `/indicators/:id` | No | Get a single indicator with metadata | | `GET` | `/indicators/:id/recipe` | No | Get the recipe that built the variables | | `GET` | `/indicators/:id/workflow` | No | Get the workflow (estimation + design) | | `POST` | `/indicators` | Yes | Publish a computed indicator | | `POST` | `/indicators/compute` | Yes | On-demand estimation via the Worker (with filters) | ### Query parameters for `GET /indicators` | Parameter | Type | Description | |-----------|------|-------------| | `search` | string | Regex search on indicator name | | `survey_type` | string | Filter by survey type | | `recipe_id` | string | Filter by recipe ID | | `workflow_id` | string | Filter by workflow ID | | `edition` | string | Filter by survey edition | | `limit` | integer | Max results (default 50) | | `offset` | integer | Skip N results (default 0) | ### Indicator document fields | Field | Type | Required | Description | |-------|------|----------|-------------| | `id` | string | Auto | Unique identifier | | `name` | string | Yes | Indicator name | | `workflow_id` | string | Yes | Workflow that produced it | | `value` | number | Yes | Point estimate | | `recipe_id` | string | No | Recipe that built variables | | `description` | string | No | Human-readable description | | `survey_type` | string | No | Survey type | | `edition` | string | No | Survey edition | | `estimation_type` | string | No | `annual`, `quarterly`, `monthly` | | `stat` | string | No | Statistic label (e.g., `svymean: pd`) | | `se` | number | No | Standard error | | `cv` | number | No | Coefficient of variation | | `confint_lower` | number | No | Lower bound of confidence interval | | `confint_upper` | number | No | Upper bound of confidence interval | | `metadata` | object | No | Additional context (formula, notes) | | `published_at` | string | Auto | ISO timestamp | ## Deploying with Minikube For Kubernetes-based deployment, you can test locally with [Minikube](https://minikube.sigs.k8s.io/). ### Start Minikube ```bash minikube start ``` ### MongoDB deployment ```yaml # k8s/mongo.yaml apiVersion: apps/v1 kind: Deployment metadata: name: metasurvey-mongo namespace: metasurvey spec: replicas: 1 selector: matchLabels: app: metasurvey-mongo template: metadata: labels: app: metasurvey-mongo spec: containers: - name: mongo image: mongo:7 ports: - containerPort: 27017 env: - name: MONGO_INITDB_ROOT_USERNAME value: metasurvey - name: MONGO_INITDB_ROOT_PASSWORD valueFrom: secretKeyRef: name: metasurvey-secrets key: mongo-password --- apiVersion: v1 kind: Service metadata: name: mongo-service namespace: metasurvey spec: selector: app: metasurvey-mongo ports: - port: 27017 targetPort: 27017 ``` ### Worker deployment ```yaml # k8s/worker.yaml apiVersion: apps/v1 kind: Deployment metadata: name: metasurvey-worker namespace: metasurvey spec: replicas: 1 selector: matchLabels: app: metasurvey-worker template: metadata: labels: app: metasurvey-worker spec: containers: - name: worker image: ghcr.io/metasurveyr/metasurvey-worker:latest ports: - containerPort: 8788 env: - name: METASURVEY_MONGO_URI value: "mongodb://metasurvey:$(MONGO_PASSWORD)@mongo-service:27017/?authSource=admin" - name: MONGO_PASSWORD valueFrom: secretKeyRef: name: metasurvey-secrets key: mongo-password volumeMounts: - name: survey-data mountPath: /data/surveys readOnly: true resources: requests: memory: "1Gi" cpu: "500m" limits: memory: "4Gi" cpu: "2000m" volumes: - name: survey-data persistentVolumeClaim: claimName: survey-microdata-pvc --- apiVersion: v1 kind: Service metadata: name: worker-service namespace: metasurvey spec: selector: app: metasurvey-worker ports: - port: 8788 targetPort: 8788 ``` ### API deployment ```yaml # k8s/api.yaml apiVersion: apps/v1 kind: Deployment metadata: name: metasurvey-api namespace: metasurvey spec: replicas: 2 selector: matchLabels: app: metasurvey-api template: metadata: labels: app: metasurvey-api spec: containers: - name: api image: ghcr.io/metasurveyr/metasurvey-api:latest ports: - containerPort: 8787 env: - name: METASURVEY_MONGO_URI value: "mongodb://metasurvey:$(MONGO_PASSWORD)@mongo-service:27017/?authSource=admin" - name: MONGO_PASSWORD valueFrom: secretKeyRef: name: metasurvey-secrets key: mongo-password - name: METASURVEY_JWT_SECRET valueFrom: secretKeyRef: name: metasurvey-secrets key: jwt-secret - name: METASURVEY_WORKER_URL value: "http://worker-service:8788" livenessProbe: httpGet: path: /health port: 8787 initialDelaySeconds: 30 readinessProbe: httpGet: path: /health port: 8787 initialDelaySeconds: 10 resources: requests: memory: "512Mi" cpu: "250m" limits: memory: "1Gi" cpu: "500m" --- apiVersion: v1 kind: Service metadata: name: metasurvey-api namespace: metasurvey spec: type: NodePort selector: app: metasurvey-api ports: - port: 8787 targetPort: 8787 ``` ### Apply and test ```bash kubectl create namespace metasurvey kubectl create secret generic metasurvey-secrets \ --namespace metasurvey \ --from-literal=mongo-password=change-me \ --from-literal=jwt-secret=change-me kubectl apply -f k8s/mongo.yaml -f k8s/worker.yaml -f k8s/api.yaml # Access the API minikube service metasurvey-api -n metasurvey ``` ## Deploying on AWS (EKS) For production on AWS, use [EKS](https://aws.amazon.com/eks/) with DocumentDB (MongoDB compatible) as the managed database. ### 1. Create the EKS cluster ```bash # Requires: aws-cli, eksctl eksctl create cluster \ --name metasurvey-prod \ --region sa-east-1 \ --node-type t3.medium \ --nodes 2 \ --nodes-min 1 \ --nodes-max 5 ``` ### 2. Create DocumentDB (MongoDB compatible) ```bash aws docdb create-db-cluster \ --db-cluster-identifier metasurvey-db \ --engine docdb \ --master-username metasurvey_admin \ --master-user-password "$DB_PASSWORD" \ --vpc-security-group-ids "$SG_ID" \ --db-subnet-group-name "$SUBNET_GROUP" aws docdb create-db-instance \ --db-instance-identifier metasurvey-db-1 \ --db-cluster-identifier metasurvey-db \ --db-instance-class db.r6g.large \ --engine docdb ``` ### 3. Configure secrets in Kubernetes ```bash aws eks update-kubeconfig --name metasurvey-prod --region sa-east-1 kubectl create namespace metasurvey kubectl create secret generic metasurvey-secrets \ --namespace metasurvey \ --from-literal=mongo-uri="mongodb://metasurvey_admin:$DB_PASSWORD@metasurvey-db.cluster-xxx.sa-east-1.docdb.amazonaws.com:27017/?tls=true&tlsCAFile=/rds-combined-ca-bundle.pem&retryWrites=false" \ --from-literal=jwt-secret="$(openssl rand -hex 32)" ``` ### 4. Deployment with ALB Ingress ```yaml # k8s/aws-api.yaml apiVersion: apps/v1 kind: Deployment metadata: name: metasurvey-api namespace: metasurvey spec: replicas: 2 selector: matchLabels: app: metasurvey-api template: metadata: labels: app: metasurvey-api spec: containers: - name: api image: ghcr.io/metasurveyr/metasurvey-api:latest ports: - containerPort: 8787 env: - name: METASURVEY_MONGO_URI valueFrom: secretKeyRef: name: metasurvey-secrets key: mongo-uri - name: METASURVEY_JWT_SECRET valueFrom: secretKeyRef: name: metasurvey-secrets key: jwt-secret - name: METASURVEY_WORKER_URL value: "http://worker-service:8788" resources: requests: memory: "512Mi" cpu: "250m" limits: memory: "1Gi" cpu: "500m" livenessProbe: httpGet: path: /health port: 8787 initialDelaySeconds: 30 readinessProbe: httpGet: path: /health port: 8787 initialDelaySeconds: 10 --- apiVersion: v1 kind: Service metadata: name: metasurvey-api namespace: metasurvey spec: type: ClusterIP selector: app: metasurvey-api ports: - port: 8787 --- apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: metasurvey-ingress namespace: metasurvey annotations: kubernetes.io/ingress.class: alb alb.ingress.kubernetes.io/scheme: internet-facing alb.ingress.kubernetes.io/target-type: ip alb.ingress.kubernetes.io/certificate-arn: "$ACM_CERT_ARN" alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]' spec: rules: - host: api.metasurvey.example.org http: paths: - path: / pathType: Prefix backend: service: name: metasurvey-api port: number: 8787 ``` ```bash # Install AWS Load Balancer Controller (if not installed) helm repo add eks https://aws.github.io/eks-charts helm install aws-load-balancer-controller eks/aws-load-balancer-controller \ -n kube-system \ --set clusterName=metasurvey-prod kubectl apply -f k8s/aws-api.yaml ``` ## Deploying on Azure (AKS) For production on Azure, use [AKS](https://azure.microsoft.com/en-us/products/kubernetes-service/) with CosmosDB (MongoDB API) as the managed database. ### 1. Create the AKS cluster ```bash # Requires: az cli az group create --name rg-metasurvey --location brazilsouth az aks create \ --resource-group rg-metasurvey \ --name aks-metasurvey \ --node-count 2 \ --node-vm-size Standard_B2ms \ --generate-ssh-keys \ --enable-managed-identity ``` ### 2. Create CosmosDB with MongoDB API ```bash az cosmosdb create \ --name cosmos-metasurvey \ --resource-group rg-metasurvey \ --kind MongoDB \ --capabilities EnableMongo \ --default-consistency-level Session \ --locations regionName=brazilsouth # Get connection string COSMOS_URI=$(az cosmosdb keys list \ --name cosmos-metasurvey \ --resource-group rg-metasurvey \ --type connection-strings \ --query "connectionStrings[0].connectionString" -o tsv) ``` ### 3. Configure secrets in Kubernetes ```bash az aks get-credentials \ --resource-group rg-metasurvey \ --name aks-metasurvey kubectl create namespace metasurvey kubectl create secret generic metasurvey-secrets \ --namespace metasurvey \ --from-literal=mongo-uri="$COSMOS_URI" \ --from-literal=jwt-secret="$(openssl rand -hex 32)" ``` ### 4. Deployment with NGINX Ingress ```yaml # k8s/azure-api.yaml apiVersion: apps/v1 kind: Deployment metadata: name: metasurvey-api namespace: metasurvey spec: replicas: 2 selector: matchLabels: app: metasurvey-api template: metadata: labels: app: metasurvey-api spec: containers: - name: api image: ghcr.io/metasurveyr/metasurvey-api:latest ports: - containerPort: 8787 env: - name: METASURVEY_MONGO_URI valueFrom: secretKeyRef: name: metasurvey-secrets key: mongo-uri - name: METASURVEY_JWT_SECRET valueFrom: secretKeyRef: name: metasurvey-secrets key: jwt-secret - name: METASURVEY_WORKER_URL value: "http://worker-service:8788" resources: requests: memory: "512Mi" cpu: "250m" limits: memory: "1Gi" cpu: "500m" livenessProbe: httpGet: path: /health port: 8787 initialDelaySeconds: 30 readinessProbe: httpGet: path: /health port: 8787 initialDelaySeconds: 10 --- apiVersion: v1 kind: Service metadata: name: metasurvey-api namespace: metasurvey spec: type: ClusterIP selector: app: metasurvey-api ports: - port: 8787 --- apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: metasurvey-ingress namespace: metasurvey annotations: cert-manager.io/cluster-issuer: letsencrypt-prod nginx.ingress.kubernetes.io/ssl-redirect: "true" spec: ingressClassName: nginx tls: - hosts: - api.metasurvey.example.org secretName: metasurvey-tls rules: - host: api.metasurvey.example.org http: paths: - path: / pathType: Prefix backend: service: name: metasurvey-api port: number: 8787 ``` ```bash # Install NGINX Ingress Controller helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx helm install ingress-nginx ingress-nginx/ingress-nginx \ --namespace ingress-nginx --create-namespace # Install cert-manager for TLS helm repo add jetstack https://charts.jetstack.io helm install cert-manager jetstack/cert-manager \ --namespace cert-manager --create-namespace \ --set crds.enabled=true kubectl apply -f k8s/azure-api.yaml ``` ## Feature Flags The API supports feature flags via environment variables to control which modules are available. This lets you run the same API image in different configurations: | Variable | Default | Description | |----------|---------|-------------| | `METASURVEY_ENABLE_INDICATORS` | `1` | Enable `/indicators` endpoints | | `METASURVEY_ENABLE_WORKER` | `0` | Enable `POST /indicators/compute` (on-demand) | For example, the public metasurvey API on Railway only serves recipes and workflows -- indicators and worker are disabled: ```bash METASURVEY_ENABLE_INDICATORS=0 METASURVEY_ENABLE_WORKER=0 ``` A self-hosted deployment with full capabilities: ```bash METASURVEY_ENABLE_INDICATORS=1 METASURVEY_ENABLE_WORKER=1 METASURVEY_WORKER_URL=http://worker:8788 ``` ## Hybrid Deployment: Public Registry + Private Infrastructure You don't need to run your own MongoDB to access shared recipes. You can deploy **your own API + Worker** pointing to the **public metasurvey recipe registry**, so that researchers with access to microdata can apply community-published recipes locally. The organization deploys two services: 1. **plumber API** (`:8787`) -- connects to the public MongoDB to read recipes and workflows. Receives requests from the frontend or researchers, and proxies compute requests to the worker. 2. **Worker** (`:8788`) -- has metasurvey installed and access to private microdata. Fetches recipes from the public registry (via MongoDB), applies them, and runs estimations. ```text Public metasurvey MongoDB Your infrastructure (recipes, workflows) (private network) +-------------------+ +---------------------------+ | MongoDB Atlas |<----------| plumber API :8787 | | (read-only) | | - reads recipes/wf | +-------------------+ | - proxies POST /compute | +-------------+-------------+ | POST /compute| v +-------------+-------------+ | Worker :8788 | | - loads microdata | | - applies recipes | | - runs workflow() | +---------------------------+ ^ +-------------+-------------+ | Microdata (.sav, .dta) | | PRIVATE — never leaves | +---------------------------+ ``` ```bash # .env for hybrid deployment # Both API and Worker point to the public metasurvey MongoDB METASURVEY_MONGO_URI=mongodb+srv://readonly:public@metasurvey.mongodb.net METASURVEY_DB=metasurvey # Worker has local microdata SURVEY_DATA_PATH=/secure/microdata # API proxies compute requests to the worker METASURVEY_WORKER_URL=http://worker:8788 METASURVEY_ENABLE_WORKER=1 # Indicators disabled (no local MongoDB to store them) METASURVEY_ENABLE_INDICATORS=0 ``` ```bash docker compose up api worker ``` In this mode: - **Recipes and workflows** are fetched from the public registry -- any recipe published by the community is available to your API and worker. - **Microdata stays private** -- the worker loads survey files from a local volume, never exposed externally. - **Estimation is local** -- the API receives the request, proxies it to the worker, which applies recipes and runs `workflow()` on your infrastructure. - **No MongoDB to maintain** -- you use the public registry as-is. This is useful for research teams that have access to restricted microdata and want to apply standardized recipes without maintaining their own database. ## Production Deployment For production deployments with Terraform modules (AWS/Azure/GCP), Helm charts, and CI/CD pipelines, see the infrastructure repository: The infrastructure repository will include: - Terraform modules for managed databases (RDS, DocumentDB/CosmosDB, Cloud SQL, Memorystore) - Kubernetes manifests with autoscaling, TLS ingress, and secrets management - Docker image builds and registry configuration - Monitoring and health check setup ## Next Steps - **[API and Database Reference](api-database.html)** -- Full endpoint documentation, MongoDB schema, and authentication details - **[Creating and Sharing Recipes](recipes.html)** -- Build recipes for reproducible survey processing - **[Estimation Workflows](workflows-and-estimation.html)** -- Compute weighted survey estimates with `workflow()`