gisconletras: 2017

miércoles, 18 de octubre de 2017

Testing Big Spatial Data software (Hadoop + HBase + GeoWave + GeoServer) without dying in the attempt :-)

The purpose of this article is to show you the results testing the integration of a Big Data platform with other Geospatial tools. It is necessary to stand out that the integration of used components, all of them are open source, allow us to publish WEB services compliant with OGC standards (WMS, WFS, WPS).

Video

This article describes installation steps, settings and development done to get a web mapping client application showing NO2 measures of around 4k of European stations during four months (Observations were registered per hour), around of 5 million of records. Yes, I know, this data doesn't seem like a "Big Data" storage, but it seems big enough to check performance when applications read it using spatial and/or temporal filters.

The article doesn't focus on teach deeper knowledge of used software, all of them already publish good documentation from user or developer point of view, it simply wants to offer experiencies and a simple guide to collect resources of used software components. By example, comments about GeoWave, and its integration with GeoServer, are a copy of content of product guide in its website.

Data scheme

Test data was downloaded from European Environment Agency (EEA). You can search here information or map viewers of this or another thematics, or better, you could use your own data. GDELT is other interesting project that offers massive data.

Scheme of test data is simple, the input is a group of CSV files (Text files with their attributes separated with commas) with points in geographical coordinates (Latitude/Longitude) that georeference the sensor, the measure date, and the NO2 concentration in air. There are other secondary attributes but they aren't important for our test.

Software architecture

Test consists of chain a set of tools, all of them offer data and funcionality to next software component in the application architecture. The application workflow starts with Hadoop and its HDFS, HBase to map it like a database, the great GeoWave working as a connector between it and the popular GeoServer that implements several OGC standards, and finally, a web client application fetching data to show maps as usual (By example, using Leaflet and Heatmap.js library).

Apache Hadoop

Apache Hadoop is, when we search a bit on Google, ... a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

HDFS is a distributed file system that provides high-performance access to data across Hadoop clusters. Because HDFS typically is deployed on low-cost commodity hardware, server failures are common. The file system is designed to be highly fault-tolerant, however, by facilitating the rapid transfer of data between compute nodes and enabling Hadoop systems to continue running if a node fails. That decreases the risk of catastrophic failure, even in the event that numerous nodes fail.

Our test will use Hadoop and its HDFS as repository of data where we are going to save and finally publish to end user application. You can read project resources here, or dive on Internet to learn about it deeply.

I have used Windows for my tests. The official Apache Hadoop releases do not include Windows binaries, but you can easily build them with this great guide (It uses Maven) and to configure the necessary files at least to run a single node cluster. Of course, a production environment will require us to configure a distributed multi-node cluster, or use a "just-to-use" distribution (Hortonworks...) or jump to the Cloud (Amazon S3, Azure...).

We go ahead with this guide; after Hadoop was built with Maven, configuration files were edited and environment variables were defined, we can test if all is ok executing in console...

> hadoop version

Then, we start the "daemons" of namenode and datanode objects, and the "yarn" resource manager.

> call ".\hadoop-2.8.1\etc\hadoop\hadoop-env.cmd"

> call ".\hadoop-2.8.1\sbin\start-dfs.cmd"

> call ".\hadoop-2.8.1\sbin\start-yarn.cmd"

We can see the Hadoop admin application on the configured HTTP port number, 50,070 in my case:

Apache HBase

Apache HBase is, searching on Google again, ... a NoSQL database that runs on top of Hadoop as a distributed and scalable big data store. This means that HBase can leverage the distributed processing paradigm of the Hadoop Distributed File System (HDFS) and benefit from Hadoop’s MapReduce programming model. It is meant to host large tables with billions of rows with potentially millions of columns and run across a cluster of commodity hardware.

You can read here to start and to install HBase. Again, we check product version executing...

> hbase version

Start HBase:

> call ".\hbase-1.3.1\conf\hbase-env.cmd"

> call ".\hbase-1.3.1\bin\start-hbase.cmd"

See the HBase admin application, on 16,010 port number in my case:

Ok, at this point, we have the big data environment working, it is time to prepare some tools which append Geospatial capabilities; GeoWave and GeoServer, we go ahead...

LocationTech GeoWave

GeoWave is a software library that connects the scalability of distributed computing frameworks and key-value stores (Hadoop + HBase in my case) with modern geospatial software to store, retrieve and analyze massive geospatial datasets. Wow! this is a great tool :-)

Speaking from a developer point of view, this library implements a vector data provider of GeoTools toolkit in order to read features (geometry and attributes) from a distributed environment. When we add the corresponding plugin to GeoServer, user will see new data stores to configure new supported distributed dataset types.

Nowadays, GeoWave supports three distributed data store types; Apache Accumulo, Google BigTable and HBase, we will use last of them.

Let's leave GeoServer for later. According to GeoWave user and developer guides, we have to define primary and secondary indexes that layers want to use, then we can load information to our big data storage.

Reading in developer guide, we will build with Maven the GeoWave toolkit in order to save geographical data on HBase:

> mvn package -P geowave-tools-singlejar

and the plugin to include in GeoServer:

> mvn package -P geotools-container-singlejar

I have defined my own environment variable with a base command in order to execute GeoWave processes as comfortable as possible:

> set GEOWAVE= java -cp "%GEOWAVE_HOME%/geowave-deploy-0.9.6-SNAPSHOT-tools.jar"

mil.nga.giat.geowave.core.cli.GeoWaveMain

Now, we can easily run commands typing %geowave% [...]. We check the GeoWave version:

> %geowave% --version

Ok, we are going to register necessary spatial and temporal indexes of our layer. Client application will filter data using a spatial filter (BBOX-crossing filter) and a temporal filter to fecth only NO2 measures of a specific date.

We go ahead, register both indexes:

> %geowave% config addindex -t spatial eea-spindex

--partitionStrategy ROUND_ROBIN

> %geowave% config addindex -t spatial_temporal eea-hrindex

--partitionStrategy ROUND_ROBIN --period HOUR

And add a "store", in GeoWave terminology, for our new layer:

> %geowave% config addstore eea-store

--gwNamespace geowave.eea -t hbase --zookeeper localhost:2222

Warning, in last command, 2,222 is the port number where I published my Zookeeper.

Now, we can load data. Our input are CSV files, so I will use "-f geotools-vector" option to indicate that GeoTools inspects which vector provider have to use to read data. There are other supported formats, and of course, we can develop a new provider to read our own specific data types.

To load a CSV file:

> %geowave% ingest localtogw -f geotools-vector

./mydatapath/eea/NO2-measures.csv eea-store eea-spindex,eea-hrindex

Ok, data loaded, so far no problems, right? but GeoTools CSVDataStore has some limitations when reading files. Current code doesn't support date/time attributes (nor boolean attributes). code manages all of them as strings. This is unacceptable for our own requirements, the measure date has to be a precise attribute for index, so I fixed it in original java code. Also, in order to calculate the appropiate value type of each attribute the reader reads all rows in file, it is the safest manner, but it can be very slow when reading big files with thousands ans thousands of rows. If the file has a congruent scheme, we can read a little set of rows to calculate types. So I changed it too. We have to rebuild GeoTools and GeoWave. You can download changes from my own GeoTools fork.

After this break, let me now return to main path of the guide, we have loaded features in our layer with "ingest" command. We have included the plugin in a deployed GeoGerver instance too (See developer guide, it is easy, just copy the "geowave-deploy-xxx-geoserver.jar" component to "..\WEB-INF\lib" folder and restart).

GeoServer

GeoServer is an open source server for sharing geospatial data. Designed for interoperability, it publishes data from any major spatial data source using open standards. GeoServer is an Open Geospatial Consortium (OGC) compliant implementation of a number of open standards such as Web Feature Service (WFS), Web Map Service (WMS), and Web Coverage Service (WCS).

Additional formats and publication options are available including Web Map Tile Service (WMTS) and extensions for Catalogue Service (CSW) and Web Processing Service (WPS).

We use GeoServer to read Layers loaded with GeoWave, the plugin just added to our GeoServer will allow us to connect to these data. We can use it as any other type of Layer. Wow! :-)

To configure the access to distributed data stores, we can use two options:

Using the GeoServer admin panel as usual:

Using the "gs" command of GeoWave to register Data Stores and Layers in a started GeoServer instance.

Since we are testing things, we are going to use the second option. First step requires to indicate to GeoWave which GeoServer instance we want to configure.

> %geowave% config geoserver

-ws geowave -u admin -p geoserver http://localhost:8080/geoserver

Similar to what we would do with the GeoServer administrator, we execute two commands to add respectively the desired Data Store and Layer.

> %geowave% gs addds -ds geowave_eea -ws geowave eea-store

> %geowave% gs addfl -ds geowave_eea -ws geowave NO2-measures

As you can notice, the spatial reference system of the Layer is EPSG:4326, remain of settings are similar to other layer types. If we preview the map with the OpenLayers client of GeoServer...

Video

The performance is quety decent (You can see the video linked to the image), taking into account that I am running on a "no-very-powerful" PC, with Hadoop working in "single mode", and drawing the whole NO2 measures of all available dates (~5 million of records). Spatial index works right, as lower zoom as faster response. Also, If we execute a WFS filter with a temporal criteria, we will check temporal index runs right, GeoServer doesn't scan all records of the Layer.

GeoWave user guide teaches us about a special style named "subsamplepoints" (It uses a WPS process named "geowave:Subsample" and that GeoWave plugin implements). When drawing a map, this style perform spatial subsampling to speed up the rendering process. I have verified a great performance gain, I recommend it to draw point type layers.

I tested too to load a polygon type layer from a Shapefile, no problems, WMS GetMap and WFS GetFeature requests run fine. Only a note, GeoWave loading tool automatically transforms geometries from the original spatial reference system (EPSG:25830 in my case) to EPSG:4326 in geographical coordinates.

At this point, we have verified that everything fits, we could stop here since the exploitation of this data could already be done with standard Web Mapping libraries (Leaflet, OpenLayers, AGOL, ...) or any GIS desktop application (QGIS, gvSIG, ...).

¿Would you like to continue?

Web Mapping client with Leaflet

I have continued developing a Web mapping client application with Leaflet. This viewer can draw the map in two styles, drawing an Heatmap or drawing a thematic ramp color. It renders all observations or measures of a unique date, and even animate between all available dates.

Also, we can verify performance with this viewer, it mixes spatial and temporal filter in an unique query. We go ahead.

The easiest option, and perhaps optimal, would have been client application performing WMS GetMap requests, but I am going to execute requests to GeoServer to fetch the geometries to draw them in the client as we desire. We could use WFS GetFeature requests with the current map bounds (It generates a spatial BBOX filter) and a propertyIsEqual filter of a specific date. But we shouldn't forget that we are managing Big Data Stores that can create GML or JSON responses with huge sizes and thousands and thousands of records.

In order to avoid this problem I developed a pair of WPS processes, called "geowave:PackageFeatureLayer" and "geowave:PackageFeatureCollection", that return the response in a compressed binary stream. You could use another packaing logic, by example, returning a special image where pixels encode geometries and feature attributes. Everything is to minimize the size of the information and accelerate the digestion of it in the client application.

WPS parameters are: first, layer name in current GeoServer Catalog (A "SimpleFeatureCollection" for the "geowave:PackageFeatureCollection" process), BBOX, and an optional CQL filter (In my case I am sending something similar to "datetime_begin = 2017-06-01 12:00:00").

I am not going to explain the code in detail, it leaves the scope of this guide. If you like, you could study it in the github link at the end of the article.

Client application runs a WebWorker executing a WPS request to our GeoServer instance. The request executes the "geowave:PackageFeatureLayer" process to minimize the response size. Then the WebWorker decompresses the binary stream, parses it to create javascript objects with points and attributes, and finally return them to main browser thread to draw. Client application renders these objects using the Heatmap.js library or drawing on a Canvas HTML5 to create a thematic ramp color. For this second style, application creates some on-the-fly textures of the colored icons to use when drawing the points. This trick allows map renders thousands and thousands of points pretty fast.

If our client application requires to draw millions of points, we can dive into WebGL and the great WebGL Heatmap library or fantastic demos as How I built a wind map with webgl.

Source code of WPS module and Web mapping client application is here.

martes, 17 de octubre de 2017

Probando software Big Spatial Data (Hadoop + HBase + GeoWave + GeoServer) sin morir en el intento :-)

El propósito de este artículo es mostrar los resultados obtenidos probando la integración de software Big Data y herramientas con funcionalidad Geoespacial. Es de resaltar que la integración de los componentes utilizados, todos ellos open source, nos permite además publicar servicios WEB conforme a los estándares OGC (WMS, WFS, WPS...).

Ver vídeo

El artículo describe los pasos de instalación, configuración y desarrollos realizados para obtener finalmente un visor WEB mostrando las concentraciones de NO2 en alrededor de 4.000 estaciones de Europa en un espacio de tiempo de cuatro meses (Con medidas registradas cada hora), unos 5.000.000 de registros. Sí, lo sé, ese volumen de datos no es nada "Big Data", pero parece suficiente para evaluar el comportamiento al consumirlo en aplicaciones cliente utilizando filtros espaciales y/o temporales (Puede ver el vídeo vinculado a la imagen).

El artículo no va a profundizar en el conocimiento de los componentes utilizados, todos los productos ya publican buena documentación desde el punto de vista de usuario y de desarrollador, simplemente quiere ofrecer experiencias y una guía orientativa que aglutine recursos de cada una de las herramientas utilizadas. Por ejemplo, la parte de GeoWave, y su integración con GeoServer, es un calco de las guías de usuario y desarrollador que existen en su portal WEB de producto.

Esquema de los datos

Los datos han sido obtenidos de la Agencia Europea de Medio Ambiente (EEA). Puede buscar aquí información y visores de mapas de esta u otras temáticas, o mejor, use otros datos que sean más de su interés. El proyecto GDELT es otro recurso que ofrece información masiva.

El esquema de los datos utilizados es sencillo; la entrada de información es un conjunto de archivos CSV (Ficheros de texto con sus atributos separados por comas) con puntos en coordenadas geográficas (Latitud/Longitud) que localizan el sensor, la fecha de medida, y el valor de concentración de NO2. Hay otros valores que son secundarios para esta prueba.

Arquitectura Software

La prueba ha consistido en encadenar diferentes herramientas, cada una de las cuáles publica datos y funcionalidad al siguiente componente de la arquitectura final. El diagrama de aplicaciones comienza con Hadoop y su HDFS, HBase para su mapeo como base de datos, el estupendo GeoWave como conector entre ella y el conocido GeoServer como servicio WEB que implementa varios estándares OGC, y ya un cliente final que consuma los datos utilizando tecnología WEB estándar (Leaflet y la librería Heatmap.js por ejemplo) para mostrar los resultados.

Por componentes:

Apache Hadoop

Apache Hadoop es, si buscamos un poco en Google, ... un framework que permite el procesamiento de grandes volúmenes de datos a través de clusters, usando un modelo simple de programación. Su diseño permite pasar de pocos nodos a miles de nodos de forma ágil. Hadoop es un sistema distribuido utilizando una arquitectura Master-Slave, usando para almacenar su Hadoop Distributed File System (HDFS) y algoritmos de MapReduce para hacer cálculos.

HDFS es el sistema de almacenamiento, es un sistema de ficheros distribuido. Fue creado a partir del Google File System (GFS). HDFS se encuentra optimizado para grandes flujos y trabajar con ficheros grandes en sus lecturas y escrituras. Su diseño reduce la E/S en la red. La escalabilidad y disponibilidad son otras de sus claves, gracias a la replicación de los datos y tolerancia a los fallos.

Nuestra prueba utilizará Hadoop y su HDFS como repositorio destino de los datos que vamos a almacenar y explotar finalmente en la aplicación de usuario. Puede consultar aquí recursos del proyecto, y bucear en toda la documentación que ya existe en Internet para conocerlo en profundidad.

Para las pruebas he utilizado Windows como sistema operativo. No se distribuye una versión para este SO pero hay una estupenda guía aquí para compilar Hadoop (Usando Maven) y para configurar los ficheros necesarios para por lo menos tener un sistema de nodo simple. Evidentemente un entorno real de producción requerirá configurar un cluster multi-node distribuido, utilizar alguna distribución ya configurada (Hortonworks...) o evaluar el salto a Cloud (Amazon S3, Azure...).

Seguidos los pasos de la guía; compilado Hadoop con Maven, editados los ficheros de configuración y establecidas las variables de entorno, podremos probar en consola que todo está correcto con el comando...

> hadoop version

Arrancaremos entonces los "demonios" que ejecutan los nodos "namenode", "datanode", y el gestor de recursos "yarn".

> call ".\hadoop-2.8.1\etc\hadoop\hadoop-env.cmd"

> call ".\hadoop-2.8.1\sbin\start-dfs.cmd"
> call ".\hadoop-2.8.1\sbin\start-yarn.cmd"

Ahora podemos ver el administrador de Hadoop en el puerto establecido en la configuración, en mi caso es el puerto 50.070:

Apache HBase

Apache HBase es, de nuevo si buscamos en Google, ... una base de datos de código abierto no relacional y con versiones que se ejecutan sobre Amazon S3 (con EMRFS) o el sistema de archivos distribuidos de Hadoop (HDFS) y está diseñada para brindar acceso en tiempo real aleatorio y estrictamente uniforme a tablas con miles de millones de filas y millones de columnas.

Los datos se almacenan en las filas de una tabla, mientras que los datos de una fila se agrupan por familia de columnas. HBase es una base de datos sin esquemas en el sentido de que no es preciso que ni las columnas ni el tipo de datos almacenados en ellas se definan antes de usarlos.

Puedes leer aquí para poner HBase en marcha. De nuevo en consola, podemos verificar la instalación...

> hbase version

Arrancamos HBase:

> call ".\hbase-1.3.1\conf\hbase-env.cmd"

> call ".\hbase-1.3.1\bin\start-hbase.cmd"

Veamos finalmente el administrador de HBase, en mi caso en el puerto 16.010:

Bien, en este punto ya tenemos el entorno "Big Data" corriendo, es momento de ir preparando las herramientas de servidor que dan la funcionalidad Geoespacial; GeoWave y GeoServer, seguimos...

LocationTech GeoWave

GeoWave es una librería software que actúa de conector entre el entorno distribuido, Hadoop + HBase en nuestro caso, y un software Geoespacial como GeoServer. Guau! esto sí que es un conector! :-)

En términos técnicos, esta librería implementa realmente un "Data Provider" vectorial de GeoTools para leer los features (fenómenos geográficos) cargados en el sistema distribuido. Cuando añadimos a GeoServer el correspondiente plugin, el usuario verá tres nuevos tipos de "Data Store" con el que configurar el acceso a capas cargadas en sistemas de este tipo.

Actualmente están soportados tres "Data Store" a bases de datos tipo "Big Data", Apache Accumulo, Google BigTable y HBase, que es la que estamos usando.

Dejemos GeoServer para más adelante y volvamos a la gestión de los datos con GeoWave. Según las guías de usuario y desarrollador de GeoWave, debemos primero definir los índices primarios o secundarios que tendrán las capas de datos, y luego cargar la información al sistema "Big Data" (Recordad, Hadoop + HBase para mi caso).

De la guía de desarrollador, compilaremos con Maven el framework de tools de GeoWave para cargar los datos sobre HBase:

> mvn package -P geowave-tools-singlejar

y el plugin que incluiremos en GeoServer:

> mvn package -P geotools-container-singlejar

Para ejecutar cómodamente comandos de GeoWave, me he definido una variable de entorno con el comando base:

> set GEOWAVE=java -cp "%GEOWAVE_HOME%/geowave-deploy-0.9.6-SNAPSHOT-tools.jar"

mil.nga.giat.geowave.core.cli.GeoWaveMain

Ahora fácilmente podremos correr los comandos de configuración y carga con escribir simplemente %geowave% [...]. Comprobamos la versión del software para ver si todo está correcto:

> %geowave% --version

Ok, ya podemos registrar los índices espacial y temporal que tendrá nuestra capa. Para este test el modo de acceso desde la aplicación cliente será estableciendo un filtro espacial, la vista actual en el mapa (Cruce con un BBOX en términos técnicos) y un filtro temporal para recuperar sólo las observaciones de NO2 de una determinada fecha.

Vamos con ello, registremos los dos índices:

> %geowave% config addindex

-t spatial eea-spindex --partitionStrategy ROUND_ROBIN

> %geowave% config addindex

-t spatial_temporal eea-hrindex --partitionStrategy ROUND_ROBIN --period HOUR

Y añadimos también lo que en terminología GeoWave se denomina "store" de la capa:

> %geowave% config addstore eea-store

--gwNamespace geowave.eea -t hbase --zookeeper localhost:2222

Atención, en el comando anterior, 2.222 es el puerto donde he publicado el Zookeeper de mi HBase.

Ya podemos cargar datos. Nuestra fuente de datos son ficheros CSV, así que usaremos el parámetro -f con "geotools-vector" para indicarle a la herramienta de carga que la librería GeoTools averigüe el proveedor de datos con el que leer la información. Como se puede ver en la documentación, hay otros formatos soportados y nada impide implementar por nosotros mismos uno nuevo que cubra nuestros propios requerimientos.

Entonces, para un CSV:

> %geowave% ingest localtogw -f geotools-vector
./mydatapath/eea/NO2-measures.csv eea-store eea-spindex,eea-hrindex

Y tendremos cargada la capa de puntos, hasta aquí sin problemas, o sí :-p, el data provider que viene con GeoTools para leer ficheros CSV tiene unas pequeñas carencias que he intentado mejorar. En el estado actual del código, el lector no soporta atributos de tipo fecha (y tampoco los tipos "boolean"), los trata como simples "string". Eso es inaceptable para nuestro requerimiento de que la fecha es un atributo principal de la capa, así que implementamos este tipo de atributo en el código original. Además, la verificación del tipo de atributo está de tal forma que el código verifica los valores de todos los registros del fichero. Este comportamiento es más seguro para inferir el tipo de valor, pero muy lento para ficheros de tamaño considerable (Miles o cientos de miles de filas). Si el fichero es congruente, vale con verificar unos pocos de los registros para calcular correctamente el tipo de valor de cada atributo. Cambiamos esto también en el código original, recompilamos GeoTools y recompilamos de nuevo GeoWave. Los cambios en el código fuente los puede encontrar en github en mi fork de GeoTools.

Después de este inciso, volvamos ahora a la senda principal de la guía, estamos en que ya hemos cargado una capa con la herramienta "ingest" de GeoWave, y de que hemos incluido el plugin comentado anteriormente en una instancia de GeoServer ya en funcionamiento (Ver guía del desarrollador, simplemente se copia el componente "geowave-deploy-xxx-geoserver.jar" previamente compilado a la carpeta "..\WEB-INF\lib" de GeoServer y reiniciamos).

GeoServer

GeoServer es un servidor de mapas open source que permite a los usuarios compartir y editar información geoespacial usando estándares abiertos. En efecto, implementa varios estándares definidos por el Open Geospatial Consortium (OGC), como por ejemplo el ampliamente utilizado Web Map Service (WMS).

Nosotros utilizaremos GeoServer para consumir las capas cargadas con GeoWave, el plugin que acabamos de añadir a nuestra distribución nos permitirá conectarnos a estos datos, y así podremos usarlos como cualquier otra capa. Insisto Guau! :-)

Para configurar el acceso a las capas (Data Stores y Layers en términos técnicos), podemos utilizar dos opciones:

Utilizando de forma usual el panel de administración de GeoServer:

Utilizando el comando "gs" de GeoWave para registrar las capas en un GeoServer en funcionamiento.

Ya que estamos probando cosas, vamos con esta segunda opción. Un primero paso es indicar a GeoWave la instancia de GeoServer que queremos configurar.

> %geowave% config geoserver -ws geowave

-u admin -p geoserver http://localhost:8080/geoserver

De forma similar a lo que haríamos con el administrador de GeoServer, ejecutamos dos comandos para añadir respectivamente el "Data Store" y el "Layer" deseados.

> %geowave% gs addds -ds geowave_eea -ws geowave eea-store

> %geowave% gs addfl -ds geowave_eea -ws geowave NO2-measures

Como se observará, el sistema de referencia de la capa es el EPSG:4326, el resto de configuraciones las veremos como cualquier otra fuente de datos. Si previsualizamos los datos en el cliente OpenLayers que viene con GeoServer...

Ver vídeo

El rendimiento es bastante digno (Puede ver el vídeo vinculado a la imagen), teniendo en cuenta que estamos corriendo en un Portátil "normalito", con Hadoop en modo "nodo simple", y visualizando las observaciones de NO2 de todas las fechas posibles (~5.000.000 de registros). El filtro espacial trabaja perfectamente, a zoom más cercano, más rápida la respuesta. Si hacemos un filtro WFS usando el atributo de tipo fecha, comprobaremos también que el filtro temporal funciona, no se hace un escaneo por fuerza bruta de todos los datos.

En la guía de usuario de GeoWave se instruye para añadir a la capa un estilo especial llamado "subsamplepoints" (Usa internamente un WPS llamado "geowave:Subsample" que se implementa en el plugin de GeoWave). Lo que hace ese estilo es descartar en el pintado del mapa aquellas geometrías coincidentes a nivel de pixel, éstas no aportan nada en la imagen resultante. He de decir que se nota bastante la mejora de rendimiento, así que para pintar puntos recomiendo investigarlo.

He probado también a cargar una capa de recintos desde un Shapefile, cero problemas, funcionan las peticiones WMS GetMap y WFS GetFeature. Sólo un detalle, la carga con GeoWave automáticamente transforma las geometrías del sistema de referencia espacial original (EPSG:25830 fue mi caso) al comentado EPSG:4326 en coordenadas geográficas.

Bien, llegados a este punto hemos comprobado que todo encaja, podríamos parar aquí ya que la explotación de estos datos ya se podría realizar con herramientas Web Mapping (Leaflet, OpenLayers, AGOL, ...) o GIS desktop (QGIS, gvSIG, ...) estándar.

¿Seguimos?

Visor Web Mapping con Leaflet

Yo he seguido estrujando un poco más la prueba, he desarrollado un visor Web Mapping usando como base Leaflet. Este visor tiene la opción de ver los puntos de la capa en modo "Heatmap" o modo "rampa temática de color"; para ver los valores de una determinada fecha, y la simulación temporal de entre todas las fechas existentes.

Con esta prueba verificaremos además el rendimiento para peticiones que utilicen conjuntamente filtros espaciales y temporales. Vamos con ello.

Lo más fácil, y tal vez óptimo, hubiera sido realizar en el cliente WEB peticiones WMS tipo GetMap, pero lo que vamos a ejecutar son peticiones a GeoServer que recuperen las geometrías/puntos para pintarlos en cliente en la forma y tecnología que queramos. Podríamos usar una petición WFS tipo GetFeature con el BBOX actual de la vista y un filtro alfanumérico de una determinada fecha, pero no perdamos de vista que estamos usando una capa "Big Data" que nos puede devolver un GML o JSON de respuesta con miles y miles de registros.

Ya podéis esperar que habrá un momento en que el fichero de geometrías que debe procesar la aplicación cliente puede ser prohibitivo. Para evitar esto he desarrollado un par de módulos WPS para GeoServer llamados "geowave:PackageFeatureLayer" y "geowave:PackageFeatureCollection" que empaquetan el resultado de esa misma petición en un array de bytes comprimido. Se podría devolver el resultado con otro formato compacto, por ejemplo una imagen especial que codifique en los pixeles las geometrías. Todo sea por minimizar el tamaño de la información y acelerar la digestión de ella en la aplicación cliente.

En la imagen podemos ver que los parámetros de entrada del módulo son el nombre de la capa (Para el módulo "geowave:PackageFeatureCollection" es un objeto "SimpleFeatureCollection"), el BBOX que actúa de filtro espacial, y el filtro CQL a aplicar de forma opcional (En nuestro caso podría ser algo del estilo "datetime_begin = 2017-06-01 12:00:00").

No voy a describir en detalle el código del cliente web, se sale del ámbito de esta guía. Si lo desea, puede estudiar el código que publico en github al final del artículo.

Lo que hace la aplicación es ejecutar en un WebWorker una petición WPS a nuestro GeoServer usando el módulo anterior. El WebWorker descomprime el stream de bytes, lo parsea para generar los objetos javascript y lo devuelve finalmente al hilo principal del explorador para que sean dibujados. El visor pinta ya estos objetos javascript usando la librería Heatmap.js para el modo de visualización "mapa de calor" o sobre un Canvas HTML5 para el de "rampa de color". Para este segundo modo, la aplicación genera al vuelo una textura para cada uno de los iconos con color de la rampa. Este truco permite dibujar miles de puntos en un tiempo record.

Si nuestro cliente requiere dibujar millones de puntos, podemos bucear con WebGL y la estupenda WebGL Heatmap o demos impresionantes como How I built a wind map with webgl.

El código fuente del visor WEB y del modulo WPS está aquí.