I wondered what temperatures in Orlando have done over this last week. You see I just happen to have a set of IoT devices which are streaming data that I persist into an archive. One of those sensors is on a covered patio in Orlando, so it would be interesting to see what kind of data there is from this last week.
In my setup, Smartthings sends data to a Spring Cloud Data Flow endpoint which drops the data into files. Those files are parsed and uploaded to Ceph host, a simple IoT data flow. Within Greenplum I have created external tables that can read the data and the table looks like this
gpadmin=# \d smartthings_tracking Table "public.smartthings_tracking" Column | Type | Modifiers ------------+-----------------------------+----------- deviceid | text | name | text | value | text | recordedat | timestamp without time zone | unit | text | Number of child tables: 170 (Use \d+ to list them.) Distributed randomly Partition by: (recordedat)
The structure is simple and I can sample of few of the data points with this query
gpadmin=# \x Expanded display is on. gpadmin=# SELECT * FROM smartthings_tracking WHERE deviceid = 'a15b41c3-8bf0-49df-8571-a5a5d138fdff' ORDER BY recordedat DESC LIMIT 2; -[ RECORD 1 ]------------------------------------ deviceid | a15b41c3-8bf0-49df-8571-a5a5d138fdff name | temperature value | 72 recordedat | 2017-09-12 01:35:47.993 unit | F -[ RECORD 2 ]------------------------------------ deviceid | a15b41c3-8bf0-49df-8571-a5a5d138fdff name | temperature value | 73 recordedat | 2017-09-12 00:20:46.677 unit | F
I know I’m going to want to have a visual representation of the data because looking at it as a list won’t easily convey the information I want. Also I know I will be doing some interactive querying to try to get this right, so I switch from the command line over to Apache Zeppelin which connects to my Greenplum server.
This is one of those cases where I can really appreciate interactive querying of IoT data. I need to do some exploring and will submit queries or attempts at a query a few dozen times. As I’m crafting the visualization I will adjust the window to varying widths of data in order to optimize what I see which will generate more queries. Getting immediate feedback greatly decreases the cycle time in order to get a result. What I’m doing here is fairly simple though and it doesn’t take long before I come up with a query and time window I like.
SELECT deviceid, recordedat::timestamp, value FROM smartthings_tracking WHERE deviceid = 'a15b41c3-8bf0-49df-8571-a5a5d138fdff' AND name = 'temperature' AND recordedat > '2017-09-05 00:00:00' ORDER BY recordedat DESC;
A little bit of messing around with the visualization buttons and I come up with this
Which is not a bad image, but I quickly realize something is wrong with it. These nice angled lines do not show how temperatures work over time. Why is that?
Then I remember, it is because these sensors kick out a temp value only when they have registered a temperature change of a full degree. So the time distance between two points could be 3 seconds or it could be 12 hours.
This will cause the time to not be reflected with spatial accuracy along my X axis. While it is showing the strange weather, reflected by the low values on the left, it isn’t giving me a good perception of how that actually happened in time over the past week.
Frustration.
Then I remember a post by Caleb Welton a few years ago on Time Series analysis. This has exactly what I need. A big part of making this work is setting up a few functions to that will handle and maintain the last known values.
create function last_known_t(prev float8, current float8) returns float8 as $$return current or prev$$ language plpythonu; create aggregate last_known(float8) ( stype =float8, sfunc = last_known_t, prefunc = last_known_t ); create function last_known_t(prev timestamp, current timestamp) returns timestamp as $$return current or prev$$ language plpythonu; create aggregate last_known(timestamp) ( stype = timestamp, sfunc = last_known_t, prefunc = last_known_t );
With those in place I can now create a new query that will create time slices and drop the data into those slices. Now the slices will carry forward the last known value.
WITH bounded AS ( SELECT recordedat, interval_bound(recordedat, '15 minutes') AS slice, value FROM smartthings_tracking WHERE deviceid = 'a15b41c3-8bf0-49df-8571-a5a5d138fdff' AND name = 'temperature' AND recordedat::timestamp > '2017-09-05 00:00:00' ), dense AS ( SELECT slice FROM generate_series('2017-09-05 00:00:00'::timestamp, '2017-09-13 00:00:00', '15 minutes') s(slice) ) SELECT slice, value::int, last_known(value::int) OVER (ORDER BY slice, recordedat) FROM BOUNDED RIGHT JOIN dense USING (slice) ORDER BY slice, recordedat;
Looking at a piece of the raw output it becomes easy to see how different the data appears as it is chunked into even time periods
I will potentially double up on some numbers if it jumped more than a degree in a 15 minute period, but my visualization can show the average so it is not a problem. This leads to a much better visualization of the temperatures
The temperatures sure did deviate from the normal pattern there for a little while. This isn’t a fantastic visualization, for under and hour of work though it does give me a decent visual representation of the strangeness in the weather. I feel lucky the the only impact I see is in this graph and pray that those dealing with greater impacts stay healthy and find a path to recovery.