How I built a (tiny) real-time Telematics application on AWS
In 2017, I wrote about how to build a basic, Open Source, Hadoop-driven Telematics application (using Spark, Hive, HDFS, and Zeppelin) that can track your movements while driving, show you how your driving skills are, or how often you go over the speed limit - all without relying on 3rd party vendors processing and using that data on your behalf.
This time around, we will re-vamp this application and transform it into a more modern, “serverless”, real-time application by using Amazon’s AWS, an actual GPS receiver, and a GNU/Linux machine.
 See conclusion
Tiny Telematics in 2019
I recently wrote about Hadoop vs. Public Clouds. Part of the conclusion was that Public Cloud Provider (like AWS, GCP, and Azure) can provide big benefits at the cost of autonomy over your tech-stack.
In this article, we will see how the benefits of this new development can help us to re-design a 2 year old project.
Sample output of the solution
Our goals are simple:
- Every trip that is taken in a car should be captured and collected; we should be able to see where we went, when we went there, what route we took, and how fast we were going
- A visualization should show us the route and our speed
- Simple queries, like “What was my top speed today?” should be possible
- The running costs should be reasonable
Our guiding principles should be the following:
- Data should be ingested and processed in real-time; if you are moving and data is being collected, the output should be available within a couple of minutes at the latest; if you have no internet connection, the data should be cached and sent later 
- We don’t want to bother with infrastructure and server management; everything should be running in a fully managed environment (“severless”)
- The architecture and code should be simple and straightforward; we want this to be ready-to-go in a couple of hours
Out of Scope
Last but not least, we also ignore some things:
- The device used will be a laptop; a similar setup will work on Android, a RaspberryPI, or any SOC device, as long as it has a Linux kernel
- Internet connectivity will be provided via a phone’s hotspot; no separate SIM card to provide native connectivity will be used
- Power delivery will be done either via 12V or a battery
- Certain “enterprise” components - LDAP integration, VPCs, long rule sets, ACLs etc. - are out of scope; we assume that those would be per-existing in an enterprise cloud
- Authentication will be simplified; no Oauth/SSAO flow will be used - we will use the device’s MAC ID as a unique ID (even though it really isn’t)
- We stick to querying S3 data; more scalable solutions, such as DynamoDB, won’t be in scope
This leads us to the following AWS architecture:
With these steps -
- A mobile client collects data in real-time by using the gpsd Linux daemon
- The AWS IoT Greengrass Core library simulates a local AWS environment by running a Lambda function directly on the device. IoT Greengrass manages deployment, authentication, network and various other things for us - this makes our data collection code very simple. A local Lambda function will process the data
- Kinesis Firehose will take the data, run some basic transformation and validation using Lambda, and stores it to AWS S3
- Amazon Athena + QuickSight will be used to analyze and visualize the data. The main reason for QuickSight is its capability to visualize geospatial data without the need for external tools or databases like Nominatim
 Quicker processing is easily achieved with more money, for instance through Kinesis polling intervals (see below) - hence, we define a “near real time” goal, as somebody - me - has to pay for this ;)
Step 1: Getting Real Time GPS Data
In the original article, we used SensorLog, a great little app for iOS, to get all the iPhone’s sensor data and stored it into a CSV file that was then processed later in a batch load scenario. This is an easy solution, but with the investment of ~$15, you can get your hands on an actual GPS receiver that works almost out of the box with any GNU/Linux device, such as a RaspberryPI or a Laptop.
Set up gpsd
So, this time around we will be relying on gpsd, the Linux kernel’s interface daemon for GPS receivers. Using this and an inexpensive GPS dongle, we can get real-time GPS data, straight from the USB TTY. We will also be able to use Python to parse this data.
We will be using this GPS receiver: Diymall Vk-172, and for the hardware, I am using my System76 Gazelle Laptop, running Pop!_OS 19.04 x86_64 with the 5.0.0-21 Kernel. Other options are available.
The dongle on my Laptop
Setting this up is straightforward:
Essentially, we are configuring the gpsd daemon to read from the right TTY and display it on screen. The above script is just a guideline - your TTY interface might be different.
Test this with
And you should see data coming in.
Just make sure you are near a window or outside in order to get a connection, otherwise you might see timeouts:
I have to be outside to get a signal - very fun in Georgia heat and humidity
gpsd can collect the following data:
|DBUS_TYPE_DOUBLE||Time (seconds since Unix epoch)|
|DBUS_TYPE_DOUBLE||Time uncertainty (seconds).|
|DBUS_TYPE_DOUBLE||Latitude in degrees.|
|DBUS_TYPE_DOUBLE||Longitude in degrees.|
|DBUS_TYPE_DOUBLE||Horizontal uncertainty in meters, 95% confidence.|
|DBUS_TYPE_DOUBLE||Altitude in meters.|
|DBUS_TYPE_DOUBLE||Altitude uncertainty in meters, 95% confidence.|
|DBUS_TYPE_DOUBLE||Course in degrees from true north.|
|DBUS_TYPE_DOUBLE||Course uncertainty in meters, 95% confidence.|
|DBUS_TYPE_DOUBLE||Speed, meters per second.|
|DBUS_TYPE_DOUBLE||Speed uncertainty in meters per second, 95% confidence.|
|DBUS_TYPE_DOUBLE||Climb, meters per second.|
|DBUS_TYPE_DOUBLE||Climb uncertainty in meters per second, 95% confidence.|
And for the sake of simplicity, we will focus on latitude, longitude, altitude, speed, and time.
Step 2: AWS IoT Core & Greengrass
The AWS IoT core will deploy a Lambda function to your device. This function will run locally, collect the GPS data, and sends it back to AWS via MQTT. It will also handle caching in case an internet connection is not available.
A Local Lambda Function
First off, we’ll have to write a function to do just that:
This function uses the gps and greengrass module to collect the data in pre-defined batches.
At this point, we also define our default values in the common case that a certain attribute - like latitude, longitude, or speed - cannot be read. We will use some ETL/filters on this later on.
AWS IoT Group
Next, create an AWS IoT Core Group (please see the AWS documentation for details).
Once the group has been created, download the certificate and key files and ensure you get the right client data for your respective architecture:
Deploying the Greengrass client
We can then deploy the Greengrass client to our device. The default configuration assumes a dedicated root folder, but we will run this in our user’s home directory:
If you are deploying this to a dedicated device (where the daemon will be running constantly, e.g. on a Raspberry Pi), I suggest sticking to the default of using /greengrass.
Deploying the Function to AWS
Next, we need to deploy our Lambda function to AWS. As we are using custom pip dependencies, please see the deploy_venv.sh script that uses a Python Virtual Environment to package dependencies:
On the AWS console, you can now upload the code:
It is important you create an alias and a version, as this will be referenced later on when configuring the IoT pipeline:
Configuring Lambda on AWS IoT Core
Next, head back to the AWS IoT Core Group we created earlier and add a Lambda function.
Go to the group
Set up the function
Keep in mind: As we won’t be able to run containers (as we need to talk to the USB GPS device via TTY), ensure that this is configured correctly:
Another thing worth mentioning is the custom user ID. The client runs under a certain username and I strongly suggest setting up a service account for it.
Once that is completed, click on deploy and the Lambda function will be deployed to your clients.
Test the Function locally
Finally, after deployment, ensure the user is running the container and check the local logs:
(This is running in my office and hence only shows lat/long as 0/0)
Great! Now our Lambda function runs locally and sends our location to AWS every second. Neat.
Step 3: Kinesis Firehose & ETL
Next, we send the data to Kinesis Firehose, which will run a Lambda function and stores the data to S3 so we can query it later.
By doing this (as opposed to triggering Lambda directly), we bundle our data into manageable packages, so that we don’t have to invoke (and pay for) a Lambda function for every handful of records. We also don’t need to handle the logic to organize the keys on the S3 bucket.
Creating an ETL Lambda Function
First off, we’ll create a Lambda function again. This time, this function will run on AWS and not on the device. We’ll call it telematics-etl.
The function simply filters invalid records (those with a latitude/longitude pair of 0, the default we defined earlier) and changes “nan” Strings for speed and altitude to an integer of -999, which we define as error code.
The function’s output is the base64 encoded data alongside an “Ok” status as well as the original recordID.
We also need to make sure we have one JSON per line and no array, as would be default with json.dumps(data). This is a limitation of the JSON Hive parser Athena uses. Please forgive the nasty hack in the code.
Naturally, more complex processing can be done here.
Once done, deploy the function to AWS.
Test the Function
Once done, we can test this with a test record that can look like this:
Sample output - using lat/long 0/0 without the filter for privacy reasons
AWS Kinesis Firehose
Once our function is working, we want to make sure all incoming data from the device is automatically calling this function, runs the ETL, and stores the data to AWS S3.
We can configure our Firehose stream as such:
In the 2nd step, we tell the stream to use the telematics-etl Lambda function:
As well as the target as S3.
The following settings define the threshold and delay to push the data to S3; at this point, tuning can be applied to make the pipeline run quicker or more frequently.
Connect IoT Core and Kinesis Firehose
And in order for it to be triggered automatically, all we need is a IoT Core Action that will send our queue data to Firehose:
Step 3.5: End-to-end testing
At this point, it is advisable to test the entire pipeline end-to-end by simply starting the greengrassd service and checking the output along the way.
Once the service is started, we can ensure that the function is running:
On the IoT console, we can follow along all MQTT messages:
Once we see data here, they should show up in Kinesis Firehose:
Next, check the CloudWatch logs for the telematics-etl Lambda function and finally, the data on S3.
A note on collecting real data
As you can imagine, collecting data can be tricky when using a Laptop - unless you happen to be a police officer, most commercial cars (and traffic laws ;) ) don’t account for using a Terminal on the road.
While relying on a headless box is certainly possible (and more realistic for daily use), I do suggest running at least one set of data collection with something that has a screen so you can validate the accuracy of GPS data.
Data collection on Atlanta roads
Step 4: Analyze & Visualize the Data
Once we collected some data, we can head over to AWS Athena to attach a SQL interface to our JSON files on S3. Athena is using the Apache Hive dialect, but does offer several helpers to make our lives easier. We’ll start by creating a database and mapping a table to our S3 output:
We can now query the data:
And see our trip output.
As you may have noticed, we are skipping a more complex, SQL based ETL step that would automatically group trips or at least organize the data in a meaningful way. For the sake of a simple process, we skipped this - but it certainly belongs on the “to do” list of things to improve.
“We should be able to see where we went, when we went there, what route we took, and how fast we were going”
As indicated in our goals, we want to know some things. For instance, what was our top speed on a trip on 2019-08-05?
Simple - we multiply the speed (in m/s) with 2.237 to get it in miles per hour, select the max of that speed, and group it by the day:
Which gives us 58.7 mph, which seems about right for a trip on the interstate.
Max speed on a trip
Queries are nice. But what about visuals?
As highlighted in the overview, we use QuickSight to visualize the data. QuickSight is a simple choice for this use case, as it provides geospatial visualization out of the box and behaves similarly to Tableau and other enterprise visualization toolkits. Do keep in mind that a custom dashboard on e.g. Elastic Beanstalk with d3.js could provide the same value with a quicker data refresh rate - QuickSight Standaed requires manual refreshes, whereas QuickSight Enterprise can refresh the data once per hour automatically.
While this does defeat the purpose of “real time”, it makes for a simple, basic analysis out of the box. Refreshing the data on the road yields about a 1 minute delay.
A trip visualized while on said trip
The set up is easy - sign up for QuickSight on the AWS console, add Athena as a data set, and drag-and-drop the fields you want.
Add a data set
Use a custom query
When editing the data set, you can define double fields as latitude and longitude for geospatial analysis:
And by simply dragging the right fields into some analysis, we get a nifty little map, showing a trip:
Often times, you don’t even need SQL. If we want to show our average speed by the minute, we can build a chart by using the timestamp value with a custom format (HH:mm) and changing the default sum(mph) to average(mph) as such:
Average speed by the minute
Using more customized SQL to do fancier things is trivial as well. For instance, seeing “high speed” scenarios on the dataset can be done as such:
And then added to the data set:
Trip with calculated fields
And all of a sudden, you can almost see all traffic lights on that route through the East of Atlanta.
Do keep in mind that QuickSight is a fairly simple tool that does not compare to the functionality of other “big” BI tools or even a Jupyter Notebook. But, in the spirit of the article, it is easy to use and set up quickly.
Compared to the “Tiny Telematics” project 2 years ago, this pipeline much simpler, runs in near real-time, can scale much easier, and requires no infrastructure setup. The whole project can be set up within a couple of hours.
Granted, we have skipped a couple of steps - for instance, a more in-depth ETL module that could prepare a much cleaner data set or a more scalable long-term storage architecture, like DynamoDB.
The focus on a “serverless” architecture enabled us to quickly spin up and use the resources we need - no time was spent on architecture management.
However, all that glitters is not gold. While we did make quick progress and have a working solution at hand (granted, driving around with a laptop maybe only qualifies for a “proof of concept” state ;) ), we gave up autonomy over a lot of components. It’s not quite a “vendor lockin” - the code is easy enough to port, but would not run out of the box on another system or Cloud proviers.
IoT Core Greengrass handled deployment to clients, certificates, code execution, containerization, and message queues.
Kinesis Firehose took over the role of a fully-fledged streaming framework like Spark Streaming, Kafka, or Flink; it handled code execution, transfer, scaling, ETL resources through Lambda, and sinks into the storage stage.
Athena bridges the gap at little bit - by relying on the Hive dialect and an Open-Source SerDe framework, the table definitions and SQL can be easily ported to a local Hive instance.
Lambda can be regarded in similar terms - it’s just Python with some additional libraries. Switching out those and use e.g. a Kafka queue would be trivial.
So, conclusion - once again, this was a completely and utterly pointless, albeit fun project. It shows how powerful even a small subset of AWS can be and how (relatively) easy it is to set up and how real-world hardware can be used in conjunction with “the Cloud” and how old ideas can be translated to a more hip - a word I prefer over “modern” - infrastructure and architecture.
All development was done under PopOS! 19.04 on Kernel 5.0.0 with 12 Intel i7-9750H vCores @ 2.6Ghz and 16GB RAM on a 2019 System76 Gazelle Laptop
The full source is available on GitHub.