Creating Data “Tiles” for Massive Amounts of Price Data
Let’s say I have 5 million records with price data like this:
open high low close volume timeStart timeEnd timeDuration 10 20 5 15 500 1391895920860 1391895920920 60
So, I have 5 million records describing the price of something at the one minute resolution.
Now, I need to show a chart to the user representing all 5 million of these records. By default the chart is going to show 2 years of data at the one week interval. Two years of data at a one week interval is just 104 records. So, in order to generate my 104 weekly records I need to process all 5 million of my minute resolution records and group them into weeks.
Let’s say the user zooms in and wants to see data grouped for every three days instead of the one week interval. I’m going to need to generate three day records which group the data from my one minute resolution data.
In fact, for each interval that I am going to allow the user to zoom into I need to summarize my minute resolution data ahead of time. To save load on the database, I am going to generate CSV “tiles” that the client can quickly pull down. Each tile will have 1,000 records in it.
Here are the resolutions (“zoom levels”) I will support and the number of tiles I will need to generate ahead of time for two years of price data at the one minute resolution:
1 minute - 1440 tiles (estimated to be about 40 MB of data) 3 minute - 480 tiles 5 minute - 288 tiles 15 minute - 96 tiles 30 minute - 48 tiles 1 hour - 24 tiles 2 hours - 12 tiles 4 hours - 6 tiles 6 hours - 4 tiles 12 hours - 2 tiles 1 day - 1 tile 3 days - 1 tile 1 week - 1 tile 1 month - 1 tile 1 year - 1 tile
The idea here is that when the user first hits the chart, all I need to do is pull down the one week tile and it will have all 104 of my records. When they zoom into a section of the one week tile I will pull down the 3 day tile.
So here is where I need help. If the user zooms into the 12 hour tile from the 1 day zoom level, I need to intelligently know which of the 12 hour tiles I should grab. I think I need to have a naming scheme for my tiles that will allow the client to easily identify sub tiles for any parent tile.
Can anyone help me with a naming scheme for these tiles? I specifically need:
- A file naming scheme for all of my tiles I will generate
- An algorithm the chart will use to easily identify the correct child tile to zoom into or parent child to zoom out to
Would love to hear your thoughts and feedback!
3 Solutions collect form web for “Creating Data “Tiles” for Massive Amounts of Price Data”
This depends on almost about anything, and you don’t say much about your server side.
But, you must balance on what you compute before, and what you compute now. The before takes space and saves time, the now takes time and saves space. Do you have fast I/O (hdd/ssd/raid X) ? Do you have much space? Do you have powerful cpus? Does your server handle great bandwith? Why each tile should have 1000 records? What are the most popular zoom levels?
First of all you must start from somewhere. Your tiniest data, is the minute. So you must have details about minutes. This releaves you from holding miliseconds. From my experience a file with a range of id’s is essential like a
23211239-23242175.csv file. So that “tile” is your start. The id I got is the
miliseconds/1000/60 to become minutes. I see your data has time-range and that is a small problem. Every record should fill entirely into your timeframe, it’s like you should care only about
timeEnd and put the record in that time-frame.
To have good I/O check your file to be close to a multiple of the cluster size of your disks. For windows I think it is 4kb. The cluster size is the size that is brought back by the disk in one I/O, let’s say you have 2 csv’s that are 6kb each. It would require 2 I/O’s to bring each file to the memory, so you got 4 I/O’s to bring 2x6Kb files. But if you split these into 3 of 4kb each, it would require 3 I/O’s to load them into memory. So check if these tiles are really close to cluster size.
See if you can get rid of some fields like
timeStart,timeEnd,timeDuration because your filename will describe these fields. Try to precompute as much data as you can, to save proccessing time. From your details on zoom’s levels, each level is computed by the previous level. Try to create subfolders that go through the next levels. You don’t have to compute every zoom level, try to skip some, it will save space. Then upon requests you will find what users are most asking for and create it if it does not exist. Have folders like:
Path/1m/xxxxxxxx-xxxxxxx.csv Path/30m/xxxxxxx-xxxxxx.csv Path/6h/xxxxx-xxxxx.csv Path/w/xx-xx.csv
Be sure you only have the 1 minute file at start and then create the others. I bet that when you zoom into a level, like for 2014 it will show 2014’s months. So it is like you need at most X results for each chart. So maybe your tiles won’t have the same lenth at each zoom level/folder. It depends on what they describe. Then it will be easy for you to find anything, just check the range the file describes and return it. Popular resolutions can be cached into memory for a period of time to make even faster the output proccess.
On the client side, you should try to keep as much data as you can. Don’t ask server for data that you have previously got again. The problem with RAM caching files is that the file can be outdated, so be careful. If you watch results of past year (year is said because it is the most generic resolution level in this application) you should know that newcomers won’t affect these old fields, so you can cache them safely.
This problem does not lie only on client side programming. To make a robust and efficient chart app you should have a great backend logic to accomplish it. I had to tell these even if I don’t see any server tags (ror,nodejs,php) in this question…
I would model the data in layers. One layer per resolution. Each layer is independent nominative and functionally. Every layer use same name convention for system stability.
For non standard “resolutions” (e.g. 6 hour division) is easier understandable applying a simple model: ordinary division
- addition: Context.”Bigger” unit part of date.
- dividend: Parts of the element.
- divisor: Size of divisions.
- remainder: Index for the resolution.
Date for examples: 11 Feb 2014 08:14 22′.
Only 1 bigger element is divided. For example in the 6 hour resolution only 1 day is divided. The rest of days are the “addition” or context.
*41 is day of the year.
Split by timestamp range.
You can use a simple grid and z-value for the zoom. Then you can use a quadkey to get the right tile. On the disk you can use x/y/z folders. A good explanation is the Microsoft Bing maps tile system:http://msdn.microsoft.com/en-us/library/bb259689.aspx. The only problem is a map is 2 dimensions but you have only 1 dimension.