Creating Data “Tiles” for Massive Amounts of Price Data

I’m designing a JavaScript chart that I want to be awesome and I need your help.

Let’s say I have 5 million records with price data like this:

  • Find Shortest Binary String In Given Interval
  • Javascript code for making my browser slow down
  • Graphing Algorithm for many nodes
  • Algorithm to merge multiple sorted sequences into one sorted sequence in javascript
  • Why does this loop terminate partway through?
  • Best algorithm for determining the high and low in an array of numbers?
  • open  high  low  close  volume  timeStart      timeEnd        timeDuration
    10    20    5    15     500     1391895920860  1391895920920  60
    

    So, I have 5 million records describing the price of something at the one minute resolution.

    Now, I need to show a chart to the user representing all 5 million of these records. By default the chart is going to show 2 years of data at the one week interval. Two years of data at a one week interval is just 104 records. So, in order to generate my 104 weekly records I need to process all 5 million of my minute resolution records and group them into weeks.

    Let’s say the user zooms in and wants to see data grouped for every three days instead of the one week interval. I’m going to need to generate three day records which group the data from my one minute resolution data.

    In fact, for each interval that I am going to allow the user to zoom into I need to summarize my minute resolution data ahead of time. To save load on the database, I am going to generate CSV “tiles” that the client can quickly pull down. Each tile will have 1,000 records in it.

    Here are the resolutions (“zoom levels”) I will support and the number of tiles I will need to generate ahead of time for two years of price data at the one minute resolution:

    1 minute - 1440 tiles (estimated to be about 40 MB of data)
    3 minute - 480 tiles
    5 minute - 288 tiles
    15 minute - 96 tiles
    30 minute - 48 tiles
    1 hour - 24 tiles
    2 hours - 12 tiles
    4 hours - 6 tiles
    6 hours - 4 tiles
    12 hours - 2 tiles
    1 day - 1 tile
    3 days - 1 tile
    1 week - 1 tile
    1 month - 1 tile
    1 year - 1 tile
    

    The idea here is that when the user first hits the chart, all I need to do is pull down the one week tile and it will have all 104 of my records. When they zoom into a section of the one week tile I will pull down the 3 day tile.

    So here is where I need help. If the user zooms into the 12 hour tile from the 1 day zoom level, I need to intelligently know which of the 12 hour tiles I should grab. I think I need to have a naming scheme for my tiles that will allow the client to easily identify sub tiles for any parent tile.

    Can anyone help me with a naming scheme for these tiles? I specifically need:

    1. A file naming scheme for all of my tiles I will generate
    2. An algorithm the chart will use to easily identify the correct child tile to zoom into or parent child to zoom out to

    Would love to hear your thoughts and feedback!

  • Permutations via Heap's algorithm with a mystery comma
  • What is the fastest implementation of ColdFusion's listFindNoCase function in Javascript?
  • Sort an array of integers into odd, then even
  • javascript change elements in array
  • How to make a controlled “shuffle” order?
  • Aggregation of array data over a given dimension
  • 3 Solutions collect form web for “Creating Data “Tiles” for Massive Amounts of Price Data”

    This depends on almost about anything, and you don’t say much about your server side.

    But, you must balance on what you compute before, and what you compute now. The before takes space and saves time, the now takes time and saves space. Do you have fast I/O (hdd/ssd/raid X) ? Do you have much space? Do you have powerful cpus? Does your server handle great bandwith? Why each tile should have 1000 records? What are the most popular zoom levels?

    First of all you must start from somewhere. Your tiniest data, is the minute. So you must have details about minutes. This releaves you from holding miliseconds. From my experience a file with a range of id’s is essential like a 23211239-23242175.csv file. So that “tile” is your start. The id I got is the miliseconds/1000/60 to become minutes. I see your data has time-range and that is a small problem. Every record should fill entirely into your timeframe, it’s like you should care only about timeEnd and put the record in that time-frame.

    To have good I/O check your file to be close to a multiple of the cluster size of your disks. For windows I think it is 4kb. The cluster size is the size that is brought back by the disk in one I/O, let’s say you have 2 csv’s that are 6kb each. It would require 2 I/O’s to bring each file to the memory, so you got 4 I/O’s to bring 2x6Kb files. But if you split these into 3 of 4kb each, it would require 3 I/O’s to load them into memory. So check if these tiles are really close to cluster size.

    See if you can get rid of some fields like timeStart,timeEnd,timeDuration because your filename will describe these fields. Try to precompute as much data as you can, to save proccessing time. From your details on zoom’s levels, each level is computed by the previous level. Try to create subfolders that go through the next levels. You don’t have to compute every zoom level, try to skip some, it will save space. Then upon requests you will find what users are most asking for and create it if it does not exist. Have folders like:

    Path/1m/xxxxxxxx-xxxxxxx.csv
    Path/30m/xxxxxxx-xxxxxx.csv
    Path/6h/xxxxx-xxxxx.csv
    Path/w/xx-xx.csv
    

    Be sure you only have the 1 minute file at start and then create the others. I bet that when you zoom into a level, like for 2014 it will show 2014’s months. So it is like you need at most X results for each chart. So maybe your tiles won’t have the same lenth at each zoom level/folder. It depends on what they describe. Then it will be easy for you to find anything, just check the range the file describes and return it. Popular resolutions can be cached into memory for a period of time to make even faster the output proccess.

    On the client side, you should try to keep as much data as you can. Don’t ask server for data that you have previously got again. The problem with RAM caching files is that the file can be outdated, so be careful. If you watch results of past year (year is said because it is the most generic resolution level in this application) you should know that newcomers won’t affect these old fields, so you can cache them safely.

    This problem does not lie only on client side programming. To make a robust and efficient chart app you should have a great backend logic to accomplish it. I had to tell these even if I don’t see any server tags (ror,nodejs,php) in this question…

    I would model the data in layers. One layer per resolution. Each layer is independent nominative and functionally. Every layer use same name convention for system stability.

    If you are going to use JavaScript I would use JavaScript date convention if applicable.

    http://www.ecma-international.org/ecma-262/5.1/#sec-15.9

    For non standard “resolutions” (e.g. 6 hour division) is easier understandable applying a simple model: ordinary division

    • addition: Context.”Bigger” unit part of date.
    • dividend: Parts of the element.
    • divisor: Size of divisions.
    • remainder: Index for the resolution.

    Date for examples: 11 Feb 2014 08:14 22′.

    Solution A

    Only 1 bigger element is divided. For example in the 6 hour resolution only 1 day is divided. The rest of days are the “addition” or context.

    layer 1:

    Layer1/2014.csv
    

    layer 7:

    Layer7/2014/41/1.csv
    

    *41 is day of the year.

    Solution B

    Split by timestamp range.

    layer 7:

    Layer7/1414908000-1414929600.csv
    

    You can use a simple grid and z-value for the zoom. Then you can use a quadkey to get the right tile. On the disk you can use x/y/z folders. A good explanation is the Microsoft Bing maps tile system:http://msdn.microsoft.com/en-us/library/bb259689.aspx. The only problem is a map is 2 dimensions but you have only 1 dimension.