Skip to content

Instantly share code, notes, and snippets.

@NickSeagull
Last active November 26, 2017 22:16
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save NickSeagull/bf91cc51c47d52ff67c6fd13e3a81156 to your computer and use it in GitHub Desktop.
Save NickSeagull/bf91cc51c47d52ff67c6fd13e3a81156 to your computer and use it in GitHub Desktop.
An adaptation of Deedle in 10 minutes for Haskell, for a possible API for Analyze

Getting started with Analyze

NOTE: NOTHING OF THIS HAS BEEN IMPLEMENTED YET, THIS IS A DESIGN DOCUMENT

Taken from Deedle in 10 minutes, adapted to Analyze

import Analyze
import Analyze.Frame as Frame
import Analyze.Series as Series
import qualified Date.Time.Calendar  -- from the 'time' package

Creating series and frames

We can create a series with any type that instantiates the IsList type class.

idxes = [ "A"
        , "B"
        , "C"
        ]

values = [10, 20, 30]

firstSeries = Series.new idxes values

Also, we can create a series from a list of tuples:

secondSeries = Series.ofObservations
    [ ( "A", 10)
    , ( "B", 20)
    , ( "C", 30)
    ]

We can create a series of implicit (ordinal) keys by doing:

thirdSeries = Series.ofValues [ 10.0, 20.0, 30.0 ]

Now we can create a Frame using the first and the second Series, as they share the keys.

df1 = Frame.new ["first", "second"] [firstSeries, secondSeries]

A Frame has two type parameters: Frame rowKey columnKey. We use this to index. The types of the data itself are not specified, and instead, we do so when getting the data from the Frame.

We can create a data frame with Int indexes for rows or columns by:

df2 = Frame.ofColumns [ ("first", firstSeries), ("second", secondSeries) ]
df3 = Frames.ofRows [ ("first", firstSeries), ("second", secondSeries)]

Also, we can specify our indexes for rows and columns by specifying (rowKey, columnKey, value):

df4 = Frame.ofValues
    [ ("Monday", "John", 1.0)
    , ("Tuesday", "Joe", 2.1)
    , ("Tuesday", "John", 4.0)
    , ("Wednesday", "John", -5.4)
    ]

If your data types derive the Generic type class you can create a data frame from a list of those too:

data Price = Price
    { day :: Text
    , quantity :: Float
    } deriving (Generic)

instance Serialize Price

prices :: [Price]
prices =
    [ Price "1-1-17" 10.0
    , Price "2-1-17" 12.0
    , Price "3-1-17" 13.0
    ]

df5 = Frame.ofRecords prices

Finally, we can also load a data frame from CSV.

msftCsv = Frame.readCsv "resources/MSFT.csv"
fbCsv = Frame.readCsv "resources/FB.csv"

The types are not inferred when loading like that, the user must specify them later.

Specifying index and joining

msftOrd <- Frame.withFrame msftCsv $ do
    Frame.indexRowsDate "Date"
    Frame.sortRowsByKey

We can now get only the open and close prices, and add a new column.

msft <- Frame.withFrame msftOrd $ do
    Frame.sliceColumns [ "Open", "Close" ]
    openColumn       <- Frame.getColumn "Open"
    closeColumn      <- Frame.getColumn "Close"
    let differenceColumn = zipWith (-) openColumn closeColumn
    Frame.addColumn "Difference" differenceColumn

We can do the same thing for Facebook:

fb <- Frame.withFrame fbCsv $ do
    Frame.indexRowsDate "Date"
    Frame.sortRowsByKey
    Frame.sliceColumns [ "Open", "Close" ]
    openColumn       <- Frame.getColumn "Open"
    closeColumn      <- Frame.getColumn "Close"
    let differenceColumn = zipWith (-) openColumn closeColumn
    Frame.addColumn "Difference" differenceColumn

Let's create a single data frame that contains Microsoft and Facebook data. Before joining those data frames, we have to rename their columns so their names aren't duplicated.

let msftNames = ["MsftOpen", "MsftClose", "MsftDiff"]
msftRenamed <- Frame.withFrame msft $
    Frame.indexColumnsWith msftNames

let fbNames = ["FbOpen", "FbClose", "FbDiff"]
fbRenamed <- Frame.withFrame fb $
    Frame.indexColumnsWith fbNames

let joinedOut = Frame.withFrame msftRenamed $
    Frame.outerJoin fbRenamed

let joinedIn = Frame.withFrame msftRenamed $
    Frame.innerJoin fbRenamed

Selecting values and slicing

let val  = Frame.getRow (Data.Time.Calendar.fromGregorian 2013 1 2) joinedIn
let val' = Frame.getRow (Data.Time.Calendar.fromGregorian 2013 1 2) joinedIn
         & Series.get "FbOpen"

Projection and filtering

-- 'modifying' modifies the original 'Frame' instead of copying it
Frame.modifying joinedOut $ do
    comparison <- Frame.mapRowValues $ \row ->
        if (Series.get "msftOpen" row) > (Series.get "fbOpen" row)
            then "MSFT"
            else "FB"
    Frame.addColumn "Comparison" comparison

We can now get the number of days when Microsoft stock prices were above Facebook and the other way round:

let msftCount = Frame.withFrame joinedOut $ do
    Frames.getColumn "Comparison" (Series.as :: String)
    >>= Series.filterValues (== "MSFT")
    >>= Series.countValues
-- msftCount = 220

let fbCount = Frame.withFrame joinedOut $ do
    Frames.getColumn "Comparison" (Series.as :: String)
    >>= Series.filterValues (== "FB")
    >>= Series.countValues
-- fbCount = 103

Grouping and aggregation

Group rows by month and year:

monthly <- Frames.withFrame joinedIn $
    Frame.groupRowsUsing $ \(y,m,_) _ -> fromGregorian y m 1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment