databento-bot/reddit-comment-2024-04-03.txt

## reddit-comment-2024-04-03.txt
# Reddit comment 2024-04-03

> See https://www.reddit.com/r/algotrading/comments/1bu59ql/comment/kxuil9a

There will always be some differences in the vendor’s infrastructure used to process real-time vs. historical.
It takes a bit of effort to make these as identical as possible. Non-exhaustive list:

The most common issue I’ve seen is that the vendor will retroactively clean and patch their historical data ex post in ways
that are not replicable in real-time. (The most obvious tell is if you report a data error and they tell you it’s patched
within the same day.) This is one area where Bloomberg is quite good despite doing it the “wrong” way - they have a
strong data model and provenance/versioning. The “better” approach is to just give you the raw data and only
apply corrections through changing the real-time parser behavior and regenerating from scratch - MayStreet, Databento,
Pico and Exegy have this type of approach.

We’ve also seen vendors do some opaque mix and match of feeds and derived data, e.g. SIP historical with IEX/Nasdaq
real-time, synthetic prices. Some ATSes do this with weighted midprice, etc. (This is something that institutional
providers like Databento avoid by giving you the same feed or feeds strictly. Other notable ones that are good in
this regard: QuantHouse, Activ, Exegy, Pico.)

Another egregious issue is that the vendor will backfill from other secondary redistributors of drastically different
quality and mix and match. We see this kind of backfill-and-rebadge done often over ICE/IDS, Xignite, dxFeed,
IEXCloud, Quodd/Nanex and Refinitiv’s data because they are more liberal with historical redistribution.
(You don’t see Bloomberg’s data getting bootlegged since they restrict historical redistribution.) And we’ve always seen
the rebadged data be much, much worse than the original.

This is a more common issue to look out for among “newer” vendors - including us - since obviously a vendor started in,
say, 2019, needs another source for data dating back to say, 2010. The telltale sign is if the data is suspiciously
cheap AND the vendor is not an official licensed distributor on the exchange directories. There’s no reason good data
must be expensive, but it’s easier to make it cheap when you’re rebadging because secondary sources tend to be cheaper
so your margins are higher. Another way to tell is just to compare their oldest data to newest. (This is why Databento
doesn’t have data going that far back - we only trust primary sources like the exchange or raw packet captures.)

4. Another issue is when the timestamps are drastically different in historical vs. real-time. This is an area where
the legacy Refinitiv Tick History (non-MayStreet) is ironically quite good - they address it by being equally bad in
both historical and real-time and consolidating the history and real-time through their Docklands hub.

I’ve named the firms that are decent but you can probably come to your conclusion which ones are bad by omission.
I don’t mind naming good firms and giving credit where it’s due, even to competitors, but prefer not to namedrop
ones that are egregiously bad.
	# Reddit comment 2024-04-03

	> See https://www.reddit.com/r/algotrading/comments/1bu59ql/comment/kxuil9a

	There will always be some differences in the vendor’s infrastructure used to process real-time vs. historical.
	It takes a bit of effort to make these as identical as possible. Non-exhaustive list:

	The most common issue I’ve seen is that the vendor will retroactively clean and patch their historical data ex post in ways
	that are not replicable in real-time. (The most obvious tell is if you report a data error and they tell you it’s patched
	within the same day.) This is one area where Bloomberg is quite good despite doing it the “wrong” way - they have a
	strong data model and provenance/versioning. The “better” approach is to just give you the raw data and only
	apply corrections through changing the real-time parser behavior and regenerating from scratch - MayStreet, Databento,
	Pico and Exegy have this type of approach.

	We’ve also seen vendors do some opaque mix and match of feeds and derived data, e.g. SIP historical with IEX/Nasdaq
	real-time, synthetic prices. Some ATSes do this with weighted midprice, etc. (This is something that institutional
	providers like Databento avoid by giving you the same feed or feeds strictly. Other notable ones that are good in
	this regard: QuantHouse, Activ, Exegy, Pico.)

	Another egregious issue is that the vendor will backfill from other secondary redistributors of drastically different
	quality and mix and match. We see this kind of backfill-and-rebadge done often over ICE/IDS, Xignite, dxFeed,
	IEXCloud, Quodd/Nanex and Refinitiv’s data because they are more liberal with historical redistribution.
	(You don’t see Bloomberg’s data getting bootlegged since they restrict historical redistribution.) And we’ve always seen
	the rebadged data be much, much worse than the original.

	This is a more common issue to look out for among “newer” vendors - including us - since obviously a vendor started in,
	say, 2019, needs another source for data dating back to say, 2010. The telltale sign is if the data is suspiciously
	cheap AND the vendor is not an official licensed distributor on the exchange directories. There’s no reason good data
	must be expensive, but it’s easier to make it cheap when you’re rebadging because secondary sources tend to be cheaper
	so your margins are higher. Another way to tell is just to compare their oldest data to newest. (This is why Databento
	doesn’t have data going that far back - we only trust primary sources like the exchange or raw packet captures.)

	4. Another issue is when the timestamps are drastically different in historical vs. real-time. This is an area where
	the legacy Refinitiv Tick History (non-MayStreet) is ironically quite good - they address it by being equally bad in
	both historical and real-time and consolidating the history and real-time through their Docklands hub.

	I’ve named the firms that are decent but you can probably come to your conclusion which ones are bad by omission.
	I don’t mind naming good firms and giving credit where it’s due, even to competitors, but prefer not to namedrop
	ones that are egregiously bad.