Mastering market dynamics: Remodeling transaction price analytics with ultra-precise Tick Historical past – PCAP and Amazon Athena for Apache Spark


This publish is cowritten with Pramod Nayak, LakshmiKanth Mannem and Vivek Aggarwal from the Low Latency Group of LSEG.

Transaction price evaluation (TCA) is broadly utilized by merchants, portfolio managers, and brokers for pre-trade and post-trade evaluation, and helps them measure and optimize transaction prices and the effectiveness of their buying and selling methods. On this publish, we analyze choices bid-ask spreads from the LSEG Tick Historical past – PCAP dataset utilizing Amazon Athena for Apache Spark. We present you the right way to entry information, outline customized capabilities to use on information, question and filter the dataset, and visualize the outcomes of the evaluation, all with out having to fret about establishing infrastructure or configuring Spark, even for big datasets.

Background

Choices Worth Reporting Authority (OPRA) serves as an important securities data processor, gathering, consolidating, and disseminating final sale experiences, quotes, and pertinent data for US Choices. With 18 lively US Choices exchanges and over 1.5 million eligible contracts, OPRA performs a pivotal function in offering complete market information.

On February 5, 2024, the Securities Trade Automation Company (SIAC) is about to improve the OPRA feed from 48 to 96 multicast channels. This enhancement goals to optimize image distribution and line capability utilization in response to escalating buying and selling exercise and volatility within the US choices market. SIAC has really useful that corporations put together for peak information charges of as much as 37.3 GBits per second.

Regardless of the improve not instantly altering the full quantity of revealed information, it allows OPRA to disseminate information at a considerably sooner fee. This transition is essential for addressing the calls for of the dynamic choices market.

OPRA stands out as one probably the most voluminous feeds, with a peak of 150.4 billion messages in a single day in Q3 2023 and a capability headroom requirement of 400 billion messages over a single day. Capturing each single message is essential for transaction price analytics, market liquidity monitoring, buying and selling technique analysis, and market analysis.

Concerning the information

LSEG Tick Historical past – PCAP is a cloud-based repository, exceeding 30 PB, housing ultra-high-quality world market information. This information is meticulously captured immediately inside the trade information facilities, using redundant seize processes strategically positioned in main main and backup trade information facilities worldwide. LSEG’s seize know-how ensures lossless information seize and makes use of a GPS time-source for nanosecond timestamp precision. Moreover, refined information arbitrage strategies are employed to seamlessly fill any information gaps. Subsequent to seize, the information undergoes meticulous processing and arbitration, and is then normalized into Parquet format utilizing LSEG’s Actual Time Extremely Direct (RTUD) feed handlers.

The normalization course of, which is integral to making ready the information for evaluation, generates as much as 6 TB of compressed Parquet information per day. The large quantity of knowledge is attributed to the surrounding nature of OPRA, spanning a number of exchanges, and that includes quite a few choices contracts characterised by various attributes. Elevated market volatility and market making exercise on the choices exchanges additional contribute to the quantity of knowledge revealed on OPRA.

The attributes of Tick Historical past – PCAP allow corporations to conduct varied analyses, together with the next:

  • Pre-trade evaluation – Consider potential commerce influence and discover totally different execution methods primarily based on historic information
  • Put up-trade analysis – Measure precise execution prices in opposition to benchmarks to evaluate the efficiency of execution methods
  • Optimized execution – Advantageous-tune execution methods primarily based on historic market patterns to attenuate market influence and cut back total buying and selling prices
  • Threat administration – Establish slippage patterns, determine outliers, and proactively handle dangers related to buying and selling actions
  • Efficiency attribution – Separate the influence of buying and selling selections from funding selections when analyzing portfolio efficiency

The LSEG Tick Historical past – PCAP dataset is offered in AWS Knowledge Change and might be accessed on AWS Market. With AWS Knowledge Change for Amazon S3, you possibly can entry PCAP information immediately from LSEG’s Amazon Easy Storage Service (Amazon S3) buckets, eliminating the necessity for corporations to retailer their very own copy of the information. This strategy streamlines information administration and storage, offering shoppers quick entry to high-quality PCAP or normalized information with ease of use, integration, and substantial information storage financial savings.

Athena for Apache Spark

For analytical endeavors, Athena for Apache Spark gives a simplified pocket book expertise accessible via the Athena console or Athena APIs, permitting you to construct interactive Apache Spark purposes. With an optimized Spark runtime, Athena helps the evaluation of petabytes of knowledge by dynamically scaling the variety of Spark engines is lower than a second. Furthermore, frequent Python libraries comparable to pandas and NumPy are seamlessly built-in, permitting for the creation of intricate software logic. The flexibleness extends to the importation of customized libraries to be used in notebooks. Athena for Spark accommodates most open-data codecs and is seamlessly built-in with the AWS Glue Knowledge Catalog.

Dataset

For this evaluation, we used the LSEG Tick Historical past – PCAP OPRA dataset from Might 17, 2023. This dataset contains the next parts:

  • Finest bid and provide (BBO) – Studies the best bid and lowest ask for a safety at a given trade
  • Nationwide finest bid and provide (NBBO) – Studies the best bid and lowest ask for a safety throughout all exchanges
  • Trades – Data accomplished trades throughout all exchanges

The dataset includes the next information volumes:

  • Trades – 160 MB distributed throughout roughly 60 compressed Parquet information
  • BBO – 2.4 TB distributed throughout roughly 300 compressed Parquet information
  • NBBO – 2.8 TB distributed throughout roughly 200 compressed Parquet information

Evaluation overview

Analyzing OPRA Tick Historical past information for Transaction Value Evaluation (TCA) includes scrutinizing market quotes and trades round a selected commerce occasion. We use the next metrics as a part of this research:

  • Quoted unfold (QS) – Calculated because the distinction between the BBO ask and the BBO bid
  • Efficient unfold (ES) – Calculated because the distinction between the commerce value and the midpoint of the BBO (BBO bid + (BBO ask – BBO bid)/2)
  • Efficient/quoted unfold (EQF) – Calculated as (ES / QS) * 100

We calculate these spreads earlier than the commerce and moreover at 4 intervals after the commerce (simply after, 1 second, 10 seconds, and 60 seconds after the commerce).

Configure Athena for Apache Spark

To configure Athena for Apache Spark, full the next steps:

  1. On the Athena console, underneath Get began, choose Analyze your information utilizing PySpark and Spark SQL.
  2. If that is your first time utilizing Athena Spark, select Create workgroup.
  3. For Workgroup title¸ enter a reputation for the workgroup, comparable to tca-analysis.
  4. Within the Analytics engine part, choose Apache Spark.
  5. Within the Extra configurations part, you possibly can select Use defaults or present a customized AWS Identification and Entry Administration (IAM) function and Amazon S3 location for calculation outcomes.
  6. Select Create workgroup.
  7. After you create the workgroup, navigate to the Notebooks tab and select Create pocket book.
  8. Enter a reputation in your pocket book, comparable to tca-analysis-with-tick-history.
  9. Select Create to create your pocket book.

Launch your pocket book

When you’ve got already created a Spark workgroup, choose Launch pocket book editor underneath Get began.


After your pocket book is created, you’ll be redirected to the interactive pocket book editor.


Now we will add and run the next code to our pocket book.

Create an evaluation

Full the next steps to create an evaluation:

import pandas as pd
import plotly.categorical as px
import plotly.graph_objects as go

  • Create our information frames for BBO, NBBO, and trades:
bbo_quote = spark.learn.parquet(f"s3://<bucket>/mt=bbo_quote/f=opra/dt=2023-05-17/*")
bbo_quote.createOrReplaceTempView("bbo_quote")
nbbo_quote = spark.learn.parquet(f"s3://<bucket>/mt=nbbo_quote/f=opra/dt=2023-05-17/*")
nbbo_quote.createOrReplaceTempView("nbbo_quote")
trades = spark.learn.parquet(f"s3://<bucket>/mt=commerce/f=opra/dt=2023-05-17/29_1.parquet")
trades.createOrReplaceTempView("trades")

  • Now we will determine a commerce to make use of for transaction price evaluation:
filtered_trades = spark.sql("choose Product, Worth,Amount, ReceiptTimestamp, MarketParticipant from trades")

We get the next output:

+---------------------+---------------------+---------------------+-------------------+-----------------+ 
|Product |Worth |Amount |ReceiptTimestamp |MarketParticipant| 
+---------------------+---------------------+---------------------+-------------------+-----------------+ 
|QQQ 230518C00329000|1.1700000000000000000|10.0000000000000000000|1684338565538021907,NYSEArca|
|QQQ 230518C00329000|1.1700000000000000000|20.0000000000000000000|1684338576071397557,NASDAQOMXPHLX|
|QQQ 230518C00329000|1.1600000000000000000|1.0000000000000000000|1684338579104713924,ISE|
|QQQ 230518C00329000|1.1400000000000000000|1.0000000000000000000|1684338580263307057,NASDAQOMXBX_Options|
|QQQ 230518C00329000|1.1200000000000000000|1.0000000000000000000|1684338581025332599,ISE|
+---------------------+---------------------+---------------------+-------------------+-----------------+

We use the highlighted commerce data going ahead for the commerce product (tp), commerce value (tpr), and commerce time (tt).

  • Right here we create numerous helper capabilities for our evaluation
def calculate_es_qs_eqf(df, trade_price):
    df['BidPrice'] = df['BidPrice'].astype('double')
    df['AskPrice'] = df['AskPrice'].astype('double')
    df["ES"] = ((df["AskPrice"]-df["BidPrice"])/2) - trade_price
    df["QS"] = df["AskPrice"]-df["BidPrice"]
    df["EQF"] = (df["ES"]/df["QS"])*100
    return df

def get_trade_before_n_seconds(trade_time, df, seconds=0, groupby_col = None):
    nseconds=seconds*1000000000
    nseconds += trade_time
    ret_df = df[df['ReceiptTimestamp'] < nseconds].groupby(groupby_col).final()
    ret_df['BidPrice'] = ret_df['BidPrice'].astype('double')
    ret_df['AskPrice'] = ret_df['AskPrice'].astype('double')
    ret_df = ret_df.reset_index()
    return ret_df

def get_trade_after_n_seconds(trade_time, df, seconds=0, groupby_col = None):
    nseconds=seconds*1000000000
    nseconds += trade_time
    ret_df = df[df['ReceiptTimestamp'] > nseconds].groupby(groupby_col).first()
    ret_df['BidPrice'] = ret_df['BidPrice'].astype('double')
    ret_df['AskPrice'] = ret_df['AskPrice'].astype('double')
    ret_df = ret_df.reset_index()
    return ret_df

def get_nbbo_trade_before_n_seconds(trade_time, df, seconds=0):
    nseconds=seconds*1000000000
    nseconds += trade_time
    ret_df = df[df['ReceiptTimestamp'] < nseconds].iloc[-1:]
    ret_df['BidPrice'] = ret_df['BidPrice'].astype('double')
    ret_df['AskPrice'] = ret_df['AskPrice'].astype('double')
    return ret_df

def get_nbbo_trade_after_n_seconds(trade_time, df, seconds=0):
    nseconds=seconds*1000000000
    nseconds += trade_time
    ret_df = df[df['ReceiptTimestamp'] > nseconds].iloc[:1]
    ret_df['BidPrice'] = ret_df['BidPrice'].astype('double')
    ret_df['AskPrice'] = ret_df['AskPrice'].astype('double')
    return ret_df

  • Within the following operate, we create the dataset that incorporates all of the quotes earlier than and after the commerce. Athena Spark routinely determines what number of DPUs to launch for processing our dataset.
def get_tca_analysis_via_df_single_query(trade_product, trade_price, trade_time):
    # BBO quotes
    bbos = spark.sql(f"SELECT Product, ReceiptTimestamp, AskPrice, BidPrice, MarketParticipant FROM bbo_quote the place Product="{trade_product}";")
    bbos = bbos.toPandas()

    bbo_just_before = get_trade_before_n_seconds(trade_time, bbos, seconds=0, groupby_col="MarketParticipant")
    bbo_just_after = get_trade_after_n_seconds(trade_time, bbos, seconds=0, groupby_col="MarketParticipant")
    bbo_1s_after = get_trade_after_n_seconds(trade_time, bbos, seconds=1, groupby_col="MarketParticipant")
    bbo_10s_after = get_trade_after_n_seconds(trade_time, bbos, seconds=10, groupby_col="MarketParticipant")
    bbo_60s_after = get_trade_after_n_seconds(trade_time, bbos, seconds=60, groupby_col="MarketParticipant")
    
    all_bbos = pd.concat([bbo_just_before, bbo_just_after, bbo_1s_after, bbo_10s_after, bbo_60s_after], ignore_index=True, kind=False)
    bbos_calculated = calculate_es_qs_eqf(all_bbos, trade_price)

    #NBBO quotes
    nbbos = spark.sql(f"SELECT Product, ReceiptTimestamp, AskPrice, BidPrice, BestBidParticipant, BestAskParticipant FROM nbbo_quote the place Product="{trade_product}";")
    nbbos = nbbos.toPandas()

    nbbo_just_before = get_nbbo_trade_before_n_seconds(trade_time,nbbos, seconds=0)
    nbbo_just_after = get_nbbo_trade_after_n_seconds(trade_time, nbbos, seconds=0)
    nbbo_1s_after = get_nbbo_trade_after_n_seconds(trade_time, nbbos, seconds=1)
    nbbo_10s_after = get_nbbo_trade_after_n_seconds(trade_time, nbbos, seconds=10)
    nbbo_60s_after = get_nbbo_trade_after_n_seconds(trade_time, nbbos, seconds=60)

    all_nbbos = pd.concat([nbbo_just_before, nbbo_just_after, nbbo_1s_after, nbbo_10s_after, nbbo_60s_after], ignore_index=True, kind=False)
    nbbos_calculated = calculate_es_qs_eqf(all_nbbos, trade_price)

    calc = pd.concat([bbos_calculated, nbbos_calculated], ignore_index=True, kind=False)
    
    return calc

  • Now let’s name the TCA evaluation operate with the knowledge from our chosen commerce:
tp = "QQQ 230518C00329000"
tpr = 1.16
tt = 1684338579104713924
c = get_tca_analysis_via_df_single_query(tp, tpr, tt)

Visualize the evaluation outcomes

Now let’s create the information frames we use for our visualization. Every information body incorporates quotes for one of many 5 time intervals for every information feed (BBO, NBBO):

bbo = c[c['MarketParticipant'].isin(['BBO'])]
bbo_bef = bbo[bbo['ReceiptTimestamp'] < tt]
bbo_aft_0 = bbo[bbo['ReceiptTimestamp'].between(tt,tt+1000000000)]
bbo_aft_1 = bbo[bbo['ReceiptTimestamp'].between(tt+1000000000,tt+10000000000)]
bbo_aft_10 = bbo[bbo['ReceiptTimestamp'].between(tt+10000000000,tt+60000000000)]
bbo_aft_60 = bbo[bbo['ReceiptTimestamp'] > (tt+60000000000)]

nbbo = c[~c['MarketParticipant'].isin(['BBO'])]
nbbo_bef = nbbo[nbbo['ReceiptTimestamp'] < tt]
nbbo_aft_0 = nbbo[nbbo['ReceiptTimestamp'].between(tt,tt+1000000000)]
nbbo_aft_1 = nbbo[nbbo['ReceiptTimestamp'].between(tt+1000000000,tt+10000000000)]
nbbo_aft_10 = nbbo[nbbo['ReceiptTimestamp'].between(tt+10000000000,tt+60000000000)]
nbbo_aft_60 = nbbo[nbbo['ReceiptTimestamp'] > (tt+60000000000)]

Within the following sections, we offer instance code to create totally different visualizations.

Plot QS and NBBO earlier than the commerce

Use the next code to plot the quoted unfold and NBBO earlier than the commerce:

fig = px.bar(title="Quoted Unfold Earlier than The Commerce",
    x=bbo_bef.MarketParticipant,
    y=bbo_bef['QS'],
    labels={'x': 'Market', 'y':'Quoted Unfold'})
fig.add_hline(y=nbbo_bef.iloc[0]['QS'],
    line_width=1, line_dash="sprint", line_color="pink",
    annotation_text="NBBO", annotation_font_color="pink")
%plotly fig

Plot QS for every market and NBBO after the commerce

Use the next code to plot the quoted unfold for every market and NBBO instantly after the commerce:

fig = px.bar(title="Quoted Unfold After The Commerce",
    x=bbo_aft_0.MarketParticipant,
    y=bbo_aft_0['QS'],
    labels={'x': 'Market', 'y':'Quoted Unfold'})
fig.add_hline(
    y=nbbo_aft_0.iloc[0]['QS'],
    line_width=1, line_dash="sprint", line_color="pink",
    annotation_text="NBBO", annotation_font_color="pink")
%plotly fig

Plot QS for every time interval and every marketplace for BBO

Use the next code to plot the quoted unfold for every time interval and every marketplace for BBO:

fig = go.Determine(information=[
    go.Bar(name="before trade", x=bbo_bef.MarketParticipant.unique(), y=bbo_bef['QS']),
    go.Bar(title="0s after commerce", x=bbo_aft_0.MarketParticipant.distinctive(), y=bbo_aft_0['QS']),
    go.Bar(title="1s after commerce", x=bbo_aft_1.MarketParticipant.distinctive(), y=bbo_aft_1['QS']),
    go.Bar(title="10s after commerce", x=bbo_aft_10.MarketParticipant.distinctive(), y=bbo_aft_10['QS']),
    go.Bar(title="60s after commerce", x=bbo_aft_60.MarketParticipant.distinctive(), y=bbo_aft_60['QS'])])
fig.update_layout(barmode="group",title="BBO Quoted Unfold Per Market/TimeFrame",
    xaxis={'title':'Market'},
    yaxis={'title':'Quoted Unfold'})
%plotly fig

Plot ES for every time interval and marketplace for BBO

Use the next code to plot the efficient unfold for every time interval and marketplace for BBO:

fig = go.Determine(information=[
    go.Bar(name="before trade", x=bbo_bef.MarketParticipant.unique(), y=bbo_bef['ES']),
    go.Bar(title="0s after commerce", x=bbo_aft_0.MarketParticipant.distinctive(), y=bbo_aft_0['ES']),
    go.Bar(title="1s after commerce", x=bbo_aft_1.MarketParticipant.distinctive(), y=bbo_aft_1['ES']),
    go.Bar(title="10s after commerce", x=bbo_aft_10.MarketParticipant.distinctive(), y=bbo_aft_10['ES']),
    go.Bar(title="60s after commerce", x=bbo_aft_60.MarketParticipant.distinctive(), y=bbo_aft_60['ES'])])
fig.update_layout(barmode="group",title="BBO Efficient Unfold Per Market/TimeFrame",
    xaxis={'title':'Market'}, 
    yaxis={'title':'Efficient Unfold'})
%plotly fig

Plot EQF for every time interval and marketplace for BBO

Use the next code to plot the efficient/quoted unfold for every time interval and marketplace for BBO:

fig = go.Determine(information=[
    go.Bar(name="before trade", x=bbo_bef.MarketParticipant.unique(), y=bbo_bef['EQF']),
    go.Bar(title="0s after commerce", x=bbo_aft_0.MarketParticipant.distinctive(), y=bbo_aft_0['EQF']),
    go.Bar(title="1s after commerce", x=bbo_aft_1.MarketParticipant.distinctive(), y=bbo_aft_1['EQF']),
    go.Bar(title="10s after commerce", x=bbo_aft_10.MarketParticipant.distinctive(), y=bbo_aft_10['EQF']),
    go.Bar(title="60s after commerce", x=bbo_aft_60.MarketParticipant.distinctive(), y=bbo_aft_60['EQF'])])
fig.update_layout(barmode="group",title="BBO Efficient/Quoted Unfold Per Market/TimeFrame",
    xaxis={'title':'Market'}, 
    yaxis={'title':'Efficient/Quoted Unfold'})
%plotly fig

Athena Spark calculation efficiency

Whenever you run a code block, Athena Spark routinely determines what number of DPUs it requires to finish the calculation. Within the final code block, the place we name the tca_analysis operate, we are literally instructing Spark to course of the information, and we then convert the ensuing Spark dataframes into Pandas dataframes. This constitutes probably the most intensive processing a part of the evaluation, and when Athena Spark runs this block, it exhibits the progress bar, elapsed time, and what number of DPUs are processing information presently. For instance, within the following calculation, Athena Spark is using 18 DPUs.

Whenever you configure your Athena Spark pocket book, you have got the choice of setting the utmost variety of DPUs that it will probably use. The default is 20 DPUs, however we examined this calculation with 10, 20, and 40 DPUs to reveal how Athena Spark routinely scales to run our evaluation. We noticed that Athena Spark scales linearly, taking quarter-hour and 21 seconds when the pocket book was configured with a most of 10 DPUs, 8 minutes and 23 seconds when the pocket book was configured with 20 DPUs, and 4 minutes and 44 seconds when the pocket book was configured with 40 DPUs. As a result of Athena Spark fees primarily based on DPU utilization, at a per-second granularity, the price of these calculations is comparable, however should you set the next most DPU worth, Athena Spark can return the results of the evaluation a lot sooner. For extra particulars on Athena Spark pricing please click on right here.

Conclusion

On this publish, we demonstrated how you should utilize high-fidelity OPRA information from LSEG’s Tick Historical past-PCAP to carry out transaction price analytics utilizing Athena Spark. The supply of OPRA information in a well timed method, complemented with accessibility improvements of AWS Knowledge Change for Amazon S3, strategically reduces the time to analytics for corporations trying to create actionable insights for essential buying and selling selections. OPRA generates about 7 TB of normalized Parquet information every day, and managing the infrastructure to offer analytics primarily based on OPRA information is difficult.

Athena’s scalability in dealing with large-scale information processing for Tick Historical past – PCAP for OPRA information makes it a compelling selection for organizations in search of swift and scalable analytics options in AWS. This publish exhibits the seamless interplay between the AWS ecosystem and Tick Historical past-PCAP information and the way monetary establishments can make the most of this synergy to drive data-driven decision-making for essential buying and selling and funding methods.


Concerning the Authors

Pramod Nayak is the Director of Product Administration of the Low Latency Group at LSEG. Pramod has over 10 years of expertise within the monetary know-how business, specializing in software program improvement, analytics, and information administration. Pramod is a former software program engineer and keen about market information and quantitative buying and selling.

LakshmiKanth Mannem is a Product Supervisor within the Low Latency Group of LSEG. He focuses on information and platform merchandise for the low-latency market information business. LakshmiKanth helps prospects construct probably the most optimum options for his or her market information wants.

Vivek Aggarwal is a Senior Knowledge Engineer within the Low Latency Group of LSEG. Vivek works on growing and sustaining information pipelines for processing and supply of captured market information feeds and reference information feeds.

Alket Memushaj is a Principal Architect within the Monetary Companies Market Improvement staff at AWS. Alket is answerable for technical technique, working with companions and prospects to deploy even probably the most demanding capital markets workloads to the AWS Cloud.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here

Stay on op - Ge the daily news in your inbox