Building my Own GPT (DATA COLLECTION 1.0)

Chris Festus Otopa Ayeh-Datey
3 min readApr 7, 2023

Day 2

The Larger your data set, the more accurate your data prediction model — Unknown (at least to me)

I will try upload all the data i collect in the google drive folder linked down below.

https://drive.google.com/drive/folders/1h10GNIYmYhmQmf-tkCubSPEmIqs61BK2?usp=sharing

The Goal of this GPT is to help predict stock prices and summarize financial statements.

in order to do that, We have to be clear about the type of data we are collecting.

In the first Beta of SadatGPT, I thought it would be quite prudent to scaled down on stock data of one company, Stock Data of Apple Inc(AAPL) has been gathered ranging from 12/121980 to 06/04/2023

This includes the Date the stock opened at, The stock price at open, the highest price it reached that day and lowest, and the volume that was moved that day.

This type of data is key to helping us get towards one of our goals in predicting stock prices.

The next type of data to collect was some Macroeconomic Indicators, this is also key since macroeconomics indicators have a direct correlation with how the stock market usually behaves, it is usually a question posed on CNBC on how the stock market would react to the Jobs report of the first Quarter.

Some of the macroeconomic indicators I have collected data on is GDP from 1980 to 2023(Real Gross Domestic Product),

Data on Interest Rates(Effective Federal Funds Rate),

Data on Inflation which is the consumer price index for all Urban Consumers.

Employment data;

Data on the unemployment rate

Total non-farm payrolls.

Data on Consumer confidence;

(consumer sentiment monthly)

Data on International Trade;

Trade Balance: Balance of payments: Goods and services trade balance)

Imports and Exports: US international trade in goods and services, balances of payments basis)

Data on Corporate earnings is also key because an earnings report informs us how well the company is doing, and how well the company is doing usually informs investor confidence and also Sentiment from News articles is also important.

Now this is where the confusion is that i have run into today, The SEC would only give me data on corporate earnings and SEC filings from 2013 about Apple, and it is quite impossible to get news articles even from the New York times on Apple ever since it got into the stock market in the year 1980.

While the Corporate filings and and Earnings dataset that i can get have me rethinking about how large my dataset should be for training, The news article search have also got me thinking there might be just a way to get all the news articles i am looking for.

My python script definitely works

But it might be internet connection that might be too weak for a dataset too large, My next option is too create a VM on Azure and perform my data collection on there.

Stay tuned for Data Collection 2.0

--

--