As a Data Engineer and Data Scientist, I specialize in data pipeline development, migration, visualization, and statistical modeling. I am proficient in Python, SQL, and SAS, and have experience with cloud platforms like Azure, GCP, Oracle Cloud, Snowflake, and Oracle ERP systems. My current role at Heathrow Airport Limited involves designing data lakehouse architectures, migrating pipelines to Databricks Unity Catalog, and optimizing database performance. I also work with big data technologies like Spark and Hadoop and use DevOps tools for CI/CD workflows.
I am certified as a Microsoft Azure Data Engineer and a Google Cloud Professional Data Engineer. My GitHub projects showcase my skills in data engineering, machine learning, and Retrieval-Augmented Generation (RAG) with vector databases.
Feel free to explore my GitHub profile for more on my projects and contributions to the open-source community.
-
👨💻 All of my projects are available at https://github.com/WilsonH918
-
📫 How to reach me wilson.hh.hsieh@gmail.com
Project Link | Tools | Project Description |
---|---|---|
EnergyStocks Historical Price DataPipeline | Pyspark, SQL, AWS (Lambda, EC2, S3), Snowflake (CDC), PowerBI | This is a data pipeline project that retrieves S&P500 listed energy companies' historical stock price data, stores the data in an AWS S3 bucket, and transforms the data in a Snowflake data warehouse. The project is automated using AWS Lambda to trigger a Python script that runs the pipeline on a scheduled basis. |
RAG based Document Retrieval with ChromaDB and Vector Embeddings | Python, OpenAI API, ChromaDB, LangChain, BeautifulSoup, Requests | ChromaQuery is an AI-powered knowledge retrieval system that integrates retrieval-augmented generation (RAG), web scraping, and ChromaDB for accurate and real-time responses. The system uses OpenAI embeddings and vector-based search to retrieve relevant articles and generate contextual answers. It scrapes content from the web, stores articles as chunks in a database, and queries relevant information to generate insightful responses. |
ERC20 Data Ingestion Pipeline | Python, Airflow (DAGs), PostgreSQL, Docker, Hadoop | This project is designed to extract ERC20 token data from Web3 using the Etherscan API and create an ETL pipeline using Apache Airflow. The extracted data is scheduled to be fed into a local PostgreSQL database daily. The project involves technologies such as Docker, Airflow DAGs, PostgreSQL, and HDFS. Below is the screenshot of the data pipeline in action. |
Real time Streaming of ERC20 Transactions with Kafka and Python | Python, SQL, Kafka, Docker, Web3 | This project demonstrates how to build a real-time data pipeline to retrieve ERC20 token transactions and store them in a local CSV file. The project uses Apache Kafka, an open-source distributed streaming platform, to stream real-time data from the Etherscan API, a blockchain explorer for the Ethereum network, and then stores the data in CSV format in a local file. |
Thesis Code - Motion Heatmap and Machine Learning for Stair Climbing Detection | Pyspark, Pandas, scikit-learn, TensorFlow, Matplotlib, Seaborn | This code repository contains the code used to generate the results presented in my thesis titled "Motion Heatmap and Machine Learning for Stair Climbing Detection." In this thesis, we present a dataset of video data that includes bounding boxes information and silhouette images, along with the methods used to process this data to detect human movements, trajectories over time, and the usage of each room in the home environment. |
ERC20 MyToken | Solidity, Python, Web3, Blockchain | This is a simple ERC20 token contract written in Solidity. It allows for the creation, transfer, and burning of tokens. The contract also includes an onlyOwner modifier to restrict access to certain functions. |