A submission of the mini-scan take home assignment described below.
--
The solution was implemented using an sqlite3 database.
One challenging aspect of the project was an issue with running sqlite3 in docker with a CGO_ENABLED=0.
To overcome this obstacle, a base image that supports CGO_ENABLED=1 (frolvlad/alpine-glibc) was used.
Next steps include:
- Switching to a PostgreSQL database.
- Handling failures with a DLQ and/or passing the timestamp in the message to make sure the most recent update is saved.
- Pushing to the cloud to test horizontal scaling which the emulater does not support.
- Actually scan the local network.
To test changes to cmd/proessor/main.go
by examining the processor's collection of scan results run
docker system prune --all --volumes --force
docker compose up --build
# this runs unit tests
docker-compose run --entrypoint /bin/sh processor
sqlite3 /data/processor.db
select * from scans
Unit tests are available in cmd/processor and pkg/scanning for your enjoyment.
Thank You and have a great day!
Hello!
As you've heard by now, Censys scans the internet at an incredible scale. Processing the results necessitates scaling horizontally across thousands of machines. One key aspect of our architecture is the use of distributed queues to pass data between machines.
The docker-compose.yml
file sets up a toy example of a scanner. It spins up a Google Pub/Sub emulator, creates a topic and subscription, and publishes scan results to the topic. It can be run via docker compose up
.
Your job is to build the data processing side. It should:
- Pull scan results from the subscription
scan-sub
. - Maintain an up-to-date record of each unique
(ip, port, service)
. This should contain when the service was last scanned and a string containing the service's response.
NOTE The scanner can publish data in two formats, shown below. In both of the following examples, the service response should be stored as:
"hello world"
.{ // ... "data_version": 1, "data": { "response_bytes_utf8": "aGVsbG8gd29ybGQ=" } } { // ... "data_version": 2, "data": { "response_str": "hello world" } }
Your processing application should be able to be scaled horizontally, but this isn't something you need to actually do. The processing application should use at-least-once
semantics where ever applicable.
You may write this in any languages you choose, but Go, Scala, or Rust would be preferred. You may use any data store of your choosing, with sqlite
being one example.
Please upload the code to a publicly accessible GitHub, GitLab or other public code repository account. This README file should be updated, briefly documenting your solution. Like our own code, we expect testing instructions: whether it’s an automated test framework, or simple manual steps.
To help set expectations, we believe you should aim to take no more than 4 hours on this task.
We understand that you have other responsibilities, so if you think you’ll need more than 5 business days, just let us know when you expect to send a reply.
Please don’t hesitate to ask any follow-up questions for clarification.