Skip to content

Add extra lsp features #204

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Apr 17, 2025
Merged

Add extra lsp features #204

merged 19 commits into from
Apr 17, 2025

Conversation

JonasAlaif
Copy link
Contributor

No description provided.

@LaurenzV
Copy link

@rayman2000

So, I believe that I've found three issues, some of which I'm not sure what the best way of fixing is, for now I've just implemented "hacky" workarounds to demonstrate that these should fix the issue.

Current Architecture

Just to briefly summarize my understanding of the current architecture: At the start of the communication chain, we have two "message reporting tasks", namely verification tasks and AST construction tasks. Those tasks regularly send messages to the "reporter (such as when exception occurs, when a new state is reached, etc.), which in this case is a Queue Actor. Once they send a message. The queue actor receives those messages and puts them into a queue. Others can "subscribe" to those queues and receive message there. In our case, this is what the "RelayActor" does, who received those messages and will use them to update the state in the LSP server.

Problem 1: Dead letters

The first problem I noticed is that I always kept getting those "dead letter" messages, even for very simple verification tasks. If I understood correctly, the reason for this is that at some point, we destroy the relay actor responsible for receiving the messages from AST constructions:

protected def discardAstOnCompletion(jid: AstJobId, jobActor: ActorRef) = {
ast_jobs.lookupJob(jid).map(_.map(_.queue.watchCompletion().onComplete(_ => {
ast_jobs.discardJob(jid)
jobActor ! PoisonPill
})))
}

We only do this until the queue is completed (i.e. AST construction finished), but apparently akka still sends a final "Success" message after the watchOnCompletion callbacks have been executed, and since we kill the actor in this callback, the letter cannot be delivered anymore. For now, I've just commented out the part where we kill the actor, but obviously this will require a more principled approach for solving the problem. And by the way, I might have missed something, but when debugging it seemed to me like we never kill the verification relay actors that we create, so that might also be something to look into.

Problem 2: IDELanguageClient

Currently, in the client coordinator, we store the IdeLanguageClient as an option which is set once on initialization, and set to null on exiting:

class ClientCoordinator(val server: ViperServerService)(implicit executor: VerificationExecutionContext) {
val localLogger: Logger = ClientLogger(this, "Client logger", server.config.logLevel()).get
val logger: Logger = server.combineLoggers(Some(localLogger))
private var _client: Option[IdeLanguageClient] = None
private val _files: ConcurrentMap[String, FileManager] = new ConcurrentHashMap() // each file is managed individually.
private var _vprFileEndings: Option[Array[String]] = None
private var _exited: Boolean = false
def client: IdeLanguageClient = {
_client.getOrElse({assert(false, "getting client failed - client has not been set yet"); ???})
}
def setClient(c: IdeLanguageClient): Unit = {
assert(_client.isEmpty, "setting client failed - client has already been set")
_client = Some(c)
}
/** called when client disconnects; the server should however remain running */
def exit(): Unit = {
_exited = true
_client = None
_files.clear()
}

This is a bit of a problem because during tests because it can happen that we disconnect while a verification or something else is still running, and thus each call to client would lead to an assertion error, resulting in a whole array of very weird error messages that would suddenly appear. The way I solved this for now is that we don't store the language client as an option, but instead we use a MockClient like it is used in testing when resetting the coordinator. This solves the error messages for me, so I believe this approach can be used directly (though the MockClient should obviously be deduplicated).

Problem 3: Terminating a null job actor

The third problem that caused the stress test "quickly switching backends" to get stuck when a file was verified that doesn't reach the verification stage (e.g. due to a parsing error) is due to this part:

private def stopOnlyVerification(handle: VerHandle, combinedLogger: Logger): Future[Boolean] = {
implicit val askTimeout: Timeout = Timeout(config.actorCommunicationTimeout() milliseconds)
val interrupt: Future[String] = (handle.job_actor ? StopVerification).mapTo[String]
handle.job_actor ! PoisonPill // the actor played its part.
interrupt.map(msg => {
combinedLogger.info(msg)
true
})
}

The problem is that we set the job actor of this verification to null when AST creation didn't complete:

Future.successful(VerHandle(null, null, null, prev_job_id_maybe))

And thus, we try to terminate a job actor that is null, which results in the server completely hanging. I fixed this by adding a simple null check before doing so, I believe this can be taken as is.

I've pushed my changes here, though as I said this still needs cleaning up and the first problem requires a more principled solution: LaurenzV@99b5bd4

If I run the Viper IDE tests using those changes, all tests seem to pass for me locally. :)

@rayman2000
Copy link
Contributor

@LaurenzV
So problem 1 is somewhat unrelated to this pull request right? Since we do not have an easy solution, I would probably move this to a separate issue in the interest of getting things merged quickly.

Problem 3 seems to be the main thing we were looking for, great that you figured that out and that the solution is quite simple!

I am not entirely sure if problem 2 is new / related to the current issues. I do not love the solution of using a MockClient outside of tests. Can this only occur during tests?

@LaurenzV
Copy link

LaurenzV commented Feb 23, 2025

So problem 1 is somewhat unrelated to this pull request right? Since we do not have an easy solution, I would probably move this to a separate issue in the interest of getting things merged quickly.

It's probably unrelated to the specific test failure, yes, but it's not unrelated to the PR since afaik it was introduced here in the first place. But it probably should still be fixed before this PR is merged, it's an issue after all, and I'm not sure whether it could lead to other weirdness.

I am not entirely sure if problem 2 is new / related to the current issues. I do not love the solution of using a MockClient outside of tests. Can this only occur during tests?

Same as above, it might not be related to that issue specifically, but I still think it should be addressed before merging.

WIth that said, if you want I can try whether just applying solution 3 fixes all of the test failures.

@rayman2000
Copy link
Contributor

Ah I was assuming (for no reason) that at least Problem 1 was previously existing. Yeah if these are regressions then we should definitely address them!

@rayman2000
Copy link
Contributor

OK I have talked with @ArquintL and decided that we prefer following solution to problem 2: instead of setting _client to MockClient, we change the return value of client to be an Option. This forces callers to decide what to do if the client is None, instead of silently dropping things.

@LaurenzV
Copy link

LaurenzV commented Mar 1, 2025

@rayman2000 I took a look, and indeed it seems like on the current main branch, relay actors are never killed. I verified this by adding

    override def postStop(): Unit = {
      println("XActor killed")
    }

to the respective classes, for queue actors the message is printed after each parse/verification step, while for the relay actor such a message is never printed. So I guess if we decide to just not kill the actor, it at least wouldn't be a regression, but obiously it would still be better to fix the underlying problem.

@rayman2000
Copy link
Contributor

@JonasAlaif and I have decided to just not kill the actors. The effort to fix this seems pretty large (mostly because no one really understands the actor model we use...) and it is not a regression, the issue existed beforehand as well. While having unnecessary actors lying around is not ideal, for known usages of ViperServer this has not been an issue so far.

@rayman2000
Copy link
Contributor

Linard and I have done some more digging and found some more issues. First of all, I get stracktraces when I verify multiple files after each other, one of the gets called here is called on None:

if (uri == file.file_uri) this else project.left.toOption.get.get(uri).get

Second of all, we printed out all of the existing actors over time. It looks like each edit of a file causes a new AST Actor to be created and they are never removed, even with the Poison Pills added back in. So the list of actors grows fairly quickly if you just edit a file a couple of times. This feels bad enough that probably do not want to just merge it as is.
Having an actor for every AST job has always seemed like a questionable design decision, but now that we have lots and lots of AST jobs, we probably have to change this before this can reasonably be merged...

@LaurenzV
Copy link

LaurenzV commented Mar 5, 2025

So should I prioritize looking into this over trying to rebase my existing branches on this one?

@rayman2000
Copy link
Contributor

All right, gang, it's update time!
So, I think I have fixed the actor mess enough for this to progress. The final problem was a juicy race condition if you cancel an AST job and it finishes at the same time. This led to the cancellation waiting for a response from the already poisoned job actor. It now no longer waits, this means we will get a couple of dead letter messages (which is fine in my opinion, it only happens for this race condition.
So now:

  • I think all the actors are killed appropriately and the reflection hack to check this seem to confirm.
  • The IDE tests run through for me.

@JonasAlaif the issue here is still open:

if (uri == file.file_uri) this else project.left.toOption.get.get(uri).get

This is all your changes and is not akka related, so I would be grateful if you could take a look. One of the gets is called on None if I quickly click between multiple files in the IDE.

Copy link
Member

@ArquintL ArquintL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only made a quick pass over the changes and left some comments

@JonasAlaif
Copy link
Contributor Author

@ArquintL come chat about the remaining two points

@JonasAlaif
Copy link
Contributor Author

@rayman2000 done with the review, happy to merge if you're ready

@rayman2000 rayman2000 added this pull request to the merge queue Apr 17, 2025
Merged via the queue into master with commit 0195c6f Apr 17, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants